# Mask

A mask is a Series with the same index as your dataframe where every value is a boolean

In [1]:
import pandas

wentworth = pandas.read_csv("data/rainfall/IDCJAC0009_047045_1800_Data.csv")

# get a series
year = wentworth["Year"]

# create a boolean series from that
mask = year < 1934

mask



0         True
1         True
2         True
3         True
4         True
         ...  
32288    False
32289    False
32290    False
32291    False
32292    False
Name: Year, Length: 32293, dtype: bool

Notice how the mask has the exact same index as the DataFrame it was taken from - this is good news because we can use it to filter the rows of the DataFrame.

The key is that _if you put a mask in the square brackets of a DataFrame, instead of getting columns, you get all the rows where that mask has True_

In [3]:
wentworth[mask]

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,47045,1933,1,1,,,
1,IDCJAC0009,47045,1933,1,2,,,
2,IDCJAC0009,47045,1933,1,3,,,
3,IDCJAC0009,47045,1933,1,4,,,
4,IDCJAC0009,47045,1933,1,5,,,
...,...,...,...,...,...,...,...,...
360,IDCJAC0009,47045,1933,12,27,0.0,,Y
361,IDCJAC0009,47045,1933,12,28,0.0,,Y
362,IDCJAC0009,47045,1933,12,29,0.0,,Y
363,IDCJAC0009,47045,1933,12,30,0.0,,Y


# Exercises

1. Show only the _rainfall amount_ (not the other columns) for 1945 in Wentworth
2. Show only the _rainfall amounts_ for every January in Wentworth
3. Show only the _rainfall amount_ (not the other columns) for January 1945 in Wentworth (hint:  You need to use some boolean algebra)

Masks are a simple concept, but they are used absolutley everywhere.  I like to make sure each mask has a name, just like in the above example, but many pandas programmers will put the mask right in the square brackets, so the above example looks like this

In [10]:
wentworth[wentworth["Year"] < 1934]

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,47045,1933,1,1,,,
1,IDCJAC0009,47045,1933,1,2,,,
2,IDCJAC0009,47045,1933,1,3,,,
3,IDCJAC0009,47045,1933,1,4,,,
4,IDCJAC0009,47045,1933,1,5,,,
...,...,...,...,...,...,...,...,...
360,IDCJAC0009,47045,1933,12,27,0.0,,Y
361,IDCJAC0009,47045,1933,12,28,0.0,,Y
362,IDCJAC0009,47045,1933,12,29,0.0,,Y
363,IDCJAC0009,47045,1933,12,30,0.0,,Y


I personally don't like this style, but most pandas programmers will use it - maybe they like showing off.

# Exercise

Show only the _rainfall amount_ (not the other columns) for January 1945 in Wentworth bu do it in just one line of Python!

In [17]:
import pandas

wentworth = pandas.read_csv("data/rainfall/IDCJAC0009_047045_1800_Data.csv")

print(wentworth[(wentworth["Year"]==1945) & (wentworth["Month"]==1)])

     Product code  Bureau of Meteorology station number  Year  Month  Day  \
4383   IDCJAC0009                                 47045  1945      1    1   
4384   IDCJAC0009                                 47045  1945      1    2   
4385   IDCJAC0009                                 47045  1945      1    3   
4386   IDCJAC0009                                 47045  1945      1    4   
4387   IDCJAC0009                                 47045  1945      1    5   
4388   IDCJAC0009                                 47045  1945      1    6   
4389   IDCJAC0009                                 47045  1945      1    7   
4390   IDCJAC0009                                 47045  1945      1    8   
4391   IDCJAC0009                                 47045  1945      1    9   
4392   IDCJAC0009                                 47045  1945      1   10   
4393   IDCJAC0009                                 47045  1945      1   11   
4394   IDCJAC0009                                 47045  1945      1   12   