# Boolean masks

You know that Boolean is used to describe any binary variable whose possible values are true or false.

With `pandas`, **Boolean masking, also called Boolean indexing**, is used to overlay a Boolean grid onto a dataframe's index in order to select only the values in the dataframe that align with the True values of the grid.

In [1]:
import pandas as pd

data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,25362, 24622],'moons': [0, 0, 1, 2, 80, 83, 27, 14]
        }
df = pd.DataFrame(data)
df

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


In [3]:
# the objective is to keep planets that have fewer than 20 moons and filter out the rest.
print(df['moons'] < 20)

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: moons, dtype: bool


This results in a Series object of `dtype`: bool that consists of the row indices, where each index contains a True or False value depending on whether that row satisfies the given condition. This is the Boolean mask.

In [4]:
# To apply this mask to the dataframe, simply insert this statement into selector brackets and apply it to your dataframe.

print(df[ df['moons'] < 20 ])

    planet  radius_km  moons
0  Mercury       2440      0
1    Venus       6052      0
2    Earth       6371      1
3     Mars       3390      2
7  Neptune      24622     14


In [8]:
'''You can also assign the Boolean mask to a named variable and then apply that to your dataframe.
This doesn't permanently modify your dataframe. It only gives a filtered view of it.
'''

mask = df['moons'] < 20
print(mask)

print(df[mask])

print(df)


0     True
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: moons, dtype: bool
    planet  radius_km  moons
0  Mercury       2440      0
1    Venus       6052      0
2    Earth       6371      1
3     Mars       3390      2
7  Neptune      24622     14
    planet  radius_km  moons
0  Mercury       2440      0
1    Venus       6052      0
2    Earth       6371      1
3     Mars       3390      2
4  Jupiter      69911     80
5   Saturn      58232     83
6   Uranus      25362     27
7  Neptune      24622     14


In [9]:
# you can assign the result to a named variable
mask = df['moons'] < 20
df2 = df[mask]
df2

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
7,Neptune,24622,14


In [11]:
# If you want to select just the planet column as a series object, you can use regular selection tools like loc[]

mask = df['moons'] < 20
df.loc[mask, 'planet']

0    Mercury
1      Venus
2      Earth
3       Mars
7    Neptune
Name: planet, dtype: object

# Complex logical statements

In statements that use multiple conditions, pandas uses logical operators to indicate which data to keep and which to filter out.

|operator|Logic|
|--------|-----|
|&|and|
|\||or|
|~|not|

**Important:Each component of a multi-condition logical statement must be in parentheses.**  Otherwise, the statement will throw an error or, worse, return something that isn't what you intended.

In [12]:
mask = (df['moons'] < 10) | (df['moons'] > 50)
mask

0     True
1     True
2     True
3     True
4     True
5     True
6    False
7    False
Name: moons, dtype: bool

Notice that each condition is self-contained in a set of parentheses, and the two conditions are separated by the logical operator, |(or). 

In [15]:
mask = (df['moons'] < 10) | (df['moons'] > 50)
df[mask]

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83


In [18]:
# select all planets that have more than 20 moons, but not planets with 80 moons and not planets with a radius less than 50,000 km
mask = (df['moons'] > 20) & ~(df['moons'] == 80) & ~(df['radius_km']<50000)

df[mask]

Unnamed: 0,planet,radius_km,moons
5,Saturn,58232,83


In [19]:
# this returns the same result
mask = (df['moons'] > 20) & (df['moons'] != 80) & (df['radius_km'] >= 50000)
df[mask]

Unnamed: 0,planet,radius_km,moons
5,Saturn,58232,83
