In [1]:
import pandas as pd


In [2]:
df = pd.read_csv("datasets/Admission_Predict.csv", index_col= 0)
df.columns = [item.lower().strip() for item in df.columns]
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


# Boolean Mask

A Boolean mask is an **array** which can be of **one dimension** like a series, or **two dimensions** like a data frame, where **each of the values in the array are either true or false**. This array is essentially overlaid on top of the data structure that we're querying, And **any cell aligned with the true value will be admitted into our final result, and any cell aligned with a false value will not**.

Boolean masks are created by applying operators directly to the pandas Series or DataFrame objects. 

For instance, in our graduate admission dataset, we might be interested in seeing only those students that have a chance higher than 0.7 . To build a Boolean mask for this query, we want to project the chance of admit column using the indexing operator and apply the greater than operator with a comparison value of 0.7. This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean Series. The resultant Series is indexed where the value of each cell is either True or False depending on whether a student has a chance of admit higher than 0.7 .

In [3]:
admit_mask = df["chance of admit"] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

so, what do you do with the boolean mask once you have formed it?

Well, you can just lay it on top of the data **to "hide" the data you don't want**, which **is represented** by all of the **False values**.

# .where() :

**to lay boolean mask on original Series or DataFrame object**, we should use **.where()** method.

In [4]:
serie = pd.Series([1,2,3])
serie_mask = serie > 2
serie.where(serie_mask)

0    NaN
1    NaN
2    3.0
dtype: float64

In [5]:
df.where(admit_mask)

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.00,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.80
5,,,,,,,,
...,...,...,...,...,...,...,...,...
396,324.0,110.0,3.0,3.5,3.5,9.04,1.0,0.82
397,325.0,107.0,3.0,3.0,3.5,9.11,1.0,0.84
398,330.0,116.0,4.0,5.0,4.5,9.45,1.0,0.91
399,,,,,,,,


We see that the resulting data frame keeps the original indexed values, and only data which met the condition was retained. All of the rows which did not meet the condition have NaN data instead, but these rows were not dropped from our dataset. 

# .dropna() 

**to drop** the data that have **NaN** value in **original Series or DataFrame object**, we should use **.dropna()** method.

In [6]:
serie.where(serie_mask).dropna()

2    3.0
dtype: float64

In [7]:
df.where(admit_mask).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


the pandas devs created a shorthand **syntax** which **combines .where() and .dropna()**, doing both at once.

**to use this syntax**, we should **send a boolean mask to an indexing operator**.

In [8]:
df[df["chance of admit"] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


# indexing operator applications on DataFrame:

1. It can be called with **a string parameter to project a single column**.

In [9]:
df["gre score"].head()

Serial No.
1    337
2    324
3    316
4    322
5    314
Name: gre score, dtype: int64

2. we can send it **a list of columns as strings**.

In [10]:
df[["gre score", "toefl score"]].head()

Unnamed: 0_level_0,gre score,toefl score
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,337,118
2,324,107
3,316,104
4,322,110
5,314,103


3. we can send it **a boolean mask to lay it on DataFrame or Series object, and to drop the data that have NaN value**.

In [11]:
df[df["chance of admit"] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


each of these is mimicing functionality from either **.loc() or .where().dropna()**

# combining multiple boolean masks :

in computer science, combining multiple boolean masks is done with "and", if both masks must be True for a True value to be in the final mask), or "or" if only one needs to be True.

In [12]:
(df["chance of admit"] > 0.7) and (df["chance of admit"] < 0.9)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The problem is that we have series objects, and python underneath doesn't know how to compare two series using
and or or.

1. the pandas devs have overwritten the **pipe | and ampersand & operators to handle this**
for us.

In [13]:
(df["chance of admit"] > 0.7) & (df["chance of admit"] < 0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

the result is a boolean mask created of combination 2 other ones.

A common error for new pandas users is to try and do boolean comparisons using the & operator but not putting parentheses around the individual terms you are interested in.

In [14]:
df["chance of admit"] > 0.7 & df["chance of admit"] < 0.9

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

The problem is that Python is trying to bitwise and a 0.7 and a pandas dataframe.

2. this is to just get rid of the comparison operator completely, and instead use the built-in functions which mimic this approach.

In [15]:
df["chance of admit"].gt(0.7) & df["chance of admit"].lt(0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

3. **.gt() and .lt()** are build into **Series and DataFrame **objects. so **we can chain them** too.

In [16]:
df["chance of admit"].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool