<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/Pandas/DataFrame/QueryingDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Querying a DataFrame

The first step is the process to understand boolean masking, the heart of fast and efficient querying in numpy and pandas, and its analogous to bit masking used in other areas of CS.

A boolean masking is an array which can be of one dimension like a Series, or two dimensional like a DataFrame, where each of the values in the array are either true or false. This array is essentially overlaid on the top of the data structure we're querying, and any cell aligned with the true value will be admitted into the final result, while the cells aligned with the false value won't.

In [1]:
import pandas as pd

admissions_df = pd.read_csv('dataset/Admission_Predict.csv', index_col=0)

admissions_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [3]:
# Let's clean up the poorly named columns
cols = list(admissions_df.columns)
cols = [x.strip() for x in cols]

admissions_df.columns = cols

admissions_df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA',
       'Research', 'Chance of Admit'],
      dtype='object')

Boolean masks are created by applying operators directly to pandas Series or DataFrame objects.

In [6]:
# Scanning the data to find students with a chance of admit higher than 0.7
ad_mask = admissions_df['Chance of Admit'] > 0.7
ad_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: Chance of Admit, Length: 400, dtype: bool

The result of brodcasting a comparison operator is True or False. Underneath, pandas is applying the comparison operator thorugh vectorization (efficiently and in parallel), to all of the specified values in the array. The result is a Series, since only one columns is being operator on.

Then, we can apply the where() function, to lay the mask on the top of data to 'hide' the False values, and dropna() to clean the NaN data.

In [8]:
admissions_df.where(ad_mask).dropna().head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


But that's not the most sofisticated way to do that. We can use where() and dropna() at the same time. In typical fashion, panda devs just overloaded the indexing operator to do this.

In [10]:
admissions_df[admissions_df['Chance of Admit'] > 0.7].head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


## Combining two masks

Pandas use the pipe | and ampersand & operators to compare two series.

In [13]:
# Students with chance of admit between 0.7 and 0.9
ad_08 = (admissions_df['Chance of Admit'] > 0.7) & (admissions_df['Chance of Admit'] < 0.9)
ad_08.head()

Serial No.
1    False
2     True
3     True
4     True
5    False
Name: Chance of Admit, dtype: bool

We can also get ride of the comparison operators and use the gt() and lt() functions, which stand for 'greater than' and 'less than', respectively.

In [15]:
ad_08 = (admissions_df['Chance of Admit'].gt(0.7)) & (admissions_df['Chance of Admit'].lt(0.9))
ad_08.head()

Serial No.
1    False
2     True
3     True
4     True
5    False
Name: Chance of Admit, dtype: bool

In [16]:
# It's possible to chain them, making code smoother
ad_08 = admissions_df['Chance of Admit'].gt(0.7).lt(0.9)
ad_08.head()

Serial No.
1    False
2    False
3    False
4    False
5     True
Name: Chance of Admit, dtype: bool