In this lecture we're going to talk about querying DataFrames. The first step in the process is to understand 
Boolean masking. Boolean masking is the heart of fast and efficient querying in numpy and pandas, and its 
analogous to bit masking used in other areas of computational science. By the end of this lecture you'll 
understand how Boolean masking works, and how to apply this to a DataFrame to get out data you're interested in.

A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, 
where each of the values in the array are either true or false. This array is essentially overlaid on top 
of the data structure that we're querying. And any cell aligned with the true value will be admitted into 
our final result, and any cell aligned with a false value will not.

In [1]:
# Let's start with an example and import our graduate admission dataset. First we'll bring in pandas
import pandas as pd

# Then we'll load in our CSV file
df = pd.read_csv('C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\Admission_Predict.csv', index_col=0)

# And we'll clean up a couple of poorly named columns like we did in a previous lecture
df.columns = [x.lower().strip() for x in df.columns]

# And we'll take a look at the results
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


Boolean mask由对Series或DataFrame应用符号操作生成。例如，在我们的graduate_admission数据集中，我们可能想知道哪些学生拥有高于0.7的录取率。为此目的建立一个Boolean mask，我们通过索引符提取chance of admit的那一列，然后运用大于号来和0.7进行比较，这会返回一个Boolean Series。这个Series拥有其index，并且值为True或False。

In [2]:
admit_mask = df['chance of admit'] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

接下来该怎么对已创造的Boolean mask进行操作呢？一种简单的操作是，可以用Boolean mask覆盖不想要的数据，它们会表示为False values。这需要通过在原DataFrame上运用.where()函数。

In [3]:
df.where(admit_mask).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


我们看到，操作后的DataFrame保持着原有的index，但仅仅满足条件的数据保留了下来。所有不满足条件的行都被替换为NaN，但它们没有从数据集中被剔除。

In [4]:
# The next step is, if we don't want the NaN data, we use the dropna() function

df.where(admit_mask).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


现在返回的DataFrame把所有的NaN行都剔除了。注意到，被剔除行的index也被剔除了，因此Serial number剩下1，2，3，4，6，而5被剔除。

实际上.where()函数并不经常使用，pandas拥有更简短的表述，将.where()和.dropna()结合在一起，仅需通过索引符就能做到这一点。

In [5]:
df[df['chance of admit'] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


我们还可以结合多个Boolean mask。在其它计算机语言中，如果需要两者都为True则用and，如果只需其中任意一个为True则用or。

In [6]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

但这会报错，因为python不知道如何用and或or来比较两个Series。Pandas的创造者则将and和or分别写为&和|，以解决这个问题。

In [7]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

#注意括号不得省略

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [8]:
# 另一种完全抛弃比较符号(comparison operator)的方法是使用内置函数来模拟以上方法，这样则可以省略括号，当然它只针
# 小于与大于。

df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [9]:
# 这些函数是Series和DataFrame的内置函数，所以可以将它们连接起来。

df['chance of admit'].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

In this lecture, we have learned to query dataframe using boolean masking, which is extremely important 
and often used in the world of data science. With boolean masking, we can select data based on the criteria 
we desire and, frankly, you'll use it everywhere. We've also seen how there are many different ways to query
the DataFrame, and the interesting side implications that come up when doing so.