We've seen a preview of how Pandas handles missing values using the None type and NumPy NaN values. Missing
values are pretty common in data cleaning activities. And, missing values can be there for any number of
reasons, and I just want to touch on a few here.

For instance, if you are running a survey and a respondant didn't answer a question the missing value is
actually an omission. This kind of missing data is called **Missing at Random** if there are other variables
that might be used to predict the variable which is missing. In my work when I delivery surveys I often find
that missing data, say the interest in being involved in a follow up study, often has some correlation with
another data field, like gender or ethnicity. If there is no relationship to other variables, then we call
this data **Missing Completely at Random (MCAR)**.

These are just two examples of missing data, and there are many more. For instance, data might be missing
because it wasn't collected, either by the process responsible for collecting that data, such as a
researcher, or because it wouldn't make sense if it were collected. This last example is extremely common
when you start joining DataFrames together from multiple sources, such as joining a list of people at a
university with a list of offices in the university (students generally don't have offices).

Let's look at some ways of handling missing data in pandas.

In [1]:
# Lets import pandas
import pandas as pd

Pandas非常擅长识别缺失值，即便大部分缺失值记为NaN, NULL, None 或者N/A，但缺失值有时并没有这么明显与规范。

Pandas的read_csv()函数拥有一个参数，称为na_values，我们可以用它规范缺失值的格式，并且它允许使用scalar, string, list或dictionary。

In [2]:
# Let's load a piece of data from a file called log.csv
df = pd.read_csv('C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\class_grades.csv')
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [3]:
# 我们可以使用.isnull()函数为整个DataFrame创建一个Boolean mask

mask=df.isnull()
mask.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [4]:
# 另一个有用的操作是，把所有只要包含缺失值的行全部剔除，这通过.dropna()函数实现。

df.dropna().head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


注意到，index为2，3，7和11的行现在被剔除了。

Pandas还有一个方便的函数，fillna()。这个函数有几个参数，可以输入一个单一值（称作scalar value），以将所有的缺失值替换为同一个值。

In [5]:
# So, if we wanted to fill all missing values with 0, we would use fillna

df.fillna(0, inplace = True)
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


注意，fillna()函数中的inplace参数，使得pandas直接对原DataFrame进行操作，而不会返回一个副本。

如果空格恰恰是我们需要的东西，我们还可以使用na_filter来关闭空格过滤（space filtering），但实践中这使用得很少。
在没有任何NA的数据中，使用na_filter = False可以增强读取大型文件的表现。

有的时候，缺失值本身也包含了一些信息，有其价值所在。例如处理在线学习系统的日志，这些系统经常将播放的统计数据周期性（例如每30秒）发送给服务器。这些回传的信息量可能非常大，因为它们可以包含整个播放系统的状态信息，例如视频有多大，哪个视频被渲染到屏幕上，音量有多大。

In [6]:
# If we load the data file log.csv, we can see an example of what this might look like.

df = pd.read_csv("C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\log.csv")
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


在这个数据集中，第一列为Unix epoch格式的时间戳（timestamp），第二列是用户名，然后是用户访问的网页和播放的视频。DataFrame中每一行都有一个playback position（播放位置），并且每当playback postion的数值+1，时间戳大概增加30秒。我们会发现这对Bob不成立，数据显示Bob暂停了他的播放，因此当时间流逝，播放位置却没有改变。

要知道从数据中提炼这个信息是非常困难的，因为这个信息并不如我们所期待的那样由时间戳区分。在paused和volume列中有许多缺失值，如果直接传输信息是效率低下的，因此这个系统只是当没有发生改变时插入null values。

下面介绍两种替换缺失值的参数方法ffill和bfill。

ffill是向前填充（forward filling），它将每一个缺失值替换为其前一行的值。
bfill是向后填充（backward filling），它将每一个缺失值替换为其后一行的值。

因此，如果需要用这两种替换方式，数据必须是有序的。传统数据库管理系统中的数据一般是不保证按顺序排列的，就像一下数据：

In [7]:
# 在Pandas中我们既可以以index排序，也可以以value排序。在这里，我们令时间戳为index，然后以index排序。

df = df.set_index('time')
df = df.sort_index()
df.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


我们发现index不是唯一的，因为多个用户可以在相同时间使用系统。因此让我们重置index，应用时间戳与用户名组成的multi-level index。

In [8]:
df = df.reset_index()
df = df.set_index(['time', 'user'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [9]:
# 现在我们的数据按照index整齐排列了，可以使用ffill来替换缺失值

df = df.fillna(method='ffill')
df

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0
1469974514,cheryl,intro.html,8,False,10.0
1469974524,sue,advanced.html,25,False,10.0
1469974544,cheryl,intro.html,9,False,10.0
1469974554,sue,advanced.html,26,False,10.0
1469974574,cheryl,intro.html,10,False,10.0


In [10]:
# We can also do customized fill-in to replace values with the replace() function. It allows replacement from
# several approaches: value-to-value, list, dictionary, regex Let's generate a simple example

df = pd.DataFrame({'A': [1, 1, 2, 3, 4],
                   'B': [3, 6, 3, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [11]:
# We can replace 1's with 100, let's try the value-to-value approach
df.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [12]:
# How about changing two values? Let's try the list approach For example, we want to change 1's to 100 and 3's
# to 300
df.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


In [14]:
# What's really cool about pandas replacement is that it supports regex too!
# Let's look at our data from the dataset logs again

df = pd.read_csv("C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\log.csv")
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


可以使用正则表达式进行替换，第一个参数to_replace是我们想要匹配的正则表达式的格式，第二个参数value是我们想替换的值，第三个参数应该是regex = True。

思考一下：如果我们想找到所有在video列中的html pages，假设这意味着它们以".html"作为结尾，并且我们想要将它们替换为"webpage"，如何实现？

In [15]:
# Here's my solution, first matching any number of characters then ending in .html

df.replace(to_replace=".*.html$", value="webpage", regex=True)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


One last note on missing values. When you use statistical functions on DataFrames, these functions typically
ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying
NumPy function will ignore missing values. This is usually what you want but you should be aware that values
are being excluded. Why you have missing values really matters depending upon the problem you are trying to
solve. It might be unreasonable to infer missing values, for instance, if the data shouldn't exist in the
first place.