# Chap07 缺失数据

In [1]:
import pandas as pd
import numpy as np

## 缺失值的统计和删除
### 缺失信息的统计
1. 缺失数据可以使用`isna`或`isnull`查看**每个单元格**是否缺失
2. 结合mean可以计算出每列缺失值的比例
3. dataframe中要查看某一列缺失或者非缺失的行，可利用Series上的isna或者notna(非缺失)进行布尔索引
   - 要同时对几个列检索出全部为缺失或者至少有一个缺失或者没有缺失的行，可以使用isna，notna和any，all的组合
     - `isna().any(1)`至少有一个缺失
     - `isna().all(1)`全部缺失
     - `notna().any(1)`至少有一个不缺失
     - `notna().all(1)`全部不缺失
### 缺失信息的删除`dropna()`
1. 主要参数
   - `axis`轴方向，默认为0删除行
   - `how`删除方式，选择`any/all`
   - `thresh`删除的非缺失值个数阈值，即没有达到这个数量的相应维度会被删除
   - `subset`备选的删除子集

In [5]:
df = pd.read_csv('./data/learn_pandas.csv', usecols=['Grade','Name','Gender','Height','Weight','Transfer'])
df.isnull().head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,True,False,False
4,False,False,False,False,False,False


In [7]:
df.isnull().mean()

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

In [9]:
df[df.Height.notna()].head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N
4,Sophomore,Gaojuan You,Male,174.0,74.0,N
5,Freshman,Xiaoli Qian,Female,158.0,51.0,N


In [20]:
sub_set = df[['Height','Weight','Transfer']]
df[sub_set.notna().any(1)] # 全部缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
4,Sophomore,Gaojuan You,Male,174.0,74.0,N
...,...,...,...,...,...,...
195,Junior,Xiaojuan Sun,Female,153.9,46.0,N
196,Senior,Li Zhao,Female,160.9,50.0,N
197,Senior,Chengqiang Chu,Female,153.9,45.0,N
198,Senior,Chengmei Shen,Male,175.3,71.0,N


In [24]:
res = df.dropna(how = 'any', subset = ['Height','Weight'])
res.shape
# 等价操作
res = df.loc[df[['Height','Weight']].notna().all(1)]

In [25]:
# 删除超过15个缺失值的列
res = df.dropna(1, thresh=df.shape[0]-15)
res.head()
# 等价操作
res = df.loc[:,~(df.isna().sum()>15)]

  res = df.dropna(1, thresh=df.shape[0]-15)


## 缺失值的填充和插值
### 利用`fillna`进行填充
1. 常用参数
   - `value`填充值，标量或者索引到元素的字典映射
   - `method`填充方式，`用前面的元素填充ffill/用后面的元素填充bfill`
   - `limit`表示连续缺失值的最大填充次数
### 插值函数`interpolate`
1. 常用参数
   - `method`插值方法
     - 默认为线性插值`linear`
     - 最近邻插补`nearest`，缺失值的元素和离它最近的非缺失值元素一样
     - 索引插值`index`，根据索引大小进行线性插值
   - `limit_direction`控制方向
     - 默认为前向限制插值`forward`
     - 后向限制插值`backward`
     - 双向限制插值`both`
   - `limit`控制最大连续缺失值插值个数
2. **注意**
   - 在interpolate中如果选用polynomial的插值方法，它内部调用的是scipy.interpolate.interp1d(\*,\*,kind=order)，这个函数内部调用的是make_interp_spline方法，因此其实是样条插值而不是类似于numpy中的polyfit多项式拟合插值；而当选用spline方法时，pandas调用的是scipy.interpolate.UnivariateSpline而不是普通的样条插值。这一部分的文档描述比较混乱，而且这种参数的设计也是不合理的，当使用这两类插值方法时，用户一定要小心谨慎地根据自己的实际需求选取恰当的插值方法。


In [27]:
s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

In [29]:
s.fillna(method='ffill', limit = 1)

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

In [30]:
s.fillna(s.mean())

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

In [31]:
s.fillna({'a':100,'d':200})

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

In [32]:
df.groupby('Grade')['Height'].transform(lambda x: x.fillna(x.mean())).head()

0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

In [35]:
s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])
s.values

array([nan, nan,  1., nan, nan, nan,  2., nan, nan])

In [39]:
res = s.interpolate(limit_direction='both')
res

0    1.00
1    1.00
2    1.00
3    1.25
4    1.50
5    1.75
6    2.00
7    2.00
8    2.00
dtype: float64

In [42]:
s

0    NaN
1    NaN
2    1.0
3    NaN
4    NaN
5    NaN
6    2.0
7    NaN
8    NaN
dtype: float64

In [43]:
s.interpolate('nearest')

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    2.0
6    2.0
7    NaN
8    NaN
dtype: float64

In [44]:
s = pd.Series([0, np.nan, 10], index=[0,1,10])
s

0      0.0
1      NaN
10    10.0
dtype: float64

In [45]:
s.interpolate()

0      0.0
1      5.0
10    10.0
dtype: float64

In [46]:
s.interpolate('index')

0      0.0
1      1.0
10    10.0
dtype: float64