## 判断缺失值

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                  [3, 4, np.nan, 1],
                 [np.nan, np.nan, np.nan, 5],
                [np.nan, 3, np.nan, 4]],
                 columns=list('ABCD'))
print(df)
df.isna()

     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4


Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False
3,True,False,True,False


## 缺失值的处理

### 缺失值删除

In [2]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                  "born": [pd.NaT, pd.Timestamp("1940-04-25"),pd.NaT]})
df

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


删除行

In [3]:
df.dropna()

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


删除列

In [4]:
df.dropna(axis=1)

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


删除行（当一行的数据全为缺失值的时候删除）

In [5]:
df.dropna(how="all")

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


删除行（当出现大于等于两个缺失值时）

In [6]:
df.dropna(thresh=2)

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


删除某个分组中的含有缺失值的行或列

In [7]:
df.dropna(subset=["toy","born"])

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


### 缺失值替换

In [8]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                  [3, 4, np.nan, 1],
                 [np.nan, np.nan, np.nan, 5],
                [np.nan, 3, np.nan, 4]],
                 columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


用缺失值前面的一个值代替缺失值

`foward fill` = `ffill`

In [10]:
df.ffill(axis=1)

Unnamed: 0,A,B,C,D
0,,2.0,2.0,0.0
1,3.0,4.0,4.0,1.0
2,,,,5.0
3,,3.0,3.0,4.0


In [11]:
df.ffill(axis=0)

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


缺失值用指定的填充

In [12]:
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


针对不同的列，用不同的值填充

In [13]:
trans={"A":9,"B":8,"C":7,"D":6}
df.fillna(value=trans)

Unnamed: 0,A,B,C,D
0,9.0,2.0,7.0,0
1,3.0,4.0,7.0,1
2,9.0,8.0,7.0,5
3,9.0,3.0,7.0,4


使用均值`mean()`进行填充

In [14]:
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,3.0,3.0,,5
3,3.0,3.0,,4


**高级插值函数**

In [15]:
df.interpolate()

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,3.5,,5
3,3.0,3.0,,4
