# 处理丢失数据

有两种丢失数据：
- None
- np.nan(NaN)

## 1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

In [1]:
#查看None的数据类型


## 2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

In [2]:
#查看np.nan的数据类型


## 3. pandas中的None与NaN

### 1) pandas中None与np.nan都视作np.nan

In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

创建DataFrame

In [2]:
df = DataFrame(data=np.random.randint(10,50,size=(8,8)))
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,12,39,29,15,34,17,15,10
1,20,13,34,20,38,22,34,15
2,30,45,41,28,26,22,19,27
3,37,29,47,12,37,49,28,44
4,30,29,40,34,40,41,20,20
5,32,49,23,38,25,13,25,40
6,23,46,31,22,15,37,21,48
7,41,39,35,26,46,48,33,24


In [18]:
#将某些数组元素赋值为nan

In [4]:
df.iloc[1,3] = None
df.iloc[2,2] = None
df.iloc[4,2] = None
df.iloc[6,7] = np.nan

In [5]:
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,12,39,29.0,15.0,34,17,15,10.0
1,20,13,34.0,,38,22,34,15.0
2,30,45,,28.0,26,22,19,27.0
3,37,29,47.0,12.0,37,49,28,44.0
4,30,29,,34.0,40,41,20,20.0
5,32,49,23.0,38.0,25,13,25,40.0
6,23,46,31.0,22.0,15,37,21,
7,41,39,35.0,26.0,46,48,33,24.0


### 2) pandas处理空值操作

- ``isnull()``
- ``notnull()``
- ``dropna()``: 过滤丢失数据
- ``fillna()``: 填充丢失数据

In [8]:
df.isnull()

Unnamed: 0,0,1,2,3,4,5,6,7
0,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False
2,False,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,True
7,False,False,False,False,False,False,False,False


In [14]:
#创建DataFrame，给其中某些元素赋值为nan

df.notnull().all(axis=1)     #notnull(all)   isnull(any)

0     True
1    False
2    False
3     True
4    False
5     True
6    False
7     True
dtype: bool

In [15]:
df.loc[df.notnull().all(axis=1)]

Unnamed: 0,0,1,2,3,4,5,6,7
0,12,39,29.0,15.0,34,17,15,10.0
3,37,29,47.0,12.0,37,49,28,44.0
5,32,49,23.0,38.0,25,13,25,40.0
7,41,39,35.0,26.0,46,48,33,24.0


(1)判断函数
- ``isnull()``
- ``notnull()``

- df.notnull/isnull().any()/all()

In [4]:
#过滤df中的空值（只保留没有空值的行）


df.dropna() 可以选择过滤的是行还是列（默认为行）:axis中0表示行，1表示的列

In [17]:
df.dropna(axis=0)

Unnamed: 0,0,1,2,3,4,5,6,7
0,12,39,29.0,15.0,34,17,15,10.0
3,37,29,47.0,12.0,37,49,28,44.0
5,32,49,23.0,38.0,25,13,25,40.0
7,41,39,35.0,26.0,46,48,33,24.0


(3) 填充函数 Series/DataFrame
- ``fillna()``:value和method参数

In [18]:
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,12,39,29.0,15.0,34,17,15,10.0
1,20,13,34.0,,38,22,34,15.0
2,30,45,,28.0,26,22,19,27.0
3,37,29,47.0,12.0,37,49,28,44.0
4,30,29,,34.0,40,41,20,20.0
5,32,49,23.0,38.0,25,13,25,40.0
6,23,46,31.0,22.0,15,37,21,
7,41,39,35.0,26.0,46,48,33,24.0


可以选择前向填充还是后向填充

In [21]:
df.fillna(method='ffill',axis=1)

Unnamed: 0,0,1,2,3,4,5,6,7
0,12.0,39.0,29.0,15.0,34.0,17.0,15.0,10.0
1,20.0,13.0,34.0,34.0,38.0,22.0,34.0,15.0
2,30.0,45.0,45.0,28.0,26.0,22.0,19.0,27.0
3,37.0,29.0,47.0,12.0,37.0,49.0,28.0,44.0
4,30.0,29.0,29.0,34.0,40.0,41.0,20.0,20.0
5,32.0,49.0,23.0,38.0,25.0,13.0,25.0,40.0
6,23.0,46.0,31.0,22.0,15.0,37.0,21.0,21.0
7,41.0,39.0,35.0,26.0,46.0,48.0,33.0,24.0


method 控制填充的方式 bfill ffill

============================================

练习7：

1. 简述None与NaN的区别

2. 假设张三李四参加模拟考试，但张三因为突然想明白人生放弃了英语考试，因此记为None，请据此创建一个DataFrame,命名为ddd3

3. 老师决定根据用数学的分数填充张三的英语成绩，如何实现？
    用李四的英语成绩填充张三的英语成绩？

============================================

In [24]:
data = pd.read_excel('测试数据.xlsx')
data.drop(labels=['none1','none2'],axis=1,inplace=True)
data

Unnamed: 0,time,1,2,3,4,5,6,7
0,2019-01-27 17:00:00,-24.8,-18.2,-20.8,-18.8,,,
1,2019-01-27 17:01:00,-23.5,-18.8,-20.5,-19.8,-15.2,-14.5,-16.0
2,2019-01-27 17:02:00,-23.2,-19.2,,,-13.0,,-14.0
3,2019-01-27 17:03:00,-22.8,-19.2,-20.0,-20.5,,-12.2,-9.8
4,2019-01-27 17:04:00,-23.2,-18.5,-20.0,-18.8,-10.2,-10.8,-8.8
5,2019-01-27 17:05:00,,,-19.0,-18.2,-10.0,-10.5,-10.8
6,2019-01-27 17:06:00,,-18.5,-18.2,-17.5,,,
7,2019-01-27 17:07:00,-24.8,-18.0,-17.5,-17.2,-14.2,-14.0,-12.5
8,2019-01-27 17:08:00,-25.2,-17.8,,,-16.2,,-14.5
9,2019-01-27 17:09:00,-24.8,-18.2,,-17.5,,-15.5,-16.0


In [27]:
data.shape

(1060, 8)

In [26]:
data.dropna(axis=0).shape

(927, 8)

In [35]:
#检测哪些列中存在空值
data.isnull().any(axis=0)

time    False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
dtype: bool

In [34]:
data.fillna(method='ffill',axis=0,inplace=True)
data.fillna(method='bfill',axis=0,inplace=True)