# CH5 数据预处理（淘米洗菜）

无论是业务数据还是从外部获取的数据，均会有数据不规整的情况（缺失数据、重复数据、异常数据）<br>
数据分析中有一句话：Garbage in, Garbage out<br>
也就是说如果在分析或建模之前没有将数据处理干净，输出得到的模型与分析报告必然也能是垃圾<br>

In [2]:
import pandas as pd
import numpy as np

## 5.1 缺失值处理

### 5.1.1缺失值查看

在Python中，info()方法除了看字段类型之外，还可以观察是否有Missing Value的情况

In [3]:
data_dict = {"编号":["A1","A2","A3","A4"],"年龄":[54,16,47,41],
        "性别":["男",np.nan,"女","男"],
        "注册时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11"]}
df = pd.DataFrame(data_dict)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
编号      4 non-null object
年龄      4 non-null int64
性别      3 non-null object
注册时间    4 non-null object
dtypes: int64(1), object(3)
memory usage: 208.0+ bytes


性别字段是3 non-null object，表示性别这一列有3个非null值，而其他列有4个非null值，说明性别这一列只有1个null<br>
当然，也可以直接用isnull()方法，来判断

In [4]:
data_dict = {"编号":["A1","A2","A3","A4"],"年龄":[54,16,47,41],
        "性别":["男",np.nan,"女","男"],
        "注册时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11"]}
df = pd.DataFrame(data_dict)
df.isnull()

Unnamed: 0,编号,年龄,性别,注册时间
0,False,False,False,False
1,False,False,True,False
2,False,False,False,False
3,False,False,False,False


为了方便期间，可以直接对每列的True进行求和

In [5]:
df.isnull().sum(axis = 0)

编号      0
年龄      0
性别      1
注册时间    0
dtype: int64

### 5.1.2缺失值删除

删除缺失值可以用dropna()方法，默认是删除有缺失值的行（只要有缺失值的就会进行删除）<br>
可以用参数how = "all"，只删除整行均为缺失值的行<br>
不过为了保险期间，在drop之前可以查看是哪些行存在Missing Value <br>

In [6]:
data_dict = {"编号":["A1","A2",np.nan,"A4"],"年龄":[54,16,np.nan,41],
        "性别":["男",np.nan,np.nan,"男"],
        "注册时间":["2018-8-8","2018-8-9",np.nan,"2018-8-11"]}
df = pd.DataFrame(data_dict)

In [7]:
#查看性别为缺失值的行
df[df["性别"].isnull()]

Unnamed: 0,编号,年龄,性别,注册时间
1,A2,16.0,,2018-8-9
2,,,,


In [8]:
df.dropna()

Unnamed: 0,编号,年龄,性别,注册时间
0,A1,54.0,男,2018-8-8
3,A4,41.0,男,2018-8-11


In [9]:
df.dropna(how = "all")

Unnamed: 0,编号,年龄,性别,注册时间
0,A1,54.0,男,2018-8-8
1,A2,16.0,,2018-8-9
3,A4,41.0,男,2018-8-11


### 5.1.3缺失值填充

因为数据是宝贵的，只要缺失比例控制在一定的范围内就可以接受<br>
在实际工作中，比起直接删除，首选应该是进行填充<br>
填充的方法有：<br>
1.直接用0<br>
2.数值型用众数、平均数、中位数填充<br>
&nbsp;&nbsp;文本型用众数<br>
3.直接将有缺失值的行导出，交给业务部的同事手工确认后再次写入<br>

In [10]:
data_dict = {"编号":["A1","A2","A3","A4"],"年龄":[54,16,np.nan,41],
        "性别":["男",np.nan,"女","男"],
        "注册时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11"]}
df = pd.DataFrame(data_dict)
df["性别"].fillna(value="Other")

0        男
1    Other
2        女
3        男
Name: 性别, dtype: object

In [11]:
data_dict = {"编号":["A1","A2","A3","A4"],"年龄":[54,16,np.nan,41],
        "性别":["男",np.nan,"女","男"],
        "注册时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11"]}
df = pd.DataFrame(data_dict)
df.fillna({"性别":"男","年龄":df["年龄"].mean()})

Unnamed: 0,编号,年龄,性别,注册时间
0,A1,54.0,男,2018-8-8
1,A2,16.0,男,2018-8-9
2,A3,37.0,女,2018-8-10
3,A4,41.0,男,2018-8-11


## 5.2 重复值处理

与缺失值一样，重复的数据，会给数据统计与建模工作造成困扰<br>
重复值处理也同样很有必要

In [12]:
data_dict = {"订单编号":["A1","A2","A3","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","孙凤","赵恒","赵恒"],
             "唯一识别码":[101,102,103,103,104,104],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-10","2018-8-11","2018-8-12"]}
df = pd.DataFrame(data_dict)

### 5.2.1重复值查看

在Python中，使用duplicated()判断是否为重复值，默认判断标准为整行，也可以加入参数subset，来指定同时重复才显示的行

In [13]:
df.duplicated(subset = ['唯一识别码','成交时间'])

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

In [14]:
df.loc[df.duplicated(keep=False, subset = ['唯一识别码','成交时间']), :]

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
2,A3,孙凤,103,2018-8-10
3,A3,孙凤,103,2018-8-10


### 5.2.2重复值删除

使用drop_duplicates()判断是否为重复值，默认判断标准也为整行，也可以加入参数subset，来指定同时重复才删除的行<br>
还有参数keep（保留哪一行），默认为first，即保留第一行，也可以传入last（保留最后一行）或者False（全部删除）

In [15]:
data_dict = {"订单编号":["A1","A2","A3","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","孙凤","赵恒","赵恒"],
             "唯一识别码":[101,102,103,103,104,104],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-10","2018-8-11","2018-8-12"]}
df = pd.DataFrame(data_dict)

In [16]:
df.drop_duplicates()

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
0,A1,张通,101,2018-8-8
1,A2,李谷,102,2018-8-9
2,A3,孙凤,103,2018-8-10
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


In [17]:
df.drop_duplicates(subset = ['唯一识别码','成交时间'])

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
0,A1,张通,101,2018-8-8
1,A2,李谷,102,2018-8-9
2,A3,孙凤,103,2018-8-10
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


In [18]:
df.drop_duplicates(subset = ['唯一识别码','成交时间'], keep = False)

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
0,A1,张通,101,2018-8-8
1,A2,李谷,102,2018-8-9
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


## 5.3 异常值检测与处理

### 5.3.1异常值查看

异常值的判断方法有如下几种方式：<br>
1.业务部同事人为定义，超过或低于某个值边算作异常值<br>
2.绘制盒须图，大于上边缘(Q3+1.5×IQR)或小于下边缘（Q1-1.5×IQR）<br>
3.若数据符合正态分布，与平均值超过3倍σ（标准差）

### 5.3.2异常值处理

异常值的处理方法有如下几种方式：<br>
1.最常见的方式就是直接删除<br>
2.当作缺失值进行填充<br>
3.把异常值作为特殊情况，研究出现的原因<br>

## 5.4 数据类型转化

Pandas共有如下几种类型，<br>
int：整数<br>
float：浮点数<br> 
object：Python对象类型<br>
string_：字符串类型<br>
unicode_：固定长度的unicode类型<br>
datetime64[ns]： 表示时间格式

除了之前提到的info()方法，还可以用dtype方法单独查看每个字段

In [19]:
data_dict = {"订单编号":["A1","A2","A3","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","孙凤","赵恒","赵恒"],
             "唯一识别码":[101,102,103,103,104,104],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-10","2018-8-11","2018-8-12"]}
df = pd.DataFrame(data_dict)

In [20]:
df["订单编号"].dtype

dtype('O')

In [21]:
df["唯一识别码"].dtype

dtype('int64')

Python中，用astype()方法来对数据类型进行转换

In [22]:
df["唯一识别码"].astype("float64")
# df["唯一识别码"].dtype

0    101.0
1    102.0
2    103.0
3    103.0
4    104.0
5    104.0
Name: 唯一识别码, dtype: float64

## 5.5 索引设置

索引是查找数据的重要依据，目的是为了便于查找数据<br>
例如，买完菜回家，把蔬菜水果放在零度间，肉类放在冷冻室<br>

In [23]:
df = pd.DataFrame([ ["A1","张通",101,"2018-8-8"],
               ["A2","李谷",102,"2018-8-9"],
               ["A3","孙凤",103,"2018-8-10"],
               ["A4","赵恒",104,"2018-8-11"],
               ["A5","赵恒",104 ,"2018-8-12"]])

### 5.5.1为无索引表添加索引

用columns方法添加列索引，index方法添加行索引

In [24]:
df.columns = ["订单编号","客户姓名","唯一识别码","成交时间"]
df.index = [1,2,3,4,5]
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
1,A1,张通,101,2018-8-8
2,A2,李谷,102,2018-8-9
3,A3,孙凤,103,2018-8-10
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


### 5.5.2重新设置索引

In [25]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","赵恒"],
             "唯一识别码":[101,102,103,104,104],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-12"]}
df = pd.DataFrame(data_dict)
df.index = [1,2,3,4,5]
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
1,A1,张通,101,2018-8-8
2,A2,李谷,102,2018-8-9
3,A3,孙凤,103,2018-8-10
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


重新设置索引，一般是指重新设置行索引，虽然有些表已经自带了行索引，但这不是我们想要的，需要手动再制定列

In [26]:
df.set_index("订单编号")

Unnamed: 0_level_0,客户姓名,唯一识别码,成交时间
订单编号,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A1,张通,101,2018-8-8
A2,李谷,102,2018-8-9
A3,孙凤,103,2018-8-10
A4,赵恒,104,2018-8-11
A5,赵恒,104,2018-8-12


### 5.5.3重命名索引

重命名索引就是指对现有的索引名进行修改，用字典进行传入，一般针对列会用的多一些

In [27]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","赵恒"],
             "唯一识别码":[101,102,103,104,104],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-12"]}
df = pd.DataFrame(data_dict)
df.index = [1,2,3,4,5]
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,成交时间
1,A1,张通,101,2018-8-8
2,A2,李谷,102,2018-8-9
3,A3,孙凤,103,2018-8-10
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


In [28]:
df.rename(columns = {"订单编号":"新订单编号","客户姓名":"新客户姓名"}, index = {1:"一",2:"二",3:"三"})

Unnamed: 0,新订单编号,新客户姓名,唯一识别码,成交时间
一,A1,张通,101,2018-8-8
二,A2,李谷,102,2018-8-9
三,A3,孙凤,103,2018-8-10
4,A4,赵恒,104,2018-8-11
5,A5,赵恒,104,2018-8-12


### 5.5.4重置索引

同样的，如果我们对原有索引不满意时，可以用reset_index()方法直接进行充值，多用于数据分组与数据透视表中<br>
将所列全都转为行，行索引转为列，并用从0开始的默认索引,在设置索引时，还有一个参数drop，默认为True，即作为索引的列，会直接在原表进行删除

In [31]:
data_dict = {"Z1":["A","A","B","B"],
             "Z2":["a","a","b","b"],
             "C1":[1,3,5,7],
             "C2":[2,4,6,8]}
df = pd.DataFrame(data_dict)
df = df.set_index(keys=['Z1','Z2'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,C1,C2
Z1,Z2,Unnamed: 2_level_1,Unnamed: 3_level_1
A,a,1,2
A,a,3,4
B,b,5,6
B,b,7,8


In [32]:
df.reset_index()

Unnamed: 0,Z1,Z2,C1,C2
0,A,a,1,2
1,A,a,3,4
2,B,b,5,6
3,B,b,7,8


In [33]:
df.reset_index(level = 0)

Unnamed: 0_level_0,Z1,C1,C2
Z2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,A,1,2
a,A,3,4
b,B,5,6
b,B,7,8
