# Lab 2. 数据预处理

* 缺失数据的处理
* 规范化
* 降维：删除掉一些属性
* 离散化：将数值型的属性变成类别型

## 1. 缺失数据的处理

<p>现实数据集中经常会出现缺失值，Pandas提供两种类型的缺失值标签：None和NaN（Not a Number）。None为object类型，NaN为浮点型。</p>
<p>Pandas还提供了一些方法来发现、剔除和替换数据中的缺失值。</p>

* isnull(): 创建一个布尔类型的掩码标签缺失值。

* notnull(): 与isnull()操作相反。

* dropna(): 返回一个剔除了缺失值的数据副本。

* fillna(): 返回一个填充了缺失值的数据副本。

### 1.1 判断缺失值

In [1]:
import pandas as pd
import numpy as np
import os

In [6]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [7]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

### 1.2 筛选包含缺失值的数据

In [4]:
df = pd.DataFrame(pd.read_csv(os.getcwd() + os.path.sep + '/data/births.csv'))
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15547 entries, 0 to 15546
Data columns (total 5 columns):
year      15547 non-null int64
month     15547 non-null int64
day       15067 non-null float64
gender    15547 non-null object
births    15547 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 607.4+ KB


(15547, 5)

In [5]:
# drop_duplicates() 使一行中有多个缺失值的数据仍然只输出一行
na_data=df[df.isnull().values==True].drop_duplicates()
# na_data.head(10)
#na_data.info()
na_data.shape

(480, 5)

<font color='red'>此时有多少条缺失数据？</font>   <font color='green'>此时有480条缺失的数据</font>

下面，为了显示 drop_duplicates() 的必要性，先将 na_data 中某一列的值设为 NaN，使一行数据中出现两个空值。

In [10]:
na_data.loc[15067,'births']=None
na_data.loc[15067]
#data.head(10)

year      1989
month        1
day        NaN
gender       F
births     NaN
Name: 15067, dtype: object

In [11]:
na_data[na_data.isnull().values==True]

Unnamed: 0,year,month,day,gender,births
15067,1989,1,,F,
15067,1989,1,,F,
15068,1989,1,,M,164052.0
15069,1989,2,,F,146710.0
15070,1989,2,,M,154047.0
15071,1989,3,,F,165889.0
15072,1989,3,,M,174433.0
15073,1989,4,,F,155689.0
15074,1989,4,,M,163432.0
15075,1989,5,,F,163800.0


<font color='red'>此时有多少条缺失数据？</font>   <font color='green'>此时有481条缺失的数据</font>

### 1.3 剔除缺失值

Pandas中的 dropna() 方法是最主要的剔除缺失值的方法。

In [12]:
df_nona = df.dropna()
df_nona.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15067 entries, 0 to 15066
Data columns (total 5 columns):
year      15067 non-null int64
month     15067 non-null int64
day       15067 non-null float64
gender    15067 non-null object
births    15067 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 706.3+ KB


默认情况下，dropna()会剔除<b>任何</b>包含缺失值的<b>整行</b>数据。

In [None]:
df_nona = df.dropna(axis=1)
df_nona.info()

df.dropna(axis=1)和df.dropna(axis='columns')会剔除<b>任何</b>包含缺失值的<b>整列</b>数据。

可以通过设置 how 和 thresh 参数来设置剔除数据中包含缺失值的阈值。

In [None]:
df_nona = df.dropna(how='all')
#df_nona = df.dropna(thresh=2)
df_nona.info()

### 1.4 填充缺失值
<p>Pandas提供的 fillna() 方法可以用来填充缺失值，其中 method 参数（可取 ffill 和 bfill）可以指定填充的方式，ffill用前面的有效值填充，而bfill用后面的有效值填充。</p>

In [None]:
df_nona = df.fillna(0)
df_nona.info()

对比填充前后数据行的变化：

In [None]:
na_idx = df[df.isnull().values==True].index.tolist()
len(na_idx)

In [None]:
df.loc[na_idx[0]]

In [None]:
df_nona.loc[na_idx[0]]

In [None]:
df_nona = df.fillna(method='ffill')
df_nona.loc[na_idx[0]]

In [None]:
df_nona = df.fillna(method='bfill')
df_nona.loc[na_idx[0]]

此时，仍然为NaN，这是因为从这行开始直到最后一行，该字段的值均为NaN，即后面没有有效值了。因此，使用bfill方法没有效果。因此，在填充后要注意检查结果，确定所有的NaN都被填充。

确认最后一条包含Nan的数据是否是数据集中的最后一行。

In [None]:
na_idx[-1]

fillna()方法还可以通过axis参数指定从前一行（axis=0）或者前一列（axis=1）获取有效值

In [None]:
df_nona = df.fillna(method='ffill',axis=1)
df_nona.loc[na_idx[0]]

利用平均值填充也是常用的填充方式

In [None]:
fill = df['day'].mean()
df_nona = df.fillna(fill)
df_nona.loc[na_idx[0]]

## 2. 规范化

In [2]:
path = os.getcwd() + os.path.sep + '/data/iris.csv'
df = pd.DataFrame(pd.read_csv(path,header=None))
df.columns = ['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class']
df['class'] = df['class'].astype('category')
print(df.describe())
df.head(20)

       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      5.843333    3.054000     3.758667    1.198667
std       0.828066    0.433594     1.764420    0.763161
min       4.300000    2.000000     1.000000    0.100000
25%       5.100000    2.800000     1.600000    0.300000
50%       5.800000    3.000000     4.350000    1.300000
75%       6.400000    3.300000     5.100000    1.800000
max       7.900000    4.400000     6.900000    2.500000


Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [3]:
# transform 
df_norm = df.groupby('class').transform(lambda x: (x - x.min())/(x.max()-x.min())) 
df_norm['class']=df['class']
print(df_norm.describe())
df_norm.head(20)

       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      0.508889    0.522044     0.517963    0.421530
std       0.233120    0.203712     0.224650    0.262814
min       0.000000    0.000000     0.000000    0.000000
25%       0.333333    0.375000     0.343750    0.200000
50%       0.500000    0.523810     0.523810    0.375000
75%       0.666667    0.642857     0.666667    0.625000
max       1.000000    1.000000     1.000000    1.000000


Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,0.533333,0.571429,0.444444,0.2,Iris-setosa
1,0.4,0.333333,0.444444,0.2,Iris-setosa
2,0.266667,0.428571,0.333333,0.2,Iris-setosa
3,0.2,0.380952,0.555556,0.2,Iris-setosa
4,0.466667,0.619048,0.444444,0.2,Iris-setosa
5,0.733333,0.761905,0.777778,0.6,Iris-setosa
6,0.2,0.52381,0.444444,0.4,Iris-setosa
7,0.466667,0.52381,0.555556,0.2,Iris-setosa
8,0.066667,0.285714,0.444444,0.2,Iris-setosa
9,0.4,0.380952,0.555556,0.0,Iris-setosa


In [14]:
from pandas.api.types import is_categorical_dtype

df_norm = df.apply(lambda x: x if is_categorical_dtype(x) else (x - np.min(x))/(np.max(x)-np.min(x)))
print(df_norm.describe())
df_norm.head(20)

       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      0.428704    0.439167     0.467571    0.457778
std       0.230018    0.180664     0.299054    0.317984
min       0.000000    0.000000     0.000000    0.000000
25%       0.222222    0.333333     0.101695    0.083333
50%       0.416667    0.416667     0.567797    0.500000
75%       0.583333    0.541667     0.694915    0.708333
max       1.000000    1.000000     1.000000    1.000000


Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,0.222222,0.625,0.067797,0.041667,Iris-setosa
1,0.166667,0.416667,0.067797,0.041667,Iris-setosa
2,0.111111,0.5,0.050847,0.041667,Iris-setosa
3,0.083333,0.458333,0.084746,0.041667,Iris-setosa
4,0.194444,0.666667,0.067797,0.041667,Iris-setosa
5,0.305556,0.791667,0.118644,0.125,Iris-setosa
6,0.083333,0.583333,0.067797,0.083333,Iris-setosa
7,0.194444,0.583333,0.084746,0.041667,Iris-setosa
8,0.027778,0.375,0.067797,0.041667,Iris-setosa
9,0.166667,0.458333,0.084746,0.0,Iris-setosa


<font color='red'>查阅资料，完成 z-score 规范化</font>

零-均值规范化也称标准差标准化，经过处理的数据的均值为0，标准差为1。转化公式为：(data - data.mean()) / data.std()

In [8]:
df_zzz = df.transform(lambda x: (x - np.mean(x))/np.std(x))
print(df_zzz.describe())
df_zzz.head(20)

TypeError: ('Categorical cannot perform the operation mean', 'occurred at index class')

## 3. 降维

降维通常要依靠 sklearn.decomposition 中的各个类来实现，我们在后面还会讲述。

【参考】

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

## 4. 离散化

Pandas中的 cut() 方法能很便捷地实现数据分箱的功能。

cut() 方法的说明参见：

https://www.cnblogs.com/sench/p/10128216.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

另外，pandas的 qcut() 方法可以根据 percentile 对数据进行分箱

In [19]:
df.petallength = pd.cut(df.petallength, 5)
df.petallength.values

[(0.994, 2.18], (0.994, 2.18], (0.994, 2.18], (0.994, 2.18], (0.994, 2.18], ..., (4.54, 5.72], (4.54, 5.72], (4.54, 5.72], (4.54, 5.72], (4.54, 5.72]]
Length: 150
Categories (5, interval[float64]): [(0.994, 2.18] < (2.18, 3.36] < (3.36, 4.54] < (4.54, 5.72] < (5.72, 6.9]]

<font color=red>这里实现的是等深分箱还是等宽分箱？请查资料实现另外一种分箱方法。</font>