# 資料前處理(Missing data, One-hot encoding, Feature Scaling)
資料的品質、特徵的選取決定了機器學習的上限，模型(Model)只是逼近這個上限。<br>
雖然在學術界總是以Model為主要討論對象，但實際上在業界80%的時間都是在對資料進行前處理，<br>
包含了資料獲取、清理、特徵選擇、特徵處理…到這裡我們可以稍微了解資料前處理的重要性。<br>

常見的資料前處理如下所示：<br>
1.缺失值的處理<br>
2.類別資料的處理（有序、無序）<br>
3.資料特徵縮放<br>

In [19]:
import numpy as np
import pandas as pd
from sklearn import datasets
from io import StringIO
import math

## 1. 缺失值的處理
缺值主要處理方式有兩種：<br>
*丟棄，如果資料量夠多<br>
*補值<br>

In [13]:
csv_data = '''A,B,C,D,E
            5.0,2.0,3.0,,6
            1.0,6.0,,8.0,5
            0.0,11.0,12.0,4.0,5
            3.0,,3.0,5.0,
            5.0,1.0,4.0,2.0,4
           '''
df = pd.read_csv(StringIO(csv_data))

In [9]:
df = pd.read_csv('preText.csv')

In [16]:
df

Unnamed: 0,A,B,C,D,E
0,5.0,2.0,3.0,,6.0
1,1.0,6.0,,8.0,5.0
2,0.0,11.0,12.0,4.0,5.0
3,3.0,,3.0,5.0,
4,5.0,1.0,4.0,2.0,4.0


In [75]:
df.fillna(0)

Unnamed: 0,A,B,C,D,E
0,5.0,2.0,3.0,0.0,6.0
1,1.0,6.0,0.0,8.0,5.0
2,0.0,11.0,12.0,4.0,5.0
3,3.0,0.0,3.0,5.0,0.0
4,5.0,1.0,4.0,2.0,4.0


## Missing Data(空值資料處理)

使用Pandas去空值的方法“dropna”，dropna預設只要任一欄位有空值，就會整筆刪掉。<br>
可透過參數來調整，像是把how設為all，就是要全部為空才清掉，或是用subset指定當某一欄為空時才刪。

In [76]:
df.dropna(0)

Unnamed: 0,A,B,C,D,E
2,0.0,11.0,12.0,4.0,5.0
4,5.0,1.0,4.0,2.0,4.0


Unnamed: 0,A,B,C,D,E
2,0.0,11.0,12.0,4.0,5.0
4,5.0,1.0,4.0,2.0,4.0


In [97]:
print("NaN <- mean:")
print(df.fillna(df.mean()))

NaN <- mean:
     A     B     C     D    E
0  5.0   2.0   3.0  4.75  6.0
1  1.0   6.0   5.5  8.00  5.0
2  0.0  11.0  12.0  4.00  5.0
3  3.0   5.0   3.0  5.00  5.0
4  5.0   1.0   4.0  2.00  4.0


In [107]:
print("NaN <- median:")
print(df.fillna(df.median()))

NaN <- median:
     A     B     C    D    E
0  5.0   2.0   3.0  4.5  6.0
1  1.0   6.0   3.5  8.0  5.0
2  0.0  11.0  12.0  4.0  5.0
3  3.0   4.0   3.0  5.0  5.0
4  5.0   1.0   4.0  2.0  4.0


Unnamed: 0,A,B,C,D,E
0,5.0,2.0,3.0,,6.0
1,1.0,6.0,,8.0,5.0
2,0.0,11.0,12.0,4.0,5.0
3,3.0,,3.0,5.0,
4,5.0,1.0,4.0,2.0,4.0


Unnamed: 0,A,B,C,D,E
0,5.0,2.0,3.0,,6.0
2,0.0,11.0,12.0,4.0,5.0
3,3.0,,3.0,5.0,
4,5.0,1.0,4.0,2.0,4.0


補值則使用fillna函式即可，依照以下範例即可補上固定值0、平均數、眾數、中位數…

In [111]:
print("NaN <- 0:")
print(df.fillna(0))

NaN <- 0:
     A     B     C    D    E
0  5.0   2.0   3.0  0.0  6.0
1  1.0   6.0   0.0  8.0  5.0
2  0.0  11.0  12.0  4.0  5.0
3  3.0   0.0   3.0  5.0  0.0
4  5.0   1.0   4.0  2.0  4.0


In [110]:
print("NaN <- mode:")
print(df.fillna(df.mode()))

NaN <- mode:
     A     B     C    D    E
0  5.0   2.0   3.0  2.0  6.0
1  1.0   6.0   NaN  8.0  5.0
2  0.0  11.0  12.0  4.0  5.0
3  3.0  11.0   3.0  5.0  NaN
4  5.0   1.0   4.0  2.0  4.0


In [109]:
print("NaN <- median:")
print(df.fillna(df.median()))

NaN <- median:
     A     B     C    D    E
0  5.0   2.0   3.0  4.5  6.0
1  1.0   6.0   3.5  8.0  5.0
2  0.0  11.0  12.0  4.0  5.0
3  3.0   4.0   3.0  5.0  5.0
4  5.0   1.0   4.0  2.0  4.0


In [108]:
print("NaN <- median:")
print(df.fillna(df.median()))

NaN <- median:
     A     B     C    D    E
0  5.0   2.0   3.0  4.5  6.0
1  1.0   6.0   3.5  8.0  5.0
2  0.0  11.0  12.0  4.0  5.0
3  3.0   4.0   3.0  5.0  5.0
4  5.0   1.0   4.0  2.0  4.0


Unnamed: 0,A,B,C,D,E
0,5.0,2.0,3.0,4.5,6.0
1,1.0,6.0,,8.0,5.0
2,0.0,11.0,12.0,4.0,5.0
3,3.0,5.0,3.0,5.0,4.0
4,5.0,1.0,4.0,2.0,4.0


## 2.Categorical Data(類別資料處理)
由於要在空間中表示點，所有的特徵都需要是數值，因此如果是類別的資料，<br>
像是XL,L,M,S,XS或是資料類別為Male, Female, Not Specified，就需要轉成數值才能在空間中來表示。<br>
有序的類別資料通常是直接使用數值替換，比方說XL, L, M, S, XS雖然是類別的屬性但因為有大小順序的關係，可以用10, 7 , 5, 3, 1來替換。<br>如果是Male, Female, Not Specified因為這三種都是等價的關係因此需要找一個方法讓這三個屬性距離原點是相同距離，<br>
One-hot encoding 就是解決這的問題的方法，首先會將Male, Female, Not Specified由Gender從成一個欄位拆成三個欄位，<br>
因此編號1的使用者的屬性資料就是(1,0,0)編號2的使用者就是(0,1,0) 編號三個使用者就是(0,0,1)這三個使用者對於原點的距離都是1，就達成我們想要的結果了。<br>
但One-hot encoding的方法只適合類別種類少的形況下，如果類別種類太多就會產生出一大堆的特徵，造成其他的問題（比方說維數災難）。

In [10]:
df2 = pd.DataFrame(
    [['green', 'M', 10.1, 1],
    ['red', 'L', 13.5, 2],
    ['blue', 'XL', 15.3, 1]]
)
df2.columns = ['color', 'size', 'price', 'classlabel']
df2

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,1
1,red,L,13.5,2
2,blue,XL,15.3,1


由於size是屬於有序的資料，我們只要稍微轉換為數值即可

In [26]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df3 = df2
df3['size'] = enc.fit_transform(df2['size'])
df3


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,0,13.5,2
2,blue,2,15.3,1


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,2
2,blue,3,15.3,1


至於color我們就用onehot-encoding的方法來處理，在pandas裡面要使用<br>
onehot-encoding使用get_dummies這個函式就可以了，範例如下


In [20]:
pd.get_dummies(df2['color'])

Unnamed: 0,blue,green,red
0,0,1,0
1,0,0,1
2,1,0,0


Unnamed: 0,blue,green,red
0,0,1,0
1,0,0,1
2,1,0,0


Unnamed: 0,size,price,classlabel
0,1,10.1,1
1,2,13.5,2
2,3,15.3,1


Unnamed: 0,color_blue,color_green,color_red,size,price,classlabel
0,0,1,0,1,10.1,1
1,0,0,1,2,13.5,2
2,1,0,0,3,15.3,1


# 3.資料特徵縮放
## 資料正規化(normalization)
最常見的Normalization為0–1區間縮放，經過Normalization之後資料的範圍會介在0~1之間，<br>
原本的最大值變為1，最小值變為0，具體作法如下圖


In [2]:
from IPython.display import Math

In [3]:
Math(r'x^{(i)}_{norm}=\frac{x^{(i)}-x_{min}}{x_{max}-x_{min}}')



<IPython.core.display.Math object>

In [27]:
import numpy as np
import pandas as pd
from sklearn import datasets
from io import StringIO

iris = datasets.load_iris()
x = pd.DataFrame(iris['data'], columns=iris['feature_names'])
print("target_names: "+str(iris['target_names']))
y = pd.DataFrame(iris['target'], columns=['target_names'])
data = pd.concat([x,y], axis=1)
data.head(3)

target_names: ['setosa' 'versicolor' 'virginica']


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target_names
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0


In [33]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
scaler.fit_transform(data)

array([[0.80377277, 0.55160877, 0.22064351, 0.0315205 , 0.        ],
       [0.82813287, 0.50702013, 0.23660939, 0.03380134, 0.        ],
       [0.80533308, 0.54831188, 0.2227517 , 0.03426949, 0.        ],
       [0.80003025, 0.53915082, 0.26087943, 0.03478392, 0.        ],
       [0.790965  , 0.5694948 , 0.2214702 , 0.0316386 , 0.        ],
       [0.78417499, 0.5663486 , 0.2468699 , 0.05808704, 0.        ],
       [0.78010936, 0.57660257, 0.23742459, 0.0508767 , 0.        ],
       [0.80218492, 0.54548574, 0.24065548, 0.0320874 , 0.        ],
       [0.80642366, 0.5315065 , 0.25658935, 0.03665562, 0.        ],
       [0.81803119, 0.51752994, 0.25041771, 0.01669451, 0.        ],
       [0.80373519, 0.55070744, 0.22325977, 0.02976797, 0.        ],
       [0.786991  , 0.55745196, 0.26233033, 0.03279129, 0.        ],
       [0.82307218, 0.51442011, 0.24006272, 0.01714734, 0.        ],
       [0.8025126 , 0.55989251, 0.20529392, 0.01866308, 0.        ],
       [0.81120865, 0.55945424, 0.

## 資料標準化(Standardization)
經過Standardization資料的平均值會變為0, 標準差變為1

In [4]:
Math(r'x^{(i)}_{std}=\frac{x^{(i)}-\mu_{x}}{\sigma_{x}}')



<IPython.core.display.Math object>

In [34]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(data)

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00, -1.22474487e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00, -1.22474487e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00, -1.22474487e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target_names
0,0.222222,1.015602,0.067797,0.041667,0
1,0.166667,-0.131539,0.067797,0.041667,0
2,0.111111,0.327318,0.050847,0.041667,0
3,0.083333,0.097889,0.084746,0.041667,0
4,0.194444,1.24503,0.067797,0.041667,0
