# 使用Scikit-Learn 完成預測
### Scikit-Learn在三個面向提供支援。
1. 獲取資料:***klearn.datasets***
2. 掌握資料:***sklearn.preprocessing*** 
3. 機器學習:***sklearn Estimator API*** 

獲取資料的方式有很多種（包含檔案、資料庫、網路爬蟲、Kaggle Datasets等），<br>
其中最簡單的方式是從Sklearn import 內建的資料庫。由於其特性隨手可得且不用下載，所以我們通常叫他**玩具資料**：

# 基本架構

* 讀取資料&pre-processing
* 切分訓練集與測試集 
* 模型配適
* 預測 
* 評估(計算成績可能是誤差值或正確率或..)


In [1]:
%matplotlib inline

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 讀取Iris資料集與資料前處理

Iris Flowers 資料集

我們在這個項目中使用 Iris Data Set，這個資料集中的每個樣本有4個特徵，1個類別。該資料集1中的樣本類別數為3類，每類樣本數目為50個，總共150個樣本。

屬性資訊：

    花萼長度 sepal length(cm)
    花萼寬度 sepal width(cm)
    花瓣長度 petal length(cm)
    花瓣寬度 petal width(cm)
    類別：
        Iris Setosa
        Iris Versicolour
        Iris Virginica

樣本特徵資料是數值型的，而且單位都相同（釐米）。

![Iris Flowers](images/iris_data.PNG)


In [2]:
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

* 印出iris的key值與檔案位置
* 查看前10筆資料
* 查看資料型別
* 印出標註的樣本類別資料

In [11]:
print(iris.keys())
print(iris.data[0:10])
print(type(iris.data))
print(iris.target_names)
print(iris.target)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
<class 'numpy.ndarray'>
['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [12]:
# we only take the first two features. 
X=iris.data[:,:2]
print(X.shape)
Y=iris.target
print(Y.shape)

(150, 2)
(150,)


In [13]:
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [15]:
#以下是組成 pandas DataFrame (也可以不用這種做)
x = pd.DataFrame(iris.data, columns=iris['feature_names'])
x.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [19]:
print ("target_names: "+str(iris['target_names']))

target_names: ['setosa' 'versicolor' 'virginica']


In [17]:
#建立Target欄位與資料
y=pd.DataFrame(iris['target'], columns=['target'])
y.head(10)

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [26]:
#合併資料特徵欄位與目標欄位
iris_data=pd.concat([x,y],axis=1)
#加以下這一型可以挑目標欄位
iris_data=iris_data[['sepal length (cm)','petal length (cm)','target']]
iris_data.head(10)

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
5,5.4,1.7,0
6,4.6,1.4,0
7,5.0,1.5,0
8,4.4,1.4,0
9,4.9,1.5,0


In [30]:
#只選擇目標為0與1的資料
iris_data=iris_data[iris_data['target'].isin([0,1])]
iris_data
print(iris['data'].size/len(iris['feature_names']))

150.0


## 切分訓練集與測試集
> train_test_split()

In [45]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(iris_data[['sepal length (cm)','petal length (cm)']],iris_data['target'],test_size=0.2)


In [46]:
X_train.head()
#X_train.shape

Unnamed: 0,sepal length (cm),petal length (cm)
7,5.0,1.5
37,4.9,1.4
16,5.4,1.3
1,4.9,1.4
25,5.0,1.6


In [47]:
X_test.head()
#X_train.shape

Unnamed: 0,sepal length (cm),petal length (cm)
85,6.0,4.5
26,5.0,1.6
53,5.5,4.0
97,6.2,4.3
55,5.7,4.5


In [48]:
Y_train.head()
#Y_train.shape

7     0
37    0
16    0
1     0
25    0
Name: target, dtype: int32

In [50]:
Y_test.head()
Y_test.shape

(20,)

# Appendix 

>normalization和standardization是差不多的<br>
都是把數據進行前處理，從而使數值都落入到統一的數值範圍，從而在建模過程中，各個特徵量沒差別對待。<br> 
* normalization一般是把數據限定在需要的範圍，比如一般都是【0，1】，從而消除了數據量綱對建模的影響。<br> 
* standardization 一般是指將數據正態化，使平均值0方差為1.<br> 

因此normalization和standardization 是針對數據而言的，消除一些數值差異帶來的特種重要性偏見。<br>
經過歸一化的數據，能加快訓練速度，促進算法的收斂。

### Standardization (z-score)
    to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

In [71]:
from IPython.display import Math
Math(r'x^{(i)}_{norm}=\frac{x^{(i)}-x_{min}}{x_{max}-x_{min}}')


<IPython.core.display.Math object>

In [72]:
Math(r'x^{(i)}_{std}=\frac{x^{(i)}-\mu_{x}}{\sigma_{x}}')

<IPython.core.display.Math object>

In [73]:
def norm_stats(dfs):
    minimum = np.min(dfs)
    maximum = np.max(dfs)
    mu = np.mean(dfs)
    sigma = np.std(dfs)
    return (minimum, maximum, mu, sigma)

def z_score(col, stats):
    m, M, mu, s = stats
    df = pd.DataFrame()
    for c in col.columns:
        df[c] = (col[c]-mu[c])/sigma[c]
    return df

In [None]:
stats = norm_stats(X_train)  
arr_x_train = np.array(z_score(X_train, stats)) 
arr_x_train

## use sklearn

In [78]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)  #Compute the statistics to be used for later scaling.
print(sc.mean_)  #mean
print(sc.scale_) #standard deviation

[5.4225  2.72875]
[0.66010889 1.44820525]


In [79]:
#transform: (x-u)/std.
X_train_std = sc.transform(X_train)
X_train_std[:5]

array([[-0.64004591, -0.84846399],
       [-0.79153607, -0.91751497],
       [-0.03408529, -0.98656596],
       [-0.79153607, -0.91751497],
       [-0.64004591, -0.779413  ]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [80]:
X_test_std = sc.transform(X_test)
print(X_test_std[:10])

[[ 0.87485566  1.22306559]
 [-0.64004591 -0.779413  ]
 [ 0.11740487  0.87781066]
 [ 1.17783597  1.08496361]
 [ 0.42038519  1.22306559]
 [ 0.7233655   1.01591263]
 [ 0.26889503  0.80875967]
 [ 0.26889503  0.60160671]
 [-1.24600654 -0.91751497]
 [-0.03408529 -0.71036202]]


you can also use fit_transform method (i.e., fit and then transform)    

In [81]:
X_train_std = sc.fit_transform(X_train)  
X_test_std = sc.fit_transform(X_test)
print(X_test_std[:10])


[[ 0.67232353  0.86103446]
 [-1.33461238 -1.38851504]
 [-0.33114443  0.4731811 ]
 [ 1.07371071  0.70589312]
 [ 0.07024276  0.86103446]
 [ 0.47162994  0.62832245]
 [-0.13045083  0.39561043]
 [-0.13045083  0.16289841]
 [-2.13738675 -1.54365638]
 [-0.53183802 -1.31094436]]


In [83]:
print('mean of X_train_std:',np.round(X_train_std.mean(),4))
print('std of X_train_std:',X_train_std.std())

mean of X_train_std: -0.0
std of X_train_std: 1.0


## Min-Max Normaliaztion
    Transforms features by scaling each feature to a given range.
    The transformation is given by:

    X' = X - X.min(axis=0) / ((X.max(axis=0) - X.min(axis=0))
    X -> N 維資料
    


In [84]:
x1 = np.random.normal(50, 6, 100)  # np.random.normal(mu,sigma,size))
y1 = np.random.normal(5, 0.5, 100)

x2 = np.random.normal(30,6,100)
y2 = np.random.normal(4,0.5,100)
plt.scatter(x1,y1,c='b',marker='s',s=20,alpha=0.8)
plt.scatter(x2,y2,c='r', marker='^', s=20, alpha=0.8)

print(np.sum(x1)/len(x1))
print(np.sum(x2)/len(x2))

50.04860955125084
30.046657805652554


In [87]:
x_val = np.concatenate((x1,x2))
y_val = np.concatenate((y1,y2))
x_val
#x_val.shape

array([46.29646511, 49.12001187, 44.08948687, 49.33175531, 43.32044656,
       40.05608262, 49.57192497, 55.85577243, 44.02372791, 59.71833842,
       50.20901126, 48.52587009, 50.60215327, 46.82302648, 57.29807436,
       49.62448995, 56.83221545, 53.86979178, 46.85908611, 56.21877435,
       47.14617458, 44.93547917, 67.07130449, 56.56480235, 54.53824364,
       44.99734967, 54.31081418, 60.65562205, 55.95256839, 39.63565309,
       50.02912232, 30.39421395, 48.10327177, 44.19484327, 51.55302819,
       51.40576151, 50.2433001 , 48.41950413, 53.3579891 , 43.99525593,
       53.25902344, 58.53831643, 43.07982717, 54.8963132 , 48.58525828,
       53.41306618, 54.39912984, 42.22718253, 47.16067275, 50.47394019,
       44.33380847, 45.55709921, 49.03281478, 66.16191662, 45.09698201,
       43.26095067, 54.29524822, 52.94405442, 62.7445844 , 46.80424539,
       60.4457845 , 49.35433768, 50.21156329, 50.70530019, 40.25157998,
       52.77467997, 53.27969908, 58.50537905, 43.06679218, 44.72

In [88]:
def minmax_norm(X):
    return (X - X.min(axis=0)) / ((X.max(axis=0) - X.min(axis=0)))

In [89]:
minmax_norm(x_val[:10])

array([0.31737877, 0.46098115, 0.20513436, 0.47175018, 0.16602184,
       0.        , 0.48396493, 0.80355428, 0.20178993, 1.        ])

In [91]:
from sklearn.preprocessing import MinMaxScaler
x_val=x_val.reshape(-1, 1) #1D>2D
print(x_val.shape)
scaler = MinMaxScaler().fit(x_val)  # default range 0~1
print(scaler.data_max_)
print(scaler.transform(x_val)[:10])

(200, 1)
[67.07130449]
[[0.59169714]
 [0.64719033]
 [0.5483218 ]
 [0.65135188]
 [0.5332073 ]
 [0.46905041]
 [0.6560721 ]
 [0.77957308]
 [0.5470294 ]
 [0.85548687]]
