# 使用Scikit-Learn 完成預測
### Scikit-Learn在三個面向提供支援。
1. 獲取資料:***klearn.datasets***
2. 掌握資料:***sklearn.preprocessing*** 
3. 機器學習:***sklearn Estimator API*** 

獲取資料的方式有很多種（包含檔案、資料庫、網路爬蟲、Kaggle Datasets等），<br>
其中最簡單的方式是從Sklearn import 內建的資料庫。由於其特性隨手可得且不用下載，所以我們通常叫他**玩具資料**：

# SCIKIT-LEARN 的基本程式架構

1.讀取資料&pre-processing
```
pd.read_csv('data.csv')
```

2.切分訓練集與測試集
```
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2)
```

3.模型配適
```
model = svm.SVC( ... )
model.fit(x_train,y_train)
```

4.預測
```
pred_Y= model.predict(x_train)
pred_Y= model.predict(x_test)
```

5.評估
```
model.score(x_train,y_train)
model.score(x_test,y_test)
```

In [1]:
%matplotlib inline

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 讀取Iris資料集與資料前處理

Iris Flowers 資料集

我們在這個項目中使用 Iris Data Set，這個資料集中的每個樣本有4個特徵，1個類別。該資料集1中的樣本類別數為3類，每類樣本數目為50個，總共150個樣本。

屬性資訊：

    花萼長度 sepal length(cm)
    花萼寬度 sepal width(cm)
    花瓣長度 petal length(cm)
    花瓣寬度 petal width(cm)
    類別：
        Iris Setosa
        Iris Versicolour
        Iris Virginica

樣本特徵資料是數值型的，而且單位都相同（釐米）。

![Iris Flowers](images/iris_data.PNG)


In [2]:
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

* 印出iris的key值與檔案位置
* 查看前10筆資料
* 查看資料型別
* 印出標註的樣本類別資料

In [3]:
# iris的key值
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


In [4]:
# iris的filename
iris.filename

'iris.csv'

In [5]:
# 查看iris的前10筆資料
iris.data[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [6]:
# 查看iris的資料型別
type(iris.data)

numpy.ndarray

In [7]:
# 樣本類別
print(iris.target_names)

# 樣本類別資料
print(iris.target)

['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [8]:
# row拿全部，col拿前2個特徵
x = iris.data[:,:2]
print(x.shape)

y = iris.target
print(y.shape)

(150, 2)
(150,)


In [9]:
#以下是組成 pandas DataFrame (也可以不用這種做)
x = pd.DataFrame(iris.data, columns=iris['feature_names'])
x.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [10]:
# Target欄位名稱
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [11]:
# 建立Target欄位與資料
y = pd.DataFrame(iris.target, columns=['target'])
y.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [12]:
# 合併資料特徵欄位與目標欄位
iris_data = pd.concat([x,y],axis=1)
iris_data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [13]:
# 選擇特徵為length資料
iris_data[['sepal length (cm)', 'petal length (cm)', 'target']]

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
...,...,...,...
145,6.7,5.2,2
146,6.3,5.0,2
147,6.5,5.2,2
148,6.2,5.4,2


In [14]:
# 選擇特徵為length資料
iris_data_filter = iris_data[['sepal length (cm)', 'petal length (cm)', 'target']]

# 只選擇目標為0與1的資料
iris_data_filter = iris_data_filter[iris_data_filter['target'].isin([0,1])]

iris_data_filter

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
...,...,...,...
95,5.7,4.2,1
96,5.7,4.2,1
97,6.2,4.3,1
98,5.1,3.0,1


In [15]:
# 查看資料比數
int(iris.data.size/len(iris.feature_names))

150

In [16]:
# 查看資料比數
print(f'資料比數: {len(iris_data_filter)}')
# 查看資料量
print(f'資料量: {iris_data_filter.size}')

資料比數: 100
資料量: 300


## 切分訓練集與測試集
> train_test_split()

In [17]:
# 加載 iris 數據集
iris = datasets.load_iris()

# 提取所有特徵並轉換為 DataFrame
x = pd.DataFrame(iris.data, columns=iris.feature_names)
# 提取目標變量並轉換為 DataFrame
y = pd.DataFrame(iris.target, columns=['target'])

In [18]:
from sklearn.model_selection import train_test_split

# 使用 train_test_split 劃分數據集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [19]:
x_train.head()
x_train.shape

(105, 4)

In [20]:
x_test.head()
x_test.shape

(45, 4)

# Appendix 

>normalization和standardization是差不多的<br>
都是把數據進行前處理，從而使數值都落入到統一的數值範圍，從而在建模過程中，各個特徵量沒差別對待。<br> 
* normalization一般是把數據限定在需要的範圍，比如一般都是【0，1】，從而消除了數據量綱對建模的影響。<br> 
* standardization 一般是指將數據正態化，使平均值0方差為1.<br> 

因此normalization和standardization 是針對數據而言的，消除一些數值差異帶來的特種重要性偏見。<br>
經過歸一化的數據，能加快訓練速度，促進算法的收斂。

### Standardization (z-score)
    to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

In [21]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

# 加載 iris 數據集
iris = datasets.load_iris()

# 提取所有特徵並轉換為 DataFrame
x = pd.DataFrame(iris.data, columns=iris.feature_names)
# 提取目標變量並轉換為 DataFrame
y = pd.DataFrame(iris.target, columns=['target'])

# 使用 train_test_split 劃分數據集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [22]:
from IPython.display import Math, display

# Min-Max 標準化公式
display(Math(r'x_{\text{norm}}^{(i)} = \frac{x^{(i)} - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}'))

# Z-Score 標準化公式
display(Math(r'x_{\text{std}}^{(i)} = \frac{x^{(i)} - \mu_{x}}{\sigma_{x}}'))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

In [23]:
# 定義標準化統計函數
def norm_stats(dfs):
    minimum = dfs.min()
    maximum = dfs.max()
    mu = dfs.mean()
    sigma = dfs.std()
    return (minimum, maximum, mu, sigma)

# 定義 z-score 標準化函數
def z_score(col, stats):
    m, M, mu, s = stats
    df = pd.DataFrame()
    for c in col.columns:
        df[c] = (col[c] - mu[c]) / s[c]
    return df

In [24]:
# 計算訓練集的統計量
stats = norm_stats(x_train)

# 使用 z-score 標準化 x_train
x_train_normalized = z_score(x_train, stats)
x_test_normalized = z_score(x_test, stats)

# 查看標準化後的 x_train 前五行
x_train_normalized.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
81,-0.411443,-1.455024,-0.099036,-0.321854
133,0.548591,-0.500165,0.714277,0.351347
137,0.668595,0.21598,0.946652,0.755267
75,0.908603,-0.022735,0.30762,0.216707
109,1.628629,1.409555,1.295215,1.697748


In [25]:
# 將標準化後的數據轉換為 NumPy 數組
arr_x_train = x_train_normalized.to_numpy()
arr_y_train = y_train.to_numpy().ravel()
arr_x_test = x_test_normalized.to_numpy()
arr_y_test = y_test.to_numpy().ravel()

# 查看轉換後的 NumPy 數組的前幾行
print(arr_x_train[:5])
print(arr_y_train[:5])

[[-0.41144304 -1.45502429 -0.09903606 -0.32185409]
 [ 0.54859072 -0.5001646   0.71427682  0.35134669]
 [ 0.66859494  0.21598017  0.94665192  0.75526716]
 [ 0.90860338 -0.02273475  0.30762038  0.21670654]
 [ 1.6286287   1.40955478  1.29521458  1.69774825]]
[1 2 2 1 2]


## use sklearn

In [26]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

# 加載 iris 數據集
iris = datasets.load_iris()

# 提取所有特徵並轉換為 DataFrame
x = pd.DataFrame(iris.data, columns=iris.feature_names)
# 提取目標變量並轉換為 DataFrame
y = pd.DataFrame(iris.target, columns=['target'])

# 使用 train_test_split 劃分數據集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [27]:
from sklearn.preprocessing import StandardScaler

# 使用 StandardScaler 進行 Z-Score 標準化
scaler = StandardScaler()

# 擬合訓練集數據
scaler.fit(x_train)

# 查看均值和標準差
print("Mean of the features in the training set:")
print(scaler.mean_)  # mean
print("\nStandard deviation of the features in the training set:")
print(scaler.scale_) # standard deviation

Mean of the features in the training set:
[5.84285714 3.00952381 3.87047619 1.23904762]

Standard deviation of the features in the training set:
[0.82932642 0.41691013 1.71313824 0.73917525]


In [28]:
# 使用擬合好的 StandardScaler 轉換訓練集數據
x_train_std = scaler.transform(x_train)

# 查看標準化後的 x_train 前五行
print(x_train_std[:5])
print("\nStandardized x_train (first 5 rows):")
pd.DataFrame(x_train_std, columns=iris.feature_names).head()

[[-0.4134164  -1.46200287 -0.09951105 -0.32339776]
 [ 0.55122187 -0.50256349  0.71770262  0.35303182]
 [ 0.67180165  0.21701605  0.95119225  0.75888956]
 [ 0.91296121 -0.02284379  0.30909579  0.2177459 ]
 [ 1.63643991  1.41631528  1.30142668  1.70589097]]

Standardized x_train (first 5 rows):


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.413416,-1.462003,-0.099511,-0.323398
1,0.551222,-0.502563,0.717703,0.353032
2,0.671802,0.217016,0.951192,0.75889
3,0.912961,-0.022844,0.309096,0.217746
4,1.63644,1.416315,1.301427,1.705891


The scaler instance can then be used on new data to transform it the same way it did on the training set:

- 在訓練集上使用 fit_transform 來計算均值和標準差，並對數據進行標準化。
- 在測試集上僅使用 transform 來應用在訓練集上計算出的均值和標準差進行標準化。

In [29]:
# 轉換轉換測試集數據
x_test_std = scaler.transform(x_test)

# 查看標準化後的 x_test 前五行
print(x_test_std[:5])
print("\nStandardized x_test (first 5 rows):")
pd.DataFrame(x_test_std, columns=iris.feature_names).head()

[[ 0.3100623  -0.50256349  0.484213   -0.05282593]
 [-0.17225683  1.89603497 -1.26695916 -1.27039917]
 [ 2.23933883 -0.98228318  1.76840592  1.43531914]
 [ 0.18948252 -0.26270364  0.36746819  0.35303182]
 [ 1.15412078 -0.50256349  0.54258541  0.2177459 ]]

Standardized x_test (first 5 rows):


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.310062,-0.502563,0.484213,-0.052826
1,-0.172257,1.896035,-1.266959,-1.270399
2,2.239339,-0.982283,1.768406,1.435319
3,0.189483,-0.262704,0.367468,0.353032
4,1.154121,-0.502563,0.542585,0.217746


you can also use fit_transform method (i.e., fit and then transform)    

In [30]:
x_train_std_1 = scaler.fit_transform(x_train)  
x_test_std_1 = scaler.fit_transform(x_test)
print(x_train_std_1[:5])
print(x_test_std_1[:5])

[[-0.4134164  -1.46200287 -0.09951105 -0.32339776]
 [ 0.55122187 -0.50256349  0.71770262  0.35303182]
 [ 0.67180165  0.21701605  0.95119225  0.75888956]
 [ 0.91296121 -0.02284379  0.30909579  0.2177459 ]
 [ 1.63643991  1.41631528  1.30142668  1.70589097]]
[[ 0.3132457  -0.81335423  0.65591001  0.11697576]
 [-0.17705192  1.3915217  -0.97781419 -1.01100479]
 [ 2.27443615 -1.25432942  1.85397443  1.49561866]
 [ 0.19067129 -0.59286664  0.54699507  0.49296928]
 [ 1.17126652 -0.81335423  0.71036749  0.36763811]]


In [31]:
print('mean of X_train_std:',np.round(x_train_std.mean(),4))
print('std of X_train_std:',x_train_std.std())

mean of X_train_std: 0.0
std of X_train_std: 0.9999999999999998


## Min-Max Normaliaztion
    Transforms features by scaling each feature to a given range.
    The transformation is given by:

    X' = X - X.min(axis=0) / ((X.max(axis=0) - X.min(axis=0))
    X -> N 維資料
    


In [32]:
x1 = np.random.normal(50, 6, 100)  # np.random.normal(mu,sigma,size))
y1 = np.random.normal(5, 0.5, 100)

x2 = np.random.normal(30,6,100)
y2 = np.random.normal(4,0.5,100)
plt.scatter(x1,y1,c='b',marker='s',s=20,alpha=0.8)
plt.scatter(x2,y2,c='r', marker='^', s=20, alpha=0.8)

print(np.sum(x1)/len(x1))
print(np.sum(x2)/len(x2))

50.044024198152336
29.139759188598415


In [33]:
x_val = np.concatenate((x1,x2))
y_val = np.concatenate((y1,y2))

x_val.shape

(200,)

In [34]:
def minmax_norm(X):
    return (X - X.min(axis=0)) / ((X.max(axis=0) - X.min(axis=0)))

In [35]:
minmax_norm(x_val[:10])

array([0.43262738, 0.57933131, 0.30007754, 0.79865669, 1.        ,
       0.20253716, 0.44474513, 0.12746552, 0.20677465, 0.        ])

## use sklearn

In [36]:
x1 = np.random.normal(50, 6, 100)  # np.random.normal(mu,sigma,size))
y1 = np.random.normal(5, 0.5, 100)

x2 = np.random.normal(30,6,100)
y2 = np.random.normal(4,0.5,100)

x_val = np.concatenate((x1,x2))
y_val = np.concatenate((y1,y2))

In [37]:
from sklearn.preprocessing import MinMaxScaler

print(x_val.shape)
x_val=x_val.reshape(-1, 1) # 1D to 2D
print(x_val.shape)
print('---------------------------')

scaler = MinMaxScaler().fit(x_val)  # default range 0~1

print(scaler.data_max_)
print(scaler.data_min_)
print('---------------------------')
print(scaler.transform(x_val)[:10])

(200,)
(200, 1)
---------------------------
[65.97202715]
[13.6369082]
---------------------------
[[0.68000793]
 [0.65540459]
 [0.99509995]
 [0.60335064]
 [0.66972912]
 [0.69398207]
 [0.75124294]
 [0.51857719]
 [0.57317544]
 [0.80607758]]
