# 利用KNN演算法分類酒類
資料集來源:sikit learn，已經過清理 

本篇介紹監督式學習演算法，監督式演算法分為回歸與分類，前一篇"線性回歸預測波斯頓房價"為回歸，本篇則是分類。

該資料集收集酒的13個特徵辨識3種酒類class_0、class_1、class_2，X為連續型變數，y為類別變數，採用KNN演算法

依舊採用機器學習8個步驟  
1. 收集資料(Dataset)
2. 清理資料(Data cleaning)  
3. 特徵工程(Feature Engineerin)
4. 資料分割為訓練組與測試組(Split)  
5. 選擇演算法(Learning Algorithm)  
6. 訓練模型(Train Model)  
7. 打分數(Score Model)  
8. 評估模型(Evalute Model)

![如圖:](https://github.com/Yi-Huei/bin/blob/master/images/ML_process.png?raw=true)  
圖片來源:https://yourfreetemplates.com/free-machine-learning-diagram/

[scikit learn提供的小抄下載](https://github.com/Yi-Huei/bin/blob/master/images/Scikit_Learn_Cheat_Sheet.pdf)

## 步驟一、載入資料
由於資料已收集清理過，所以可以跳過機器學習8個步驟中第1、2步驟

In [1]:
# 載入sikit learn資料集
from sklearn import datasets
ds = datasets.load_wine()
print(ds.DESCR)  #查看資料定義

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [2]:
# 利用pandas觀看X資料
import pandas as pd
X = pd.DataFrame(ds.data, columns=ds.feature_names) # X
X

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [3]:
# 載入y
y = ds.target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

以上基本已經載完資料，但在一開始面對陌生資料時，須採取一些行動來瞭接資料是否乾淨，比如查看資訊、檢查空值、攔與列數...

以下一一介紹查驗法

In [4]:
# 標記名稱
ds.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

In [5]:
# x資訊
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
dtypes: fl

13個特徵皆沒有空值，資料型別為float64

In [6]:
# 空值確認
X.isnull()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,False,False,False,False,False,False,False,False,False,False,False,False,False
174,False,False,False,False,False,False,False,False,False,False,False,False,False
175,False,False,False,False,False,False,False,False,False,False,False,False,False
176,False,False,False,False,False,False,False,False,False,False,False,False,False


可一一查看是否有空值，True->空值，False->沒空值

然數量龐大可以，難以查驗，可使用下列方法

In [7]:
X.isnull().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

In [8]:
# 特徵X為連續性變數，可查看其描述性統計
X.describe().transpose()  #行列轉置，較符合期刊論文格式

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
alcohol,178.0,13.000618,0.811827,11.03,12.3625,13.05,13.6775,14.83
malic_acid,178.0,2.336348,1.117146,0.74,1.6025,1.865,3.0825,5.8
ash,178.0,2.366517,0.274344,1.36,2.21,2.36,2.5575,3.23
alcalinity_of_ash,178.0,19.494944,3.339564,10.6,17.2,19.5,21.5,30.0
magnesium,178.0,99.741573,14.282484,70.0,88.0,98.0,107.0,162.0
total_phenols,178.0,2.295112,0.625851,0.98,1.7425,2.355,2.8,3.88
flavanoids,178.0,2.02927,0.998859,0.34,1.205,2.135,2.875,5.08
nonflavanoid_phenols,178.0,0.361854,0.124453,0.13,0.27,0.34,0.4375,0.66
proanthocyanins,178.0,1.590899,0.572359,0.41,1.25,1.555,1.95,3.58
color_intensity,178.0,5.05809,2.318286,1.28,3.22,4.69,6.2,13.0


## 步驟四、分割資料
為避免訓練資料與測試資料在標準化時，相互染污，所以更換步驟三與四之順序

資料切割使用sklearn.model_selection之 train_test_split模組，參數分別為
- X
- y 
- test_size= 測試資料集"數量"或"比例"
- random_state= 設定亂數種子，可確保每一次進行分割，皆是相同資料(做專案不要設定喔)

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [10]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((142, 13), (36, 13), (142,), (36,))

## 步驟三、標準化
**注意:訓練資料與測試資料之標準化處理是不一樣的**

In [11]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

# 訓練資料標準化
X_train_std = scaler.fit_transform(X_train)

#測試資料標準化
X_test_std = scaler.transform(X_test)

## 步驟五、選擇演算法KNN
Scikit Learn 的KNN套件，[參考網站](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

參數說明:  
n_neighbors=5，比較相鄰5個點

In [12]:
from sklearn import neighbors
clf_5 = neighbors.KNeighborsClassifier(n_neighbors=5)

## 步驟六、訓練

In [13]:
clf_5.fit(X_train_std, y_train)

KNeighborsClassifier()

In [14]:
# 利用已建立之模型預測測試資料之結果(y)
y_pred = clf_5.predict(X_test_std)
y_pred

array([0, 1, 2, 2, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 0, 2, 1, 2, 2,
       0, 0, 2, 2, 0, 0, 0, 2, 1, 0, 0, 1, 2, 0])

In [15]:
# 比對y_test與y_pred
y_test == y_pred

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True])

True->預測結果正確、False->預測結果錯誤

## 步驟七、打分數

In [16]:
# 該模型之準確度
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

1.0

## 步驟八、評估模型
使用方法:相同演算法，不同參數

In [17]:
# 將鄰近點由5提高到11
clf_11 = neighbors.KNeighborsClassifier(n_neighbors=11)
clf_11.fit(X_train_std, y_train)

KNeighborsClassifier(n_neighbors=11)

In [18]:
y_pred_11 = clf_11.predict(X_test_std)
accuracy_score(y_test, y_pred)

1.0

## 結論
分類型的評估可以使用混淆矩陣(confusion matrix)

採用scikit learn套件之confusion_matrix，[參考網址](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [19]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_11)

array([[13,  0,  0],
       [ 0, 12,  0],
       [ 0,  0, 11]], dtype=int64)

混淆矩陣之列標題分別是3種酒類class_0, class_1, class_2預測值，攔標題class_0, class_1, class_2實際值

由矩陣中對角線為自己對自己，可以有數據，其他應為0

**可發現有2例實際值為class_2，電腦預測為class_1**

## 總程式碼
由於每次資料分割皆為隨機性，所以結果與上面略有不同

In [20]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

ds = datasets.load_wine()
X = ds.data
y = ds.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

scaler = preprocessing.StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

# 設定n_neighbors=5
clf_5 = neighbors.KNeighborsClassifier(n_neighbors=5)
clf_5.fit(X_train_std, y_train)

y_pred_5 = clf_5.predict(X_test_std)
print("n_neighbors=5準確度->",accuracy_score(y_test, y_pred_5))

# 設定n_neighbors=11
clf_11 = neighbors.KNeighborsClassifier(n_neighbors=11)
clf_11.fit(X_train_std, y_train)

y_pred_11 = clf_11.predict(X_test_std)
print("n_neighbors=11準確度->",accuracy_score(y_test, y_pred_11))

n_neighbors=5準確度-> 0.9722222222222222
n_neighbors=11準確度-> 0.9444444444444444
