## [作業重點]
使用 Sklearn 中的線性迴歸模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義

## 作業
試著使用 sklearn datasets 的其他資料集 (wine, boston, ...)，來訓練自己的線性迴歸模型。

### HINT: 注意 label 的型態，確定資料集的目標是分類還是回歸，在使用正確的模型訓練！

In [1]:
# 將需要的都import進來
import os
import copy
import time
import math
import numpy             as np
import pandas            as pd
import seaborn           as sns
import datetime          as dt
import warnings
import matplotlib.pyplot as plt
from scipy                   import stats
from itertools               import compress
from sklearn.metrics         import roc_curve,mean_squared_error,r2_score,accuracy_score,precision_score,recall_score,fbeta_score
from sklearn.ensemble        import GradientBoostingRegressor,GradientBoostingClassifier,RandomForestClassifier
from sklearn.datasets        import load_boston, load_wine
from sklearn.linear_model    import LogisticRegression,LinearRegression,Lasso
from sklearn.preprocessing   import LabelEncoder, MinMaxScaler, StandardScaler,OneHotEncoder
from sklearn.model_selection import cross_val_score,train_test_split
from IPython.display         import YouTubeVideo

# 將較長的函式改名一下
MSE  = mean_squared_error
ACC  = accuracy_score
MME  = MinMaxScaler()
LE   = LabelEncoder()
LR   = LogisticRegression()
LIR  = LinearRegression()
GBR  = GradientBoostingRegressor()
GBC  = GradientBoostingClassifier()
RFC  = RandomForestClassifier()
OHE  = OneHotEncoder()

# 一些必要的設定
warnings.filterwarnings('ignore')
%matplotlib inline

# 設定【data的資料夾路徑】，命名為【data_folder】
data_folder = 'C:/Users/Ynitsed/Documents/GitHub/2nd-ML100Days/data'

In [2]:
# 讀取 Boston 資料
t001 = load_boston()

In [3]:
t001

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

學藝不精，實在不太知道直接進來的【load_boston()】是哪種狀態？  
但看來是有兩份資料，一份是X，也就是被命名為data的資料；一份是Y，也就是被命名成target的資料。  
【X = t001.data】、【Y = t001.target】  
其中X可以找到columns name，也就是被命名成feature_names的資料集合，但我找不到Y的columns name，只好直接給定一個欄位名稱target。  
總之都先轉成DataFrame來看。

In [4]:
train_X_t1 = pd.DataFrame(t001.data, columns=t001.feature_names)

In [5]:
print(train_X_t1.shape)
train_X_t1.head()

(506, 13)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [6]:
train_Y_t1 = pd.DataFrame({"target": t001.target})

In [7]:
print(train_Y_t1.shape)
train_Y_t1.head()

(506, 1)


Unnamed: 0,target
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


In [8]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(train_X_t1, train_Y_t1, test_size=0.1, random_state=4)

In [9]:
print(x_train.shape)
print(x_train.head())
print(y_train.shape)
print(y_train.head())
print(x_test.shape)
print(x_test.head())
print(y_test.shape)
print(y_test.head())

(455, 13)
        CRIM   ZN  INDUS  CHAS    NOX     RM    AGE     DIS   RAD    TAX  \
169  2.44953  0.0  19.58   0.0  0.605  6.402   95.2  2.2625   5.0  403.0   
402  9.59571  0.0  18.10   0.0  0.693  6.404  100.0  1.6390  24.0  666.0   
295  0.12932  0.0  13.92   0.0  0.437  6.678   31.1  5.9604   4.0  289.0   
134  0.97617  0.0  21.89   0.0  0.624  5.757   98.4  2.3460   4.0  437.0   
117  0.15098  0.0  10.01   0.0  0.547  6.021   82.6  2.7474   6.0  432.0   

     PTRATIO       B  LSTAT  
169     14.7  330.04  11.32  
402     20.2  376.11  20.31  
295     16.0  396.90   6.27  
134     21.2  262.76  17.31  
117     17.8  394.51  10.30  
(455, 1)
     target
169    22.3
402    12.1
295    28.6
134    15.6
117    19.2
(51, 13)
        CRIM    ZN  INDUS  CHAS    NOX     RM    AGE     DIS  RAD    TAX  \
8    0.21124  12.5   7.87   0.0  0.524  5.631  100.0  6.0821  5.0  311.0   
289  0.04297  52.5   5.32   0.0  0.405  6.565   22.9  7.3172  6.0  293.0   
68   0.13554  12.5   6.07   0.0  0.

In [10]:
# 超級方便的LIR函式
# 跑完背後就已經有整個回歸模型了
LIR.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [11]:
# 將x_test丟進上面跑好的回歸模型裡，得到y_pred，也就是預測出來的y_pred。
y_pred = LIR.predict(x_test)

In [12]:
print(y_pred.shape)
print(pd.DataFrame(y_pred).head())

(51, 1)
           0
0  11.460308
1  26.802693
2  17.434789
3  17.556310
4  37.391564


In [13]:
# 看一下預測出來的y_pred和實際的y_test差多少？
print("Mean squared error: %.2f"% MSE(y_test, y_pred))

Mean squared error: 17.04


但我實在不知道看不懂這數值是高還低...

In [14]:
# 讀取 wine 資料
t002 = load_wine()

In [15]:
t002

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [16]:
train_X_t1 = pd.DataFrame(t002.data, columns=t002.feature_names)

In [17]:
print(train_X_t1.shape)
train_X_t1.head()

(178, 13)


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [18]:
train_Y_t1 = pd.DataFrame({"target": t002.target})

In [19]:
print(train_Y_t1.shape)
train_Y_t1.head()

(178, 1)


Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [20]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(train_X_t1, train_Y_t1, test_size=0.1, random_state=4)

In [21]:
print(x_train.shape)
print(x_train.head())
print(y_train.shape)
print(y_train.head())
print(x_test.shape)
print(x_test.head())
print(y_test.shape)
print(y_test.head())

(160, 13)
     alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
93     12.29        2.83  2.22               18.0       88.0           2.45   
156    13.84        4.12  2.38               19.5       89.0           1.80   
91     12.00        1.51  2.42               22.0       86.0           1.45   
165    13.73        4.36  2.26               22.5       88.0           1.28   
124    11.87        4.31  2.39               21.0       82.0           2.86   

     flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
93         2.25                  0.25             1.99             2.15  1.15   
156        0.83                  0.48             1.56             9.01  0.57   
91         1.25                  0.50             1.63             3.60  1.05   
165        0.47                  0.52             1.15             6.62  0.78   
124        3.03                  0.21             2.91             2.80  0.75   

     od280/od315_of_diluted_

In [22]:
# 超級方便的LR函式
# 跑完背後就已經有整個回歸模型了
LR.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [23]:
# 將x_test丟進上面跑好的回歸模型裡，得到y_pred，也就是預測出來的y_pred。
y_pred = LR.predict(x_test)

In [24]:
print(y_pred.shape)
print(pd.DataFrame(y_pred).head())

(18,)
   0
0  2
1  2
2  0
3  0
4  1


In [25]:
# 看一下預測出來的y_pred和實際的y_test差多少？
print("Mean squared error: %.2f"% MSE(y_test, y_pred))

Mean squared error: 0.06


In [26]:
# 本例中直接分成了0,1,2，屬於classification tasks，可以使用ACC。
print("Accuracy: ", ACC(y_test, y_pred))

Accuracy:  0.9444444444444444


這數字看來是18個裡面中了17個，17/18=0.9444444