## [作業重點]
使用 Sklearn 中的 Lasso, Ridge 模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義。

機器學習的模型非常多種，但要訓練的資料多半有固定的格式，確保你了解訓練資料的格式為何，這樣在應用新模型時，就能夠最快的上手開始訓練！

## 練習時間
試著使用 sklearn datasets 的其他資料集 (boston, ...)，來訓練自己的線性迴歸模型，並加上適當的正則化來觀察訓練情形。

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# 讀取 Boston 資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=4)

# 建立一個線性回歸模型
regr = linear_model.LinearRegression()

# 將訓練資料丟進去模型訓練
regr.fit(x_train, y_train)

# 將測試資料丟進模型得到預測結果
y_pred = regr.predict(x_test)

In [3]:
y_pred

array([12.07495986, 26.9894969 , 17.58803353, 18.15584511, 36.92091659,
       25.43267386, 31.09256932, 19.72549907, 19.66103377, 22.96358632,
       28.38841214, 28.48925986, 18.99690357, 32.41097504, 21.52350275,
       15.25945122, 21.23364112, 11.6220597 , 11.37109662, 13.63515584,
        5.62431971, 17.35323315, 20.80951594, 22.51311312, 16.39055556,
       20.32352451, 17.88994185, 14.23445109, 21.1187098 , 17.50765806,
       14.54295525, 23.63289896, 34.32419647, 22.23027161, 16.82396516,
       20.16274383, 30.67665825, 35.61882904, 23.50372003, 24.66451121,
       36.91269871, 32.33290254, 19.11785719, 32.19546605, 33.42795148,
       25.52705821, 40.63477427, 18.21762788, 19.34587461, 23.80167377,
       33.42122982, 26.1451108 , 18.10363121, 28.19906437, 13.37486655,
       23.34019279, 24.44952678, 33.54973856, 16.71263275, 36.56402224,
       15.69684554, 18.55447039, 32.14543203, 15.49568061, 39.02363234,
       27.38174402, 31.96333419, 10.09436162, 19.13214621, 21.73

In [4]:
print(regr.coef_)

[-1.15966452e-01  4.71249231e-02  8.25980146e-03  3.23404531e+00
 -1.66865890e+01  3.88410651e+00 -1.08974442e-02 -1.54129540e+00
  2.93208309e-01 -1.34059383e-02 -9.06296429e-01  8.80823439e-03
 -4.57723846e-01]


In [5]:
# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

Mean squared error: 25.42


In [6]:
# 建立一個線性回歸模型
lasso = linear_model.Lasso(alpha=1.0)

# 將訓練資料丟進去模型訓練
lasso.fit(x_train, y_train)

# 將測試資料丟進模型得到預測結果
y_pred2 = lasso.predict(x_test)

In [7]:
y_pred2

array([11.56930154, 26.32668758, 18.94404342, 14.49905024, 33.60514309,
       24.629156  , 31.17314247, 18.42507137, 15.83396035, 23.33431095,
       28.89063128, 28.11400124, 21.35805085, 30.21375325, 22.27253304,
       14.60583691, 23.34591179,  8.82672312, 12.84778968, 16.43172375,
        8.86471891, 21.99613214, 21.26035226, 21.87466221, 19.30195521,
       20.33940901, 15.02519703, 15.68821883, 19.77543932, 16.80036872,
       13.08931044, 26.57292582, 32.01194594, 23.07528864, 18.09196699,
       16.77979041, 29.26675028, 31.68145973, 26.07651487, 25.11252925,
       33.77239594, 31.8482782 , 19.60210441, 30.84240219, 28.07766984,
       26.16173816, 36.81584179, 18.2252528 , 20.25516321, 23.69928341,
       32.56397348, 25.90397903, 16.62569205, 27.58761516, 15.28492743,
       23.94100266, 23.69804716, 32.72159708, 19.43476306, 32.25972294,
       17.81970989, 20.28199548, 30.68465409, 15.89602224, 36.7309312 ,
       29.03264874, 28.86324959,  7.24827001, 18.72630174, 21.47

In [8]:
# 印出各特徵對應的係數，可以看到許多係數都變成 0，Lasso Regression 的確可以做特徵選取
lasso.coef_

array([-0.06494981,  0.04581458, -0.        ,  0.        , -0.        ,
        1.18140024,  0.01109101, -0.73695809,  0.23350042, -0.01551065,
       -0.69270805,  0.00763157, -0.6927848 ])

In [9]:
# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred2))

Mean squared error: 28.95


加入 $\alpha = 1.0$ 之 Lasso 之後 MSE 反而變大

In [10]:
# 建立一個線性回歸模型
lasso2 = linear_model.Lasso(alpha=0.5)

# 將訓練資料丟進去模型訓練
lasso2.fit(x_train, y_train)

# 將測試資料丟進模型得到預測結果
y_pred3 = lasso2.predict(x_test)

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred3))

Mean squared error: 26.94


加入  $\alpha = 0.5$ 之 Lasso 之後 MSE 仍然變大, 但是比 $\alpha = 0.1$ 時小

一次測試多種不同的 $\alpha$ 值

In [11]:
MSEs = []
coefs = []
for i in range(1, 21):
    # 建立一個線性回歸模型
    l = linear_model.Lasso(alpha=0.1 * i)
    # 將訓練資料丟進去模型訓練
    l.fit(x_train, y_train)
    # 將測試資料丟進模型得到預測結果
    y_pred4 = l.predict(x_test)
    MSE = mean_squared_error(y_test, y_pred4)
    MSEs.append(MSE)
    coefs.append(l.coef_)
    print(f"Mean squared error for alpha =  {0.1 * i:.1f}: {MSE:.2f}")

Mean squared error for alpha =  0.1: 26.45
Mean squared error for alpha =  0.2: 26.60
Mean squared error for alpha =  0.3: 26.65
Mean squared error for alpha =  0.4: 26.76
Mean squared error for alpha =  0.5: 26.94
Mean squared error for alpha =  0.6: 27.22
Mean squared error for alpha =  0.7: 27.59
Mean squared error for alpha =  0.8: 27.98
Mean squared error for alpha =  0.9: 28.43
Mean squared error for alpha =  1.0: 28.95
Mean squared error for alpha =  1.1: 29.54
Mean squared error for alpha =  1.2: 30.19
Mean squared error for alpha =  1.3: 30.91
Mean squared error for alpha =  1.4: 31.66
Mean squared error for alpha =  1.5: 32.00
Mean squared error for alpha =  1.6: 32.37
Mean squared error for alpha =  1.7: 32.77
Mean squared error for alpha =  1.8: 33.19
Mean squared error for alpha =  1.9: 33.63
Mean squared error for alpha =  2.0: 34.09


In [12]:
min_ = min(MSEs)
min_index = MSEs.index(min_)
print(f"when alpha = {0.1 * (min_index + 1):.1f}, there is the minimal value of MSE, {min_:.2f}.")

when alpha = 0.1, there is the minimal value of MSE, 26.45.


可以看出以 0.1 為單位遞增, MSE 之最小值大約在 $\alpha = 0.1$ 時(但是還是比沒使用 lasso 大)

其對應的係數

In [13]:
print(coefs[min_index])

[-0.10618872  0.04886351 -0.04536655  1.14953069 -0.          3.82353877
 -0.02089779 -1.23590613  0.26008876 -0.01517094 -0.74673362  0.00963864
 -0.49877104]


- 比較 (未使用lasso 之係數 - 使用 lasso 之係數) / (未使用lasso 之係數)

接近 0: 經過 lasso 後該項係數幾乎沒改變

接近 1: 經過 lasso 後該項係數幾乎被消除

其他情形: 變成別的值

In [14]:
for fld, val in zip(boston.feature_names, (regr.coef_ - coefs[min_index])/regr.coef_):
    print(f"{fld:10}: {val:+.5f}")

CRIM      : +0.08432
ZN        : -0.03689
INDUS     : +6.49245
CHAS      : +0.64455
NOX       : +1.00000
RM        : +0.01559
AGE       : -0.91768
DIS       : +0.19814
RAD       : +0.11296
TAX       : -0.13166
PTRATIO   : +0.17606
B         : -0.09428
LSTAT     : -0.08968


- 改變最少的是 RM
- 係數變 0 的有 NOX