# 線性回歸預測波斯頓房價
從本章開始進入機器學習的領域(Mechine learning)，監督式學習  
機器學習分為8個步驟:  
1. 收集資料(Dataset)
2. 清理資料(Data cleaning)  
3. 特徵工程(Feature Engineerin)
4. 資料分割為訓練組與測試組(Split)  
5. 選擇演算法(Learning Algorithm)  
6. 訓練模型(Train Model)  
7. 打分數(Score Model)  
8. 評估模型(Evalute Model)

![如圖:](https://github.com/Yi-Huei/bin/blob/master/images/ML_process.png?raw=true)  
圖片來源:https://yourfreetemplates.com/free-machine-learning-diagram/

本篇將透過波斯頓房價資料集進行機器學習_監督式學習，而本資料集為Scikit learn所收集之。  [參考網站](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston)  

[Scikit learn其他資料集](https://scikit-learn.org/stable/datasets/index.html)

#### 步驟一: 載入資料
由於該資料集已透過Scikit Learn收集並清理，所以跳過第一步與第二步直接進行載入資料。  

清理資料相關知識，往後再講。 

Scikit learn之資料集可透過以下方式叫出

In [1]:
from sklearn import datasets
ds = datasets.load_boston()

#查看資料
print(ds.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

利用一些工具查看資料是否為數字值或是否為空值，  
若有非數字值資料或空值資料，需進行處裡

In [2]:
#利用pandas查看data 表格資料
#設定 X
import pandas as pd
X = pd.DataFrame(ds.data, columns=ds.feature_names)
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [3]:
#設定y
y = ds.target
y

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21

In [4]:
#查看資料是否有空值
X.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

#### 步驟四、分割資料
在確認資料無誤後，進行資料切割，此步驟會將全部資料分割為"訓練資料"與"測試資料"  
**為了避免訓練資料與測試資料相互染污，所以先進行資料分割**  

使用函數為**train_test_split(X, y, test_size=.2)**  
參數1 : X訓練資料，X測試資料  
參數2 : y訓練資料，y測試資料  
參數3 : 測試資料大小，可使用比例或數量

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((404, 13), (102, 13), (404,), (102,))

#### 步驟三、標準化
完成步驟四後再進行步驟三

資料切割後，將特徵資料(X)進行標準化，標準化公式為: (X-平均值) / 標準差  
標準化後的特徵(X)，其值會在-1~1之間  
且訓練資料與測試資料分別使用不同函數  

使用套件為:StandardScaler  
訓練資料:.fit_transform(X_train)  
測試資料:.transform(X_test)

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# 訓練資料處理
X_train_std = scaler.fit_transform(X_train)

In [7]:
#測試資料不做訓練，只做轉換
X_test_std = scaler.transform(X_test)
X_test_std[0]

array([-0.44325292, -0.50069537, -0.9819652 , -0.28828791, -0.97410716,
       -0.40003485, -0.71771697,  1.98525198, -0.76272677, -0.36502562,
        0.18532535,  0.32533435, -0.29225545])

#### 步驟五、選擇演算法
從資料來看X與y，皆為連續型變數，可以採用演算法-線性回歸

In [8]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

#### 步驟六、進行訓練

In [9]:
lr.fit(X_train_std, y_train)

LinearRegression()

In [10]:
# 取得X係數
lr.coef_

array([-0.70195335,  1.25332393,  0.0933845 ,  0.84521626, -2.06943217,
        2.36674247,  0.39483133, -3.05148657,  2.6412023 , -2.1215265 ,
       -2.13543496,  0.96667552, -4.26289182])

In [11]:
# 取得截距項
lr.intercept_

22.734900990099018

In [12]:
# 取得測試資料X，帶入模型中進行運算，取得預測y
y_pred = lr.predict(X_test_std)
y_pred

array([16.73535844, 13.35131583, 33.81343572, 19.01283458,  7.6881888 ,
       12.95292327, 22.91989793, 33.02834692, 20.77686757, 23.71186536,
       25.75324787, 23.54060354, 32.60199259, 29.50916785, 21.71074893,
       26.73921957, 34.7785217 , 18.84080233, 26.79901   , 28.90976809,
        5.96180825, 18.86266551, 24.67976802, 30.78860219, 23.45275874,
       26.58078929, 18.75464728, 25.09208322, 25.22506736, 14.04493588,
       22.07629585, 15.94206003, 24.1475928 , 27.43695147, 19.43057252,
       19.90017036, 28.41905569, 25.06703573, 31.96285583, 15.67593437,
       18.51947218, 15.51547675, 24.32351045, 12.18952919, 29.23190159,
       21.45258683, 24.1849763 , 38.21148435, 23.72136631, 27.35125682,
       19.02360907, 18.66760868, 24.59908304,  9.55628503, 27.80871231,
       13.83374535,  0.74344617, 22.78626904, 22.73849408, 23.04196331,
        6.38672211, 30.14834607,  7.40952164, 27.16265518, 28.08077322,
       28.61752959, 20.73193188, 12.32054012, 19.76129853, 27.62

#### 步驟七、Score Model
步驟六中，我們透過訓練好的模型，計算出測試資料之y的預測值，可與y的原始數據(真實數據)進行比對，比對方法有$R^2與MSM$，$R^2$公式:

$
R^2 = \frac{\sum_i^n (\hat{y_i} - \bar{y_i})}{\sum_i^n ({y_i} - \bar{y_i})}
$

$\bar{y_i}真實數據平均值、 \hat{y_i}為y的預測值、 {y_i}真實數據 $  

透過$R^2$公式可知以真實數據平均值為基準點，評估預測值到基準點之距離 與 真實數據到基準點之距離之關係。

$R^2$會在-1 ~ +1 之間，靠近-1為高度負相關，靠近+1高度正相關，0為完全不相關。

**MSE(mean-square error，均方誤差)，圖示:**
<img src="https://github.com/Yi-Huei/bin/blob/master/images/MSE.png?raw=true" width="500px" />

從X軸向畫上一條直線，可取得y的預測值，y真實數據兩點，取兩者間差值，將所有點以如此方法處理便是MSE，公式如下

$
MSE = \sum_i^n (\hat{y_i} - y_i)
$  

**MSE值越小，表預測值與真實值差距越小。**

兩者程式如下:

In [13]:
from sklearn.metrics import mean_squared_error, r2_score
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f'R2 = {r2:.2f}')
print(f'MSE = {mse:.2f}')

R2 = 0.68
MSE = 20.16


#### 步驟八、 評估
採用不同演算法，進行訓練，取得$R^2$與$MSE$，比較哪種演算法較好

變更演算法-SVR演算法，再計算準確性

In [14]:
from sklearn.svm import SVR  #SVM，支援向量機演算法，其中SVR是針對連續變數之統計方式
svr = SVR()
svr.fit(X_train_std, y_train)

SVR()

In [15]:
y_pred = svr.predict(X_test_std)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f'R2 = {r2:.2f}')
print(f'MSE = {mse:.2f}')

R2 = 0.74
MSE = 16.30


#### 程式碼
以下為本範例所有程式碼，因為資料經過重新分配，所以結果會有所不同

In [16]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVR

ds = datasets.load_boston()

X = ds.data
y = ds.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

# 線性回歸
lr = LinearRegression()
lr.fit(X_train_std, y_train)

y_pred = lr.predict(X_test_std)

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'LinearRegression R2 = {r2:.2f}')
print(f'LinearRegression MSE = {mse:.2f}')

# SVM 支援向量機
svr = SVR()
svr.fit(X_train_std, y_train)
y_pred = svr.predict(X_test_std)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f'SVM R2 = {r2:.2f}')
print(f'SVM MSE = {mse:.2f}')

LinearRegression R2 = 0.78
LinearRegression MSE = 23.12
SVM R2 = 0.69
SVM MSE = 32.50
