# 利用Scikit Learn預測波斯頓房價
資料來源:Scikit learn  
網址:https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston

In [1]:
from sklearn import datasets

In [2]:
#第一步 載入資料
ds = datasets.load_boston()

In [3]:
#查看資料
print(ds.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

利用一些工具查看資料是否為數字值或是否為空值，  
若有非數字值資料或空值資料，需進行處裡

In [4]:
#利用pandas查看data 表格資料
#設定 X
import pandas as pd
X = pd.DataFrame(ds.data, columns=ds.feature_names)
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [5]:
#設定y
y = ds.target
y

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21

In [6]:
#查看資料是否有空值
X.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

在確認資料無誤後，進行資料切割，此步驟會將全部資料分割為"訓練資料"與"測試資料"  
使用函數為train_test_split(X, y, test_size=.2)  
參數1 : X訓練資料，X測試資料  
參數2 : y訓練資料，y測試資料  
參數3 : 測試資料大小，可使用比例或數量

In [7]:
#第四步 資料切割
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((404, 13), (102, 13), (404,), (102,))

資料切割後，將特徵資料(X)進行標準化，標準化公式為: (X-平均值) / 標準差  
標準化後的特徵(X)，其值會在-1~1之間  
且訓練資料與測試資料分別使用不同函數  
使用套件為:StandardScaler  
訓練資料:.fit_transform(X_train)  
測試資料:.transform(X_test)

In [8]:
# 第三步 標準化: (X-Mean)/SD
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# 訓練資料處理
X_train_std = scaler.fit_transform(X_train)

In [9]:
#測試資料不做訓練，只做轉換
X_test_std = scaler.transform(X_test)
X_test_std[0]

array([-0.37972154, -0.49728974, -1.19241966, -0.28828791, -0.93860831,
       -0.16691346,  0.05770603, -0.15897378, -0.84923229, -0.74943601,
       -0.20116703,  0.36349628, -0.14759848])

In [10]:
#第五步 選擇演算法_ 線性回歸
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# 第六步 訓練
lr.fit(X_train_std, y_train)

LinearRegression()

In [11]:
#X係數
lr.coef_

array([-1.13086873,  1.06647572,  0.09167089,  0.69005858, -2.15799533,
        2.7635342 ,  0.14636638, -3.0951146 ,  2.80086855, -1.96523454,
       -2.14636742,  0.78589611, -3.68985391])

In [12]:
#截距
lr.intercept_

22.853217821782206

In [13]:
y_pred = lr.predict(X_test_std)
y_pred

array([24.86506452, 30.50978142, 25.95808216, 20.91013385, 26.76976524,
       17.47831168, 22.75126252, 15.74795015, 11.78185986, 18.01164711,
        6.51789504, 14.468403  , 41.32876333, 20.63154787, 28.98507235,
       20.49719447, 15.6633102 , 18.7601521 , 12.60578741, 25.0365703 ,
       21.56033218,  9.18549737, 30.1432942 , 23.23614955, 27.21801088,
       23.90065237, 22.81803001, 14.59620406, 22.81550803, 20.65691606,
       13.71387575, 37.18302166, 31.07855835, 24.1029135 , 20.29577775,
       25.1586361 , 14.60589919,  5.97260756, 33.54154977, 15.89420022,
       21.37829032, 31.69840924, 15.68904361, 35.61237928, 20.02220057,
       24.01108848, 13.31577341, 29.37567087, 30.90517383, 16.61991774,
       32.43519019, 34.71755942, 19.1950607 , 25.27558389, 14.10012207,
       28.52198618, 35.45877545, 35.03648936, 21.45419416, 22.84688099,
       17.44160813, 20.01773496, 35.9940572 , 19.51637448, 27.88802734,
       14.76237167, 25.34667352, 20.62646166, 20.72873639, 17.43

In [14]:
#第七步 準確性: R_square與MSE
from sklearn.metrics import mean_squared_error, r2_score
print(f'R2 = {r2_score(y_test, y_pred):.2f}')
print(f'MSE = {mean_squared_error(y_test, y_pred):.2f}')

R2 = 0.79
MSE = 15.42


In [15]:
#第八步  評估: 變更演算法-SVR演算法，再計算準確性
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train_std, y_train)

SVR()

In [16]:
y_pred = svr.predict(X_test_std)
print(f'R2 = {r2_score(y_test, y_pred):.2f}')
print(f'MSE = {mean_squared_error(y_test, y_pred):.2f}')

R2 = 0.76
MSE = 17.19
