## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [373]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split,KFold,GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier as XGBclr
from xgboost import XGBRegressor as XGBregr
import numpy as np
import pandas as pd
import copy
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_curve
from sklearn.metrics import mean_squared_error


In [416]:
#鐵達尼號資料集-XGBoost-Classifier

In [417]:
data_path = 'data/'
df = pd.read_csv(data_path + 'titanic_train.csv')

df_T = df['Survived']
df = df.drop(['PassengerId', 'Survived'] , axis=1)
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [418]:
# 因為需要把類別型與數值型特徵都加入, 故使用最簡版的特徵工程
LEncoder = LabelEncoder()
MMEncoder = MinMaxScaler()
for c in df.columns:
    df[c] = df[c].fillna(-1)
    if df[c].dtype == 'object':
        df[c] = LEncoder.fit_transform(list(df[c].values))
    df[c] = MMEncoder.fit_transform(df[c].values.reshape(-1, 1))
df.head()



Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1.0,0.121348,1.0,0.283951,0.125,0.0,0.769118,0.014151,0.0,1.0
1,0.0,0.213483,0.0,0.481481,0.125,0.0,0.876471,0.139136,0.557823,0.333333
2,1.0,0.396629,0.0,0.333333,0.0,0.0,0.983824,0.015469,0.0,1.0
3,0.0,0.305618,0.0,0.444444,0.125,0.0,0.072059,0.103644,0.380952,1.0
4,1.0,0.016854,1.0,0.444444,0.0,0.0,0.694118,0.015713,0.0,1.0


In [419]:
df_D = df.values
train_X, test_X, train_Y, test_Y = train_test_split(df_D, df_T, test_size=0.25,random_state = 42)

In [420]:
clf = XGBclr(
    random_state=7)

In [421]:
clf.fit(train_X,train_Y)
Pred_Y = clf.predict(test_X)





In [422]:
# 先看看使用預設參數得到的結果
clf.fit(train_X, train_Y)
acc = accuracy_score(test_Y, Pred_Y)
print("Accuracy: ", acc)

Accuracy:  0.7847533632286996


In [423]:
# 設定要訓練的超參數組合
n_estimators = [100,200,300]
max_depth = [1,3,5]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, scoring="accuracy", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(train_X, train_Y)

# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done   4 out of  27 | elapsed:    0.1s remaining:    1.3s




[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    1.1s finished


In [424]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best Accuracy: 0.805389 using {'max_depth': 5, 'n_estimators': 100}


In [425]:
XGBrclr_bestparam = XGBclr(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])

In [426]:
XGBrclr_bestparam.fit(train_X,train_Y)
Pred_Y = XGBrclr_bestparam.predict(test_X)



In [427]:
#依結果最佳結果重建模型
acc = accuracy_score(test_Y, Pred_Y)
print("Accuracy: ", acc)

Accuracy:  0.7982062780269058


In [428]:
#波士頓房價資料集-GradientBoostingRegression

In [399]:
# 讀取波士頓房價資料集
boston = datasets.load_boston()
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=42)

# 建立模型
regr = GradientBoostingRegressor(random_state=7)

In [400]:
# 先看看使用預設參數得到的結果，約為 8.379 的 MSE
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
print(metrics.mean_squared_error(y_test, y_pred))

8.913775994322064


In [401]:
# 設定要訓練的超參數組合
n_estimators = [100,200,300]
max_depth = [1,3,5]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:   49.6s finished


Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [402]:
# 印出最佳結果與最佳參數
print("Best mean_squared_error: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best mean_squared_error: -29.996913 using {'max_depth': 5, 'n_estimators': 100}


In [403]:
# 使用最佳參數重新建立模型
XGBregr_bestparam = XGBregr(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])

# 訓練模型
XGBregr_bestparam.fit(x_train, y_train)

# 預測測試集
y_pred = XGBregr_bestparam.predict(x_test)

In [404]:
# 調整參數後約可降至 8.30 的 MSE
print(metrics.mean_squared_error(y_test, y_pred))

9.57366320194703
