## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
# 載入相關套件
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
# 載入手寫資料集
digit = datasets.load_digits()
digit.keys()

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

In [3]:
# 查閱手寫資料集內容
print(digit['DESCR'])

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

In [4]:
# 切分訓練集/測試集
X_train, X_test, y_train, y_test = train_test_split(digit.data, digit.target, test_size=0.2, random_state=10)

In [5]:
# 建立梯度提升模型
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [6]:
# 在預設超參數下之精準率
y_pred = gbc.predict(X_test)
print('The Accuarcy of Unadjusted Hyper-Parameters:', metrics.accuracy_score(y_test, y_pred))

The Accuarcy of Unadjusted Hyper-Parameters: 0.9666666666666667


In [12]:
# 設定超參數組合範圍
import numpy as np
learning_rate = list(np.arange(0.05,0.2,0.05))
n_estimators = [40,70,150,200]
max_features = ['auto','sqrt','log2']

hyperpararmeters = {'learning_rate':learning_rate, 'n_estimators':n_estimators, 'max_features': max_features}

In [13]:
# 利用cross-validation進行Grid Search
gridCV = GridSearchCV(gbc, param_grid=hyperpararmeters, cv=5, n_jobs=4)

# 執行5折交叉驗證Grid Search
gridCV.fit(X_train, y_train)

# 找出最佳的超參數組合值
best_learning_rate = gridCV.best_params_['learning_rate']
best_n_estimators = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']

# 輸出最佳超參數值
print('Best learning rate:', best_learning_rate)
print('Best n_estimators:', best_n_estimators)
print('Best max_features:', best_max_features)

Best learning rate: 0.2
Best n_estimators: 150
Best max_features: log2


In [14]:
# 使用最佳參數重新訓練模型
gbc_best = GradientBoostingClassifier(learning_rate=best_learning_rate, n_estimators=best_n_estimators, max_features=best_max_features)
gbc_best.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.2, loss='deviance', max_depth=3,
                           max_features='log2', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=150,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [15]:
# 預測測試集
y_pred = gbc_best.predict(X_test)

# 調整超參數後之準確率
print('The Accuracy of adjusted Hyper-Parameters:', metrics.accuracy_score(y_test, y_pred))

The Accuracy of adjusted Hyper-Parameters: 0.975
