## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [30]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

In [31]:
bc = datasets.load_breast_cancer()
print(bc.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [32]:
bc.target[-10:]

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

In [33]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = bc.data
y = bc.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

In [34]:
# Basic model GradiantBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train_std, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [35]:
# Grid Search for hyperparameters
param_grid = {'n_estimators': [100, 200, 300], 'learning_rate':[0.1, 0.5, 1.0]}

grid_search = GridSearchCV(gbc, param_grid, scoring='accuracy', n_jobs=-1, verbose=1, cv=5)
grid_result = grid_search.fit(X_train_std, y_train)

print("Grid Best Accuracy : %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    3.9s finished


Grid Best Accuracy : 0.962891 using {'learning_rate': 0.1, 'n_estimators': 300}


In [36]:
# Random Search for hyperparameters
import numpy as np
param_dist = {'n_estimators':range(100, 300, 10), 'learning_rate':np.arange(0.1, 1.0, 0.1)}

random_search = RandomizedSearchCV(gbc, param_dist, scoring='accuracy', n_jobs=-1, verbose=1, cv=5)
random_result = random_search.fit(X_train_std, y_train)

print("Random Best Accuracy : %f using %s" %(random_result.best_score_, random_result.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Random Best Accuracy : 0.962891 using {'n_estimators': 110, 'learning_rate': 0.5}


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    2.8s finished


In [37]:
# train model with the best parameters of Grid and Random search

gbc_grid = GradientBoostingClassifier(n_estimators=grid_result.best_params_['n_estimators'], learning_rate=grid_result.best_params_['learning_rate'])
gbc_grid.fit(X_train_std, y_train)

gbc_random = GradientBoostingClassifier(n_estimators=random_result.best_params_['n_estimators'], learning_rate=random_result.best_params_['learning_rate'])
gbc_random.fit(X_train_std, y_train)

# predict with gbc, gbc_grid and gbc_random

y_pred = gbc.predict(X_test_std)
y_grid = gbc_grid.predict(X_test_std)
y_random = gbc_random.predict(X_test_std)

print("Basic Accuracy : ", metrics.accuracy_score(y_test, y_pred))
print("Accuracy of Grid Search   : ", metrics.accuracy_score(y_test, y_grid))
print("Accuracy of Random Search : ", metrics.accuracy_score(y_test, y_random))

Basic Accuracy :  0.9649122807017544
Accuracy of Grid Search   :  0.9473684210526315
Accuracy of Random Search :  0.9649122807017544
