## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# future
#from sklearn.ensemble import BaggingClassifier
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.ensemble import ExtraTreesClassifier

In [2]:
def default_gbr(clf, sets):
    x_train, x_test, y_train, y_test = sets
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    mse = metrics.mean_squared_error(y_test, y_pred)
    #print('default mse %f' % mse)
    return mse

In [3]:
def best_gbr(clf, sets, param_grid, scoring="neg_mean_squared_error",):
    x_train, x_test, y_train, y_test = sets
    
    # grid search
    grid_search = GridSearchCV(clf, param_grid, scoring=scoring, n_jobs=4, verbose=1, cv=3)
    grid_result = grid_search.fit(x_train, y_train)
    
    print("grid search acc: %f using %s" % (-grid_result.best_score_, grid_result.best_params_))
    
    # re-fit by best parameter
    clf_bestparam = GradientBoostingRegressor(max_depth=grid_result.best_params_['max_depth'], 
                    n_estimators=grid_result.best_params_['n_estimators'])    
    clf_bestparam.fit(x_train, y_train)    
    y_pred = clf_bestparam.predict(x_test)
    mse_bestparam = metrics.mean_squared_error(y_test, y_pred)
    #print('after grid mse %f' % mse_bestparam)
    return mse_bestparam

In [4]:
def cmp_gbr(name, clf, sets, param_grid):
    mse0 = default_gbr(clf, sets)
    mse1 = best_gbr(clf, sets, param_grid)
    print('mse of %s: default=%f, best=%f'% (name, mse0, mse1))

In [5]:
gbr = GradientBoostingRegressor(random_state=7)

In [6]:
iris = datasets.load_iris()
boston = datasets.load_boston()
wine = datasets.load_wine()

In [7]:
iris_sets = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)
boston_sets = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)
wine_sets = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

In [10]:
gbr_param_grid = dict(n_estimators=[50,75,100,200], max_depth=[1,3,5,7])

In [11]:
for name,sets in zip(('iris', 'boston', 'wine'), (iris_sets, boston_sets, wine_sets)):
    cmp_gbr(name, gbr, sets, gbr_param_grid)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  48 out of  48 | elapsed:    0.5s finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


grid search acc: 0.020517 using {'max_depth': 5, 'n_estimators': 50}
mse of iris: default=0.020740, best=0.026081
Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=4)]: Done  48 out of  48 | elapsed:    1.0s finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


grid search acc: 10.783516 using {'max_depth': 3, 'n_estimators': 200}
mse of boston: default=10.599670, best=11.450995
Fitting 3 folds for each of 16 candidates, totalling 48 fits
grid search acc: 0.036940 using {'max_depth': 3, 'n_estimators': 50}
mse of wine: default=0.031755, best=0.029182


[Parallel(n_jobs=4)]: Done  48 out of  48 | elapsed:    0.5s finished
