<center>
<img src="../../img/ml_theme.png">
# Дополнительное профессиональное <br> образование НИУ ВШЭ
#### Программа "Практический анализ данных и машинное обучение"
<img src="../../img/faculty_logo.jpg" height="240" width="240">
## Автор материала: преподаватель Факультета Компьютерных Наук НИУ ВШЭ <br> Кашницкий Юрий
</center>
Материал распространяется на условиях лицензии <a href="https://opensource.org/licenses/MS-RL">Ms-RL</a>. Можно использовать в любых целях, кроме коммерческих, но с обязательным упоминанием автора материала.

# <center>Занятие 8. Продвинутые методы классификации и регрессии</center>
## <center>Часть 7. Настройка гиперпараметров Xgboost. Библиотека Hyperopt</center>

In [2]:
import numpy as np

from xgboost.sklearn import XGBClassifier
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss


from scipy.stats import randint, uniform

**Генерируем синтетические данные.**

In [3]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, 
                           n_repeated=2, random_state=42)

**Будем проводить 10-кратную стратифицированную кросс-валидацию.**

In [4]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

### Grid-Search

In [5]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

**Инициализируем отдельно словарь фиксированных параметров.**

In [6]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': 42
}

In [8]:
xgb_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed),
    param_grid=params_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

In [9]:
%%time
xgb_grid.fit(X, y)

CPU times: user 833 ms, sys: 134 ms, total: 967 ms
Wall time: 7.19 s


GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=42, shuffle=True),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=42, silent=1, subsample=1),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'learning_rate': array([  1.00000e-16,   5.00000e-01,   1.00000e+00]), 'n_estimators': [5, 10, 25, 50], 'max_depth': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

**C помощью cv\_results\_ можно строить кривые валидации.**

In [11]:
xgb_grid.cv_results_

{'mean_fit_time': array([ 0.01221633,  0.01780193,  0.04689682,  0.07434978,  0.01564403,
         0.03670058,  0.06118083,  0.14782541,  0.01960669,  0.06404665,
         0.12422216,  0.19214203,  0.01614695,  0.01762159,  0.04554641,
         0.0733228 ,  0.02001672,  0.02961311,  0.05675511,  0.12664104,
         0.02019458,  0.03863971,  0.08547418,  0.20366635,  0.01574569,
         0.03481348,  0.05126011,  0.09232423,  0.01588628,  0.04012156,
         0.06307402,  0.10700021,  0.01951032,  0.03550386,  0.094115  ,
         0.14071109]),
 'mean_score_time': array([ 0.00165932,  0.00131259,  0.00336685,  0.00097911,  0.00106299,
         0.00105286,  0.0008981 ,  0.00096128,  0.00090954,  0.00085635,
         0.00105612,  0.00159659,  0.00104072,  0.00082819,  0.0017103 ,
         0.00118191,  0.000913  ,  0.00112689,  0.00105858,  0.00127876,
         0.00095367,  0.00088651,  0.00099692,  0.00207779,  0.00126097,
         0.00103703,  0.0014528 ,  0.0024158 ,  0.00114775,  0.00

**Или просто использовать лучшее сочетание параметров.**

In [12]:
print("Best accuracy obtained: {0}".format(xgb_grid.best_score_))
print("Parameters:")
for key, value in xgb_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.891
Parameters:
	learning_rate: 0.5
	n_estimators: 50
	max_depth: 3


### Randomized Grid-Search
**Часто неплохо, а главное, намного быстрее, работает рандомизированная версия.
Теперь создаем словарь с распределениями параметров:**

In [20]:
params_dist_grid = {
    'max_depth': [1, 2, 3, 4],
    'gamma': [0, 0.5, 1],
    'n_estimators': randint(1, 1001), # uniform discrete random distribution
    'learning_rate': uniform(), # gaussian distribution
    'subsample': [0.5, 0.75, 1.],
    'colsample_bytree': [0.5, 0.75, 1.]
}

**Инициализируем `RandomizedSearchCV` так чтобы случайно выбрать 10 комбинаций параметров.**

In [26]:
rs_grid = RandomizedSearchCV(
    estimator=XGBClassifier(**params_fixed),
    param_distributions=params_dist_grid,
    n_iter=10,
    cv=cv,
    scoring='accuracy',
    random_state=42,
)

In [27]:
%%time
rs_grid.fit(X, y)

CPU times: user 1min 34s, sys: 38.4 s, total: 2min 12s
Wall time: 1min 29s


RandomizedSearchCV(cv=StratifiedKFold(n_splits=10, random_state=42, shuffle=True),
          error_score='raise',
          estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=42, silent=1, subsample=1),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11111f898>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11111f710>, 'max_depth': [1, 2, 3, 4], 'gamma': [0, 0.5, 1], 'colsample_bytree': [0.5, 0.75, 1.0], 'subsample': [0.5, 0.75, 1.0]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring='accuracy', verbose=0)

In [28]:
rs_grid.best_estimator_

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1.0,
       gamma=0.5, learning_rate=0.056411579027100256, max_delta_step=0,
       max_depth=4, min_child_weight=1, missing=None, n_estimators=492,
       nthread=-1, objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=42, silent=1, subsample=0.75)

In [29]:
rs_grid.best_params_

{'colsample_bytree': 1.0,
 'gamma': 0.5,
 'learning_rate': 0.056411579027100256,
 'max_depth': 4,
 'n_estimators': 492,
 'subsample': 0.75}

In [30]:
rs_grid.best_score_

0.89900000000000002

### Hyperopt (пока только Python 2)
**В библиотеке Hyperopt реализовано намного большее алгоритмов подбора параметров разных моделей. Будем настраивать, например, функцию log_loss по валидационной выборке.**

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3, random_state=42)

**Определим функцию, которую надо минимизировать.**

In [40]:
def score(params):
    print("Training with params:")
    print(params)
    num_round = int(params['n_estimators'])
    del params['n_estimators']
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_test, label=y_test)
    model = xgb.train(params, dtrain, num_round)
    predictions = model.predict(dvalid).reshape((X_test.shape[0], 1))
    score = log_loss(y_test, predictions)
    print("\tScore {0}\n\n".format(score))
    return {'loss': score, 'status': STATUS_OK}

In [43]:
def optimize(trials):
    space = {
             'n_estimators' : 150,
             'eta' : hp.quniform('eta', 0.025, 0.5, 0.025),
             'max_depth' : hp.quniform('max_depth', 4, 10, 2),
             'min_child_weight' : hp.quniform('min_child_weight', 1, 6, 1),
             'subsample' : hp.quniform('subsample', 0.5, 1, 0.25),
             'gamma' : 0,
             'colsample_bytree' : hp.quniform('colsample_bytree', 0.5, 1, 0.25),
             'eval_metric': 'merror',
             'objective': 'binary:logistic',
             'nthread' : 4,
             'silent' : 1
             }
    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=20)

    return best

In [45]:
trials = Trials()
best_params = optimize(trials)
print(best_params)

Training with params:
{'colsample_bytree': 0.75, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 2.0, 'n_estimators': 150, 'subsample': 1.0, 'eta': 0.125, 'objective': 'binary:logistic', 'max_depth': 4.0, 'gamma': 0}
	Score 0.282255703456


Training with params:
{'colsample_bytree': 0.75, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 4.0, 'n_estimators': 150, 'subsample': 0.5, 'eta': 0.42500000000000004, 'objective': 'binary:logistic', 'max_depth': 6.0, 'gamma': 0}
	Score 0.367044530358


Training with params:
{'colsample_bytree': 0.5, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 6.0, 'n_estimators': 150, 'subsample': 1.0, 'eta': 0.1, 'objective': 'binary:logistic', 'max_depth': 6.0, 'gamma': 0}
	Score 0.294940195541


Training with params:
{'colsample_bytree': 1.0, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 2.0, 'n_estimators': 150, 'subsample': 0.5, 'eta': 0.4, 'objective': 'bina