# MLClass. "Прикладной анализ данных"
# Модуль "Машинное обучение с помощью Python"
<img src="../img/mlclass_logo.jpg" height="240" width="240">
## Авторы материала: преподаватель ФКН НИУ ВШЭ Кашницкий Юрий, магистрант ВМК МГУ Евгений Колмаков
Материал распространяется на условиях лицензии <a href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-Share Alike 4.0</a>. Можно использовать в любых целях, но с обязательным упоминанием автора курса и аффилиации.

# Урок 6. Нейронные сети. Бустинг. Смешивание алгоритмов. Стекинг.
## Часть 2. Пример использования библиотеки XGBoost

<a href="https://github.com/dmlc/xgboost/blob/master/doc/parameter.md">Параметры</a> Xgboost

In [1]:
import pickle
import xgboost as xgb

import numpy as np
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_iris, load_digits, load_boston

**Пример классификации на данных Iris**

In [2]:
iris = load_iris()
X = iris['data']
y = iris['target']
kf = KFold(y.shape[0], n_folds=5, shuffle=True, random_state=13)
for train_index, test_index in kf:
    xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print(confusion_matrix(actuals, predictions))

[[ 9  0  0]
 [ 0  8  0]
 [ 0  2 11]]
[[10  0  0]
 [ 0  9  1]
 [ 0  1  9]]
[[9 0 0]
 [0 9 1]
 [0 2 9]]
[[12  0  0]
 [ 0 10  1]
 [ 0  0  7]]
[[10  0  0]
 [ 0 11  0]
 [ 0  0  9]]


**Пример восстановления регрессии на данных boston**

In [3]:
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(y.shape[0], n_folds=5, shuffle=True, random_state=17)
for train_index, test_index in kf:
    xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print(mean_squared_error(actuals, predictions))

8.09713971121
8.92640829762
18.2997739166
6.73819019067
8.47392738812


**Кросс-валидация**

In [4]:
X = boston['data']
y = boston['target']

xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    4.9s finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.598487920717
{'n_estimators': 100, 'max_depth': 4}


**Пример использования pickle для сохранения обученных моделей**

In [5]:
# must open in binary format to pickle
pickle.dump(clf, open("best_boston.pkl", "wb"))
clf2 = pickle.load(open("best_boston.pkl", "rb"))
print(np.allclose(clf.predict(X), clf2.predict(X)))

True


**Ранняя остановка**

In [6]:
digits = load_digits()

X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="merror",
        eval_set=[(X_test, y_test)])

Will train until validation_0 error hasn't decreased in 10 rounds.
[0]	validation_0-merror:0.168889
[1]	validation_0-merror:0.162222
[2]	validation_0-merror:0.151111
[3]	validation_0-merror:0.142222
[4]	validation_0-merror:0.131111
[5]	validation_0-merror:0.128889
[6]	validation_0-merror:0.124444
[7]	validation_0-merror:0.111111
[8]	validation_0-merror:0.113333
[9]	validation_0-merror:0.111111
[10]	validation_0-merror:0.111111
[11]	validation_0-merror:0.111111
[12]	validation_0-merror:0.106667
[13]	validation_0-merror:0.113333
[14]	validation_0-merror:0.108889
[15]	validation_0-merror:0.104444
[16]	validation_0-merror:0.102222
[17]	validation_0-merror:0.102222
[18]	validation_0-merror:0.102222
[19]	validation_0-merror:0.104444
[20]	validation_0-merror:0.104444
[21]	validation_0-merror:0.100000
[22]	validation_0-merror:0.093333
[23]	validation_0-merror:0.086667
[24]	validation_0-merror:0.086667
[25]	validation_0-merror:0.084444
[26]	validation_0-merror:0.088889
[27]	validation_0-merror:

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)