<center>
<img src="../../img/ods_stickers.jpg">
## Открытый курс по машинному обучению. Сессия № 2
Автор материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий. Материал распространяется на условиях лицензии [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала.

# <center>Тема 10. Бустинг
## <center>Часть 8. Оценка результатов Xgboost

## Загрузка бибилиотек

In [1]:
import numpy as np
import pandas as pd
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb

## Загрузка и подготовка данных

Посмотрим на примере данных по оттоку клиентов из телеком-компании.

In [2]:
df = pd.read_csv('../../data/telecom_churn.csv')

In [3]:
df.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


**Штаты просто занумеруем, а признаки International plan (наличие международного роуминга), Voice mail plan (наличие голосовой почтыы) и целевой Churn сделаем бинарными.**

In [4]:
state_enc = LabelEncoder()
df['State'] = state_enc.fit_transform(df['State'])
df['International plan'] = (df['International plan'] == 'Yes').astype('int')
df['Voice mail plan'] = (df['Voice mail plan'] == 'Yes').astype('int')
df['Churn'] = (df['Churn']).astype('int')

**Разделим данные на обучающую и тестовую выборки в отношении 7:3. Создадим соотв. объекты DMAtrix.**

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Churn', axis=1), df['Churn'],
                                                    test_size=0.3, random_state=42)
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

**Зададим параметры Xgboost.**

In [6]:
params = {
    'objective':'binary:logistic',
    'max_depth':3,
    'silent':1,
    'eta':0.5
}

num_rounds = 10

**Будем отслеживать качество модели и на обучающей выборке, и на валидационной.**

In [7]:
watchlist  = [(dtest,'test'), (dtrain,'train')]

## Использование встроенных метрик 
В Xgboost реализованы большинство популярных метрик для классификации, регрессии и ранжирования:

- `rmse` - [root mean square error](https://www.wikiwand.com/en/Root-mean-square_deviation)
- `mae` - [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error?oldformat=true)
- `logloss` - [negative log-likelihood](https://en.wikipedia.org/wiki/Likelihood_function?oldformat=true)
- `error` (по умолчанию) - доля ошибок в бинарной классификации
- `merror` - доля ошибок в классификации на несколько классов
- `auc` - [area under curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic?oldformat=true)
- `ndcg` - [normalized discounted cumulative gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain?oldformat=true)
- `map` - [mean average precision](https://en.wikipedia.org/wiki/Information_retrieval?oldformat=true)

In [8]:
xgb_model = xgb.train(params, dtrain, num_rounds, watchlist)

[0]	test-error:0.1	train-error:0.091299
[1]	test-error:0.09	train-error:0.088298
[2]	test-error:0.073	train-error:0.067724
[3]	test-error:0.067	train-error:0.060437
[4]	test-error:0.058	train-error:0.046292
[5]	test-error:0.056	train-error:0.049721
[6]	test-error:0.057	train-error:0.046292
[7]	test-error:0.056	train-error:0.043292
[8]	test-error:0.052	train-error:0.044149
[9]	test-error:0.053	train-error:0.042435


**Чтоб отслеживать log_loss, просто добавим ее в словарь params.**

In [9]:
params['eval_metric'] = 'logloss'
xgb_model = xgb.train(params, dtrain, num_rounds, watchlist)

[0]	test-logloss:0.431523	train-logloss:0.426057
[1]	test-logloss:0.326082	train-logloss:0.319245
[2]	test-logloss:0.268074	train-logloss:0.261545
[3]	test-logloss:0.237364	train-logloss:0.232012
[4]	test-logloss:0.211304	train-logloss:0.203699
[5]	test-logloss:0.19646	train-logloss:0.190836
[6]	test-logloss:0.189223	train-logloss:0.180746
[7]	test-logloss:0.183324	train-logloss:0.174471
[8]	test-logloss:0.180598	train-logloss:0.16928
[9]	test-logloss:0.178481	train-logloss:0.165865


**Можно отслеживать сразу несколько метрик.**

In [10]:
params['eval_metric'] = ['logloss', 'auc']
xgb_model = xgb.train(params, dtrain, num_rounds, watchlist)

[0]	test-logloss:0.431523	test-auc:0.831107	train-logloss:0.426057	train-auc:0.834741
[1]	test-logloss:0.326082	test-auc:0.897047	train-logloss:0.319245	train-auc:0.888422
[2]	test-logloss:0.268074	test-auc:0.902286	train-logloss:0.261545	train-auc:0.895616
[3]	test-logloss:0.237364	test-auc:0.912461	train-logloss:0.232012	train-auc:0.901253
[4]	test-logloss:0.211304	test-auc:0.919258	train-logloss:0.203699	train-auc:0.908693
[5]	test-logloss:0.19646	test-auc:0.921123	train-logloss:0.190836	train-auc:0.911472
[6]	test-logloss:0.189223	test-auc:0.922836	train-logloss:0.180746	train-auc:0.914537
[7]	test-logloss:0.183324	test-auc:0.924101	train-logloss:0.174471	train-auc:0.91686
[8]	test-logloss:0.180598	test-auc:0.934289	train-logloss:0.16928	train-auc:0.933808
[9]	test-logloss:0.178481	test-auc:0.934831	train-logloss:0.165865	train-auc:0.937508


## Создание собственной метрики качества

**Чтобы создать свою метрику качества, достаточно определить функцию, принимающую 2 аргумента: вектор предсказанных вероятностей и объект `DMatrix` с истинными метками.  
В этом примере функция вернет просто число объектов, на которых классификатор ошибся, когла относил к классу 1 при превышении предсказанной вероятности класса 1 порога 0.5. 
Далее передаем эту функцию в xgb.train (параметр feval), если метрика тем лучше, чем меньше, надо дополнительно указать `maximize=False`.**


In [11]:
# custom evaluation metric
def misclassified(pred_probs, dmatrix):
    labels = dmatrix.get_label() # obtain true labels
    preds = pred_probs > 0.5 # obtain predicted values
    return 'misclassified', np.sum(labels != preds)

In [12]:
xgb_model = xgb.train(params, dtrain, num_rounds, watchlist, feval=misclassified, maximize=False)

[0]	test-logloss:0.431523	test-auc:0.831107	train-logloss:0.426057	train-auc:0.834741	test-misclassified:100	train-misclassified:213
[1]	test-logloss:0.326082	test-auc:0.897047	train-logloss:0.319245	train-auc:0.888422	test-misclassified:90	train-misclassified:206
[2]	test-logloss:0.268074	test-auc:0.902286	train-logloss:0.261545	train-auc:0.895616	test-misclassified:73	train-misclassified:158
[3]	test-logloss:0.237364	test-auc:0.912461	train-logloss:0.232012	train-auc:0.901253	test-misclassified:67	train-misclassified:141
[4]	test-logloss:0.211304	test-auc:0.919258	train-logloss:0.203699	train-auc:0.908693	test-misclassified:58	train-misclassified:108
[5]	test-logloss:0.19646	test-auc:0.921123	train-logloss:0.190836	train-auc:0.911472	test-misclassified:56	train-misclassified:116
[6]	test-logloss:0.189223	test-auc:0.922836	train-logloss:0.180746	train-auc:0.914537	test-misclassified:57	train-misclassified:108
[7]	test-logloss:0.183324	test-auc:0.924101	train-logloss:0.174471	train-auc

**С помощью параметра evals_result можно сохранить значения метрик по итерациям.**

In [13]:
evals_result = {}
xgb_model = xgb.train(params, dtrain, num_rounds, watchlist, feval=misclassified, maximize=False, 
                      evals_result=evals_result)

[0]	test-logloss:0.431523	test-auc:0.831107	train-logloss:0.426057	train-auc:0.834741	test-misclassified:100	train-misclassified:213
[1]	test-logloss:0.326082	test-auc:0.897047	train-logloss:0.319245	train-auc:0.888422	test-misclassified:90	train-misclassified:206
[2]	test-logloss:0.268074	test-auc:0.902286	train-logloss:0.261545	train-auc:0.895616	test-misclassified:73	train-misclassified:158
[3]	test-logloss:0.237364	test-auc:0.912461	train-logloss:0.232012	train-auc:0.901253	test-misclassified:67	train-misclassified:141
[4]	test-logloss:0.211304	test-auc:0.919258	train-logloss:0.203699	train-auc:0.908693	test-misclassified:58	train-misclassified:108
[5]	test-logloss:0.19646	test-auc:0.921123	train-logloss:0.190836	train-auc:0.911472	test-misclassified:56	train-misclassified:116
[6]	test-logloss:0.189223	test-auc:0.922836	train-logloss:0.180746	train-auc:0.914537	test-misclassified:57	train-misclassified:108
[7]	test-logloss:0.183324	test-auc:0.924101	train-logloss:0.174471	train-auc

In [14]:
evals_result

{'test': {'auc': [0.831107,
   0.897047,
   0.902286,
   0.912461,
   0.919258,
   0.921123,
   0.922836,
   0.924101,
   0.934289,
   0.934831],
  'logloss': [0.431523,
   0.326082,
   0.268074,
   0.237364,
   0.211304,
   0.19646,
   0.189223,
   0.183324,
   0.180598,
   0.178481],
  'misclassified': [100.0,
   90.0,
   73.0,
   67.0,
   58.0,
   56.0,
   57.0,
   56.0,
   52.0,
   53.0]},
 'train': {'auc': [0.834741,
   0.888422,
   0.895616,
   0.901253,
   0.908693,
   0.911472,
   0.914537,
   0.91686,
   0.933808,
   0.937508],
  'logloss': [0.426057,
   0.319245,
   0.261545,
   0.232012,
   0.203699,
   0.190836,
   0.180746,
   0.174471,
   0.16928,
   0.165865],
  'misclassified': [213.0,
   206.0,
   158.0,
   141.0,
   108.0,
   116.0,
   108.0,
   101.0,
   103.0,
   99.0]}}

## Ранняя остановка
**Ранняя остановка используется для того, чтобы прекратить обучение модели, если ошибка за несколько итераций не уменьшилась.**

In [15]:
params['eval_metric'] = 'error'
num_rounds = 1500

xgb_model = xgb.train(params, dtrain, num_rounds, watchlist, early_stopping_rounds=10)

[0]	test-error:0.1	train-error:0.091299
Multiple eval metrics have been passed: 'train-error' will be used for early stopping.

Will train until train-error hasn't improved in 10 rounds.
[1]	test-error:0.09	train-error:0.088298
[2]	test-error:0.073	train-error:0.067724
[3]	test-error:0.067	train-error:0.060437
[4]	test-error:0.058	train-error:0.046292
[5]	test-error:0.056	train-error:0.049721
[6]	test-error:0.057	train-error:0.046292
[7]	test-error:0.056	train-error:0.043292
[8]	test-error:0.052	train-error:0.044149
[9]	test-error:0.053	train-error:0.042435
[10]	test-error:0.057	train-error:0.042006
[11]	test-error:0.055	train-error:0.041577
[12]	test-error:0.052	train-error:0.040291
[13]	test-error:0.054	train-error:0.039434
[14]	test-error:0.051	train-error:0.039863
[15]	test-error:0.054	train-error:0.039434
[16]	test-error:0.053	train-error:0.035148
[17]	test-error:0.053	train-error:0.035577
[18]	test-error:0.052	train-error:0.033862
[19]	test-error:0.055	train-error:0.032576
[20]	t

In [16]:
print("Booster best train score: {}".format(xgb_model.best_score))
print("Booster best iteration: {}".format(xgb_model.best_iteration))

Booster best train score: 0.000857
Booster best iteration: 122


## Кросс-валидация с Xgboost
**Продемонстрируем функцию xgboost.cv.**

In [17]:
num_rounds = 10
hist = xgb.cv(params, dtrain, num_rounds, nfold=10, metrics={'error'}, seed=42)
hist

Unnamed: 0,test-error-mean,test-error-std,train-error-mean,train-error-std
0,0.101288,0.015851,0.094897,0.006412
1,0.095708,0.015119,0.087887,0.004854
2,0.084549,0.010345,0.070577,0.007276
3,0.065236,0.008098,0.054411,0.005094
4,0.057082,0.008378,0.047449,0.004794
5,0.057081,0.010866,0.046829,0.003861
6,0.055365,0.012357,0.043014,0.002255
7,0.055794,0.011355,0.041488,0.002909
8,0.054077,0.011548,0.039962,0.002812
9,0.051931,0.014292,0.039199,0.001905


Замечания:

- по умолчанию на выходе DataFrame (можно поменять параметр `as_pandas`),
- метрики передатся как параметр (можно и несколько),
- можно использовать и свои метрики (параметры `feval` и `maximize`),
- можно также использовать раннюю остановку ( `early_stopping_rounds`)