<center>
<img src="../../img/ods_stickers.jpg">
## Открытый курс по машинному обучению. Сессия № 3
Автор материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий. Материал распространяется на условиях лицензии [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала.

# <center>Домашнее задание № 10
## <center> Прогнозирование задержек вылетов

Ваша задача – побить как минимум 2 бенчмарка в [соревновании](https://www.kaggle.com/c/flight-delays-spring-2018) на Kaggle Inclass. Подробных инструкций не будет, будет только тезисно описано, как получен второй – с помощью Xgboost. Надеюсь, на данном этапе курса вам достаточно бросить полтора взгляда на данные, чтоб понять, что это тот тип задачи, в которой затащит градиентный бустинг. Скорее всего Xgboost, но тут у нас еще немало категориальных признаков...

<img src='../../img/xgboost_meme.jpg' width=40% />

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import roc_auc_score



In [2]:
train = pd.read_csv('../../data/flight_delays_train.csv')
test = pd.read_csv('../../data/flight_delays_test.csv')

In [3]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [4]:
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Итак, надо по времени вылета самолета, коду авиакомпании-перевозчика, месту вылета и прилета и расстоянию между аэропортами вылета и прилета предсказать задержку вылета более 15 минут. В качестве простейшего бенчмарка возьмем логистическую регрессию и два признака, которые проще всего взять: `DepTime` и `Distance`. У такой модели результат – 0.68202 на LB. 

In [5]:
X_train, y_train = train[['Distance', 'DepTime']].values, train['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_test = test[['Distance', 'DepTime']].values

X_train_part, X_valid, y_train_part, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=17)

scaler = StandardScaler()
X_train_part = scaler.fit_transform(X_train_part)
X_valid = scaler.transform(X_valid)

In [6]:
logit = LogisticRegression()

logit.fit(X_train_part, y_train_part)
logit_valid_pred = logit.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, logit_valid_pred)

0.67956914653526068

In [7]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logit.fit(X_train_scaled, y_train)
logit_test_pred = logit.predict_proba(X_test_scaled)[:, 1]

pd.Series(logit_test_pred, name='dep_delayed_15min').to_csv('logit_2feat.csv', index_label='id', header=True)

Второй бенчмарк, представленный в рейтинге соревнования, был получен так:
- Признаки `Distance` и  `DepTime` брались без изменений
- Создан признак "маршрут" из исходных `Origin` и `Dest`
- К признакам `Month`, `DayofMonth`, `DayOfWeek`, `UniqueCarrier` и "маршрут" применено OHE-преобразование (`LabelBinarizer`)
- Выделена отложенная выборка
- Обучалась логистическая регрессия и градиентный бустинг (xgboost), гиперпараметры бустинга настраивались на кросс-валидации, сначала те, что отвечают за сложность модели, затем число деревьев фиксировалось равным 500 и настраивался шаг градиентного спуска
- С помощью `cross_val_predict` делались прогнозы обеих моделей на кросс-валидации (именно предсказанные вероятности), настраивалась линейная смесь ответов логистической регрессии и градиентного бустинга вида $w_1 * p_{logit} + (1 - w_1) * p_{xgb}$, где $p_{logit}$ – предсказанные логистической регрессией вероятности класса 1, $p_{xgb}$ – аналогично. Вес $w_1$ подбирался вручную. 
- В качестве ответа для тестовой выборки бралась аналогичная комбинация ответов двух моделей, но уже обученных на всей обучающей выборке.

Описанный план ни к чему не обязывает – это просто то, как решение получил автор задания. Возможно, вы не захотите следовать намеченному плану, а добавите, скажем, пару хороших признаков и обучите лес из тысячи деревьев.

Удачи!

## Try 1

In [8]:
y_train = train['dep_delayed_15min'].apply(lambda x: 0 if x == 'N' else 1)
train['path'] = train['Origin'] + train['Dest']
test['path'] = test['Origin'] + test['Dest']
train.drop(['Origin', 'Dest', 'dep_delayed_15min'], axis=1, inplace=True)
test.drop(['Origin', 'Dest'], axis=1, inplace=True)

cat_cols = ['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'path']
dummy_train = pd.get_dummies(train[cat_cols])
dummy_new = pd.get_dummies(test[cat_cols])
dummy_new = dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
train = pd.concat([train[['DepTime', 'Distance']], dummy_train], axis=1)
test = pd.concat([test[['DepTime', 'Distance']], dummy_new], axis=1)

train['DepHour'] = train['DepTime'] // 100
test['DepHour'] = test['DepTime'] // 100
train.drop('DepTime', axis=1, inplace=True)
test.drop('DepTime', axis=1, inplace=True)

from scipy import sparse
train_sp = sparse.csr_matrix(train)
test_sp = sparse.csr_matrix(test)

In [9]:
train.head()

Unnamed: 0,Distance,Month_c-1,Month_c-10,Month_c-11,Month_c-12,Month_c-2,Month_c-3,Month_c-4,Month_c-5,Month_c-6,...,path_XNALAX,path_XNALGA,path_XNAORD,path_XNASLC,path_YAKCDV,path_YAKJNU,path_YUMIPL,path_YUMLAX,path_YUMPHX,DepHour
0,732,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19
1,834,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,15
2,416,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,14
3,872,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10
4,423,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,18


In [10]:
xgbc = XGBClassifier(max_depth=6, n_estimators=500, silent=False)
xgbc.fit(train_sp, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=6,
       min_child_weight=1, missing=None, n_estimators=500, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=False, subsample=1)

In [12]:
answer = xgbc.predict_proba(test_sp)
sub = pd.read_csv('sample_submission.csv')
sub['dep_delayed_15min'] = answer[:, 1]
sub.to_csv('subm1.csv', index=None)

## Try 2

In [23]:
train = pd.read_csv('../../data/flight_delays_train.csv')
test = pd.read_csv('../../data/flight_delays_test.csv')

In [24]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [25]:
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


In [26]:
train['Route'] = train.apply(lambda x: x['Origin'] + '-' + x['Dest'], axis=1)
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min,Route
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N,ATL-DFW
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N,PIT-MCO
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N,RDU-CLE
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N,DEN-MEM
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y,MDW-OMA


In [27]:
test['Route'] = test.apply(lambda x: x['Origin'] + '-' + x['Dest'], axis=1)
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,Route
0,c-7,c-25,c-3,615,YV,MRY,PHX,598,MRY-PHX
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235,LAS-HOU
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577,GSP-ORD
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377,BWI-MHT
4,c-6,c-6,c-3,1505,UA,ORD,STL,258,ORD-STL


In [30]:
scaler = StandardScaler()
month_encoder = LabelBinarizer(sparse_output=True)
dom_encoder = LabelBinarizer(sparse_output=True)
dow_encoder = LabelBinarizer(sparse_output=True)
carier_encoder = LabelBinarizer(sparse_output=True)
route_encoder = LabelBinarizer(sparse_output=True)

In [31]:
X_train = csr_matrix(hstack([scaler.fit_transform(train[['Distance', 'DepTime']].values),
                            month_encoder.fit_transform(train[['Month']]),
                            dom_encoder.fit_transform(train[['DayofMonth']]),
                            dow_encoder.fit_transform(train[['DayOfWeek']]),
                            carier_encoder.fit_transform(train[['UniqueCarrier']]),
                            route_encoder.fit_transform(train[['Route']])]))
y_train = train['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values

In [32]:
X_test = csr_matrix(hstack([scaler.transform(test[['Distance', 'DepTime']].values),
                            month_encoder.transform(test[['Month']]),
                            dom_encoder.transform(test[['DayofMonth']]),
                            dow_encoder.transform(test[['DayOfWeek']]),
                            carier_encoder.transform(test[['UniqueCarrier']]),
                            route_encoder.transform(test[['Route']])]))

In [33]:
X_train_part, X_valid, y_train_part, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=17)

In [35]:
xgb = XGBClassifier(learning_rate=0.1,
                    n_estimators=500,
                    max_depth=5,
                    min_child_weight=2,
                    seed=17,
                    silent = True)
xgb.fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric="auc", eval_set=[(X_valid, y_valid)])

[0]	validation_0-auc:0.69023
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.693579
[2]	validation_0-auc:0.69744
[3]	validation_0-auc:0.698552
[4]	validation_0-auc:0.699263
[5]	validation_0-auc:0.699259
[6]	validation_0-auc:0.699734
[7]	validation_0-auc:0.70017
[8]	validation_0-auc:0.700813
[9]	validation_0-auc:0.701183
[10]	validation_0-auc:0.701343
[11]	validation_0-auc:0.701917
[12]	validation_0-auc:0.702324
[13]	validation_0-auc:0.702686
[14]	validation_0-auc:0.703288
[15]	validation_0-auc:0.703522
[16]	validation_0-auc:0.703821
[17]	validation_0-auc:0.704155
[18]	validation_0-auc:0.704436
[19]	validation_0-auc:0.704748
[20]	validation_0-auc:0.704844
[21]	validation_0-auc:0.705316
[22]	validation_0-auc:0.705659
[23]	validation_0-auc:0.706068
[24]	validation_0-auc:0.706369
[25]	validation_0-auc:0.70667
[26]	validation_0-auc:0.706893
[27]	validation_0-auc:0.707193
[28]	validation_0-auc:0.7075
[29]	validation_0-auc:0.707813
[30]	validation_0-auc:

[259]	validation_0-auc:0.720835
[260]	validation_0-auc:0.720821
[261]	validation_0-auc:0.720838
[262]	validation_0-auc:0.72083
[263]	validation_0-auc:0.720825
[264]	validation_0-auc:0.720982
[265]	validation_0-auc:0.720975
[266]	validation_0-auc:0.720985
[267]	validation_0-auc:0.720982
[268]	validation_0-auc:0.721008
[269]	validation_0-auc:0.721016
[270]	validation_0-auc:0.721152
[271]	validation_0-auc:0.721165
[272]	validation_0-auc:0.721204
[273]	validation_0-auc:0.721199
[274]	validation_0-auc:0.721195
[275]	validation_0-auc:0.721239
[276]	validation_0-auc:0.721251
[277]	validation_0-auc:0.721295
[278]	validation_0-auc:0.721258
[279]	validation_0-auc:0.721244
[280]	validation_0-auc:0.721274
[281]	validation_0-auc:0.721286
[282]	validation_0-auc:0.721325
[283]	validation_0-auc:0.72131
[284]	validation_0-auc:0.721294
[285]	validation_0-auc:0.721314
[286]	validation_0-auc:0.721325
[287]	validation_0-auc:0.72132
[288]	validation_0-auc:0.721338
[289]	validation_0-auc:0.721397
[290]	valid

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=5,
       min_child_weight=2, missing=None, n_estimators=500, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=17, silent=True, subsample=1)

In [39]:
%%time
searcher = GridSearchCV(xgb,
                   {'max_depth': [4,5,6],
                    'min_child_weight': [4,5,6]}, verbose=1, n_jobs=1, scoring='roc_auc', cv=3)
searcher.fit(X_train, y_train)
print(searcher.best_score_)
print(searcher.best_params_)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  8.7min finished


0.734050802052
{'max_depth': 6, 'min_child_weight': 6}
Wall time: 9min 5s


In [41]:
xgb = XGBClassifier(learning_rate=0.1,
                    n_estimators=500,
                    max_depth=6,
                    min_child_weight=6,
                    seed=17,
                    silent = True)
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=6,
       min_child_weight=6, missing=None, n_estimators=500, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=17, silent=True, subsample=1)

In [43]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [44]:
print(accuracy_score(y_valid, xgb.predict(X_valid)))
print(roc_auc_score(y_valid, xgb.predict_proba(X_valid)[:, 1]))

0.825866666667
0.769922407351


In [45]:
xgb = XGBClassifier(learning_rate=0.01,
                    n_estimators=5000,
                    max_depth=6,
                    min_child_weight=6,
                    seed=17,
                    silent = True)
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.01, max_delta_step=0, max_depth=6,
       min_child_weight=6, missing=None, n_estimators=5000, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=17, silent=True, subsample=1)

In [47]:
print(accuracy_score(y_valid, xgb.predict(X_valid)))
print(roc_auc_score(y_valid, xgb.predict_proba(X_valid)[:, 1]))

0.825566666667
0.766154357271


In [48]:
xgb_test_pred = xgb.predict_proba(X_test)[:, 1]
pd.Series(xgb_test_pred, name='dep_delayed_15min').to_csv('try2.csv', index_label='id', header=True)

## Try 3

In [70]:
new_train = pd.read_csv('../../data/flight_delays_train.csv')
new_test = pd.read_csv('../../data/flight_delays_test.csv')

In [71]:
from sklearn.model_selection import train_test_split

In [72]:
X_tr, X_val, y_tr, y_val = train_test_split(train_sp, y_train, test_size=0.33)

In [73]:
from sklearn.model_selection import GridSearchCV

In [74]:
import xgboost as xgb
from sklearn import metrics
def modelfit(alg, dtrain, target, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain, label=target)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain, target, eval_metric='auc')
    
    dtrain_predictions = alg.predict(dtrain)
    dtrain_predprob = alg.predict_proba(dtrain)[:,1]
        
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(target, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(target, dtrain_predprob))

In [75]:
xgb1 = XGBClassifier(
    learning_rate =0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=4,
    scale_pos_weight=1,
    seed=27)
modelfit(xgb1, train_sp, y_train)


Model Report
Accuracy : 0.8268
AUC Score (Train): 0.784399


In [76]:
pred = xgb1.predict_proba(test_sp)[:, 1]

In [78]:
sub = pd.read_csv('sample_submission.csv')
sub['dep_delayed_15min'] = pred
sub.to_csv('try3.csv', index=None)

## Try 4

In [2]:
train_df = pd.read_csv('../../data/flight_delays_train.csv')
test_df = pd.read_csv('../../data/flight_delays_test.csv')

In [3]:
train_df['flight'] = train_df['Origin'] + '-->' + train_df['Dest']
test_df['flight'] = test_df['Origin'] + '-->' + test_df['Dest']

In [4]:
categ_feat_idx = np.where(train_df.drop('dep_delayed_15min', axis=1).dtypes == 'object')[0]
categ_feat_idx

array([0, 1, 2, 4, 5, 6, 8], dtype=int64)

In [5]:
X_train = train_df.drop('dep_delayed_15min', axis=1).values
y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_test = test_df.values

In [6]:
X_train_part, X_valid, y_train_part, y_valid = train_test_split(X_train, y_train, 
                                                                test_size=0.3, 
                                                                random_state=17)

In [8]:
from catboost import CatBoostClassifier

In [9]:
ctb = CatBoostClassifier(random_seed=17)

In [10]:
%%time
ctb.fit(X_train_part, y_train_part,
        cat_features=categ_feat_idx)

0:	learn: 0.6760207	total: 337ms	remaining: 5m 36s
1:	learn: 0.6604472	total: 562ms	remaining: 4m 40s
2:	learn: 0.6455956	total: 855ms	remaining: 4m 44s
3:	learn: 0.6320420	total: 1.1s	remaining: 4m 33s
4:	learn: 0.6197660	total: 1.4s	remaining: 4m 39s
5:	learn: 0.6082641	total: 1.69s	remaining: 4m 40s
6:	learn: 0.5975882	total: 1.92s	remaining: 4m 32s
7:	learn: 0.5875524	total: 2.19s	remaining: 4m 31s
8:	learn: 0.5784343	total: 2.45s	remaining: 4m 29s
9:	learn: 0.5694783	total: 2.7s	remaining: 4m 27s
10:	learn: 0.5614182	total: 2.96s	remaining: 4m 25s
11:	learn: 0.5537577	total: 3.27s	remaining: 4m 29s
12:	learn: 0.5467238	total: 3.57s	remaining: 4m 31s
13:	learn: 0.5398879	total: 3.83s	remaining: 4m 29s
14:	learn: 0.5336101	total: 4.15s	remaining: 4m 32s
15:	learn: 0.5276577	total: 4.38s	remaining: 4m 29s
16:	learn: 0.5224512	total: 4.63s	remaining: 4m 27s
17:	learn: 0.5173353	total: 4.87s	remaining: 4m 25s
18:	learn: 0.5128165	total: 5.14s	remaining: 4m 25s
19:	learn: 0.5082362	tota

158:	learn: 0.4255965	total: 46.2s	remaining: 4m 4s
159:	learn: 0.4254997	total: 46.5s	remaining: 4m 4s
160:	learn: 0.4254405	total: 46.8s	remaining: 4m 3s
161:	learn: 0.4253790	total: 47s	remaining: 4m 3s
162:	learn: 0.4252815	total: 47.3s	remaining: 4m 2s
163:	learn: 0.4251821	total: 47.6s	remaining: 4m 2s
164:	learn: 0.4251357	total: 47.9s	remaining: 4m 2s
165:	learn: 0.4250812	total: 48.2s	remaining: 4m 2s
166:	learn: 0.4249976	total: 48.5s	remaining: 4m 1s
167:	learn: 0.4249535	total: 48.7s	remaining: 4m 1s
168:	learn: 0.4248721	total: 49s	remaining: 4m 1s
169:	learn: 0.4248383	total: 49.3s	remaining: 4m
170:	learn: 0.4247541	total: 49.6s	remaining: 4m
171:	learn: 0.4246577	total: 49.8s	remaining: 3m 59s
172:	learn: 0.4245672	total: 50.2s	remaining: 3m 59s
173:	learn: 0.4245131	total: 50.5s	remaining: 3m 59s
174:	learn: 0.4244301	total: 50.8s	remaining: 3m 59s
175:	learn: 0.4243746	total: 51.2s	remaining: 3m 59s
176:	learn: 0.4243293	total: 51.5s	remaining: 3m 59s
177:	learn: 0.42

313:	learn: 0.4177207	total: 1m 28s	remaining: 3m 13s
314:	learn: 0.4176794	total: 1m 28s	remaining: 3m 12s
315:	learn: 0.4176494	total: 1m 29s	remaining: 3m 12s
316:	learn: 0.4176338	total: 1m 29s	remaining: 3m 12s
317:	learn: 0.4176134	total: 1m 29s	remaining: 3m 11s
318:	learn: 0.4175811	total: 1m 29s	remaining: 3m 11s
319:	learn: 0.4175218	total: 1m 30s	remaining: 3m 11s
320:	learn: 0.4174530	total: 1m 30s	remaining: 3m 11s
321:	learn: 0.4174142	total: 1m 30s	remaining: 3m 10s
322:	learn: 0.4173721	total: 1m 30s	remaining: 3m 10s
323:	learn: 0.4173525	total: 1m 31s	remaining: 3m 10s
324:	learn: 0.4173194	total: 1m 31s	remaining: 3m 9s
325:	learn: 0.4173090	total: 1m 31s	remaining: 3m 9s
326:	learn: 0.4172717	total: 1m 31s	remaining: 3m 9s
327:	learn: 0.4172506	total: 1m 32s	remaining: 3m 8s
328:	learn: 0.4172079	total: 1m 32s	remaining: 3m 8s
329:	learn: 0.4171841	total: 1m 32s	remaining: 3m 8s
330:	learn: 0.4171491	total: 1m 32s	remaining: 3m 7s
331:	learn: 0.4171178	total: 1m 33s

467:	learn: 0.4133199	total: 2m 18s	remaining: 2m 37s
468:	learn: 0.4132868	total: 2m 18s	remaining: 2m 36s
469:	learn: 0.4132512	total: 2m 18s	remaining: 2m 36s
470:	learn: 0.4132287	total: 2m 19s	remaining: 2m 36s
471:	learn: 0.4131964	total: 2m 19s	remaining: 2m 35s
472:	learn: 0.4131575	total: 2m 19s	remaining: 2m 35s
473:	learn: 0.4131277	total: 2m 20s	remaining: 2m 35s
474:	learn: 0.4131126	total: 2m 20s	remaining: 2m 35s
475:	learn: 0.4130967	total: 2m 20s	remaining: 2m 35s
476:	learn: 0.4130798	total: 2m 21s	remaining: 2m 34s
477:	learn: 0.4130555	total: 2m 21s	remaining: 2m 34s
478:	learn: 0.4130171	total: 2m 21s	remaining: 2m 34s
479:	learn: 0.4130016	total: 2m 21s	remaining: 2m 33s
480:	learn: 0.4129663	total: 2m 22s	remaining: 2m 33s
481:	learn: 0.4129528	total: 2m 22s	remaining: 2m 33s
482:	learn: 0.4129232	total: 2m 22s	remaining: 2m 32s
483:	learn: 0.4128987	total: 2m 22s	remaining: 2m 32s
484:	learn: 0.4128626	total: 2m 23s	remaining: 2m 31s
485:	learn: 0.4128466	total:

620:	learn: 0.4098826	total: 3m 4s	remaining: 1m 52s
621:	learn: 0.4098390	total: 3m 5s	remaining: 1m 52s
622:	learn: 0.4098119	total: 3m 6s	remaining: 1m 52s
623:	learn: 0.4097851	total: 3m 6s	remaining: 1m 52s
624:	learn: 0.4097807	total: 3m 7s	remaining: 1m 52s
625:	learn: 0.4097482	total: 3m 7s	remaining: 1m 52s
626:	learn: 0.4097361	total: 3m 8s	remaining: 1m 52s
627:	learn: 0.4096970	total: 3m 8s	remaining: 1m 51s
628:	learn: 0.4096682	total: 3m 9s	remaining: 1m 51s
629:	learn: 0.4096349	total: 3m 9s	remaining: 1m 51s
630:	learn: 0.4095938	total: 3m 10s	remaining: 1m 51s
631:	learn: 0.4095773	total: 3m 10s	remaining: 1m 51s
632:	learn: 0.4095649	total: 3m 11s	remaining: 1m 50s
633:	learn: 0.4095485	total: 3m 12s	remaining: 1m 50s
634:	learn: 0.4095390	total: 3m 12s	remaining: 1m 50s
635:	learn: 0.4095183	total: 3m 13s	remaining: 1m 50s
636:	learn: 0.4095065	total: 3m 13s	remaining: 1m 50s
637:	learn: 0.4094904	total: 3m 14s	remaining: 1m 50s
638:	learn: 0.4094790	total: 3m 14s	re

773:	learn: 0.4067897	total: 4m 3s	remaining: 1m 11s
774:	learn: 0.4067628	total: 4m 4s	remaining: 1m 10s
775:	learn: 0.4067443	total: 4m 4s	remaining: 1m 10s
776:	learn: 0.4067238	total: 4m 5s	remaining: 1m 10s
777:	learn: 0.4066984	total: 4m 6s	remaining: 1m 10s
778:	learn: 0.4066862	total: 4m 6s	remaining: 1m 9s
779:	learn: 0.4066765	total: 4m 7s	remaining: 1m 9s
780:	learn: 0.4066630	total: 4m 7s	remaining: 1m 9s
781:	learn: 0.4066540	total: 4m 7s	remaining: 1m 9s
782:	learn: 0.4066383	total: 4m 8s	remaining: 1m 8s
783:	learn: 0.4066063	total: 4m 9s	remaining: 1m 8s
784:	learn: 0.4065839	total: 4m 9s	remaining: 1m 8s
785:	learn: 0.4065734	total: 4m 9s	remaining: 1m 8s
786:	learn: 0.4065566	total: 4m 10s	remaining: 1m 7s
787:	learn: 0.4065390	total: 4m 10s	remaining: 1m 7s
788:	learn: 0.4065207	total: 4m 11s	remaining: 1m 7s
789:	learn: 0.4064996	total: 4m 11s	remaining: 1m 6s
790:	learn: 0.4064815	total: 4m 11s	remaining: 1m 6s
791:	learn: 0.4064625	total: 4m 12s	remaining: 1m 6s
7

929:	learn: 0.4039166	total: 4m 51s	remaining: 21.9s
930:	learn: 0.4038886	total: 4m 51s	remaining: 21.6s
931:	learn: 0.4038789	total: 4m 51s	remaining: 21.3s
932:	learn: 0.4038616	total: 4m 52s	remaining: 21s
933:	learn: 0.4038482	total: 4m 52s	remaining: 20.7s
934:	learn: 0.4038309	total: 4m 52s	remaining: 20.4s
935:	learn: 0.4038120	total: 4m 53s	remaining: 20.1s
936:	learn: 0.4038005	total: 4m 53s	remaining: 19.7s
937:	learn: 0.4037787	total: 4m 54s	remaining: 19.4s
938:	learn: 0.4037561	total: 4m 54s	remaining: 19.1s
939:	learn: 0.4037311	total: 4m 54s	remaining: 18.8s
940:	learn: 0.4037154	total: 4m 55s	remaining: 18.5s
941:	learn: 0.4037013	total: 4m 55s	remaining: 18.2s
942:	learn: 0.4036847	total: 4m 55s	remaining: 17.9s
943:	learn: 0.4036727	total: 4m 56s	remaining: 17.6s
944:	learn: 0.4036422	total: 4m 56s	remaining: 17.3s
945:	learn: 0.4036265	total: 4m 56s	remaining: 16.9s
946:	learn: 0.4036158	total: 4m 57s	remaining: 16.6s
947:	learn: 0.4035882	total: 4m 57s	remaining: 1

<catboost.core.CatBoostClassifier at 0x159197da1d0>

In [11]:
ctb_valid_pred = ctb.predict_proba(X_valid)[:, 1]

In [12]:
roc_auc_score(y_valid, ctb_valid_pred)

0.75379639269529775

In [13]:
%%time
ctb.fit(X_train, y_train,
        cat_features=categ_feat_idx)

0:	learn: 0.6759628	total: 477ms	remaining: 7m 56s
1:	learn: 0.6603862	total: 1.06s	remaining: 8m 51s
2:	learn: 0.6458010	total: 1.75s	remaining: 9m 42s
3:	learn: 0.6320782	total: 2.31s	remaining: 9m 34s
4:	learn: 0.6195228	total: 2.83s	remaining: 9m 23s
5:	learn: 0.6077608	total: 3.19s	remaining: 8m 48s
6:	learn: 0.5968867	total: 3.56s	remaining: 8m 24s
7:	learn: 0.5869267	total: 4.02s	remaining: 8m 18s
8:	learn: 0.5777886	total: 4.36s	remaining: 8m
9:	learn: 0.5687755	total: 4.88s	remaining: 8m 3s
10:	learn: 0.5603156	total: 5.56s	remaining: 8m 19s
11:	learn: 0.5524557	total: 6.22s	remaining: 8m 32s
12:	learn: 0.5452347	total: 6.67s	remaining: 8m 26s
13:	learn: 0.5383743	total: 7.01s	remaining: 8m 14s
14:	learn: 0.5318879	total: 7.5s	remaining: 8m 12s
15:	learn: 0.5265261	total: 8.23s	remaining: 8m 26s
16:	learn: 0.5212310	total: 8.94s	remaining: 8m 36s
17:	learn: 0.5163792	total: 9.55s	remaining: 8m 41s
18:	learn: 0.5116802	total: 10.1s	remaining: 8m 43s
19:	learn: 0.5075286	total: 

158:	learn: 0.4241676	total: 1m 12s	remaining: 6m 25s
159:	learn: 0.4240980	total: 1m 13s	remaining: 6m 26s
160:	learn: 0.4240219	total: 1m 13s	remaining: 6m 25s
161:	learn: 0.4239521	total: 1m 14s	remaining: 6m 24s
162:	learn: 0.4238176	total: 1m 14s	remaining: 6m 23s
163:	learn: 0.4237428	total: 1m 15s	remaining: 6m 22s
164:	learn: 0.4236928	total: 1m 15s	remaining: 6m 22s
165:	learn: 0.4236444	total: 1m 15s	remaining: 6m 21s
166:	learn: 0.4235479	total: 1m 16s	remaining: 6m 20s
167:	learn: 0.4234804	total: 1m 16s	remaining: 6m 19s
168:	learn: 0.4234320	total: 1m 17s	remaining: 6m 19s
169:	learn: 0.4233578	total: 1m 17s	remaining: 6m 18s
170:	learn: 0.4232882	total: 1m 18s	remaining: 6m 18s
171:	learn: 0.4231465	total: 1m 18s	remaining: 6m 17s
172:	learn: 0.4230904	total: 1m 18s	remaining: 6m 16s
173:	learn: 0.4230003	total: 1m 19s	remaining: 6m 16s
174:	learn: 0.4229363	total: 1m 19s	remaining: 6m 15s
175:	learn: 0.4228811	total: 1m 20s	remaining: 6m 14s
176:	learn: 0.4227970	total:

312:	learn: 0.4160801	total: 2m 18s	remaining: 5m 3s
313:	learn: 0.4160542	total: 2m 18s	remaining: 5m 3s
314:	learn: 0.4160299	total: 2m 19s	remaining: 5m 2s
315:	learn: 0.4160055	total: 2m 19s	remaining: 5m 2s
316:	learn: 0.4159735	total: 2m 20s	remaining: 5m 1s
317:	learn: 0.4159563	total: 2m 20s	remaining: 5m 1s
318:	learn: 0.4159259	total: 2m 20s	remaining: 5m
319:	learn: 0.4158851	total: 2m 21s	remaining: 5m
320:	learn: 0.4158483	total: 2m 21s	remaining: 4m 59s
321:	learn: 0.4158197	total: 2m 21s	remaining: 4m 58s
322:	learn: 0.4158055	total: 2m 22s	remaining: 4m 58s
323:	learn: 0.4157727	total: 2m 22s	remaining: 4m 57s
324:	learn: 0.4157400	total: 2m 23s	remaining: 4m 57s
325:	learn: 0.4156882	total: 2m 23s	remaining: 4m 56s
326:	learn: 0.4156664	total: 2m 23s	remaining: 4m 56s
327:	learn: 0.4156143	total: 2m 24s	remaining: 4m 55s
328:	learn: 0.4155985	total: 2m 24s	remaining: 4m 55s
329:	learn: 0.4155900	total: 2m 25s	remaining: 4m 54s
330:	learn: 0.4155565	total: 2m 25s	remain

466:	learn: 0.4116748	total: 3m 20s	remaining: 3m 49s
467:	learn: 0.4116513	total: 3m 21s	remaining: 3m 48s
468:	learn: 0.4115959	total: 3m 21s	remaining: 3m 48s
469:	learn: 0.4115762	total: 3m 22s	remaining: 3m 48s
470:	learn: 0.4115621	total: 3m 22s	remaining: 3m 47s
471:	learn: 0.4115470	total: 3m 23s	remaining: 3m 47s
472:	learn: 0.4115085	total: 3m 24s	remaining: 3m 47s
473:	learn: 0.4114942	total: 3m 24s	remaining: 3m 47s
474:	learn: 0.4114824	total: 3m 25s	remaining: 3m 46s
475:	learn: 0.4114596	total: 3m 25s	remaining: 3m 46s
476:	learn: 0.4114447	total: 3m 25s	remaining: 3m 45s
477:	learn: 0.4114331	total: 3m 26s	remaining: 3m 45s
478:	learn: 0.4114078	total: 3m 26s	remaining: 3m 44s
479:	learn: 0.4113856	total: 3m 27s	remaining: 3m 44s
480:	learn: 0.4113688	total: 3m 28s	remaining: 3m 44s
481:	learn: 0.4113550	total: 3m 29s	remaining: 3m 44s
482:	learn: 0.4113318	total: 3m 30s	remaining: 3m 45s
483:	learn: 0.4113104	total: 3m 30s	remaining: 3m 44s
484:	learn: 0.4112924	total:

619:	learn: 0.4085549	total: 4m 41s	remaining: 2m 52s
620:	learn: 0.4085263	total: 4m 41s	remaining: 2m 52s
621:	learn: 0.4085124	total: 4m 42s	remaining: 2m 51s
622:	learn: 0.4085030	total: 4m 42s	remaining: 2m 50s
623:	learn: 0.4084813	total: 4m 42s	remaining: 2m 50s
624:	learn: 0.4084622	total: 4m 43s	remaining: 2m 49s
625:	learn: 0.4084540	total: 4m 43s	remaining: 2m 49s
626:	learn: 0.4084312	total: 4m 44s	remaining: 2m 49s
627:	learn: 0.4084184	total: 4m 44s	remaining: 2m 48s
628:	learn: 0.4084031	total: 4m 45s	remaining: 2m 48s
629:	learn: 0.4083900	total: 4m 45s	remaining: 2m 47s
630:	learn: 0.4083796	total: 4m 45s	remaining: 2m 47s
631:	learn: 0.4083568	total: 4m 46s	remaining: 2m 46s
632:	learn: 0.4083454	total: 4m 46s	remaining: 2m 46s
633:	learn: 0.4083271	total: 4m 47s	remaining: 2m 45s
634:	learn: 0.4083043	total: 4m 47s	remaining: 2m 45s
635:	learn: 0.4082924	total: 4m 47s	remaining: 2m 44s
636:	learn: 0.4082860	total: 4m 48s	remaining: 2m 44s
637:	learn: 0.4082576	total:

772:	learn: 0.4059875	total: 5m 48s	remaining: 1m 42s
773:	learn: 0.4059621	total: 5m 49s	remaining: 1m 41s
774:	learn: 0.4059207	total: 5m 49s	remaining: 1m 41s
775:	learn: 0.4059050	total: 5m 50s	remaining: 1m 41s
776:	learn: 0.4058965	total: 5m 50s	remaining: 1m 40s
777:	learn: 0.4058846	total: 5m 51s	remaining: 1m 40s
778:	learn: 0.4058602	total: 5m 51s	remaining: 1m 39s
779:	learn: 0.4058483	total: 5m 52s	remaining: 1m 39s
780:	learn: 0.4058400	total: 5m 52s	remaining: 1m 38s
781:	learn: 0.4058096	total: 5m 52s	remaining: 1m 38s
782:	learn: 0.4057945	total: 5m 53s	remaining: 1m 37s
783:	learn: 0.4057710	total: 5m 53s	remaining: 1m 37s
784:	learn: 0.4057579	total: 5m 54s	remaining: 1m 37s
785:	learn: 0.4057148	total: 5m 54s	remaining: 1m 36s
786:	learn: 0.4056991	total: 5m 55s	remaining: 1m 36s
787:	learn: 0.4056823	total: 5m 55s	remaining: 1m 35s
788:	learn: 0.4056556	total: 5m 56s	remaining: 1m 35s
789:	learn: 0.4056470	total: 5m 56s	remaining: 1m 34s
790:	learn: 0.4056314	total:

927:	learn: 0.4035165	total: 7m	remaining: 32.6s
928:	learn: 0.4035028	total: 7m 1s	remaining: 32.2s
929:	learn: 0.4034959	total: 7m 1s	remaining: 31.7s
930:	learn: 0.4034732	total: 7m 1s	remaining: 31.3s
931:	learn: 0.4034647	total: 7m 2s	remaining: 30.8s
932:	learn: 0.4034567	total: 7m 2s	remaining: 30.4s
933:	learn: 0.4034368	total: 7m 3s	remaining: 29.9s
934:	learn: 0.4034068	total: 7m 3s	remaining: 29.5s
935:	learn: 0.4033978	total: 7m 4s	remaining: 29s
936:	learn: 0.4033856	total: 7m 4s	remaining: 28.5s
937:	learn: 0.4033794	total: 7m 4s	remaining: 28.1s
938:	learn: 0.4033754	total: 7m 5s	remaining: 27.6s
939:	learn: 0.4033540	total: 7m 5s	remaining: 27.2s
940:	learn: 0.4033310	total: 7m 6s	remaining: 26.7s
941:	learn: 0.4033161	total: 7m 6s	remaining: 26.3s
942:	learn: 0.4032989	total: 7m 6s	remaining: 25.8s
943:	learn: 0.4032877	total: 7m 7s	remaining: 25.3s
944:	learn: 0.4032740	total: 7m 7s	remaining: 24.9s
945:	learn: 0.4032562	total: 7m 8s	remaining: 24.4s
946:	learn: 0.403

<catboost.core.CatBoostClassifier at 0x159197da1d0>

In [14]:
ctb_test_pred = ctb.predict_proba(X_test)[:, 1]

In [15]:
sample_sub = pd.read_csv('sample_submission.csv', index_col='id')
sample_sub['dep_delayed_15min'] = ctb_test_pred
sample_sub.to_csv('try4.csv')

In [16]:
sample_sub.head()

Unnamed: 0_level_0,dep_delayed_15min
id,Unnamed: 1_level_1
0,0.040439
1,0.062803
2,0.041464
3,0.266344
4,0.290884
