<center>
<img src="../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist @ Mail.Ru Group <br>All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

# <center> Assignment #10 (demo)
## <center> Gradient boosting

Your task is to beat at least 2 benchmarks in this [Kaggle Inclass competition](https://www.kaggle.com/c/flight-delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief description of how the second benchmark was achieved using Xgboost. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will perform well. Most likely it will be Xgboost, however, we’ve got plenty of categorical features here.

<img src='../img/xgboost_meme.jpg' width=40% />

In [1]:
import warnings

warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

In [2]:
train = pd.read_csv("../data/flight_delays_train.csv")
test = pd.read_csv("../data/flight_delays_test.csv")

In [3]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [4]:
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take Xgboost classifier and two features that are easiest to take: DepTime and Distance. Such model results in 0.68202 on the LB.

In [5]:
X_train = train[["Distance", "DepTime"]].values
y_train = train["dep_delayed_15min"].map({"Y": 1, "N": 0}).values
X_test = test[["Distance", "DepTime"]].values

X_train_part, X_valid, y_train_part, y_valid = train_test_split(
    X_train, y_train, test_size=0.3, random_state=17
)

We'll train Xgboost with default parameters on part of data and estimate holdout ROC AUC.

In [6]:
xgb_model = XGBClassifier(seed=17)

xgb_model.fit(X_train_part, y_train_part)
xgb_valid_pred = xgb_model.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, xgb_valid_pred)



0.7001228657578435

Now we do the same with the whole training set, make predictions to test set and form a submission file. This is how you beat the first benchmark. 

In [7]:
xgb_model.fit(X_train, y_train)
xgb_test_pred = xgb_model.predict_proba(X_test)[:, 1]

pd.Series(xgb_test_pred, name="dep_delayed_15min").to_csv(
    "xgb_2feat.csv", index_label="id", header=True
)



The second benchmark in the leaderboard was achieved as follows:

- Features `Distance` and `DepTime` were taken unchanged
- A feature `Flight` was created from features `Origin` and `Dest`
- Features `Month`, `DayofMonth`, `DayOfWeek`, `UniqueCarrier` and `Flight` were transformed with OHE (`LabelBinarizer`)
- Logistic regression and gradient boosting (xgboost) were trained. Xgboost hyperparameters were tuned via cross-validation. First, the hyperparameters responsible for model complexity were optimized, then the number of trees was fixed at 500 and learning step was tuned.
- Predicted probabilities were made via cross-validation using `cross_val_predict`. A linear mixture of logistic regression and gradient boosting predictions was set in the form $w_1 * p_{logit} + (1 - w_1) * p_{xgb}$, where $p_{logit}$ is a probability of class 1, predicted by logistic regression, and $p_{xgb}$ – the same for xgboost. $w_1$ weight was selected manually.
- A similar combination of predictions was made for test set. 

Following the same steps is not mandatory. That’s just a description of how the result was achieved by the author of this assignment. Perhaps you might not want to follow the same steps, and instead, let’s say, add a couple of good features and train a random forest of a thousand trees.

Good luck!

In [9]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.1.1-cp38-none-macosx_10_6_universal2.whl (22.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.0/22.0 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1
You should consider upgrading via the '/Users/arturburiev/Library/Caches/pypoetry/virtualenvs/mlcourse.ai-YmPKOXOi-py3.8/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [11]:
from catboost import CatBoostClassifier

In [14]:
catboost_model = CatBoostClassifier(random_seed=17)

In [18]:
catboost_model.fit(X_train, y_train)
catboost_test_pred = catboost_model.predict_proba(X_test)[:, 1]

pd.Series(catboost_test_pred, name="dep_delayed_15min").to_csv(
    "catboost_2feat.csv", index_label="id", header=True
)

Learning rate set to 0.07361
0:	learn: 0.6536546	total: 10.4ms	remaining: 10.4s
1:	learn: 0.6211785	total: 19ms	remaining: 9.47s
2:	learn: 0.5938502	total: 27.1ms	remaining: 8.99s
3:	learn: 0.5703896	total: 34.6ms	remaining: 8.62s
4:	learn: 0.5516728	total: 41.4ms	remaining: 8.24s
5:	learn: 0.5347917	total: 47.8ms	remaining: 7.92s
6:	learn: 0.5215205	total: 54ms	remaining: 7.66s
7:	learn: 0.5101607	total: 61.2ms	remaining: 7.58s
8:	learn: 0.5008439	total: 68.1ms	remaining: 7.5s
9:	learn: 0.4930374	total: 75.4ms	remaining: 7.46s
10:	learn: 0.4866230	total: 82.1ms	remaining: 7.38s
11:	learn: 0.4811460	total: 88.7ms	remaining: 7.3s
12:	learn: 0.4762913	total: 95.6ms	remaining: 7.26s
13:	learn: 0.4725988	total: 103ms	remaining: 7.27s
14:	learn: 0.4685880	total: 110ms	remaining: 7.2s
15:	learn: 0.4653948	total: 117ms	remaining: 7.22s
16:	learn: 0.4627732	total: 125ms	remaining: 7.22s
17:	learn: 0.4606552	total: 131ms	remaining: 7.17s
18:	learn: 0.4587443	total: 138ms	remaining: 7.12s
19:	le

164:	learn: 0.4415086	total: 1.22s	remaining: 6.16s
165:	learn: 0.4414900	total: 1.23s	remaining: 6.16s
166:	learn: 0.4414664	total: 1.24s	remaining: 6.16s
167:	learn: 0.4414412	total: 1.24s	remaining: 6.16s
168:	learn: 0.4414133	total: 1.25s	remaining: 6.15s
169:	learn: 0.4413979	total: 1.26s	remaining: 6.14s
170:	learn: 0.4413800	total: 1.27s	remaining: 6.14s
171:	learn: 0.4413468	total: 1.27s	remaining: 6.13s
172:	learn: 0.4413236	total: 1.28s	remaining: 6.13s
173:	learn: 0.4412926	total: 1.29s	remaining: 6.12s
174:	learn: 0.4412636	total: 1.3s	remaining: 6.11s
175:	learn: 0.4412419	total: 1.3s	remaining: 6.11s
176:	learn: 0.4412213	total: 1.31s	remaining: 6.1s
177:	learn: 0.4411957	total: 1.32s	remaining: 6.1s
178:	learn: 0.4411532	total: 1.33s	remaining: 6.09s
179:	learn: 0.4411237	total: 1.33s	remaining: 6.08s
180:	learn: 0.4410987	total: 1.34s	remaining: 6.08s
181:	learn: 0.4410784	total: 1.35s	remaining: 6.07s
182:	learn: 0.4410547	total: 1.36s	remaining: 6.06s
183:	learn: 0.44

344:	learn: 0.4372583	total: 2.64s	remaining: 5.01s
345:	learn: 0.4372379	total: 2.65s	remaining: 5s
346:	learn: 0.4372149	total: 2.65s	remaining: 5s
347:	learn: 0.4371973	total: 2.66s	remaining: 4.99s
348:	learn: 0.4371705	total: 2.67s	remaining: 4.98s
349:	learn: 0.4371493	total: 2.68s	remaining: 4.97s
350:	learn: 0.4371288	total: 2.69s	remaining: 4.97s
351:	learn: 0.4371046	total: 2.69s	remaining: 4.96s
352:	learn: 0.4370805	total: 2.7s	remaining: 4.96s
353:	learn: 0.4370533	total: 2.71s	remaining: 4.95s
354:	learn: 0.4370405	total: 2.72s	remaining: 4.94s
355:	learn: 0.4370173	total: 2.73s	remaining: 4.93s
356:	learn: 0.4370012	total: 2.73s	remaining: 4.92s
357:	learn: 0.4369776	total: 2.74s	remaining: 4.92s
358:	learn: 0.4369585	total: 2.75s	remaining: 4.91s
359:	learn: 0.4369472	total: 2.76s	remaining: 4.9s
360:	learn: 0.4369269	total: 2.77s	remaining: 4.89s
361:	learn: 0.4369012	total: 2.77s	remaining: 4.89s
362:	learn: 0.4368796	total: 2.78s	remaining: 4.88s
363:	learn: 0.436870

527:	learn: 0.4339106	total: 4.06s	remaining: 3.63s
528:	learn: 0.4338923	total: 4.07s	remaining: 3.63s
529:	learn: 0.4338758	total: 4.08s	remaining: 3.62s
530:	learn: 0.4338667	total: 4.09s	remaining: 3.61s
531:	learn: 0.4338400	total: 4.1s	remaining: 3.6s
532:	learn: 0.4338312	total: 4.11s	remaining: 3.6s
533:	learn: 0.4338155	total: 4.11s	remaining: 3.59s
534:	learn: 0.4338082	total: 4.12s	remaining: 3.58s
535:	learn: 0.4337926	total: 4.13s	remaining: 3.58s
536:	learn: 0.4337765	total: 4.14s	remaining: 3.57s
537:	learn: 0.4337561	total: 4.15s	remaining: 3.56s
538:	learn: 0.4337341	total: 4.15s	remaining: 3.55s
539:	learn: 0.4337222	total: 4.16s	remaining: 3.54s
540:	learn: 0.4337132	total: 4.17s	remaining: 3.54s
541:	learn: 0.4337028	total: 4.18s	remaining: 3.53s
542:	learn: 0.4336849	total: 4.18s	remaining: 3.52s
543:	learn: 0.4336647	total: 4.19s	remaining: 3.51s
544:	learn: 0.4336361	total: 4.2s	remaining: 3.5s
545:	learn: 0.4336162	total: 4.21s	remaining: 3.5s
546:	learn: 0.4336

709:	learn: 0.4310882	total: 5.49s	remaining: 2.24s
710:	learn: 0.4310792	total: 5.5s	remaining: 2.24s
711:	learn: 0.4310542	total: 5.51s	remaining: 2.23s
712:	learn: 0.4310378	total: 5.52s	remaining: 2.22s
713:	learn: 0.4310210	total: 5.53s	remaining: 2.21s
714:	learn: 0.4310041	total: 5.53s	remaining: 2.21s
715:	learn: 0.4309855	total: 5.54s	remaining: 2.2s
716:	learn: 0.4309728	total: 5.55s	remaining: 2.19s
717:	learn: 0.4309614	total: 5.56s	remaining: 2.18s
718:	learn: 0.4309451	total: 5.57s	remaining: 2.17s
719:	learn: 0.4309359	total: 5.57s	remaining: 2.17s
720:	learn: 0.4309261	total: 5.58s	remaining: 2.16s
721:	learn: 0.4309106	total: 5.59s	remaining: 2.15s
722:	learn: 0.4309015	total: 5.6s	remaining: 2.14s
723:	learn: 0.4308983	total: 5.6s	remaining: 2.14s
724:	learn: 0.4308825	total: 5.61s	remaining: 2.13s
725:	learn: 0.4308635	total: 5.62s	remaining: 2.12s
726:	learn: 0.4308516	total: 5.63s	remaining: 2.11s
727:	learn: 0.4308389	total: 5.63s	remaining: 2.1s
728:	learn: 0.430

888:	learn: 0.4286701	total: 6.93s	remaining: 865ms
889:	learn: 0.4286614	total: 6.93s	remaining: 857ms
890:	learn: 0.4286447	total: 6.94s	remaining: 849ms
891:	learn: 0.4286299	total: 6.95s	remaining: 842ms
892:	learn: 0.4286240	total: 6.96s	remaining: 834ms
893:	learn: 0.4286174	total: 6.97s	remaining: 826ms
894:	learn: 0.4285973	total: 6.98s	remaining: 818ms
895:	learn: 0.4285829	total: 6.98s	remaining: 811ms
896:	learn: 0.4285743	total: 6.99s	remaining: 803ms
897:	learn: 0.4285459	total: 7s	remaining: 795ms
898:	learn: 0.4285299	total: 7.01s	remaining: 787ms
899:	learn: 0.4285147	total: 7.02s	remaining: 780ms
900:	learn: 0.4285006	total: 7.02s	remaining: 772ms
901:	learn: 0.4284888	total: 7.03s	remaining: 764ms
902:	learn: 0.4284759	total: 7.04s	remaining: 756ms
903:	learn: 0.4284712	total: 7.05s	remaining: 748ms
904:	learn: 0.4284604	total: 7.05s	remaining: 741ms
905:	learn: 0.4284419	total: 7.06s	remaining: 733ms
906:	learn: 0.4284346	total: 7.07s	remaining: 725ms
907:	learn: 0.4