<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg" />
</center> 
     
## <center>  [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 

#### <center> Author: [Yury Kashnitsky](https://yorko.github.io) (@yorko) 

# <center>Assignment #2. Fall 2019
## <center> Part 2. Gradient boosting

**In this assignment, you're asked to beat a baseline in the ["Flight delays" competition](https://www.kaggle.com/c/flight-delays-fall-2018).**

This time we decided to share a pretty decent CatBoost baseline, you'll have to improve the provided solution.

Prior to working on the assignment, you'd better check out the corresponding course material:
 1. [Classification, Decision Trees and k Nearest Neighbors](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb?flush_cache=true), the same as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-3-decision-trees-and-knn) 
 2. Ensembles:
  - [Bagging](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part1_bagging.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-1-bagging)
  - [Random Forest](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part2_random_forest.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-2-random-forest)
  - [Feature Importance](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part3_feature_importance.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-3-feature-importance)
 3. - [Gradient boosting](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic10_boosting/topic10_gradient_boosting.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-10-gradient-boosting) 
   - Logistic regression, Random Forest, and LightGBM in the "Kaggle Forest Cover Type Prediction" competition: [Kernel](https://www.kaggle.com/kashnitsky/topic-10-practice-with-logit-rf-and-lightgbm) 
 4. You can also practice with demo assignments, which are simpler and already shared with solutions:
  - "Decision trees with a toy task and the UCI Adult dataset": [assignment](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees) + [solution](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees-solution)
  - "Logistic Regression and Random Forest in the credit scoring problem": [assignment](https://www.kaggle.com/kashnitsky/assignment-5-logit-and-rf-for-credit-scoring) + [solution](https://www.kaggle.com/kashnitsky/a5-demo-logit-and-rf-for-credit-scoring-sol)
 5. There are also 7 video lectures on trees, forests, boosting and their applications: [mlcourse.ai/video](https://mlcourse.ai/video) 
 6. mlcourse.ai tutorials on [categorical feature encoding](https://www.kaggle.com/waydeherman/tutorial-categorical-encoding) (by Wayde Herman) and [CatBoost](https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview) (by Mikhail Tribunskiy)
 7. Last but not the least: [Public Kernels](https://www.kaggle.com/c/flight-delays-fall-2018/notebooks) in this competition

### Your task is to:
  beat **"A2 baseline (10 credits)"** on Public LB (**0.75914** LB score)

 



In [None]:
import warnings
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**Read the data**

In [None]:
PATH_TO_DATA = Path('../input/flight-delays-fall-2018/')

In [None]:
train_df = pd.read_csv(PATH_TO_DATA / 'flight_delays_train.csv')

In [None]:
train_df.head()

In [None]:
test_df = pd.read_csv(PATH_TO_DATA / 'flight_delays_test.csv')

In [None]:
test_df.head()

### Create new features

In [None]:
train_df['flight'] = train_df['Origin'] + '-->' + train_df['Dest']
test_df['flight'] = test_df['Origin'] + '-->' + test_df['Dest']

In [None]:
# Change time to hour and minutes
train_df['DepHour'] = train_df['DepTime'] // 100
train_df['DepHour'].replace(to_replace=24, value=0, inplace=True)

test_df['DepHour'] = test_df['DepTime'] // 100
test_df['DepHour'].replace(to_replace=24, value=0, inplace=True)

train_df['DepMinutes'] = train_df['DepTime'] % 100
test_df['DepMinutes'] = test_df['DepTime'] % 100

In [None]:
# Change Month,DayofMonth,DayOfWeek
train_df['Month'] = train_df['Month'].str[2:].astype('int')
train_df['DayofMonth'] = train_df['DayofMonth'].str[2:].astype('int')
train_df['DayOfWeek'] = train_df['DayOfWeek'].str[2:].astype('int')


test_df['Month'] = test_df['Month'].str[2:].astype('int')
test_df['DayofMonth'] = test_df['DayofMonth'].str[2:].astype('int')
test_df['DayOfWeek'] =test_df['DayOfWeek'].str[2:].astype('int')


In [None]:
train_df

In [None]:
# Compare delay or not delay from 'DepHour'
df_delay = train_df[train_df['dep_delayed_15min']=='Y']
df_not_delay = train_df[train_df['dep_delayed_15min']=='N']

In [None]:
plt.figure(figsize=(18,7));
plt.subplot(1,2,1);
df_delay['DepHour'].value_counts().sort_index().plot(kind='bar')
plt.subplot(1,2,2);
df_not_delay['DepHour'].value_counts().sort_index().plot(kind='bar');

In [None]:
# new function division 1 hour into 4 parts(to 15min)
def minute_4_part(x):
    for i in range(4):
        if x >= i*15 and x < (i+1)*15:
            return i+1

In [None]:
# apply minute_4_part and create new feature
train_df['all_part_time'] = train_df['DepHour']*4+train_df['DepMinutes'].apply(minute_4_part)
test_df['all_part_time'] = test_df['DepHour']*4+test_df['DepMinutes'].apply(minute_4_part)

In [None]:
train_df

In [None]:
plt.figure(figsize=(18,7));
plt.subplot(1,2,1);
df_delay['Month'].value_counts().sort_index().plot(kind='bar')
plt.subplot(1,2,2);
df_not_delay['Month'].value_counts().sort_index().plot(kind='bar');

In [None]:
# Create new feature for monthly workload 
def month_peak(x):
    if x==12 or x==6 or x==7:
        return  0 #month_alot
    elif x==4 or x==5 or x==9 or x==2:
        return 1 #month_alot
    else:
        return 2 #norm

In [None]:
#apply function to 'month_peak'
train_df['month_peak'] = train_df['Month'].apply(month_peak)

test_df['month_peak'] = test_df['Month'].apply(month_peak)

In [None]:
# Create new feature 'season'
def season(x):
    if x==12 or x==1 or x==2:
        return  0 #winter
    elif x>=3 and x<=5:
        return 1 #spring
    elif x>=6 and x<=8:
        return 2 #summer
    elif x>=9 and x<=11:
        return 3 #autunm

In [None]:
#apply function to 'season'
train_df['season_time'] = train_df['Month'].apply(season)

test_df['season_time'] = test_df['Month'].apply(season)

In [None]:
plt.figure(figsize=(18,7));
plt.subplot(1,2,1);
df_delay['DayOfWeek'].value_counts().sort_index().plot(kind='bar')
plt.subplot(1,2,2);
df_not_delay['DayOfWeek'].value_counts().sort_index().plot(kind='bar');

In [None]:
# Create new feature 'traffic_week'
def traffic_week(x):
    if x==1 or x==4 or x==5 or x==7:
        return 1 #hight_delay
    else:
        return 0 #low_delay

In [None]:
#apply function to 'traffic_wee'
train_df['traffic_week'] = train_df['DayOfWeek'].apply(traffic_week)

test_df['traffic_week'] = test_df['DayOfWeek'].apply(traffic_week)

In [None]:
print(train_df.head())
print(test_df.head())

In [None]:
# change binary feature to 1 and 0
train_df['dep_delayed_15min'] = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0})

In [None]:
#create heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(train_df.corr(), linewidths=1, annot=True);

### Remember indexes of categorical features (to be passed to CatBoost)

In [None]:
#change type of feature to 'object'
train_df['Month'] = train_df['Month'].astype('object')
test_df['Month'] = test_df['Month'].astype('object')

train_df['DepHour'] = train_df['DepHour'].astype('object')
test_df['DepHour'] = test_df['DepHour'].astype('object')


train_df['month_peak'] = train_df['month_peak'].astype('object')
test_df['month_peak'] = test_df['month_peak'].astype('object')

train_df['season_time'] = train_df['season_time'].astype('object')
test_df['season_time'] = test_df['season_time'].astype('object')

train_df['traffic_week']=train_df['traffic_week'].astype('object')
test_df['traffic_week']=test_df['traffic_week'].astype('object')

train_df['all_part_time']=train_df['all_part_time'].astype('object')
test_df['all_part_time']=test_df['all_part_time'].astype('object')

In [None]:
test_df.info()

In [None]:
colstodel = ['dep_delayed_15min']

In [None]:
# saving indexes of categorical features
categ_feat_idx = np.where(train_df.drop(columns = colstodel, axis=1).dtypes == 'object')[0]
categ_feat_idx

In [None]:
train_df.drop(columns = colstodel, axis=1).head()

### Separation to X_train, y_train, X_test and separation data-set to 70%:30%

In [None]:
X_train = train_df.drop(columns = colstodel, axis=1).values
y_train = train_df['dep_delayed_15min']
X_test = test_df.drop(columns = colstodel[1:], axis=1).values

In [None]:
X_train_part, X_valid, y_train_part, y_valid = train_test_split(X_train, y_train, 
                                                                test_size=0.3, 
                                                                random_state=17)

### Create new classifier CAT-BOOST

In [None]:
ctb1 = CatBoostClassifier(iterations=1200,
                          learning_rate = 0.05,
                          eval_metric='AUC',
                          max_depth=None,
                          random_seed=17, 
                          silent=True)

In [None]:
%%time
ctb1.fit(X_train_part, y_train_part, 
         eval_set=(X_valid, y_valid),
         cat_features=categ_feat_idx,
         plot=True);

In [None]:
ctb_valid_pred_1 = ctb1.predict_proba(X_valid)[:, 1]

In [None]:
roc_auc_score(y_valid, ctb_valid_pred_1)

**I got some 0.802809 ROC AUC on the hold-out set.**

In [None]:
# Now we fit on data-set (100%)
ctb1.fit(X_train, y_train,
        cat_features=categ_feat_idx);
ctb_test_pred_3 = ctb1.predict_proba(X_test)[:, 1]

In [None]:
ctb_test_pred_3

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
    sample_sub = pd.read_csv(PATH_TO_DATA / 'sample_submission.csv', 
                             index_col='id')
    sample_sub['dep_delayed_15min'] = ctb_test_pred_3
    sample_sub.to_csv('ctb_pred_4.csv')

In [None]:
!head ctb_pred_4.csv

### ROC AUC: 0.76364



<img src='https://habrastorage.org/webt/fs/42/ms/fs42ms0r7qsoj-da4x7yfntwrbq.jpeg' width=50%>
