<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist @ Mail.Ru Group <br>All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

# <center> Assignment #10 (demo)
## <center> Gradient boosting

Your task is to beat at least 2 benchmarks in this [Kaggle Inclass competition](https://www.kaggle.com/c/flight-delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief description of how the second benchmark was achieved using Xgboost. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will perform well. Most likely it will be Xgboost, however, we’ve got plenty of categorical features here.

<img src='../../img/xgboost_meme.jpg' width=40% />

In [215]:
import warnings

from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline

warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

In [216]:
train = pd.read_csv("../../data/flight_delays_train.csv")
test = pd.read_csv("../../data/flight_delays_test.csv")

In [217]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [218]:
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take Xgboost classifier and two features that are easiest to take: DepTime and Distance. Such model results in 0.68202 on the LB.

In [219]:
X_train = train[["Distance", "DepTime"]].values
y_train = train["dep_delayed_15min"].map({"Y": 1, "N": 0}).values
X_test = test[["Distance", "DepTime"]].values

X_train_part, X_valid, y_train_part, y_valid = train_test_split(
    X_train, y_train, test_size=0.3, random_state=17
)

We'll train Xgboost with default parameters on part of data and estimate holdout ROC AUC.

In [220]:
xgb_model = XGBClassifier(seed=17)

xgb_model.fit(X_train_part, y_train_part)
xgb_valid_pred = xgb_model.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, xgb_valid_pred)

0.7001348346148775

Now we do the same with the whole training set, make predictions to test set and form a submission file. This is how you beat the first benchmark. 

In [221]:
xgb_model.fit(X_train, y_train)
xgb_test_pred = xgb_model.predict_proba(X_test)[:, 1]

pd.Series(xgb_test_pred, name="dep_delayed_15min").to_csv(
    "xgb_2feat.csv", index_label="id", header=True
)

The second benchmark in the leaderboard was achieved as follows:

- Features `Distance` and `DepTime` were taken unchanged
- A feature `Flight` was created from features `Origin` and `Dest`
- Features `Month`, `DayofMonth`, `DayOfWeek`, `UniqueCarrier` and `Flight` were transformed with OHE (`LabelBinarizer`)
- Logistic regression and gradient boosting (xgboost) were trained. Xgboost hyperparameters were tuned via cross-validation. First, the hyperparameters responsible for model complexity were optimized, then the number of trees was fixed at 500 and learning step was tuned.
- Predicted probabilities were made via cross-validation using `cross_val_predict`. A linear mixture of logistic regression and gradient boosting predictions was set in the form $w_1 * p_{logit} + (1 - w_1) * p_{xgb}$, where $p_{logit}$ is a probability of class 1, predicted by logistic regression, and $p_{xgb}$ – the same for xgboost. $w_1$ weight was selected manually.
- A similar combination of predictions was made for test set. 

Following the same steps is not mandatory. That’s just a description of how the result was achieved by the author of this assignment. Perhaps you might not want to follow the same steps, and instead, let’s say, add a couple of good features and train a random forest of a thousand trees.

Good luck!

---
In order to beat the benchmarks, we need to score on Kaggle higher than 0,682.

First, we need to do exploratory data analysis to see if:
1. Are there any null values?
2. Do we need to do feature engineering?

In [222]:
train.isnull().sum()

Month                0
DayofMonth           0
DayOfWeek            0
DepTime              0
UniqueCarrier        0
Origin               0
Dest                 0
Distance             0
dep_delayed_15min    0
dtype: int64

No null values were found. Let's check the columns datatypes:

In [223]:
train.dtypes

Month                object
DayofMonth           object
DayOfWeek            object
DepTime               int64
UniqueCarrier        object
Origin               object
Dest                 object
Distance              int64
dep_delayed_15min    object
dtype: object

In [224]:
train["dep_delayed_15min"].value_counts()

dep_delayed_15min
N    80956
Y    19044
Name: count, dtype: int64

The **split of data is 80/20**. Therefore, we could use **StratifiedKFold** later or just **GridSearchCV**, since it does **stratification** by default.

We can start by converting dep_delayed_15min to uint8 (more memory efficient than int). I did not choose bool type here for compatibility, since some libraries accept only ints.

In [225]:
train["Target"] = train["dep_delayed_15min"].map({'Y': 1, 'N': 0})
train["Target"] = train["Target"].astype('uint8')
train.drop("dep_delayed_15min", axis=1, inplace=True)

Then, the author mentioned, that they created a new feature called "Flight", however, if we did this, our dataset would become very high-dimensional. Therefore, let's not take this approach for now.

In [226]:
origin_unique_count = train["Origin"].nunique()
dest_unique_count = train["Dest"].nunique()
print("Unique origin locations:", origin_unique_count)
print("Unique destinations:", dest_unique_count)
print("If we use One Hot Encoding on the feature 'Flight', we could potentially have:",
      origin_unique_count * dest_unique_count, "new OHE generated features.")

Unique origin locations: 289
Unique destinations: 289
If we use One Hot Encoding on the feature 'Flight', we could potentially have: 83521 new OHE generated features.


The author also said they left the "DepTime" untouched. However, if we hot encode this feature, we could get from 0 to 2359 new features. Which would make our dataset very high dimentional aswell.

The smart move here is to split the time into hours and minutes. This way, the max features that would be added is 23 + 59.

In [227]:
def handle_time_features(df):
    """
    Splits time and removes the original DepHour column.
    """
    df["DepHour"] = df["DepTime"] // 100
    df["DepMinute"] = df["DepTime"] % 100
    df.drop("DepTime", axis=1, inplace=True)
    return df

train = handle_time_features(train)
test = handle_time_features(test)

In [228]:
train.head(5)

Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,Origin,Dest,Distance,Target,DepHour,DepMinute
0,c-8,c-21,c-7,AA,ATL,DFW,732,0,19,34
1,c-4,c-20,c-3,US,PIT,MCO,834,0,15,48
2,c-9,c-2,c-5,XE,RDU,CLE,416,0,14,22
3,c-11,c-25,c-6,OO,DEN,MEM,872,0,10,15
4,c-10,c-7,c-6,WN,MDW,OMA,423,1,18,28


We can see that we successfully split the time.

Next up, we should extract the numbers from Month, Day of month, Day of week columns. For logistic regression this would not matter, since we would be using One Hot Encoder anyways, but for XG Boost, we can use the extracted values directly without using OHE.

In [229]:
def handle_date_columns(df):
    df.rename(columns={"DayofMonth": "DayOfMonth"}, inplace=True) # Fix inconsistent naming in the dataset

    df["Month"] = df["Month"].str.split('-').str[1].astype('uint8')
    df["DayOfMonth"] = df["DayOfMonth"].str.split('-').str[1].astype('uint8')
    df["DayOfWeek"] = df["DayOfWeek"].str.split('-').str[1].astype('uint8')

    return df

train = handle_date_columns(train)
test = handle_date_columns(test)

In [230]:
train.head(5)

Unnamed: 0,Month,DayOfMonth,DayOfWeek,UniqueCarrier,Origin,Dest,Distance,Target,DepHour,DepMinute
0,8,21,7,AA,ATL,DFW,732,0,19,34
1,4,20,3,US,PIT,MCO,834,0,15,48
2,9,2,5,XE,RDU,CLE,416,0,14,22
3,11,25,6,OO,DEN,MEM,872,0,10,15
4,10,7,6,WN,MDW,OMA,423,1,18,28


We can see that our data is looking much cleaner and more usable.

Next, let's split the data:

In [231]:
X = train.drop("Target", axis=1)
y = train["Target"]

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Next, let's create **2 different pipelines: for Logistic Regression and XG Boost.**

**Logistic Regression pipeline:**
1. Convert Departure Hours and Minutes using Sin, Cos to create cycle features,
2. Scale numeric features,
3. One-hot encode the categorical features,
4. Use GridSearchCV for selecting:
    1. Top X % of features
    2. Best regularization value
    3. Choose whether L1 or L2 regularization is better
5. Train on X_train data
6. Predict and calculate ROC AUC

In [232]:
class CyclicalTimeFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, time_col='DepTime'):
        self.time_col = time_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy() # The original dataset is not altered
        X["DepHour_sin"] = np.sin(2 * np.pi * X["DepHour"] / 24)
        X["DepHour_cos"] = np.cos(2 * np.pi * X["DepHour"] / 24)
        X["DepMinute_sin"] = np.sin(2 * np.pi * X["DepMinute"] / 60)
        X["DepMinute_cos"] = np.cos(2 * np.pi * X["DepMinute"] / 60)
        return X

In [233]:
categorical_features = ["UniqueCarrier", "Origin", "Dest"]
numeric_features = ["Distance", "DepHour_sin", "DepHour_cos", "DepMinute_sin", "DepMinute_cos"]

In [234]:
preprocessor = ColumnTransformer([
    ("scaler", StandardScaler(), numeric_features),
    ("encoder", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

logregPipeline = Pipeline([
    ("cyclic", CyclicalTimeFeatures()),
    ("preprocessor", preprocessor),
    ("select", SelectPercentile()),
    ("clf", LogisticRegression(solver="liblinear", max_iter=3000, n_jobs=-1, random_state=17))
])

In [235]:
param_grid = {
    "select__percentile": [10, 30, 50, 70, 80, 100],
    "clf__penalty": ["l1", "l2"],
    "clf__C": [0.01, 0.1, 1, 10],
}

grid_logreg = GridSearchCV(
    logregPipeline,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=5,
    verbose=1,
    n_jobs=-1
)

**XGBoost pipeline:**
1. One-hot encode the categorical features,
2. Use GridSearchCV for selecting:
    1. Best max depth
    2. Best learning rate
    3. Optimal n amount of estimators
    4. Best % for row random subsampling
    5. Best % for feature random subsampling
3. Train on X_train data
4. Predict and calculate ROC AUC

In [236]:
xgb_pipeline = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ("clf", XGBClassifier(random_state=17, use_label_encoder=False, eval_metric='logloss'))
])

param_grid = {
    "clf__n_estimators": [100, 200],
    "clf__max_depth": [3, 5],
    "clf__learning_rate": [0.1, 0.01],
    "clf__subsample": [0.7, 1],
    "clf__colsample_bytree": [0.7, 1]
}

grid_xgb = GridSearchCV(
    xgb_pipeline,
    param_grid,
    scoring='roc_auc',
    cv=5,
    verbose=1,
    n_jobs=-1
)

Great. Let's fit both models and get the results.

In [237]:
%time

grid_logreg.fit(X_train, y_train)

CPU times: total: 0 ns
Wall time: 5.96 μs
Fitting 5 folds for each of 48 candidates, totalling 240 fits


0,1,2
,estimator,Pipeline(step...liblinear'))])
,param_grid,"{'clf__C': [0.01, 0.1, ...], 'clf__penalty': ['l1', 'l2'], 'select__percentile': [10, 30, ...]}"
,scoring,'roc_auc'
,n_jobs,-1
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,time_col,'DepTime'

0,1,2
,transformers,"[('scaler', ...), ('encoder', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,score_func,<function f_c...001C8A90D4F40>
,percentile,100

0,1,2
,penalty,'l1'
,dual,False
,tol,0.0001
,C,0.1
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,17
,solver,'liblinear'
,max_iter,3000


From SelectPercentile parameter we can see, that it has provided no real benefit, since GridSearchCV chose that using all parameters brings the best result.

In [238]:
%time

grid_xgb.fit(X_train, y_train)

CPU times: total: 0 ns
Wall time: 5.96 μs
Fitting 5 folds for each of 32 candidates, totalling 160 fits


0,1,2
,estimator,"Pipeline(step...=None, ...))])"
,param_grid,"{'clf__colsample_bytree': [0.7, 1], 'clf__learning_rate': [0.1, 0.01], 'clf__max_depth': [3, 5], 'clf__n_estimators': [100, 200], ...}"
,scoring,'roc_auc'
,n_jobs,-1
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,1
,device,
,early_stopping_rounds,
,enable_categorical,False


Great, now lets calculate ROC-AUC for the train data

In [239]:
print("Logistic Regression Train data ROC-AUC:", grid_logreg.best_score_)
print("XG Boost Train data ROC-AUC:", grid_xgb.best_score_)

Logistic Regression Train data ROC-AUC: 0.692761856071308
XG Boost Train data ROC-AUC: 0.7321453549611016


ROC-AUC
* LogReg: 0.692
* XG Boost: 0.732.

Now, let's get the weighted average of the predictions for test dataset

In [240]:
logreg_test_pred = grid_logreg.predict_proba(test)[:, 1]
xgb_test_pred = grid_xgb.predict_proba(test)[:, 1]

In [241]:
w = 0.4
combined_pred = w * logreg_test_pred + (1 - w) * xgb_test_pred

In [242]:
submission = pd.Series(combined_pred, name="dep_delayed_15min")
submission.index.name = "id"
submission.to_csv("submission_combined.csv", header=True)

After submission I received score 0.72, while the highest score was 0.76.

Meaning, a good approach was taken, the two benchmarks were passed successfully and a pretty good score in comparison was achieved.