## A Gradient Boosting Machine (CatBoost) application for Click Through Rate (CTR) prediction

In this jupyeter notebook we will explore using Gradient Boosting Machines (CatBoost) for predicting CTR, which consequently will serve as input into a non-linear bidding strategy. 

The model will be trained on a training set. A validation set is used for early stopping, feature importances and hyperparameter optimisation. The CTR predictions are then obtained by retraining the model on the training + validation set under the obtained hyperparameters.     

# 1. Data Preparation

### 1.1 CatBoost installation

Install CatBoost, if not already done so.

In [1]:
!pip install catboost
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/98/03/777a0e1c12571a7f3320a4fa6d5f123dba2dd7c0bca34f4f698a6396eb48/catboost-0.12.2-cp36-none-manylinux1_x86_64.whl (55.5MB)
[K    100% |████████████████████████████████| 55.5MB 551kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.12.2
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


### Imports

In [45]:
import numpy as np
import pandas as pd
from pickle import dump, load

np.random.seed(42)

# Following lines are needed for being able to read the data from personal google drive. Change this path accordingly to where you store the data sets.
from google.colab import drive
drive.mount('/content/drive')
dir = 'drive/My Drive/Colab Notebooks/data/RTB-data/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 1.2 Data Loading

In [0]:
# I first transformed the datasets as pickle objects such that loading the data is much quicker than when loading a csv file.
train_df = load(open(dir + 'train.pkl', 'rb'))
val_df = load(open(dir + 'validation.pkl', 'rb'))


### 1.3 Feature Preparation

Let's see how many NaN values do we have:

In [4]:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

adexchange      49829
domain         137135
url             86812
urlid         2430981
keypage        504990
usertag        497479
dtype: int64

As we can observe, **`adexchange`**, **`domain`** , **`url`**, **`urlid`**, **`keypage`**, and **`usertag`** have a quite substantial number of missing values. How do we deal with this?


*      Leave them out for now


The features bidid, userid, url, urlid, bidprice, payprice, keypage (?) are considered for removal because they are either almost unique for each case or meaningless to be added to Catboost training.






Let's exclude features that we do not want to include in the Catboos training.

In [31]:
features_delete = ['bidprice', 'payprice']
features_nan = ['adexchange', 'domain', 'url', 'urlid', 'keypage', 'usertag'] 
train_df = train_df.drop(features_delete + features_nan, axis=1)
val_df = val_df.drop(features_delete + features_nan, axis=1)

['click', 'weekday', 'hour', 'bidid', 'userid', 'useragent', 'IP', 'region', 'city', 'slotid', 'slotwidth', 'slotheight', 'slotvisibility', 'slotformat', 'slotprice', 'creative', 'advertiser']


Let's have a look at the feature data types. Catboost will handle string type features as categorical type.

In [32]:
print(X_train.dtypes)
categorical_features_indices = np.where(X_train.dtypes != np.float)[0]

weekday            int64
hour               int64
bidid             object
userid            object
useragent         object
IP                object
region             int64
city               int64
slotid            object
slotwidth          int64
slotheight         int64
slotvisibility    object
slotformat        object
slotprice          int64
creative          object
advertiser         int64
dtype: object


# 1. CatBoost Training and Testing

### 2.1 Negative Downsampling

Because of the highly imbalanced dataset, we perform negative downsampling; in all our trianing sample the observations from the minority class ('click') are kept and we take different number of observations from the majority class ('no click') by performing sampling without replacement.

In [0]:
# 1793 Clicks in the trainig set
factor = 1  # a 1:1 ratio of majority samples and minority samples performs best  
X_minority = train_df.iloc[np.where(train_df.click == 1)]
X_majority = train_df.iloc[np.where(train_df.click == 0)].sample(n=np.int(factor * len(X_minority)))
X_sample = pd.concat([X_minority, X_majority])
X_sample = X_sample.sample(frac=1).reset_index(drop=True) # Randomly shuffle rows

# Split features from target variable
X_train, y_train = X_sample.drop('click', axis=1), X_sample.click
X_val, y_val = val_df.drop('click', axis=1), val_df.click

### 2.1 Model Training

Let's start with training a model with default parameters. 

In [0]:
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import roc_auc_score, auc

In [0]:
cbc = CatBoostClassifier(
    #custom_loss=['Accuracy'],
    random_seed=42,
    logging_level='Silent'
)

In [76]:
cbc.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    #eval_set=(X_val, y_val),
    #logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

### 2.3 Model Applying
Let's get some predictions from the validation set and see how the Catboost model with default
hyperparameter settings performs on the AUC score.

In [94]:
predictions = cbc.predict(X_val)
predictions_probs = cbc.predict_proba(X_val)
print('Example predicted probabilities (no click, click): \n {}'.format(predictions_probs[885:895]))
print('Example predicted classes (0=no click, 1=click): \n {}'.format(predictions[885:895]))
print('Example true classes (0=no click, 1=click): \n {}'.format(np.array(y_val[885:895]).T))

auc = roc_auc_score(y_val, predictions)
print()
print('ROC AUC: {}'.format(auc))


Example predicted probabilities (no click, click): 
 [[0.81850567 0.18149433]
 [0.6722476  0.3277524 ]
 [0.82022442 0.17977558]
 [0.15722246 0.84277754]
 [0.27770268 0.72229732]
 [0.63254862 0.36745138]
 [0.17472864 0.82527136]
 [0.84511416 0.15488584]
 [0.66141743 0.33858257]
 [0.61479392 0.38520608]]
Example predicted classes (0=no click, 1=click): 
 [0. 0. 0. 1. 1. 0. 1. 0. 0. 0.]
Example true classes (0=no click, 1=click): 
 [0 0 0 0 0 0 1 0 0 0]

ROC AUC: 0.7015829757331972
