# Predicting ad click-through rates

We all have encountered online ads in various formats while visiting some websites. This is a big industry where people always spend a lot of their money in order to increase their sells ofcourse.

Although what matters to us is that it is one of the fields that is using machine learning extensively. Our job here is to target the ads in a good manner so that maximise the click-through rate.

Now, click through rate is ratio of clicks on an ad to its total number of views. Pretty straightforward. So lets jump in and lets see how we can achieve that.

To study various techniques related to this application, we are going to use data from a kaggle competition(avazu) which is relatively large dataset sp we are going to use first 300,000 rows in this example.

In [1]:
import pandas as pd
n_rows = 300000
df = pd.read_csv("train.gz", nrows=n_rows)
df.head(5)

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,1.000064e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,1.000068e+19,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


In [3]:
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], axis=1).values
Y = df['click'].values

In [4]:
X.shape

(300000, 19)

# Training and Validation sets

Normally, we would want to split our data in train and validation sets randomly selecting rows. However, here our data is stored in chornological order as hour column is given. It wont make sense to predict past values from future data. So, we will just take first 90% of the data as training set and other 10% as validation set.

In [5]:
n_train = int(n_rows * 0.9)
X_train = X[:n_train]
Y_train = Y[:n_train]
X_test = X[n_train:]
Y_test = Y[n_train:]


# One Hot Encoding

Now, we will use one hot encoding to convert our variables that are stored as strings into numerical one hot encoded values.

In [6]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X_train_enc = enc.fit_transform(X_train)

In [9]:
X_train_enc[0]
print(X_train_enc[0])

  (0, 2)	1.0
  (0, 6)	1.0
  (0, 188)	1.0
  (0, 2608)	1.0
  (0, 2679)	1.0
  (0, 3771)	1.0
  (0, 3885)	1.0
  (0, 3929)	1.0
  (0, 4879)	1.0
  (0, 7315)	1.0
  (0, 7319)	1.0
  (0, 7475)	1.0
  (0, 7824)	1.0
  (0, 7828)	1.0
  (0, 7869)	1.0
  (0, 7977)	1.0
  (0, 7982)	1.0
  (0, 8021)	1.0
  (0, 8189)	1.0


In [10]:
X_test_enc = enc.transform(X_test)

# Finding the best model

Its time to train our model on our dataset. We are using tree based methods here and decision tree clasifier at that. Scikit-learn has very good implementation for this type of model and we are going to use it. Then again, to find the  best set of hyperparameters, we are going to use grid search as always with AUC-ROC as our classification metric as our classes are highly imbalanced as only 17% of clicks are positive.


In [11]:
from sklearn.tree import DecisionTreeClassifier
parameters = {'max_depth': [3, 10, None]}
decision_tree = DecisionTreeClassifier(criterion='gini', min_samples_split=30)

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(decision_tree, parameters, n_jobs=-1, cv=3, scoring='roc_auc')

grid_search.fit(X_train_enc, Y_train)
print(grid_search.best_params_)

{'max_depth': 10}


After finding the best model, we will just use the model and start predicting with it using the code below.

In [12]:
decision_tree_best = grid_search.best_estimator_
pos_prob = decision_tree_best.predict_proba(X_test_enc)[:, 1]


In [13]:
from sklearn.metrics import roc_auc_score
print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(Y_test, pos_prob)))

The ROC AUC on testing set is: 0.719


In [14]:
import numpy as np
pos_prob = np.zeros(len(Y_test))
click_index = np.random.choice(len(Y_test), int(len(Y_test) *  51211.0/300000), replace=False)
pos_prob[click_index] = 1

In [15]:
print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(Y_test, pos_prob)))


The ROC AUC on testing set is: 0.502


# Random Forest Classfier

We have tried decision tree classifier and accuracy was 0.72 which is not so good but as CTR depends on many other factors than those that are provided to us. This is considered a good score apparently. To compare, we built a random genrator using numpy and it turned out to be 0.5 auc-roc score which is making our score looks good i guess.

Enough with decsion tree, we will move on to random forest classifier and doing the same steps, will try to find out what are the scores this model will give us


In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100, criterion='gini', min_samples_split=30, n_jobs=-1)
grid_search = GridSearchCV(random_forest, parameters, n_jobs=-1, cv=3, scoring='roc_auc')
grid_search.fit(X_train_enc, Y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)

random_forest_best = grid_search.best_estimator_
pos_prob = random_forest_best.predict_proba(X_test_enc)[:, 1]
print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(Y_test, pos_prob)))

Using random forest classifier, we are getting a score of 0.76 which is a good improvement over decision tree. We can use other algoriths like lightGBM or xgBoost to get even higher scores which are all available in scikit learn and can be implemented similarly.


Some things abouts hyperparamters:
1. "criterion = gini" relates to how decision tree are implemented in scikit learn. There are other imlememtations too and we are using gini among them
2. n_jobs = -1 suggests that we are using all CPU processors for our task.
3. n_estimators = 100  tells our model how many trees are there in out random forest.
4. parameters. How much deep is our tree? How many leaves that is.
5. min_samples_split = 30. Minimum number of samples required to split an internal node. Refers to how new leaves are made.
6. class_weights . How much weight is given to a class while predicting?

Grid_Search Parameters:
1. FIrst is name of our model.
2. Parameters providing which values for certain parameters should be tested in finding best model. Here, i only took depth as parameter but we can also use min_sample_split and class_weight to test for their best values.
3. n_jobs again.
4. cv = 3. Cross validation.
5. Scoring . What type of metric to use for scoring purposes.



So, that was all about using tree based methods for AD click prediction. Now, we have to remember that there are other models that can give better results like LGBM and xgBoost and we can use them and everyone is using them as well to get best results. But let us say you know all the algorithms and using best algorithms always. What then?

The main thing we always  have at our disposal is our logic and how we use it to genrate and select features in our model. It can be very much possible that naive algorithms like logistic regression can give better performance than ensemble algorithms like xgBoost. So, lets keep that in mind and spend more time on understanding the data than to choose the best algorithms.