# Click-Prediction | Final Project MSDS 699 - Machine Learning Laboratory

Author : _Emre Okcular_

Date : 03.12.2021

## Table of Contents
* Introduction
* Dataset
* Preprocessing and Exploratory Data Analysis
* Training Models
* Summary 

### Introduction

In this project, the predictions of whether or not people clicked on these ads will be made with machine learning techniques. The goal is to find more accurate predictions. This information will be used in optimizing the revenue from ads and ad auction systems.

### Dataset

Author: Tencent Inc.
Source: [KDD Cup](https://www.kddcup2012.org/) - 2012
Please cite:

This data is derived from the 2012 KDD Cup. The data is subsampled to 1% of the original number of instances, downsampling the majority class (click=0) so that the target feature is reasonably balanced (5 to 1).

The data is about advertisements shown alongside search results in a search engine, and whether or not people clicked on these ads.
The task is to build the best possible model to predict whether a user will click on a given ad.

A search session contains information on user id, the query issued by the user, ads displayed to the user, and target feature indicating whether a user clicked at least one of the ads in this session. The number of ads displayed to a user in a session is called ‘depth’. The order of an ad in the displayed list is called ‘position’. An ad is displayed as a short text called ‘title’, followed by a slightly longer text called ’description’, and a URL called ‘display URL’.
To construct this dataset each session was split into multiple instances. Each instance describes an ad displayed under a certain setting (‘depth’, ‘position’). Instances with the same user id, ad id, query, and setting are merged. Each ad and each user have some additional properties located in separate data files that can be looked up using ids in the instances.

The dataset has the following features:
* Click – binary variable indicating whether a user clicked on at least one ad.
* Impression - the number of search sessions in which AdID was impressed by UserID who issued Query.
* Url_hash - URL is hashed for anonymity
* AdID
* AdvertiserID - some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads.
* Depth - number of ads displayed to a user in a session
* Position - order of an ad in the displayed list
* QueryID - is the key of the data file 'queryid_tokensid.txt'. 
* KeywordID - is the key of 'purchasedkeyword_tokensid.txt' 
* TitleID - is the key of 'titleid_tokensid.txt'
* DescriptionID - is the key of 'descriptionid_tokensid.txt' 
* UserID – is also the key of 'userid_profile.txt' . 0 is a special value denoting that the user could be identified.

In [1]:
!wget https://raw.githubusercontent.com/emreokcular/click-prediction/main/requirements.txt
!pip install -r requirements.txt

In [1]:
import numpy as np
import pandas as pd

from scipy.io.arff import loadarff # For loading .arff data file.

from sklearn.compose            import *
from sklearn.ensemble           import RandomForestClassifier, ExtraTreesClassifier
from sklearn.experimental       import enable_iterative_imputer
from sklearn.impute             import *
from sklearn.linear_model       import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, SGDClassifier
from sklearn.metrics            import accuracy_score, balanced_accuracy_score, f1_score, precision_score, average_precision_score , make_scorer, recall_score
from sklearn.pipeline           import Pipeline
from sklearn.preprocessing      import *
from sklearn.tree               import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.model_selection    import train_test_split
from sklearn.preprocessing      import LabelEncoder
from sklearn.model_selection    import RandomizedSearchCV
from sklearn.model_selection    import cross_validate, KFold
from sklearn.base               import BaseEstimator

from pandas_profiling import ProfileReport # For EDA

from sklearn import set_config # For visualizing pipelines
set_config(display='diagram')

import warnings
warnings.filterwarnings("ignore")

#### Helper Functions and Classes:

In [2]:
def print_scores(y_test,y_pred):
    """
    Prints Accuracy and F1 Scores
    """
    score = accuracy_score(y_test, y_pred)
    print(f"The Accuracy score for is: %.{3}f" % score)
    score = f1_score(y_pred, y_test, average="weighted")
    print(f"The F1 score for is: %.{3}f" % score)

    
class DummyEstimator(BaseEstimator):
    def fit(self): pass
    def score(self): pass

## Preprocessing and Exploratory Data Analysis

In [3]:
raw_data_arff = loadarff('phpfGCaQC.arff')
df = pd.DataFrame(raw_data_arff[0])

In [4]:
df["click"] = df['click'].str.decode("utf-8").astype(int)
df["ad_id"] = df["ad_id"].astype(int)
df["advertiser_id"] = df["advertiser_id"].astype(int)
df["depth"] = df["depth"].astype(int)
df["position"] = df["position"].astype(int)
df["query_id"] = df["query_id"].astype(int)
df["keyword_id"] = df["keyword_id"].astype(int)
df["title_id"] = df["title_id"].astype(int)
df["description_id"] = df["description_id"].astype(int)
df["user_id"] = df["user_id"].astype(int)
df["impression"] = df["impression"].astype(int)
df["url_hash"] = df["url_hash"].astype(float)

In [5]:
df.head()

Unnamed: 0,click,impression,url_hash,ad_id,advertiser_id,depth,position,query_id,keyword_id,title_id,description_id,user_id
0,0,1,1.071003e+19,8343295,11700,3,3,7702266,21264,27892,1559,0
1,1,1,1.736385e+19,20017077,23798,1,1,93079,35498,4,36476,562934
2,0,1,8.915473e+18,21348354,36654,1,1,10981,19975,36105,33292,11621116
3,0,1,4.426693e+18,20366086,33280,3,3,0,5942,4057,4390,8778348
4,0,1,1.15726e+19,6803526,10790,2,1,9881978,60593,25242,1679,12118311


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39948 entries, 0 to 39947
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   click           39948 non-null  int64  
 1   impression      39948 non-null  int64  
 2   url_hash        39948 non-null  float64
 3   ad_id           39948 non-null  int64  
 4   advertiser_id   39948 non-null  int64  
 5   depth           39948 non-null  int64  
 6   position        39948 non-null  int64  
 7   query_id        39948 non-null  int64  
 8   keyword_id      39948 non-null  int64  
 9   title_id        39948 non-null  int64  
 10  description_id  39948 non-null  int64  
 11  user_id         39948 non-null  int64  
dtypes: float64(1), int64(11)
memory usage: 3.7 MB


In [34]:
profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_widgets()
profile.to_file("Pandas Profiling Report.html")

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=27.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render widgets', max=1.0, style=ProgressStyle(description…

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




In the pandas profiling warning report it says that there are 22 duplicate rows in dataset. These rows are dropped.

In [7]:
df = df.drop_duplicates()

In [8]:
df["url_hash"].nunique()

6941

We have one categorical variable which has 6941 unique values. Type of this variable is float for numeric calculations in models

In [9]:
df.isnull().sum()

click             0
impression        0
url_hash          0
ad_id             0
advertiser_id     0
depth             0
position          0
query_id          0
keyword_id        0
title_id          0
description_id    0
user_id           0
dtype: int64

We have no null values but null values encoded as 0 in the dataset so we have zeros to impute.

In [10]:
np.sum(df==0)

click             33201
impression            0
url_hash              0
ad_id                 0
advertiser_id         0
depth                 0
position              0
query_id            570
keyword_id          569
title_id            355
description_id      355
user_id            9627
dtype: int64

In [11]:
print("Percentage of ones in total :",round(len(df[df["click"]==1])/len(df)*100,2))

Percentage of ones in total : 16.84


In [12]:
y = df["click"]
X = df.iloc[:,1:]

Train test split is applied with 20:80 ratio

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, shuffle=True , random_state=99)

## Training Models

Firstly, LogisticRegression, DecisionTreeClassifier and RandomForestClassifier models are used for algorithm search with default parameters.

In [14]:
# Create a pipeline
pipe_dummy = Pipeline([('clf', DummyEstimator())]) # Placeholder Estimator

In [15]:
# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'clf': [LogisticRegression()]},
                {'clf': [DecisionTreeClassifier()]},
                {'clf': [RandomForestClassifier()]}]


In [16]:
# Random search for default models
clf_algos_rand = RandomizedSearchCV(estimator=pipe_dummy, 
                                    param_distributions=search_space, 
                                    n_iter=25,
                                    cv=5, 
                                    n_jobs=-1,
                                    verbose=1)

In [17]:
#  Fit  search
best_model = clf_algos_rand.fit(X_train, y_train);

# View best model
best_model.best_estimator_.get_params()['clf']

Fitting 5 folds for each of 3 candidates, totalling 15 fits


In [18]:
y_pred = best_model.predict(X_test)
print_scores(y_test,y_pred)

The Accuracy score for is: 0.833
The F1 score for is: 0.881


Since the default parameter algorithm search gives Logistic Regression and Decision Tree classifiers as the best estimator randomly, we will examine these two algorithm closely by applyting hyperparameter grid search.

#### Logistic Regression Classifier with Grid Search

First, we will try Logistic Regression with grid search.

In [19]:
con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 ,strategy='mean', add_indicator=True)), 
                        ('scaler', MinMaxScaler())])

cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 , strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessing = ColumnTransformer([('categorical', cat_pipe,  (X_train.dtypes == object)), 
                                    ('continuous',  con_pipe, ~(X_train.dtypes == object))])

pipe_lm = Pipeline([('preprocessing', preprocessing), 
                    ('lm', LogisticRegression())])

pipe_lm.fit(X_train, y_train)

We will continue to increase this score with hyperparameter search and CV techniques.

In [20]:
hyperparameters = {"n_jobs": -1, "class_weight":"balanced_subsample", "max_depth":2, "max_leaf_nodes":10 }

In [21]:
param_grid_lr = {'lm__penalty': ["l1", "l2"], # Type of regularization, for penalizing outlier features
              'lm__C': [0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.9, 1], # The amount of regularization
              'lm__solver': ["newton-cg", "lbfgs", "liblinear", "sag"] # Solver for finding the parameters
             }

In [22]:
clf_rand = RandomizedSearchCV(estimator = pipe_lm,
                                   param_distributions=param_grid_lr,
                                   n_iter = 50,
                                   cv=5,
                                   n_jobs=-1)
clf_rand.fit(X_train, y_train) 

In [23]:
list(clf_rand.best_estimator_.get_params().items())[-15:-1]

[('lm__C', 0.1),
 ('lm__class_weight', None),
 ('lm__dual', False),
 ('lm__fit_intercept', True),
 ('lm__intercept_scaling', 1),
 ('lm__l1_ratio', None),
 ('lm__max_iter', 100),
 ('lm__multi_class', 'auto'),
 ('lm__n_jobs', None),
 ('lm__penalty', 'l1'),
 ('lm__random_state', None),
 ('lm__solver', 'liblinear'),
 ('lm__tol', 0.0001),
 ('lm__verbose', 0)]

In [24]:
y_pred = clf_rand.predict(X_test)
print_scores(y_test,y_pred)

The Accuracy score for is: 0.832
The F1 score for is: 0.908


#### Random Forest Classifier with Grid Search

In [25]:
con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 ,strategy='mean', add_indicator=True)), 
                        ('scaler', MinMaxScaler())])

cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 , strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessing = ColumnTransformer([('categorical', cat_pipe,  (X_train.dtypes == object)), 
                                    ('continuous',  con_pipe, ~(X_train.dtypes == object))])

pipe_rf = Pipeline([('preprocessing', preprocessing), 
                    ('rf', RandomForestClassifier())])

pipe_rf.fit(X_train, y_train)

In [26]:
param_grid_rf = {'rf__max_features': ["sqrt", "log2"], # Define the number of features to be used
              'rf__min_samples_leaf': [3, 4, 5, 6, 7, 8, 9, 10], # Number of samples in leaf
              'rf__max_depth': [None, 1, 2, 3, 5, 7, 8, 9, 10], # Tree depth
              'rf__n_estimators': [100, 120, 140, 160, 180] # Number of sub trees
             }

In [27]:
clf_rand = RandomizedSearchCV(estimator = pipe_rf,
                                   param_distributions=param_grid_rf,
                                   n_iter = 50,
                                   cv=5,
                                   n_jobs=-1)
best_model = clf_rand.fit(X_train, y_train) 
best_model

In [28]:
best_model.best_estimator_.get_params()['rf']

In [29]:
y_pred = clf_rand.predict(X_test)
print_scores(y_test,y_pred)

The Accuracy score for is: 0.839
The F1 score for is: 0.898


#### Extra Tree Classifier with Grid Search

In [30]:
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [31]:
con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 ,strategy='median', add_indicator=True)), 
                        ('scaler', MinMaxScaler())])

cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 , strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessing = ColumnTransformer([('categorical', cat_pipe,  (X_train.dtypes == object)), 
                                    ('continuous',  con_pipe, ~(X_train.dtypes == object))])

pipe_xt = Pipeline([('preprocessing', preprocessing), 
                    ('xt', ExtraTreesClassifier())])

pipe_xt.fit(X_train, y_train)

In [32]:
scoring = {'acc': 'accuracy',
           'prec_macro': 'precision_macro',
           'rec_micro': 'recall_macro',
           "f1":"f1_macro"}
scores = cross_validate(pipe_xt, 
                          X_train,
                          y_train, 
                          cv=kfold,
                          scoring=scoring)

In [33]:
sc_acc = scores["test_acc"].mean()
print(f"The mean training test accuracy - {sc_acc:.3f}")
sc_f = scores["test_f1"].mean()
print(f"The mean training test F1 Score - {sc_f:.3f}")

The mean training test accuracy - 0.825
The mean training test F1 Score - 0.536


## Summary 

To wrap up, Decision Tree model with random grid search has the highest accuracy score. We wanted to deterimne how accurate are our predictions by using accuracy metric and also f1 score is used which is a combined mesaure of precision and recall. We increased our scores slightly. For future work boosting and deep learning methods could be used for enhancement with more cpu power.

**Models**|**Hyperparameters**|**accuracy\_score**|**f1\_score**
:-----:|:-----:|:-----:|:-----:
RandomForestClassifier|sklearn default|0.833|0.881
LogisticRegression with Randomized Grid Search|See notebook|0.832|0.908
RandomForestClassifier with Randomized Grid Search|See notebook|0.839|0.898
ExtraTreesClassifier with 5 fold CV|sklearn default|0.825|0.536

---