# Exploratory Data Analysis for Click-Through Rate Prediction

The dataset was provided by the mobile advertising company Avazu as part of a [Kaggle competition](https://www.kaggle.com/competitions/avazu-ctr-prediction/overview)

*Dataset Citation: Steve Wang, Will Cukierski. (2014). Click-Through Rate Prediction. Kaggle. https://kaggle.com/competitions/avazu-ctr-prediction*


# Modeling

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
contentRoot = '/content/drive/MyDrive/Github/machinelearning/click-through-prediction'

# Read data

In [None]:
# # Create a list to store the dataframes
# dataframes = []

# # Read each file in chunks
# for i in range(0,17):
#   file_path = '{}/data/prepped_data_chunk_{}.csv.gz'.format(contentRoot, i)
#   print('Processing ' + file_path)
#   for chunk in pd.read_csv(file_path, chunksize=100000):
#     # Apply the transformation
#     chunk = chunk[X_columns]
#     chunk = transformer.fit_transform(chunk)
#     # Add the transformed chunk to the list of dataframes
#     dataframes.append(chunk)

In [None]:
df = pd.read_csv(f"{contentRoot}/data/sample_10_prepped_data.csv.gz")

In [None]:
print('Data: {}'.format(str(df.shape)))

Data: (4042897, 26)


## Define test and training datasets

Now that the column preprocessing is complete, we may define our test and training data.

In [None]:
X_columns = ['hour', 'day_of_week', 'C1', 'banner_pos', 'site_id', 'site_domain',
             'site_category', 'app_id', 'app_domain', 'app_category', 'device_model',
             'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18',
             'C19', 'C20', 'C21']

In [None]:
transformer = make_column_transformer(
    (OneHotEncoder(),['hour', 'day_of_week', 'C1', 'banner_pos',
                                        'site_id', 'site_domain', 'site_category',
                                        'app_id', 'app_domain', 'app_category',
                                        'device_model', 'device_type', 'device_conn_type',
                                        'C14', 'C15', 'C16', 'C17', 'C18',
                                        'C19', 'C20', 'C21']),
    remainder='passthrough')

In [None]:
X = transformer.fit_transform(df[X_columns])
y = df['click']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

## Baseline Model

In [None]:
dummy_estimator = DummyClassifier().fit(X_train, y_train)
dummy_train_score = dummy_estimator.score(X_train, y_train)
dummy_test_score = dummy_estimator.score(X_test, y_test)
print('Train Score with Dummy Classifier is {}'.format(dummy_train_score))
print('Test Score with Dummy Classifier is {}'.format(dummy_test_score))

Train Score with Dummy Classifier is 0.8300775928992904
Test Score with Dummy Classifier is 0.8303668158994781


## Grid Search for the best model

In [None]:
scores_dict = {'Model':[],
                 'Best Estimator': [],
                 'Train Time':[],
                 'Train Accuracy':[],
                 'Test Accuracy':[]}


def train_model(model_name, model, param_grid):
  print(f"Training {model}")
  gscv = GridSearchCV(estimator=model, param_grid = param_grid, cv=3)
  gscv.fit(X_train, y_train)

  scores_dict['Model'].append(model_name)
  scores_dict['Best Estimator'].append(gscv.best_estimator_)
  scores_dict['Train Accuracy'].append(gscv.best_estimator_.score(X_train, y_train))
  scores_dict['Test Accuracy'].append(gscv.best_estimator_.score(X_test, y_test))
  scores_dict['Train Time'].append(gscv.cv_results_['mean_fit_time'].mean())

## Support Vector Machine

In [None]:
svc = SVC()
train_model('Support Vector Machine', svc, {'kernel':['linear', 'rbf'], 'gamma': [-30,0,30]})

Training SVC()


In [None]:
print(scores_dict)

## Decision Tree Classifier

In [None]:
dtree = DecisionTreeClassifier()
train_model('Decision Tree', dtree, {'max_depth':[0,4,10,100]})

Training DecisionTreeClassifier()


3 fits failed out of a total of 12.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py", line 889, in fit
    super().fit(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py", line 177, in fit
    self._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/util

In [None]:
print(scores_dict)

{'Model': ['Decision Tree'], 'Best Estimator': [DecisionTreeClassifier(max_depth=10)], 'Train Time': [360.3462048570315], 'Train Accuracy': [0.8340323254866473], 'Test Accuracy': [0.8341627709482468]}


## Logistic Regression

In [None]:
logistic_regression = LogisticRegression()
train_model('Logistic Regression', logistic_regression, {'max_iter':[10000]})

Training LogisticRegression()


In [None]:
print(scores_dict)

{'Model': ['Logistic Regression'], 'Best Estimator': [LogisticRegression(max_iter=10000)], 'Train Time': [288.8871890703837], 'Train Accuracy': [0.8350623509952378], 'Test Accuracy': [0.8351026903130591]}


## K Nearest Neighbors

In [None]:
knn = KNeighborsClassifier()
train_model('K Nearest Neighbors', knn, {'n_neighbors':[3,9,15]})

Training KNeighborsClassifier()


In [None]:
print(scores_dict)

## Summary of results after Grid Search of various models

Due to the high dimensionality of the data, SVM and KNN could not complete execution in reasonable time (~xx hours). The results of LogisticRegression and DecisionTree are as below.

LogisticRegression performed better with a `max_iter=10000` and DecisionTree performed better with a `max_depth=10`.

In [None]:
output_dict = {'Model': ['Decision Tree', 'Logistic Regression'],
               'Best Estimator': [DecisionTreeClassifier(max_depth=10), LogisticRegression(max_iter=10000)],
               'Train Time': [360.3462048570315, 288.8871890703837],
               'Train Accuracy': [0.8340323254866473, 0.8350623509952378],
               'Test Accuracy': [0.8341627709482468, 0.8351026903130591]}

results_df = pd.DataFrame(output_dict)
results_df

## Findings

Both LogisticRegression and DecisionTreeClassifier produced a 0.5% improvement in accuracy. While this is usable, exploration needs to continue to identify other models, such as ensemble techniques, that might work better.

In [None]:
# dtree_output = {'Model': ['Decision Tree'], 'Best Estimator': [DecisionTreeClassifier(max_depth=10)], 'Train Time': [360.3462048570315], 'Train Accuracy': [0.8340323254866473], 'Test Accuracy': [0.8341627709482468]}

# logreg_output = {'Model': ['Logistic Regression'], 'Best Estimator': [LogisticRegression(max_iter=10000)], 'Train Time': [288.8871890703837], 'Train Accuracy': [0.8350623509952378], 'Test Accuracy': [0.8351026903130591]}