# Final Project - Criteo Labs Display Advertising Challenge
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Spring 2019`__

__`Team: Chi Iong Ansjory, Catherine Cao, Scott Xu`__

Table of Content:
- [0. Background](#background)
- [1. Question Formulation](#question_formulation)
- [2. Algorithm Explanation](#algorithm_explanation)
- [3. EDA & Discussion of Challenges](#eda_challenges)
- [4. Algorithm Implementation](#algorithm_implementation)
- [5. Application of Course Concepts](#course_concepts_application)

<a id='background'></a>
# 0. Background

Criteo Labs is a leading global technology company that specializes in performance display advertising, working with over 4,000 e-commerce companies around the world. Their technology takes an algorithmic approach to determining what user they show an advertisement to, when, and for what products. For billions of unique advertisements that are created and displayed at lightning fast speeds every day.

Display advertising is a billion dollar effort and one of the central uses of Machine Learning on the Internet. However, its data and methods are usually kept confidential. Through the Kaggle research competition, Criteo Labs is sharing a week’s worth of data for participants to develop models predicting advertisement click-through rate (CTR). Given a user and the page being visited, what is the probability that the user will click on a given advertisement?

Source: https://www.kaggle.com/c/criteo-display-ad-challenge

For the dataset, the smaller version is no longer available from Kaggle. The full-size version needs to be used instead from Criteo Labs.

Source: https://www.kaggle.com/c/criteo-display-ad-challenge/data (smaller version - obsoleted); http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ (full-size version)

### Notebook Set-Up

In [1]:
# imports
import re
import ast
import time
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt

import os

In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

<a id='question_formulation'></a>
# 1. Question Formulation

The goal of this analysis is to benchmark the most accurate ML algorithms for CTR estimation.

<a id='algorithm_explanation'></a>
# 2. Algorithm Explanation

<a id='eda_challenges'></a>
# 3. EDA & Discussion of Challenges

The main challenges are the dataset given for this analysis has no column labels. We can't leverage any of our pre-existing knowledge about how online ads are served and CTR is computed in understanding the data. This means we have to put in extra effort in analyzing the data so we can understand the relationships between different features in the dataset and process them appropriately.

In [None]:
# Load data. Keep this relative path access to data
dir_name = os.getcwd()
#train_filename = os.path.join(dir_name, 'data/train.txt')
#test_filename = os.path.join(dir_name, 'data/test.txt')
toy_filename = os.path.join(dir_name, 'data/toy.txt')

# Reading the data
#train = pd.read_csv(train_filename, sep="\t", header=None)
#test = pd.read_csv(test_filename, sep="\t", header=None)
toy = pd.read_csv(toy_filename, sep="\t", header=None)

#print("Original shapes of train and test datasets:")
#train.shape, test.shape
toy.shape

<a id='algorithm_implementation'></a>
# 4. Algorithm Implementation

In [4]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "fpj_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

from pyspark.sql.types import *
import pyspark.sql.functions as F

In [31]:
#from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
#from pyspark.mllib.util import MLUtils

In [9]:
# load the raw data into an RDD
toyRDD = sc.textFile('data/toy.txt')

In [10]:
# take a look at raw data
toyRDD.take(5)

['0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16',
 '0,2,0,44,1,102,8,2,2,4,1,1,,4,68fd1e64,f0cf0024,6f67f7e5,41274cd7,25c83c98,fe6b92e5,922afcc0,0b153874,a73ee510,2b53e5fb,4f1b46f3,623049e6,d7020589,b28479f6,e6c5b5cd,c92f3b61,07c540c4,b04e4670,21ddcdc9,5840adea,60f6221e,,3a171ecb,43f13e8b,e8b83407,731c3655',
 '0,2,0,1,14,767,89,4,2,245,1,3,3,45,287e684f,0a519c5c,02cf9876,c18be181,25c83c98,7e0ccccf,c78204a1,0b153874,a73ee510,3b08e48b,5f5e6091,8fe001f4,aa655a2f,07d13a8f,6dc710ed,36103458,8efede7f,3412118d,,,e587c466,ad3062eb,3a171ecb,3b183c5c,,',
 '0,,893,,,4392,,0,0,0,,0,,,68fd1e64,2c16a946,a9a87e68,2e17d6f6,25c83c98,fe6b92e5,2e8a689b,0b153874,a73ee510,efea433b,e51ddf94,a30567ca,3516f6e6,07d13a8f,18231224,52b8680f,1e88c74f,74ef3502,,,6b3a5ca6,,3a171ecb,9117a34a,,',
 '0,3,-1,,0,

In [11]:
# split variables
temp_var = toyRDD.map(lambda x: x.split(","))

In [12]:
# creating header for data
numeric_features = ['I'+str(i) for i in range(1, 14)]
categorical_features = ['C'+str(i) for i in range(1, 27)]
header = ['target'] + numeric_features + categorical_features

In [13]:
# create pyspark dataframe
toy_df = temp_var.toDF(header)
toy_df.show()

+------+---+---+---+---+-----+---+---+---+----+---+---+---+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|target| I1| I2| I3| I4|   I5| I6| I7| I8|  I9|I10|I11|I12|I13|      C1|      C2|      C3|      C4|      C5|      C6|      C7|      C8|      C9|     C10|     C11|     C12|     C13|     C14|     C15|     C16|     C17|     C18|     C19|     C20|     C21|     C22|     C23|     C24|     C25|     C26|
+------+---+---+---+---+-----+---+---+---+----+---+---+---+---+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|     0|  1|  1|  5|  0| 1382|  4| 15|  2| 181|  1|  2|   |  2|68fd1e64|80e26c9b|fb936136|7b4723c4|25c83c9

In [15]:
# cast integer variables to integer type
for var in ['target'] + numeric_features:
    toy_df = toy_df.withColumn(var, toy_df[var].cast(IntegerType()))

In [16]:
# check schema
toy_df.printSchema()

root
 |-- target: integer (nullable = true)
 |-- I1: integer (nullable = true)
 |-- I2: integer (nullable = true)
 |-- I3: integer (nullable = true)
 |-- I4: integer (nullable = true)
 |-- I5: integer (nullable = true)
 |-- I6: integer (nullable = true)
 |-- I7: integer (nullable = true)
 |-- I8: integer (nullable = true)
 |-- I9: integer (nullable = true)
 |-- I10: integer (nullable = true)
 |-- I11: integer (nullable = true)
 |-- I12: integer (nullable = true)
 |-- I13: integer (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: string (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: string (nullable = true)
 |-- C6: string (nullable = true)
 |-- C7: string (nullable = true)
 |-- C8: string (nullable = true)
 |-- C9: string (nullable = true)
 |-- C10: string (nullable = true)
 |-- C11: string (nullable = true)
 |-- C12: string (nullable = true)
 |-- C13: string (nullable = true)
 |-- C14: string (nullable = true)
 |-- C15: string

In [17]:
# how many records do we have
total_count = toy_df.count()
total_count

100

In [18]:
# take a look at data
pd.DataFrame(toy_df.take(5), columns=toy_df.columns).transpose()

Unnamed: 0,0,1,2,3,4
target,0,0,0,0,0
I1,1,2,2,,3
I2,1,0,0,893,-1
I3,5,44,1,,
I4,0,1,14,,0
I5,1382,102,767,4392,2
I6,4,8,89,,0
I7,15,2,4,0,3
I8,2,2,2,0,0
I9,181,4,245,0,0


In [19]:
# for some reason missing categorical variables are not recognized as NA, replacing with the None here
toy_df = toy_df.replace('', None, categorical_features)

In [20]:
num_summary = toy_df.select(numeric_features).describe()
num_summary_pd = num_summary.toPandas().transpose()
num_summary_pd.columns = num_summary_pd.iloc[0]
num_summary_pd = num_summary_pd[1:]
num_summary_pd

summary,count,mean,stddev,min,max
I1,51,4.862745098039215,13.827537174555909,0,88
I2,100,86.65,318.3916454604139,-1,2382
I3,71,28.591549295774648,77.56758849357283,1,632
I4,78,7.551282051282051,11.1956014506364,0,75
I5,98,19274.11224489796,73309.81640234843,0,681386
I6,78,139.35897435897436,302.6821647214993,0,2106
I7,98,11.10204081632653,21.22218727901376,0,147
I8,100,11.82,13.772597592482123,0,49
I9,98,109.18367346938776,191.2973260325174,0,1306
I10,51,0.6078431372549019,0.6656855525711526,0,3


In [22]:
cat_summary = pd.DataFrame()
for i in categorical_features:
    cached = toy_df.groupBy(i).count().cache()
    mode = cached.orderBy('count', ascending=0).filter(i + "!=''").limit(1).toPandas()
    level = cached.count()
    mode['level'] = level
    mode['var'] = i
    mode.rename(columns={i:'mode'}, inplace=True)
    cat_summary = cat_summary.append(mode, ignore_index = True)
cat_summary

Unnamed: 0,mode,count,level,var
0,05db9164,50,16,C1
1,38a947a1,11,48,C2
2,d032c263,10,83,C3
3,c18be181,13,77,C4
4,25c83c98,62,8,C5
5,7e0ccccf,42,6,C6
6,3f4ec687,4,96,C7
7,0b153874,58,15,C8
8,a73ee510,90,2,C9
9,3b08e48b,27,69,C10


In [23]:
# impute missing values - numeric features impute with mean, categorical features impute with mode
num_means = list(num_summary_pd['mean'].astype('float64').round().astype('int64'))
cat_modes = list(cat_summary['mode'])
impute = dict(zip(numeric_features+categorical_features, num_means+cat_modes))
impute

{'I1': 5,
 'I2': 87,
 'I3': 29,
 'I4': 8,
 'I5': 19274,
 'I6': 139,
 'I7': 11,
 'I8': 12,
 'I9': 109,
 'I10': 1,
 'I11': 2,
 'I12': 1,
 'I13': 9,
 'C1': '05db9164',
 'C2': '38a947a1',
 'C3': 'd032c263',
 'C4': 'c18be181',
 'C5': '25c83c98',
 'C6': '7e0ccccf',
 'C7': '3f4ec687',
 'C8': '0b153874',
 'C9': 'a73ee510',
 'C10': '3b08e48b',
 'C11': 'c4adf918',
 'C12': 'dfbb09fb',
 'C13': '85dbe138',
 'C14': '07d13a8f',
 'C15': '7ac43a46',
 'C16': '84898b2a',
 'C17': 'e5ba7672',
 'C18': '5aed7436',
 'C19': '21ddcdc9',
 'C20': 'a458ea53',
 'C21': '0014c32a',
 'C22': 'ad3062eb',
 'C23': '32c7478e',
 'C24': '3b183c5c',
 'C25': 'e8b83407',
 'C26': '49d68486'}

In [24]:
import json
f = open("imputation_int.json")
impute = json.load(f)
f.close()

In [25]:
# impute missing and take a look at imputed data
toy_df_impute = toy_df.fillna(impute)
toy_df_impute.cache()
pd.DataFrame(toy_df_impute.take(10), columns=toy_df_impute.columns).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
target,0,0,0,0,0,0,0,1,0,0
I1,1,2,2,4,3,4,4,1,4,4
I2,1,0,0,893,-1,-1,1,4,44,35
I3,5,44,1,27,27,27,2,2,4,27
I4,0,1,14,7,0,7,7,0,8,1
I5,1382,102,767,4392,2,12824,3168,0,19010,33737
I6,4,8,89,116,0,116,116,0,249,21
I7,15,2,4,0,3,0,0,1,28,1
I8,2,2,2,0,0,0,1,0,31,2
I9,181,4,245,0,0,6,2,0,141,3


In [54]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.sql.functions import lit
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
#from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [48]:
def preprocess(df):
    stages = []
    categorical_features_index=[]
    for categoricalCol in ['C6','C9','C14','C17','C20','C23','C25']:
        stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        #encoder = VectorIndexer(inputCol=[stringIndexer.getOutputCol()], outputCol=[categoricalCol + "classVec"])
        categorical_features_index += [categoricalCol + 'classVec']
        stages += [stringIndexer,encoder]
    #print(categorical_features_index)

    vector_assembler = VectorAssembler( \
        inputCols= numeric_features+ categorical_features_index, \
        outputCol="features")

    stages += [vector_assembler] 
    pipeline = Pipeline(stages = stages)

    pipelineModel = pipeline.fit(df)
    df_temp = pipelineModel.transform(df)
    
    return df_temp

In [51]:
stages = []
categorical_features_index=[]
for categoricalCol in ['C6','C9','C14','C17','C20','C23','C25']:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    #encoder = VectorIndexer(inputCol=[stringIndexer.getOutputCol()], outputCol=[categoricalCol + "classVec"])
    categorical_features_index += [categoricalCol + 'classVec']
    stages += [stringIndexer,encoder]
#print(categorical_features_index)

vector_assembler = VectorAssembler( \
    inputCols= numeric_features+ categorical_features_index, \
    outputCol="features")

stages += [vector_assembler] 
pipeline = Pipeline(stages = stages)
    
pipelineModel = pipeline.fit(toy_df_impute)
df_temp = pipelineModel.transform(toy_df_impute)

In [52]:
start = time.time()
df_temp = preprocess(toy_df_impute)
print("Wall time: {} seconds".format(time.time() - start))
df_temp.printSchema()

Wall time: 1.7658843994140625 seconds
root
 |-- target: integer (nullable = true)
 |-- I1: integer (nullable = false)
 |-- I2: integer (nullable = false)
 |-- I3: integer (nullable = false)
 |-- I4: integer (nullable = false)
 |-- I5: integer (nullable = false)
 |-- I6: integer (nullable = false)
 |-- I7: integer (nullable = false)
 |-- I8: integer (nullable = false)
 |-- I9: integer (nullable = false)
 |-- I10: integer (nullable = false)
 |-- I11: integer (nullable = false)
 |-- I12: integer (nullable = false)
 |-- I13: integer (nullable = false)
 |-- C1: string (nullable = false)
 |-- C2: string (nullable = false)
 |-- C3: string (nullable = false)
 |-- C4: string (nullable = false)
 |-- C5: string (nullable = false)
 |-- C6: string (nullable = false)
 |-- C7: string (nullable = false)
 |-- C8: string (nullable = false)
 |-- C9: string (nullable = false)
 |-- C10: string (nullable = false)
 |-- C11: string (nullable = false)
 |-- C12: string (nullable = false)
 |-- C13: string (nulla

In [53]:
(trainingData, testData) = df_temp.randomSplit([0.7, 0.3])

In [56]:
# https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
start = time.time()
gbt = GBTClassifier(labelCol="target",\
featuresCol="features", maxIter=10)
model = gbt.fit(trainingData)

predictions = model.transform(testData)
evaluator = BinaryClassificationEvaluator(labelCol = "target")

print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))
print("Wall time: {} seconds".format(time.time() - start))

Test Area Under ROC: 0.41866028708133973
Wall time: 10.904360294342041 seconds


In [57]:
predictions

DataFrame[target: int, I1: int, I2: int, I3: int, I4: int, I5: int, I6: int, I7: int, I8: int, I9: int, I10: int, I11: int, I12: int, I13: int, C1: string, C2: string, C3: string, C4: string, C5: string, C6: string, C7: string, C8: string, C9: string, C10: string, C11: string, C12: string, C13: string, C14: string, C15: string, C16: string, C17: string, C18: string, C19: string, C20: string, C21: string, C22: string, C23: string, C24: string, C25: string, C26: string, C6Index: double, C6classVec: vector, C9Index: double, C9classVec: vector, C14Index: double, C14classVec: vector, C17Index: double, C17classVec: vector, C20Index: double, C20classVec: vector, C23Index: double, C23classVec: vector, C25Index: double, C25classVec: vector, features: vector, rawPrediction: vector, probability: vector, prediction: double]

In [58]:
predictions.select("rawPrediction").show(10)

+--------------------+
|       rawPrediction|
+--------------------+
|[0.74513602147748...|
|[0.97951028809689...|
|[0.07378347550427...|
|[-1.5491457122103...|
|[0.38339226534002...|
|[1.27888079864318...|
|[-0.6106425104610...|
|[1.11183510870168...|
|[1.14495891103410...|
|[-1.2038285656869...|
+--------------------+
only showing top 10 rows



In [47]:
#from pyspark.mllib.regression import LabeledPoint
#y_toy_train = [item[0] for item in toy_df_impute]
#x_toy_train = [item[1:] for item in toy_df_impute]
#toy_df_impute_lp = LabeledPoint(y_toy_train, x_toy_train)

#toy_df_impute_lp = toy_df_impute.rdd.map(lambda x: LabeledPoint(x[0], x[1:])).collect()

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively.

In [46]:
#categorical_features = ['C6','C9','C14','C17','C20','C23','C25']

#model = GradientBoostedTrees.trainClassifier(toy_df_impute,
                                             categoricalFeaturesInfo=categorical_features,
                                             numIterations=3)

AssertionError: the data should be RDD of LabeledPoint

<a id='course_concepts_application'></a>
# 5. Application of Course Concepts