# Logistic regression on Criteo dataset with Spark MLlib

In this exercise, we will build a logistic regression model on the Criteo dataset using Spark MLlib.

First a few imports to help you find what is needed for the exercise. Reminder, you can easily access the documentation of a methode by typing method? in a notebook cell.

Then we create a spark application running locally on your computer.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import types as st
from pyspark.sql import functions as sf
from pyspark.ml.feature import FeatureHasher
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark import StorageLevel

In [None]:
%matplotlib inline

In [None]:
from scipy.special import expit
import matplotlib.pyplot as plot

In [None]:
ss = SparkSession \
    .builder \
    .appName("criteo-lr") \
    .master("local[4]") \
    .config("spark.submit.deployMode", "client") \
    .config("spark.driver.memory", "4g") \
    .config("spark.ui.port", "0") \
    .getOrCreate()
ss

## Q1: Load the data as a Spark DataFrame

Use the function ss.read.csv to read the train.txt file.  
Don't forget to specify the separator is "\t", there is no header.  
For the schema, you can either infer the schema automatically on a subset of the dataset using inferSchema and samplingRatio parameters or build the schema using st.StructType, st.StructField, st.IntegerType and st.StringType.

For faster execution when building the code for the next questions, sample the dataset at 1% and persist the data.

## Q2: Simple stats

Using the Spark dataframe API, compute the number of training examples, the number of positive and negative examples and the average probability of the positive class.

## Q3: Train Test Split

Split the dataframe in a train and test dataframes using 80% for train and 20% for test.

## Q4: Feature Hashing

For now we will restrict ourselves to the categorical features.  
Hash all categorical features on a 2^16 space using FeatureHasher.

Observe a few lines of the hashed features. How are the features represented after hashing ?

## Q5: Fit a logistic regression model

Create and fit a logistic regression model using LogisicticRegression.  
Use L2 regulization.  
Limit the maximum number of iterations of the optimization algorithm to something small for fast iteration.

Print the model intercept and compute the sigmoid of the intercept (using scipy expit for example). What do you notice ?

Using matplotlib, print an histogram of the model coefficients. What do you notice ?

## Q6: Set up a pipeline

Your model consist of two steps: feature hashing and logistic regression. Assemble both steps into a Pipeline to ease of use.

## Q7: Model evaluation

Compute your model prediction on the test dataframe.  
Using BinaryClassificationEvaluator, compute the areaUnderROC metric on the test dataframe.

## [OPT] Q8: Bucketization of integer features

Using QuantileDiscretizer, bucketize to 20 buckets every integer features. Then add those features in the hasher and relaunch training and evalution. By how much the AUC metric improves?

Don't forget to close the spark application at the end.

In [None]:
ss.stop()