# Distributed gradient descent

In this exercise, we will build from scratch a logistic regression model and train it with distributed gradient descent.

As for the other exercise with start with a few imports (fewer than before since we won't use MLlib) and create a local spark application.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import types as st
from pyspark.sql import functions as sf
from pyspark.sql import Row, DataFrame
from pyspark import RDD
from pyspark import StorageLevel

In [0]:
%matplotlib inline

In [0]:
import numpy as np
import urllib
import math
import matplotlib.pyplot as plot
from typing import Tuple, Dict

In [0]:
ss = SparkSession \
    .builder \
    .appName("criteo-lr") \
    .master("local[4]") \
    .config("spark.submit.deployMode", "client") \
    .config("spark.driver.memory", "4g") \
    .config("spark.ui.port", "0") \
    .getOrCreate()
ss

In [0]:
toy_dataset_url = 'https://www.dropbox.com/s/dle2t3szhljfevh/criteo_toy_dataset.txt?dl=1'
urllib.request.urlretrieve(toy_dataset_url, "criteo_toy_dataset.txt")
dbutils.fs.mv("file:/databricks/driver/criteo_toy_dataset.txt", "dbfs:/train.txt")

## Q0: Load the data as a Spark DataFrame

This is exactly the same as Q1 for the other exercise.  

We will asumme in the rest of the code that your dataframe is called df, that categorical_features is the list of the categorical feature column names and that the label column is called 'label'.

## Convert the input to a vector using one hot encoding

Unlike the previous exercise, we will use one hot encoding to transform the raw features to our input vector. We will restrict ourselves to a subset of the categorical features, the ones with a small number of distinct modalities. Using one hot encoding on this subset of features will give us a vector of dimension ~100. This will allow us to work with dense vectors. For feature hashing to work well, we have to use a much larger dimension (look at the 2^16 in the previous exercise) where sparse vectors are required.

### Selecting a subset of features based on the number of modalities

In [0]:
num_modalities = {} 
for cat_feat in categorical_features:
    num_modalities[cat_feat] = df \
        .filter(sf.col(cat_feat).isNotNull()) \
        .select(cat_feat) \
        .distinct() \
        .count()
num_modalities

We will use all categorical features with less than 50 distinct modalities.

In [0]:
low_card_cat_feat = [cat_feat for cat_feat, num_modalities in num_modalities.items() if num_modalities < 50]
low_card_cat_feat

### Building dict for one hot encoding

For one hot encoding, you need a dictionary mapping for each feature and each modality the index in the vector.  
First let's collect the list of modalities for each feature.

In [0]:
modalities = {}
for cat_feat in low_card_cat_feat:
    rows = df\
        .filter(sf.col(cat_feat).isNotNull())\
        .select(cat_feat)\
        .distinct()\
        .collect()
    modalities[cat_feat] = [row[cat_feat] for row in rows]
    # Previous line is to unpack the data from List[Row[str]] to a List[str]

Then let's build the dictionary.  
We put in the dictionary all modalities collected in the previous step plus for account for the possibility of each feature being absent. Giving an index to the modality 'absent' for a feature will allow our model to give a weight to such an event and may increase model quality.

In [0]:
one_hot_encoder = {cat_feat:{} for cat_feat in low_card_cat_feat}
index = 0
for cat_feat in low_card_cat_feat:
    for value in modalities[cat_feat]:
        one_hot_encoder[cat_feat][value] = index
        index += 1
    one_hot_encoder[cat_feat][None] = index
    index += 1

### Converting our input to a vector

We simply apply the previously generated dictionary and put a 1 in the vector at the index of each (feature,modality).  
The dimension of our vector will be the total number of distinct modalities + 1. We use one more dimension to compute the weight of the intercept. It will ease the code below to consider the intercept simply as a feature that all examples have.

In [0]:
dimension = 1 + np.sum([len(one_hot_encoder[feature]) for feature in one_hot_encoder.keys()])
dimension

In [0]:
def row_to_vector(
    row: Row, dimension: int, encoder: Dict[str, Dict[str, int]]
) -> np.ndarray:
    x = np.zeros(dimension)
    x[-1] = 1 # for intercept
    for feat in encoder.keys():
        value = row[feat]
        index = encoder[feat].get(value, None)
        # index == None mean this modality was not in our dictionary 
        # which is possible if we encouter a new modality in the test set
        # that was not present in the training set used to build the dictionnary
        # we don't have space for such features in our vector
        if index != None:
            x[index] = 1
    return x

## Q1: Convert the dataframe to an RDD[vector, label]

Using the function row_to_vector, convert the dataframe to an RDD where each element of the RDD is the pair (vector, label).  
Print the first few elements of this RDD.

## Computing the prediction and the loss

The prediction of the logistic regression model is defined as the dot product between the feature vector and the model weights followed by the sigmoid.

In [0]:
def sigmoid(x: float) -> float:
    return 1 / (1 + math.exp(-x))

In [0]:
X = np.arange(-10, 10, 0.01)

In [0]:
plot.plot(X, [sigmoid(x) for x in X])

In [0]:
def point_predict(x: np.ndarray, model: np.ndarray) -> float:
    # implement me !

The logistic loss

In [0]:
def point_loss(prediction: float, y: int) -> float:
    return - y * math.log(prediction) - (1-y) * math.log(1-prediction)

The closest the prediction is to the label, the lower the loss.

In [0]:
for pred, label in [(0.9, 1), (0.1, 0), (0.1, 1), (0.9, 0)]:
    print(f'For a prediction of {pred} of positive, when the label is {"positive" if label ==1 else "negative"} the loss is {point_loss(pred, label)}')

## Q2: Compute the loss

Given an RDD of pair (vector,label), a model and the number of training examples, compute the average loss for this model on this RDD.

## Q3: Compute the gradient of the loss

Here is the function to compute the gradient on one example

In [0]:
def point_gradient(x: np.ndarray, y: int, model: np.ndarray) -> float:
    # implement me !

Given an RDD of pair (vector, label), a model and the number of training example, use this function to compute the gradient of the loss.

## Q4: Smart initialization

Initialize your model to zero except for the intercept which should be initialized with the logit of the average probability of the positive label. The logit is the inverse of the sigmoid and is given below.  
Compare the loss of this smart model to the model which is always zero.  
What is the prediction using this smart model?

In [0]:
def logit(x: float) -> float:
    # implement me !

In [0]:
np.sum(np.abs([x - logit(sigmoid(x)) for x in X]))

## Q5: Distributed Gradient Descent

Write a train function taking as input the training dataframe, the dictionary for the encoder, a maximum number of iterations and a learning rate and that outputs a model. Print the initial and the final loss and the loss at every step to make sure it decreases.

## Q6: Weight analysis

Print the intercept and compare it to the average probability of positive.  
Print the weight associated to every feature and modality.

## [OPT] Q7: Sparse vectors and feature hashing

Replace the one hot encoding scheme by feature hashing and use all categorical features.  
Replace all usage of dense vectors by sparse vectors.  
Compare the performance of your model to Spark MLlib model.

In [0]:
ss.stop()