### HW 13.4: Criteo Phase 2 baseline


>SPECIAL NOTE:
Please share your findings as they become available with class via the Google Group. You will get brownie points for this.  Once results are shared please use them and build on them.

>The Criteo data for this challenge is located in the following S3/Dropbox buckets:

>On Dropbox see:
     https://www.dropbox.com/sh/dnevke9vsk6yj3p/AABoP-Kv2SRxuK8j3TtJsSv5a?dl=0

>Raw Data:  (Training, Validation and Test data)
https://console.aws.amazon.com/s3/home?region=us-west-1#&bucket=criteo-dataset&prefix=rawdata/

>Hashed Data: Training, Validation and Test data in hash encoded (10,000 buckets) and sparse representation
https://console.aws.amazon.com/s3/home?region=us-west-1#&bucket=criteo-dataset&prefix=processeddata/


>Using the training dataset, validation dataset and testing dataset in the Criteo bucket perform the following experiment:

>-- write spark code (borrow from Phase 1 of this project) to train a logistic regression model with the following hyperparamters:

>-- Number of buckets for hashing: 1,000
-- Logistic Regression: no regularization term
-- Logistic Regression: step size = 10

>Report the AWS cluster configuration that you used and how long in minutes and seconds it takes to complete this job.

>Report in tabular form the AUC value (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) for the Training, Validation, and Testing datasets.
Report in tabular form  the logLossTest for the Training, Validation, and Testing datasets.

>Dont forget to put a caption on your tables (above each table).


In [1]:
# spark functions 
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.evaluation import BinaryClassificationMetrics

# helper libraries 
from collections import defaultdict
import hashlib
import numpy as np
import datetime as DT

In [2]:
def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
    """Produce a one-hot-encoding from a list of features and an OHE dictionary.

    Note:
        If a (featureID, value) tuple doesn't have a corresponding key in OHEDict it should be
        ignored.

    Args:
        rawFeats (list of (int, str)): The features corresponding to a single observation.  Each
            feature consists of a tuple of featureID and the feature's value. (e.g. sampleOne)
        OHEDict (dict): A mapping of (featureID, value) to unique integer.
        numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
            value).

    Returns:
        SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
            identifiers for the (featureID, value) combinations that occur in the observation and
            with values equal to 1.0.
    """
    ref = {}
    for k in rawFeats:
        if k in OHEDict:
            ref.update({OHEDict[k]:1})
    return SparseVector(numOHEFeats, ref)

def parseOHEPoint(point, OHEDict, numOHEFeats, delim):
    """Obtain the label and feature vector for this raw observation.

    Note:
        You must use the function `oneHotEncoding` in this implementation or later portions
        of this lab may not function as expected.

    Args:
        point (str): A comma separated string where the first value is the label and the rest
            are features.
        OHEDict (dict of (int, str) to int): Mapping of (featureID, value) to unique integer.
        numOHEFeats (int): The number of unique features in the training dataset.

    Returns:
        LabeledPoint: Contains the label for the observation and the one-hot-encoding of the
            raw features based on the provided OHE dictionary.
    """
    splits = point.split(delim)
    fields = [ (i,v) for i,v in enumerate(splits[1:]) ]
    lp = LabeledPoint(splits[0], oneHotEncoding(fields, OHEDict, numOHEFeats))
    return lp

def hashFunction(numBuckets, rawFeats, printMapping=False):
    """Calculate a feature dictionary for an observation's features based on hashing.

    Note:
        Use printMapping=True for debug purposes and to better understand how the hashing works.

    Args:
        numBuckets (int): Number of buckets to use as features.
        rawFeats (list of (int, str)): A list of features for an observation.  Represented as
            (featureID, value) tuples.
        printMapping (bool, optional): If true, the mappings of featureString to index will be
            printed.

    Returns:
        dict of int to float:  The keys will be integers which represent the buckets that the
            features have been hashed to.  The value for a given key will contain the count of the
            (featureID, value) tuples that have hashed to that key.
    """
    mapping = {}
    for ind, category in rawFeats:
        featureString = category + str(ind)
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
    if(printMapping): print mapping
    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)

def parseHashPoint(point, numBuckets, delim):
    """Create a LabeledPoint for this observation using hashing.

    Args:
        point (str): A comma separated string where the first value is the label and the rest are
            features.
        numBuckets: The number of buckets to hash to.

    Returns:
        LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
            features.
    """
    splits = point.split(delim)
    fields = [ (i,v) for i,v in enumerate(splits[1:]) ]
    vec = SparseVector(numBuckets, hashFunction(numBuckets, fields))
    return LabeledPoint(splits[0], vec)

def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    if p==0:
        p+=epsilon
    elif p==1:
        p-=epsilon
    if y==1:
        return -log(p)
    elif y==0:
        return -log(1-p)
    else:
        raise Exception('y not in {0,1}')

def evaluateResults(model, data):
    """Calculates the log loss for the data given the model.

    Args:
        model (LogisticRegressionModel): A trained logistic regression model.
        data (RDD of LabeledPoint): Labels and features for each observation.

    Returns:
        float: Log loss for the data.
    """
    probs = data.map(lambda x: (getP(x.features, model.weights, model.intercept), x.label))
    logloss = probs.map(lambda x: computeLogLoss(x[0],x[1])).reduce(lambda x,y: x+y) / probs.count()
    return logloss

In [3]:
train_data_location = 's3n://criteo-dataset/rawdata/train'
test_data_location = 's3n://criteo-dataset/rawdata/test'
validation_data_location = 's3n://criteo-dataset/rawdata/validation'

In [4]:
train_data = sc.textFile(train_data_location, 100).cache()
validation_data = sc.textFile(validation_data_location, 100).cache()
test_data = sc.textFile(test_data_location, 100).cache()

In [5]:
train_data.first()

u'0\t1\t1\t5\t0\t1382\t4\t15\t2\t181\t1\t2\t\t2\t68fd1e64\t80e26c9b\tfb936136\t7b4723c4\t25c83c98\t7e0ccccf\tde7995b8\t1f89b562\ta73ee510\ta8cd5504\tb2cb9c98\t37c9c164\t2824a5f6\t1adce6ef\t8ba8b39a\t891b62e7\te5ba7672\tf54016b9\t21ddcdc9\tb1252a9d\t07b5194c\t\t3a171ecb\tc5c50484\te8b83407\t9727dd16'

In [6]:
hashTrainData = train_data.map(lambda x: parseHashPoint(x, 1000, '\t')).cache()
hashValidationData = validation_data.map(lambda x: parseHashPoint(x, 1000, '\t')).cache()
hashTestData = test_data.map(lambda x: parseHashPoint(x, 1000, '\t')).cache()

In [None]:
from pyspark.mllib.classification import LogisticRegressionWithSGD

# fixed hyperparameters
numIters = 50
stepSize = 10.
regParam = 0.
regType = 'l2'
includeIntercept = True

starttime = DT.datetime.now()
model0 = LogisticRegressionWithSGD.train( data= hashTrainData
                                         , iterations= numIters
                                         , step= stepSize
                                         , regParam= regParam
                                         , regType= 'l2'
                                         , intercept= includeIntercept )
endtime = DT.datetime.now()
runtime = (endtime-starttime).seconds
runtime_hours = runtime // 3600
runtime_minutes = (runtime % 3600) // 60
runtime_seconds = (runtime % 3600) % 60
print 'job finished in {} hours, {} minutes and {} seconds'.\
        format(runtime_hours, runtime_minutes, runtime_seconds)

In [None]:
# Compute raw scores on the test set
predictionAndLabels = test.map(lambda lp: (float(model0.predict(lp.features)), lp.label))

# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)