===================================================================
===HW 13.4: Criteo Phase 2 baseline ===


SPECIAL NOTE:
Please share your findings as they become available with class via the Google Group. You will get brownie points for this.  Once results are shared please use them and build on them.

The Criteo data for this challenge is located in the following S3/Dropbox buckets:

On Dropbox see:
     https://www.dropbox.com/sh/dnevke9vsk6yj3p/AABoP-Kv2SRxuK8j3TtJsSv5a?dl=0

Raw Data:  (Training, Validation and Test data)
https://console.aws.amazon.com/s3/home?region=us-west-1#&bucket=criteo-dataset&prefix=rawdata/

Hashed Data: Training, Validation and Test data in hash encoded (10,000 buckets) and sparse representation
https://console.aws.amazon.com/s3/home?region=us-west-1#&bucket=criteo-dataset&prefix=processeddata/


Using the training dataset, validation dataset and testing dataset in the Criteo bucket perform the following experiment:

-- write spark code (borrow from Phase 1 of this project) to train a logistic regression model with the following hyperparamters:

-- Number of buckets for hashing: 1,000
-- Logistic Regression: no regularization term
-- Logistic Regression: step size = 10

Report the AWS cluster configuration that you used and how long in minutes and seconds it takes to complete this job.

Report in tabular form the AUC value (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) for the Training, Validation, and Testing datasets.
Report in tabular form  the logLossTest for the Training, Validation, and Testing datasets.

Dont forget to put a caption on your tables (above each table).

---

##Code

In [1]:
%%writefile hw13_4.py
# Initialise and turn off info messages
import pyspark

from math import exp #  exp(-t) = e^-t
from math import log

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.evaluation import BinaryClassificationMetrics

from collections import defaultdict
import hashlib

import time

start_time = time.time()


sc = pyspark.SparkContext(appName="hw13")
sc.setLogLevel("WARN")

#load raw data
st = time.time()
rawTrainData = sc.textFile('s3://criteo-dataset/rawdata/train/part-*', 2).map(lambda x: x.replace('\t', ','))  # work with either ',' or '\t' separated data
out = rawTrainData.take(1)
et = time.time()
print '\nExample Row from Train Set'
print out
print 'Time to Parse Train: ',et-st

st = time.time()
rawValidationData = sc.textFile('s3://criteo-dataset/rawdata/validation/part-*', 2).map(lambda x: x.replace('\t', ','))  # work with either ',' or '\t' separated data
out = rawValidationData.take(1)
et = time.time()
print '\nExample Row from Validation Set'
print out
print 'Time to Parse Validation: ',et-st

st = time.time()
rawTestData = sc.textFile('s3://criteo-dataset/rawdata/test/part-*', 2).map(lambda x: x.replace('\t', ','))  # work with either ',' or '\t' separated data
out = rawTestData.take(1)
et = time.time()
print '\nExample Row from Test Set'
print out
print 'Time to Parse Test: ',et-st

# Cache the data
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

st = time.time()
nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print '\nNumber of Observations in Train,Validation,Test'
print nTrain, nVal, nTest
et = time.time()
print 'Time to Count Obs: ',et-st

def hashFunction(numBuckets, rawFeats, printMapping=False):
    """Calculate a feature dictionary for an observation's features based on hashing.

    Note:
        Use printMapping=True for debug purposes and to better understand how the hashing works.

    Args:
        numBuckets (int): Number of buckets to use as features.
        rawFeats (list of (int, str)): A list of features for an observation.  Represented as
            (featureID, value) tuples.
        printMapping (bool, optional): If true, the mappings of featureString to index will be
            printed.

    Returns:
        dict of int to float:  The keys will be integers which represent the buckets that the
            features have been hashed to.  The value for a given key will contain the count of the
            (featureID, value) tuples that have hashed to that key.
    """
    mapping = {}
    for ind, category in rawFeats:
        featureString = category + str(ind)
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
    if(printMapping): print mapping
    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)



def parseHashPoint(point, numBuckets):
    """Create a LabeledPoint for this observation using hashing.

    Args:
        point (str): A comma separated string where the first value is the label and the rest are
            features.
        numBuckets: The number of buckets to hash to.

    Returns:
        LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
            features.
    """
    f_tuples = []
    to_list = point.split(',')
    for p in range(len(to_list)-1):
        t = (p,to_list[p+1])
        f_tuples.append(t)
        
    hash_point = hashFunction(numBuckets,f_tuples,True)
    
    hash_items = hash_point.items()
    hash_items.sort()
    
    label = point.split(',')[0]
    sp = SparseVector(numBuckets,[i[0] for i in hash_items],[i[1] for i in hash_items])
    
    return LabeledPoint(label,sp)

st = time.time()
print '\nHashing Data...'
numBucketsCTR = 1000
hashTrainData = rawTrainData.map(lambda x: parseHashPoint(x,numBucketsCTR))
hashTrainData.cache()
hashValidationData = rawValidationData.map(lambda x: parseHashPoint(x,numBucketsCTR))
hashValidationData.cache()
hashTestData = rawTestData.map(lambda x: parseHashPoint(x,numBucketsCTR))
hashTestData.cache()

#Example Hashed Data
print 'Example Hashed Data Point'
print hashTrainData.take(1)
et = time.time()
print 'Time to Hash Data: ',et-st

def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    if y==1:
        if p==0:
            p+=epsilon
        l = -log(p)
    else:
        if p==1:
            p-=epsilon
        l = -log(1-p)
    return l


def sigmoid(x):
    return 1.0/(1+exp(-x))


def getP(x, w, intercept):
    """Calculate the probability for an observation given a set of weights and intercept.

    Note:
        We'll bound our raw prediction between 20 and -20 for numerical purposes.

    Args:
        x (SparseVector): A vector with values of 1.0 for features that exist in this
            observation and 0.0 otherwise.
        w (DenseVector): A vector of weights (betas) for the model.
        intercept (float): The model's intercept.

    Returns:
        float: A probability between 0 and 1.
    """
    rawPrediction = intercept + w.dot(x)

    # Bound the raw prediction value
    rawPrediction = min(rawPrediction, 20)
    rawPrediction = max(rawPrediction, -20)
    return sigmoid(rawPrediction)


def evaluateResults(model, data):
    """Calculates the log loss for the data given the model.

    Args:
        model (LogisticRegressionModel): A trained logistic regression model.
        data (RDD of LabeledPoint): Labels and features for each observation.

    Returns:
        float: Log loss for the data.
    """
    preds_values = data.map(lambda x: (getP(x.features,model.weights,model.intercept),x.label))
    metrics = BinaryClassificationMetrics(preds_values)

    logloss_sum_count = preds_values.map(lambda x: computeLogLoss(x[0],x[1])).map(lambda x: (x,1)).reduce(lambda x,y: (x[0]+y[0],x[1]+y[1]))

    return metrics.areaUnderROC, logloss_sum_count[0]/logloss_sum_count[1]

#Train model
print '\nTraining LR model with 100 iterations'
st = time.time()
model = LogisticRegressionWithSGD.train(hashTrainData, iterations=100, step=10,regType=None,intercept=True)
et = time.time()
print 'Model Training Time: ',et-st

st = time.time()
auc1,logLossVa1 = evaluateResults(model, hashTrainData)
auc2,logLossVa2 = evaluateResults(model, hashValidationData)
auc3,logLossVa3 = evaluateResults(model, hashTestData)
print '\nLogloss Results'
print 'Train: ',logLossVa1
print 'Validation: ',logLossVa2
print 'Test: ',logLossVa3
print '\nAUC Results'
print 'Train: ',auc1
print 'Validation: ',auc2
print 'Test: ',auc3
et = time.time()
print 'Model Evaluation Time: ',et-st

Overwriting hw13_4.py


---

#Cluster Setup

8x r3.xlarge
(32 cpu cores, 30.5*8 gb RAM)

---

#Spark Setup

--num-executors 8
--executor-cores 3
--executor-memory 20G

#Setup Discussion

This setup was chosen so that 1 executor could run on each server in the cluster and would leave one core spare on each node to handle overhead/hadoop tasks.
RAM is a similar story, each executor is give 20G out of the 30G available on each node leaving plenty free for other tasks. This setup certainly could be optimized but
it works much better than the default settings.

---

#Results:

    16/04/25 12:08:04 INFO SparkContext: Running Spark version 1.6.1
    16/04/25 12:08:04 INFO SecurityManager: Changing view acls to: hadoop
    16/04/25 12:08:04 INFO SecurityManager: Changing modify acls to: hadoop
    16/04/25 12:08:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
    16/04/25 12:08:05 INFO Utils: Successfully started service 'sparkDriver' on port 45888.
    16/04/25 12:08:05 INFO Slf4jLogger: Slf4jLogger started
    16/04/25 12:08:05 INFO Remoting: Starting remoting
    16/04/25 12:08:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.92.131.186:44657]
    16/04/25 12:08:06 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 44657.
    16/04/25 12:08:06 INFO SparkEnv: Registering MapOutputTracker
    16/04/25 12:08:06 INFO SparkEnv: Registering BlockManagerMaster
    16/04/25 12:08:06 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-49937afb-83e3-4b06-8b47-25370bf45c4b
    16/04/25 12:08:06 INFO MemoryStore: MemoryStore started with capacity 518.1 MB
    16/04/25 12:08:06 INFO SparkEnv: Registering OutputCommitCoordinator
    16/04/25 12:08:06 INFO Utils: Successfully started service 'SparkUI' on port 4040.
    16/04/25 12:08:06 INFO SparkUI: Started SparkUI at http://10.92.131.186:4040
    16/04/25 12:08:07 INFO RMProxy: Connecting to ResourceManager at ip-10-92-131-186.us-west-2.compute.internal/10.92.131.186:8032
    16/04/25 12:08:07 INFO Client: Requesting a new application from cluster with 7 NodeManagers
    16/04/25 12:08:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23424 MB per container)
    16/04/25 12:08:07 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
    16/04/25 12:08:07 INFO Client: Setting up container launch context for our AM
    16/04/25 12:08:07 INFO Client: Setting up the launch environment for our AM container
    16/04/25 12:08:07 INFO Client: Preparing resources for our AM container
    16/04/25 12:08:08 INFO Client: Uploading resource file:/usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2-amzn-1.jar -> hdfs://ip-10-92-131-186.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461582686385_0003/spark-assembly-1.6.1-hadoop2.7.2-amzn-1.jar
    16/04/25 12:08:08 INFO MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461582696406 
    16/04/25 12:08:08 INFO MetricsSaver: Created MetricsSaver j-HPMP0QX6BSW5:i-9118f74c:SparkSubmit:21681 period:60 /mnt/var/em/raw/i-9118f74c_20160425_SparkSubmit_21681_raw.bin
    16/04/25 12:08:09 INFO MetricsSaver: 1 aggregated HDFSWriteBytes 879 raw values into 1 aggregated values, total 1
    16/04/25 12:08:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-92-131-186.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461582686385_0003/pyspark.zip
    16/04/25 12:08:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.9-src.zip -> hdfs://ip-10-92-131-186.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461582686385_0003/py4j-0.9-src.zip
    16/04/25 12:08:10 INFO Client: Uploading resource file:/mnt/tmp/spark-cad40399-5a48-4c89-9683-d12ff639f852/__spark_conf__5482069572366415676.zip -> hdfs://ip-10-92-131-186.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461582686385_0003/__spark_conf__5482069572366415676.zip
    16/04/25 12:08:10 INFO SecurityManager: Changing view acls to: hadoop
    16/04/25 12:08:10 INFO SecurityManager: Changing modify acls to: hadoop
    16/04/25 12:08:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
    16/04/25 12:08:10 INFO Client: Submitting application 3 to ResourceManager
    16/04/25 12:08:10 INFO YarnClientImpl: Submitted application application_1461582686385_0003
    16/04/25 12:08:11 INFO Client: Application report for application_1461582686385_0003 (state: ACCEPTED)
    16/04/25 12:08:11 INFO Client: 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1461586090245
         final status: UNDEFINED
         tracking URL: http://ip-10-92-131-186.us-west-2.compute.internal:20888/proxy/application_1461582686385_0003/
         user: hadoop
    16/04/25 12:08:12 INFO Client: Application report for application_1461582686385_0003 (state: ACCEPTED)
    16/04/25 12:08:13 INFO Client: Application report for application_1461582686385_0003 (state: ACCEPTED)
    16/04/25 12:08:14 INFO Client: Application report for application_1461582686385_0003 (state: ACCEPTED)
    16/04/25 12:08:14 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
    16/04/25 12:08:14 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-92-131-186.us-west-2.compute.internal, PROXY_URI_BASES -> http://ip-10-92-131-186.us-west-2.compute.internal:20888/proxy/application_1461582686385_0003), /proxy/application_1461582686385_0003
    16/04/25 12:08:14 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
    16/04/25 12:08:15 INFO Client: Application report for application_1461582686385_0003 (state: RUNNING)
    16/04/25 12:08:15 INFO Client: 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 10.91.128.213
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1461586090245
         final status: UNDEFINED
         tracking URL: http://ip-10-92-131-186.us-west-2.compute.internal:20888/proxy/application_1461582686385_0003/
         user: hadoop
    16/04/25 12:08:15 INFO YarnClientSchedulerBackend: Application application_1461582686385_0003 has started running.
    16/04/25 12:08:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46705.
    16/04/25 12:08:15 INFO NettyBlockTransferService: Server created on 46705
    16/04/25 12:08:15 INFO BlockManager: external shuffle service port = 7337
    16/04/25 12:08:15 INFO BlockManagerMaster: Trying to register BlockManager
    16/04/25 12:08:15 INFO BlockManagerMasterEndpoint: Registering block manager 10.92.131.186:46705 with 518.1 MB RAM, BlockManagerId(driver, 10.92.131.186, 46705)
    16/04/25 12:08:15 INFO BlockManagerMaster: Registered BlockManager
    16/04/25 12:08:15 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1461582686385_0003
    16/04/25 12:08:15 WARN SparkContext: Dynamic Allocation and num executors both set, thus dynamic allocation disabled.
    16/04/25 12:08:20 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-128-213.us-west-2.compute.internal:38152) with ID 6
    16/04/25 12:08:20 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-0-196.us-west-2.compute.internal:46232) with ID 1
    16/04/25 12:08:20 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-128-213.us-west-2.compute.internal:33973 with 14.8 GB RAM, BlockManagerId(6, ip-10-91-128-213.us-west-2.compute.internal, 33973)
    16/04/25 12:08:20 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-0-196.us-west-2.compute.internal:34040 with 14.8 GB RAM, BlockManagerId(1, ip-10-90-0-196.us-west-2.compute.internal, 34040)
    16/04/25 12:08:20 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-136-203.us-west-2.compute.internal:57460) with ID 2
    16/04/25 12:08:20 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-136-203.us-west-2.compute.internal:41496 with 14.8 GB RAM, BlockManagerId(2, ip-10-90-136-203.us-west-2.compute.internal, 41496)
    16/04/25 12:08:21 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-0-52.us-west-2.compute.internal:54924) with ID 3
    16/04/25 12:08:21 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-0-52.us-west-2.compute.internal:32852 with 14.8 GB RAM, BlockManagerId(3, ip-10-90-0-52.us-west-2.compute.internal, 32852)
    16/04/25 12:08:21 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-12-94.us-west-2.compute.internal:46280) with ID 5
    16/04/25 12:08:21 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-12-94.us-west-2.compute.internal:41034 with 14.8 GB RAM, BlockManagerId(5, ip-10-91-12-94.us-west-2.compute.internal, 41034)
    16/04/25 12:08:21 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-135-132.us-west-2.compute.internal:50098) with ID 7
    16/04/25 12:08:21 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-135-132.us-west-2.compute.internal:41542 with 14.8 GB RAM, BlockManagerId(7, ip-10-90-135-132.us-west-2.compute.internal, 41542)
    16/04/25 12:08:22 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-1-108.us-west-2.compute.internal:36832) with ID 4
    16/04/25 12:08:22 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
    Example Row from Train Set
    [u'0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']
    Time to Parse Train:  6.78655314445
    Example Row from Validation Set
    [u'0,,44,4,8,19010,249,28,31,141,,1,,8,05db9164,d833535f,d032c263,c18be181,25c83c98,7e0ccccf,d5b6acf2,0b153874,a73ee510,2acdcf4e,086ac2d2,dfbb09fb,41a6ae00,b28479f6,e2502ec9,84898b2a,e5ba7672,42a2edb9,,,0014c32a,,32c7478e,3b183c5c,,']
    Time to Parse Validation:  1.43855500221
    Example Row from Test Set
    [u'0,0,51,84,4,3633,26,1,4,8,0,1,,4,5a9ed9b0,80e26c9b,97144401,5dbf0cc5,0942e0a7,13718bbd,9ce6136d,0b153874,a73ee510,2106e595,b5bb9d63,04f55317,ab04d8fe,1adce6ef,0ad47a49,2bd32e5c,3486227d,12195b22,21ddcdc9,b1252a9d,fa131867,,dbb486d7,8ecc176a,e8b83407,c43c3f58']
    Time to Parse Test:  4.79847717285
    Time to Cache Data:  0.0238311290741
    Number of Observations in Train,Validation,Test
    36669090 4585184 4586343
    Time to Count Obs:  65.4181230068
    Hashing Data...
    Example Hashed Data Point
    [LabeledPoint(0.0, (1000,[64,101,117,147,178,215,223,268,304,313,321,328,384,385,442,532,601,613,619,621,628,644,650,655,659,680,681,697,721,738,742,824,846,882,903,924],[1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]
    Time to Hash Data:  87.2900378704
    16/04/25 12:30:40 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
    16/04/25 12:30:40 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
    Model Training Time:  1232.07148004
    Logloss:  0.556640268951
    AUC:  0.720544437285
    Model Evaluation Time:  193.71164608



    50 iterations - 7x r3.xlarge 1x m3.xlarge

#Results Table:

| Measure | Results|
|-----|----|
| Time to Hash Data |  87.29s |
| Model Training Time |  1232s |
| Logloss |  0.556640268951 |
| AUC |  0.720544437285 |
| Model Evaluation Time |  193.71s |

===================================================================
===HW 13.5: Criteo Phase 2 hyperparameter tuning  ===
SPECIAL NOTE:
Please share your findings as they become available with class via the Google Group. You will get brownie points for this.  Once results are shared please used them and build on them.
 

Using the training dataset, validation dataset and testing dataset in the Criteo bucket perform the following experiments:

-- write spark code (borrow from Phase 1 of this project) to train a logistic regression model with various hyperparamters. Do a gridsearch of the hyperparameter space and determine optimal settings using the validation set.

-- Number of buckets for hashing: 1,000, 10,000, .... explore different values  here
-- Logistic Regression: regularization term: [1e-6, 1e-3]  explore other  values here also
-- Logistic Regression: step size: explore different step sizes. Focus on a stepsize of 1 initially. 

Report the AWS cluster configuration that you used and how long in minutes and seconds it takes to complete this job.

Report in tabular form and using heatmaps the AUC values (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) for the Training, Validation, and Testing datasets.
Report in tabular form and using heatmaps  the logLossTest for the Training, Validation, and Testing datasets.

Dont forget to put a caption on your tables (above the table) and on your heatmap figures (put caption below figures) detailing the experiment associated with each table or figure (data, algorithm used, parameters and settings explored.

Discuss the optimal setting to solve this problem  in terms of the following:
-- Features
-- Learning algortihm
-- Spark cluster

Justiy your recommendations based on your experimental results and cross reference with table numbers and figure numbers. Also highlight key results with annotations, both textual and line and box based, on your tables and graphs.


In [None]:
%%writefile hw13_5.py
# HW 13.4
# Thomas Atkins

# Initialise and turn off info messages
import pyspark

from math import exp #  exp(-t) = e^-t
from math import log

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.evaluation import BinaryClassificationMetrics

from collections import defaultdict
import hashlib

import time

start_time = time.time()


sc = pyspark.SparkContext(appName="hw13")
sc.setLogLevel("WARN")

#load raw data
st = time.time()
rawTrainData = sc.textFile('s3://criteo-dataset/rawdata/train/part-*', 2).map(lambda x: x.replace('\t', ','))  # work with either ',' or '\t' separated data
out = rawTrainData.take(1)
et = time.time()
print '\nExample Row from Train Set'
print out
print 'Time to Parse Train: ',et-st

st = time.time()
rawValidationData = sc.textFile('s3://criteo-dataset/rawdata/validation/part-*', 2).map(lambda x: x.replace('\t', ','))  # work with either ',' or '\t' separated data
out = rawValidationData.take(1)
et = time.time()
print '\nExample Row from Validation Set'
print out
print 'Time to Parse Validation: ',et-st

st = time.time()
rawTestData = sc.textFile('s3://criteo-dataset/rawdata/test/part-*', 2).map(lambda x: x.replace('\t', ','))  # work with either ',' or '\t' separated data
out = rawTestData.take(1)
et = time.time()
print '\nExample Row from Test Set'
print out
print 'Time to Parse Test: ',et-st

# Cache the data
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

st = time.time()
nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print '\nNumber of Observations in Train,Validation,Test'
print nTrain, nVal, nTest
et = time.time()
print 'Time to Count Obs: ',et-st



def hashFunction(numBuckets, rawFeats, printMapping=False):
    """Calculate a feature dictionary for an observation's features based on hashing.

    Note:
        Use printMapping=True for debug purposes and to better understand how the hashing works.

    Args:
        numBuckets (int): Number of buckets to use as features.
        rawFeats (list of (int, str)): A list of features for an observation.  Represented as
            (featureID, value) tuples.
        printMapping (bool, optional): If true, the mappings of featureString to index will be
            printed.

    Returns:
        dict of int to float:  The keys will be integers which represent the buckets that the
            features have been hashed to.  The value for a given key will contain the count of the
            (featureID, value) tuples that have hashed to that key.
    """
    mapping = {}
    for ind, category in rawFeats:
        featureString = category + str(ind)
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
    if(printMapping): print mapping
    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)



def parseHashPoint(point, numBuckets):
    """Create a LabeledPoint for this observation using hashing.

    Args:
        point (str): A comma separated string where the first value is the label and the rest are
            features.
        numBuckets: The number of buckets to hash to.

    Returns:
        LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
            features.
    """
    f_tuples = []
    to_list = point.split(',')
    for p in range(len(to_list)-1):
        t = (p,to_list[p+1])
        f_tuples.append(t)
        
    hash_point = hashFunction(numBuckets,f_tuples,True)
    
    hash_items = hash_point.items()
    hash_items.sort()
    
    label = point.split(',')[0]
    sp = SparseVector(numBuckets,[i[0] for i in hash_items],[i[1] for i in hash_items])
    
    return LabeledPoint(label,sp)


def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    if y==1:
        if p==0:
            p+=epsilon
        l = -log(p)
    else:
        if p==1:
            p-=epsilon
        l = -log(1-p)
    return l


def sigmoid(x):
    return 1.0/(1+exp(-x))


def getP(x, w, intercept):
    """Calculate the probability for an observation given a set of weights and intercept.

    Note:
        We'll bound our raw prediction between 20 and -20 for numerical purposes.

    Args:
        x (SparseVector): A vector with values of 1.0 for features that exist in this
            observation and 0.0 otherwise.
        w (DenseVector): A vector of weights (betas) for the model.
        intercept (float): The model's intercept.

    Returns:
        float: A probability between 0 and 1.
    """
    rawPrediction = intercept + w.dot(x)

    # Bound the raw prediction value
    rawPrediction = min(rawPrediction, 20)
    rawPrediction = max(rawPrediction, -20)
    return sigmoid(rawPrediction)


def evaluateResults(model, data):
    """Calculates the log loss for the data given the model.

    Args:
        model (LogisticRegressionModel): A trained logistic regression model.
        data (RDD of LabeledPoint): Labels and features for each observation.

    Returns:
        float: Log loss for the data.
    """
    preds_values = data.map(lambda x: (getP(x.features,model.weights,model.intercept),x.label))
    metrics = BinaryClassificationMetrics(preds_values)

    logloss_sum_count = preds_values.map(lambda x: computeLogLoss(x[0],x[1])).map(lambda x: (x,1)).reduce(lambda x,y: (x[0]+y[0],x[1]+y[1]))

    return metrics.areaUnderROC, logloss_sum_count[0]/logloss_sum_count[1]




#grid search loop

for buckets in [10000]:
    print '\nHashing Data: ',buckets
    numBucketsCTR = buckets
    hashTrainData = rawTrainData.map(lambda x: parseHashPoint(x,numBucketsCTR))
    hashTrainData.cache()
    hashValidationData = rawValidationData.map(lambda x: parseHashPoint(x,numBucketsCTR))
    hashValidationData.cache()
    hashTestData = rawTestData.map(lambda x: parseHashPoint(x,numBucketsCTR))
    hashTestData.cache()
    for regtype in ['l1','l2']:
        print 'Regularization type: ',regtype
        for reg in [1e-6,1e-5,1e-4,1e-3,1e-2,1e-1]:
            st = time.time()
            print '\nTraining LR model (iter=100, step=10) with L2 reg at: ',reg
            model = LogisticRegressionWithSGD.train(hashTrainData, iterations=100, step=10,regType=regtype,regParam=reg,intercept=True)
            et = time.time()
            print 'Model Training Time: ',et-st
            st = time.time()
            auc1,logLossVa1 = evaluateResults(model, hashTrainData)
            auc2,logLossVa2 = evaluateResults(model, hashValidationData)
            auc3,logLossVa3 = evaluateResults(model, hashTestData)
            print '\nLogloss Results'
            print 'Train: ',logLossVa1
            print 'Validation: ',logLossVa2
            print 'Test: ',logLossVa3
            print '\nAUC Results'
            print 'Train: ',auc1
            print 'Validation: ',auc2
            print 'Test: ',auc3
            et = time.time()
            print 'Model Evaluation Time: ',et-st

#Final Grid Search Results

Round 1 grid search was run over the following parameters:

    num_hash_bins in [100,1000,10000]
    regularization type = L2
    regularization_parameter in [1e-6,1e-3]

#Table of results (Validation Dataset)

|           |  hashbins=100  |  hashbins=1000  |  hashbins=10000  |
|-----------|----------------|-----------------|------------------|
| reg=1e-6  | ll=1.10846, auc=0.6642 | ll=0.50567, auc=0.72432 | ll=0.49764, auc=0.73772 |
| reg=1e-3  | ll=0.97682, auc=0.6644 | ll=0.50594, auc=0.72396 | ll=0.498389, auc=0.736909 |

From this we can see that hashbins of 10k has the best result. 

Round 2 grid search:

    num_hash_bins = 10000
    regularization type in L1,L2
    regularization_parameter in [1e-6,1e-5,1e-4,1e-3,1e-2,1e-1]
    
#Table of results (Validation Dataset)

|           |       L1       |        L2        |
|-----------|----------------|------------------|
| reg=1e-6  | ll=0.497681, auc=0.737658 | ll=0.497643, auc=0.7377213 |
| reg=1e-5  | ll=0.498035, auc=0.737076 | ll=0.497644, auc=0.737709 |
| reg=1e-4  | ll=0.500823, auc=0.73253 | ll=0.497701, auc=0.7376345 |
| reg=1e-3  | ll=0.511274, auc=0.71573 | ll=0.4983891, auc=0.73690979 |
| reg=1e-2  | ll=0.542069, auc=0.661522 | ll=0.5063953, auc=0.7283924 |
| reg=1e-1  | ll=0.592235, auc=0.5  | ll=0.535101, auc=0.6960073 |

Best Result:
    
    Hash bins = 10000
    regularization type = l1
    regularization parameter = 1e-6
    
First iteration took 1200s to train and 600s to eval. Subsequent runs took 150s to train and 350s to eval due to pre-cached results.

#Spark Output from Grid Search Round 1:

16/04/26 03:41:48 INFO SparkContext: Running Spark version 1.6.1
16/04/26 03:41:49 INFO SecurityManager: Changing view acls to: hadoop
16/04/26 03:41:49 INFO SecurityManager: Changing modify acls to: hadoop
16/04/26 03:41:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/04/26 03:41:50 INFO Utils: Successfully started service 'sparkDriver' on port 39712.
16/04/26 03:41:50 INFO Slf4jLogger: Slf4jLogger started
16/04/26 03:41:50 INFO Remoting: Starting remoting
16/04/26 03:41:50 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.212.135.252:35187]
16/04/26 03:41:50 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 35187.
16/04/26 03:41:50 INFO SparkEnv: Registering MapOutputTracker
16/04/26 03:41:50 INFO SparkEnv: Registering BlockManagerMaster
16/04/26 03:41:50 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-4a0cd3e5-bb74-4633-8831-3b6d5ef53085
16/04/26 03:41:50 INFO MemoryStore: MemoryStore started with capacity 518.1 MB
16/04/26 03:41:51 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/26 03:41:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/26 03:41:51 INFO SparkUI: Started SparkUI at http://10.212.135.252:4040
16/04/26 03:41:51 INFO RMProxy: Connecting to ResourceManager at ip-10-212-135-252.us-west-2.compute.internal/10.212.135.252:8032
16/04/26 03:41:51 INFO Client: Requesting a new application from cluster with 7 NodeManagers
16/04/26 03:41:52 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23424 MB per container)
16/04/26 03:41:52 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/04/26 03:41:52 INFO Client: Setting up container launch context for our AM
16/04/26 03:41:52 INFO Client: Setting up the launch environment for our AM container
16/04/26 03:41:52 INFO Client: Preparing resources for our AM container
16/04/26 03:41:52 INFO Client: Uploading resource file:/usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2-amzn-1.jar -> hdfs://ip-10-212-135-252.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461641531592_0001/spark-assembly-1.6.1-hadoop2.7.2-amzn-1.jar
16/04/26 03:41:52 INFO MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461641540464 
16/04/26 03:41:52 INFO MetricsSaver: Created MetricsSaver j-1LIGG8UX5SGH4:i-ce826b13:SparkSubmit:10036 period:60 /mnt/var/em/raw/i-ce826b13_20160426_SparkSubmit_10036_raw.bin
16/04/26 03:41:54 INFO MetricsSaver: 1 aggregated HDFSWriteDelay 738 raw values into 1 aggregated values, total 1
16/04/26 03:41:56 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-212-135-252.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461641531592_0001/pyspark.zip
16/04/26 03:41:56 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.9-src.zip -> hdfs://ip-10-212-135-252.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461641531592_0001/py4j-0.9-src.zip
16/04/26 03:41:56 INFO Client: Uploading resource file:/mnt/tmp/spark-32295e7f-990c-4403-8914-7fdf890fc5a9/__spark_conf__2321669995725229919.zip -> hdfs://ip-10-212-135-252.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461641531592_0001/__spark_conf__2321669995725229919.zip
16/04/26 03:41:56 INFO SecurityManager: Changing view acls to: hadoop
16/04/26 03:41:56 INFO SecurityManager: Changing modify acls to: hadoop
16/04/26 03:41:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/04/26 03:41:56 INFO Client: Submitting application 1 to ResourceManager
16/04/26 03:41:56 INFO YarnClientImpl: Submitted application application_1461641531592_0001
16/04/26 03:41:57 INFO Client: Application report for application_1461641531592_0001 (state: ACCEPTED)
16/04/26 03:41:57 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1461642116823
	 final status: UNDEFINED
	 tracking URL: http://ip-10-212-135-252.us-west-2.compute.internal:20888/proxy/application_1461641531592_0001/
	 user: hadoop
16/04/26 03:41:58 INFO Client: Application report for application_1461641531592_0001 (state: ACCEPTED)
16/04/26 03:41:59 INFO Client: Application report for application_1461641531592_0001 (state: ACCEPTED)
16/04/26 03:42:00 INFO Client: Application report for application_1461641531592_0001 (state: ACCEPTED)
16/04/26 03:42:01 INFO Client: Application report for application_1461641531592_0001 (state: ACCEPTED)
16/04/26 03:42:02 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/04/26 03:42:02 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-212-135-252.us-west-2.compute.internal, PROXY_URI_BASES -> http://ip-10-212-135-252.us-west-2.compute.internal:20888/proxy/application_1461641531592_0001), /proxy/application_1461641531592_0001
16/04/26 03:42:02 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/04/26 03:42:02 INFO Client: Application report for application_1461641531592_0001 (state: RUNNING)
16/04/26 03:42:02 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.213.3.111
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1461642116823
	 final status: UNDEFINED
	 tracking URL: http://ip-10-212-135-252.us-west-2.compute.internal:20888/proxy/application_1461641531592_0001/
	 user: hadoop
16/04/26 03:42:02 INFO YarnClientSchedulerBackend: Application application_1461641531592_0001 has started running.
16/04/26 03:42:02 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33768.
16/04/26 03:42:02 INFO NettyBlockTransferService: Server created on 33768
16/04/26 03:42:02 INFO BlockManager: external shuffle service port = 7337
16/04/26 03:42:02 INFO BlockManagerMaster: Trying to register BlockManager
16/04/26 03:42:02 INFO BlockManagerMasterEndpoint: Registering block manager 10.212.135.252:33768 with 518.1 MB RAM, BlockManagerId(driver, 10.212.135.252, 33768)
16/04/26 03:42:02 INFO BlockManagerMaster: Registered BlockManager
16/04/26 03:42:03 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1461641531592_0001
16/04/26 03:42:03 WARN SparkContext: Dynamic Allocation and num executors both set, thus dynamic allocation disabled.
16/04/26 03:42:07 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-213-3-111.us-west-2.compute.internal:59612) with ID 2
16/04/26 03:42:07 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-213-3-111.us-west-2.compute.internal:40798 with 14.8 GB RAM, BlockManagerId(2, ip-10-213-3-111.us-west-2.compute.internal, 40798)
16/04/26 03:42:09 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-130-200.us-west-2.compute.internal:39414) with ID 1
16/04/26 03:42:09 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-130-200.us-west-2.compute.internal:39726 with 14.8 GB RAM, BlockManagerId(1, ip-10-91-130-200.us-west-2.compute.internal, 39726)
16/04/26 03:42:09 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-133-154.us-west-2.compute.internal:57126) with ID 3
16/04/26 03:42:09 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-133-154.us-west-2.compute.internal:44718 with 14.8 GB RAM, BlockManagerId(3, ip-10-90-133-154.us-west-2.compute.internal, 44718)
16/04/26 03:42:09 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-129-113.us-west-2.compute.internal:52208) with ID 4
16/04/26 03:42:09 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-129-113.us-west-2.compute.internal:43838 with 14.8 GB RAM, BlockManagerId(4, ip-10-91-129-113.us-west-2.compute.internal, 43838)
16/04/26 03:42:10 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-132-144.us-west-2.compute.internal:57474) with ID 5
16/04/26 03:42:10 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-132-144.us-west-2.compute.internal:34022 with 14.8 GB RAM, BlockManagerId(5, ip-10-91-132-144.us-west-2.compute.internal, 34022)
16/04/26 03:42:10 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-219-146-229.us-west-2.compute.internal:39870) with ID 6
16/04/26 03:42:10 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-219-146-229.us-west-2.compute.internal:36601 with 14.8 GB RAM, BlockManagerId(6, ip-10-219-146-229.us-west-2.compute.internal, 36601)
16/04/26 03:42:10 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-2-42.us-west-2.compute.internal:44886) with ID 7
16/04/26 03:42:10 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
16/04/26 03:42:10 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-2-42.us-west-2.compute.internal:42391 with 14.8 GB RAM, BlockManagerId(7, ip-10-90-2-42.us-west-2.compute.internal, 42391)

Example Row from Train Set
[u'0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']
Time to Parse Train:  9.07977414131

Example Row from Validation Set
[u'0,,44,4,8,19010,249,28,31,141,,1,,8,05db9164,d833535f,d032c263,c18be181,25c83c98,7e0ccccf,d5b6acf2,0b153874,a73ee510,2acdcf4e,086ac2d2,dfbb09fb,41a6ae00,b28479f6,e2502ec9,84898b2a,e5ba7672,42a2edb9,,,0014c32a,,32c7478e,3b183c5c,,']
Time to Parse Validation:  8.05411291122

Example Row from Test Set
[u'0,0,51,84,4,3633,26,1,4,8,0,1,,4,5a9ed9b0,80e26c9b,97144401,5dbf0cc5,0942e0a7,13718bbd,9ce6136d,0b153874,a73ee510,2106e595,b5bb9d63,04f55317,ab04d8fe,1adce6ef,0ad47a49,2bd32e5c,3486227d,12195b22,21ddcdc9,b1252a9d,fa131867,,dbb486d7,8ecc176a,e8b83407,c43c3f58']
Time to Parse Test:  5.89245796204

Number of Observations in Train,Validation,Test
36669090 4585184 4586343
Time to Count Obs:  61.4934830666

Hashing Data:  100

Training LR model (iter=100, step=10) with L2 reg at:  1e-06
Model Training Time:  1229.81804395

Logloss Results
Train:  1.10852921043
Validation:  1.10846248417
Test:  1.10826315059

AUC Results
Train:  0.664292839071
Validation:  0.664209808916
Test:  0.664419051281
Model Evaluation Time:  610.810709953

Training LR model (iter=100, step=10) with L2 reg at:  0.001
Model Training Time:  146.848901987

Logloss Results
Train:  0.976837143855
Validation:  0.976819216695
Test:  0.976650481976

AUC Results
Train:  0.664510392125
Validation:  0.664445791117
Test:  0.664637055162
Model Evaluation Time:  348.263712168

Hashing Data:  1000

Training LR model (iter=100, step=10) with L2 reg at:  1e-06
Model Training Time:  1264.95922017

Logloss Results
Train:  0.505464359462
Validation:  0.505676478333
Test:  0.505603161929

AUC Results
Train:  0.724548685426
Validation:  0.724325440631
Test:  0.724796040475
Model Evaluation Time:  629.481713057

Training LR model (iter=100, step=10) with L2 reg at:  0.001
Model Training Time:  149.754159927

Logloss Results
Train:  0.505730100218
Validation:  0.505940997868
Test:  0.505865517722

AUC Results
Train:  0.724190398037
Validation:  0.723966486731
Test:  0.724449373923
Model Evaluation Time:  354.413424969

Hashing Data:  10000

Training LR model (iter=100, step=10) with L2 reg at:  1e-06
Model Training Time:  1285.08617783

Logloss Results
Train:  0.497478074001
Validation:  0.497643323262
Test:  0.497676313325

AUC Results
Train:  0.737903312714
Validation:  0.737721358332
Test:  0.737982324792
Model Evaluation Time:  628.86914897

Training LR model (iter=100, step=10) with L2 reg at:  0.001
Model Training Time:  169.95717907

Logloss Results
Train:  0.498224333378
Validation:  0.498389184961
Test:  0.49841839104

AUC Results
Train:  0.737090789316
Validation:  0.736909791352
Test:  0.737180830772
Model Evaluation Time:  361.28002882
16/04/26 05:43:15 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(7,WrappedArray())


#Grid Search 2 results

16/04/28 13:56:45 INFO SparkContext: Running Spark version 1.6.1
16/04/28 13:56:45 INFO SecurityManager: Changing view acls to: hadoop
16/04/28 13:56:45 INFO SecurityManager: Changing modify acls to: hadoop
16/04/28 13:56:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/04/28 13:56:46 INFO Utils: Successfully started service 'sparkDriver' on port 38975.
16/04/28 13:56:46 INFO Slf4jLogger: Slf4jLogger started
16/04/28 13:56:46 INFO Remoting: Starting remoting
16/04/28 13:56:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.90.137.243:33743]
16/04/28 13:56:46 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 33743.
16/04/28 13:56:46 INFO SparkEnv: Registering MapOutputTracker
16/04/28 13:56:47 INFO SparkEnv: Registering BlockManagerMaster
16/04/28 13:56:47 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-a5405cae-646c-47b8-8afe-9e3b9ab20a54
16/04/28 13:56:47 INFO MemoryStore: MemoryStore started with capacity 518.1 MB
16/04/28 13:56:47 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/28 13:56:47 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/28 13:56:47 INFO SparkUI: Started SparkUI at http://10.90.137.243:4040
16/04/28 13:56:47 INFO RMProxy: Connecting to ResourceManager at ip-10-90-137-243.us-west-2.compute.internal/10.90.137.243:8032
16/04/28 13:56:48 INFO Client: Requesting a new application from cluster with 7 NodeManagers
16/04/28 13:56:48 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23424 MB per container)
16/04/28 13:56:48 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/04/28 13:56:48 INFO Client: Setting up container launch context for our AM
16/04/28 13:56:48 INFO Client: Setting up the launch environment for our AM container
16/04/28 13:56:48 INFO Client: Preparing resources for our AM container
16/04/28 13:56:48 INFO Client: Uploading resource file:/usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2-amzn-1.jar -> hdfs://ip-10-90-137-243.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461851381259_0001/spark-assembly-1.6.1-hadoop2.7.2-amzn-1.jar
16/04/28 13:56:48 INFO MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461851391084 
16/04/28 13:56:48 INFO MetricsSaver: Created MetricsSaver j-2SDBRGE0BU088:i-b1d4216c:SparkSubmit:08985 period:60 /mnt/var/em/raw/i-b1d4216c_20160428_SparkSubmit_08985_raw.bin
16/04/28 13:56:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-90-137-243.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461851381259_0001/pyspark.zip
16/04/28 13:56:52 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.9-src.zip -> hdfs://ip-10-90-137-243.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461851381259_0001/py4j-0.9-src.zip
16/04/28 13:56:52 INFO Client: Uploading resource file:/mnt/tmp/spark-70115be3-a468-46f2-b06a-bc0273ebba63/__spark_conf__4693610164360111789.zip -> hdfs://ip-10-90-137-243.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1461851381259_0001/__spark_conf__4693610164360111789.zip
16/04/28 13:56:53 INFO SecurityManager: Changing view acls to: hadoop
16/04/28 13:56:53 INFO SecurityManager: Changing modify acls to: hadoop
16/04/28 13:56:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/04/28 13:56:53 INFO Client: Submitting application 1 to ResourceManager
16/04/28 13:56:53 INFO YarnClientImpl: Submitted application application_1461851381259_0001
16/04/28 13:56:54 INFO Client: Application report for application_1461851381259_0001 (state: ACCEPTED)
16/04/28 13:56:54 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1461851813382
	 final status: UNDEFINED
	 tracking URL: http://ip-10-90-137-243.us-west-2.compute.internal:20888/proxy/application_1461851381259_0001/
	 user: hadoop
16/04/28 13:56:55 INFO Client: Application report for application_1461851381259_0001 (state: ACCEPTED)
16/04/28 13:56:56 INFO Client: Application report for application_1461851381259_0001 (state: ACCEPTED)
16/04/28 13:56:57 INFO Client: Application report for application_1461851381259_0001 (state: ACCEPTED)
16/04/28 13:56:58 INFO Client: Application report for application_1461851381259_0001 (state: ACCEPTED)
16/04/28 13:56:59 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/04/28 13:56:59 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-90-137-243.us-west-2.compute.internal, PROXY_URI_BASES -> http://ip-10-90-137-243.us-west-2.compute.internal:20888/proxy/application_1461851381259_0001), /proxy/application_1461851381259_0001
16/04/28 13:56:59 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/04/28 13:56:59 INFO Client: Application report for application_1461851381259_0001 (state: RUNNING)
16/04/28 13:56:59 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.91.12.216
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1461851813382
	 final status: UNDEFINED
	 tracking URL: http://ip-10-90-137-243.us-west-2.compute.internal:20888/proxy/application_1461851381259_0001/
	 user: hadoop
16/04/28 13:56:59 INFO YarnClientSchedulerBackend: Application application_1461851381259_0001 has started running.
16/04/28 13:56:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45417.
16/04/28 13:56:59 INFO NettyBlockTransferService: Server created on 45417
16/04/28 13:56:59 INFO BlockManager: external shuffle service port = 7337
16/04/28 13:56:59 INFO BlockManagerMaster: Trying to register BlockManager
16/04/28 13:56:59 INFO BlockManagerMasterEndpoint: Registering block manager 10.90.137.243:45417 with 518.1 MB RAM, BlockManagerId(driver, 10.90.137.243, 45417)
16/04/28 13:56:59 INFO BlockManagerMaster: Registered BlockManager
16/04/28 13:56:59 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1461851381259_0001
16/04/28 13:56:59 WARN SparkContext: Dynamic Allocation and num executors both set, thus dynamic allocation disabled.
16/04/28 13:57:04 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-12-216.us-west-2.compute.internal:42550) with ID 1
16/04/28 13:57:04 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-12-216.us-west-2.compute.internal:36728 with 14.8 GB RAM, BlockManagerId(1, ip-10-91-12-216.us-west-2.compute.internal, 36728)
16/04/28 13:57:06 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-12-94.us-west-2.compute.internal:54060) with ID 2
16/04/28 13:57:06 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-12-94.us-west-2.compute.internal:38928 with 14.8 GB RAM, BlockManagerId(2, ip-10-91-12-94.us-west-2.compute.internal, 38928)
16/04/28 13:57:06 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-136-245.us-west-2.compute.internal:37036) with ID 3
16/04/28 13:57:06 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-136-245.us-west-2.compute.internal:32886 with 14.8 GB RAM, BlockManagerId(3, ip-10-90-136-245.us-west-2.compute.internal, 32886)
16/04/28 13:57:07 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-129-20.us-west-2.compute.internal:36488) with ID 4
16/04/28 13:57:07 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-129-20.us-west-2.compute.internal:38217 with 14.8 GB RAM, BlockManagerId(4, ip-10-91-129-20.us-west-2.compute.internal, 38217)
16/04/28 13:57:07 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-90-6-144.us-west-2.compute.internal:53384) with ID 7
16/04/28 13:57:07 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-90-6-144.us-west-2.compute.internal:44993 with 14.8 GB RAM, BlockManagerId(7, ip-10-90-6-144.us-west-2.compute.internal, 44993)
16/04/28 13:57:07 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-128-93.us-west-2.compute.internal:35388) with ID 5
16/04/28 13:57:07 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-128-93.us-west-2.compute.internal:33347 with 14.8 GB RAM, BlockManagerId(5, ip-10-91-128-93.us-west-2.compute.internal, 33347)
16/04/28 13:57:08 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-91-129-249.us-west-2.compute.internal:57358) with ID 6
16/04/28 13:57:08 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-91-129-249.us-west-2.compute.internal:38492 with 14.8 GB RAM, BlockManagerId(6, ip-10-91-129-249.us-west-2.compute.internal, 38492)
16/04/28 13:57:08 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8

Example Row from Train Set
[u'0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']
Time to Parse Train:  8.54659581184

Example Row from Validation Set
[u'0,,44,4,8,19010,249,28,31,141,,1,,8,05db9164,d833535f,d032c263,c18be181,25c83c98,7e0ccccf,d5b6acf2,0b153874,a73ee510,2acdcf4e,086ac2d2,dfbb09fb,41a6ae00,b28479f6,e2502ec9,84898b2a,e5ba7672,42a2edb9,,,0014c32a,,32c7478e,3b183c5c,,']
Time to Parse Validation:  6.44285392761

Example Row from Test Set
[u'0,0,51,84,4,3633,26,1,4,8,0,1,,4,5a9ed9b0,80e26c9b,97144401,5dbf0cc5,0942e0a7,13718bbd,9ce6136d,0b153874,a73ee510,2106e595,b5bb9d63,04f55317,ab04d8fe,1adce6ef,0ad47a49,2bd32e5c,3486227d,12195b22,21ddcdc9,b1252a9d,fa131867,,dbb486d7,8ecc176a,e8b83407,c43c3f58']
Time to Parse Test:  1.26228404045

Number of Observations in Train,Validation,Test
36669090 4585184 4586343
Time to Count Obs:  59.7730550766

Hashing Data:  10000
Regularization type:  l1

Training LR model (iter=100, step=10) with L2 reg at:  1e-06
16/04/28 14:17:11 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/04/28 14:17:11 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Model Training Time:  1259.64682484

Logloss Results
Train:  0.497516612187
Validation:  0.497681759997
Test:  0.497714831399

AUC Results
Train:  0.737840206221
Validation:  0.737658408702
Test:  0.737919240959
Model Evaluation Time:  620.700991869

Training LR model (iter=100, step=10) with L2 reg at:  1e-05
Model Training Time:  151.726215839

Logloss Results
Train:  0.497871408565
Validation:  0.49803509868
Test:  0.498069259653

AUC Results
Train:  0.737255526278
Validation:  0.737076060064
Test:  0.737335479915
Model Evaluation Time:  359.820932865

Training LR model (iter=100, step=10) with L2 reg at:  0.0001
Model Training Time:  162.769757032

Logloss Results
Train:  0.500670201883
Validation:  0.500823293783
Test:  0.500866067742

AUC Results
Train:  0.73270020721
Validation:  0.732534552865
Test:  0.732781781435
Model Evaluation Time:  349.492506027

Training LR model (iter=100, step=10) with L2 reg at:  0.001
Model Training Time:  154.079956055

Logloss Results
Train:  0.511111913122
Validation:  0.511274957797
Test:  0.51128224491

AUC Results
Train:  0.715906603164
Validation:  0.715734811476
Test:  0.716045814684
Model Evaluation Time:  348.526885986

Training LR model (iter=100, step=10) with L2 reg at:  0.01
Model Training Time:  152.547412157

Logloss Results
Train:  0.54188605982
Validation:  0.542069324482
Test:  0.542061715933

AUC Results
Train:  0.661586966752
Validation:  0.661522154875
Test:  0.6619501034
Model Evaluation Time:  297.961891174

Training LR model (iter=100, step=10) with L2 reg at:  0.1
Model Training Time:  77.6855239868

Logloss Results
Train:  0.592132446311
Validation:  0.592235191334
Test:  0.592310502043

AUC Results
Train:  0.5
Validation:  0.5
Test:  0.5
Model Evaluation Time:  291.532629967
Regularization type:  l2

Training LR model (iter=100, step=10) with L2 reg at:  1e-06
Model Training Time:  153.715564966

Logloss Results
Train:  0.497478074001
Validation:  0.497643323262
Test:  0.497676313325

AUC Results
Train:  0.737903312714
Validation:  0.737721358332
Test:  0.737982324792
Model Evaluation Time:  347.644450903

Training LR model (iter=100, step=10) with L2 reg at:  1e-05
Model Training Time:  158.011615038

Logloss Results
Train:  0.497479105066
Validation:  0.497644435841
Test:  0.497677604679

AUC Results
Train:  0.737891247493
Validation:  0.737709035053
Test:  0.737969675919
Model Evaluation Time:  349.29243803

Training LR model (iter=100, step=10) with L2 reg at:  0.0001
Model Training Time:  152.585780144

Logloss Results
Train:  0.497536398852
Validation:  0.497701770834
Test:  0.497734804972

AUC Results
Train:  0.737816925809
Validation:  0.737634560678
Test:  0.737895655038
Model Evaluation Time:  352.715439081

Training LR model (iter=100, step=10) with L2 reg at:  0.001
Model Training Time:  154.25915885

Logloss Results
Train:  0.498224333378
Validation:  0.498389184961
Test:  0.49841839104

AUC Results
Train:  0.737090789316
Validation:  0.736909791352
Test:  0.737180830772
Model Evaluation Time:  352.843014956

Training LR model (iter=100, step=10) with L2 reg at:  0.01
Model Training Time:  152.900197029

Logloss Results
Train:  0.506232052734
Validation:  0.506395349512
Test:  0.506403982275

AUC Results
Train:  0.728549375264
Validation:  0.728392482585
Test:  0.728719521233
Model Evaluation Time:  351.499484062

Training LR model (iter=100, step=10) with L2 reg at:  0.1
Model Training Time:  152.293453932

Logloss Results
Train:  0.534929209631
Validation:  0.535101721172
Test:  0.535103144398

AUC Results
Train:  0.696124751023
Validation:  0.696007317337
Test:  0.696433646746
Model Evaluation Time:  347.678356886
