# Final Project - Criteo Labs Display Advertising Challenge
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Spring 2019`__

__`Team: Chi Iong Ansjory, Catherine Cao, Scott Xu`__

Table of Content:
- [0. Background](#background)
- [1. Question Formulation](#question_formulation)
- [2. Algorithm Explanation](#algorithm_explanation)
- [3. EDA & Discussion of Challenges](#eda_challenges)
- [4. Algorithm Implementation](#algorithm_implementation)
- [5. Application of Course Concepts](#course_concepts_application)

<a id='background'></a>
# 0. Background

Criteo Labs is a leading global technology company that specializes in performance display advertising, working with over 4,000 e-commerce companies around the world. Their technology takes an algorithmic approach to determining what user they show an advertisement to, when, and for what products. For billions of unique advertisements that are created and displayed at lightning fast speeds every day.

Display advertising is a billion dollar effort and one of the central uses of Machine Learning on the Internet. However, its data and methods are usually kept confidential. Through the Kaggle research competition, Criteo Labs is sharing a week’s worth of data for participants to develop models predicting advertisement click-through rate (CTR). Given a user and the page being visited, what is the probability that the user will click on a given advertisement?

Source: https://www.kaggle.com/c/criteo-display-ad-challenge

For the dataset, the smaller version is no longer available from Kaggle. The full-size version needs to be used instead from Criteo Labs.

Source: https://www.kaggle.com/c/criteo-display-ad-challenge/data (smaller version - obsoleted); http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ (full-size version)

### Notebook Set-Up

In [1]:
# imports
import re
import ast
import time
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt

import os
import json


In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

<a id='question_formulation'></a>
# 1. Question Formulation

The goal of this analysis is to benchmark the most accurate ML algorithms for CTR estimation.

<a id='algorithm_explanation'></a>
# 2. Algorithm Explanation

Random Forest is an ensemble method for classification and regression. The algorithm creates multiple trees. Each tree will give a prediction on its own. And final prediction is the most common from all the trees(classification) or the average (regression). In order to remove the correlation between each tree, it use bagging to sample 1) the training data 2) the features. So that each tree will have slightly different input. Overall random forest helps improve the model performance and avoid the overfitting.

<a id='eda_challenges'></a>
# 3. EDA & Discussion of Challenges

The main challenges are the dataset given for this analysis has no column labels. We can't leverage any of our pre-existing knowledge about how online ads are served and CTR is computed in understanding the data. This means we have to put in extra effort in analyzing the data so we can understand the relationships between different features in the dataset and process them appropriately.

Categorical vs real number

In [None]:
# Load data. Keep this relative path access to data
dir_name = os.getcwd()
#train_filename = os.path.join(dir_name, 'data/train.txt')
#test_filename = os.path.join(dir_name, 'data/test.txt')
toy_filename = os.path.join(dir_name, 'data/toy_train.txt')

# Reading the data
#train = pd.read_csv(train_filename, sep="\t", header=None)
#test = pd.read_csv(test_filename, sep="\t", header=None)
toy = pd.read_csv(toy_filename, sep=",", header=None)

#print("Original shapes of train and test datasets:")
#train.shape, test.shape
toy.shape

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
app_name = "final"
master = "local[*]"
MAX_MEMORY = "5g"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .config("spark.executor.memory", MAX_MEMORY) \
        .config("spark.driver.memory", MAX_MEMORY) \
        .getOrCreate()
sc = spark.sparkContext


In [6]:
rdd = sc.textFile("./data/train.txt")

#rdd_sample = rdd.sample(fraction = 0.001, withReplacement= False).cache()

numeric_features = ['I'+str(i) for i in range(1, 14)]
categorical_features = ['C'+str(i) for i in range(1, 27)]
header = ['target'] + numeric_features + categorical_features

df = rdd.map(lambda x: x.split("\t")).toDF(header).cache()

for var in ['target'] + numeric_features:
    df =df.withColumn(var, df[var].cast(IntegerType()))

In [80]:
rdd_sample.count()

46228

<a id='algorithm_implementation'></a>
# 4. Algorithm Implementation

In [10]:
#from pyspark.mllib.tree import RandomForest
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.sql.functions import lit
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

In [208]:
df_filter=df.replace('', None, categorical_features)

df_filter=df_filter.dropna()

In [210]:
df_filter.count()

802

In [11]:
df.replace('', None, categorical_features)
with open("imputation.json") as json_file:  
    impute = json.load(json_file)
df_impute = df.replace('', None, categorical_features)
df_impute = df_impute.fillna(impute)

In [17]:
def preprocess(df):
    stages = []
    categorical_features_index=[]
    for categoricalCol in ['C6','C9','C14','C17','C20','C23','C25']:
        stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
        #encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        #encoder = VectorIndexer(inputCol=[stringIndexer.getOutputCol()], outputCol=[categoricalCol + "classVec"])
        categorical_features_index += [categoricalCol + 'Index']
        stages += [stringIndexer,encoder]
    #print(categorical_features_index)

    vector_assembler = VectorAssembler( \
        inputCols= numeric_features+ categorical_features_index, \
        outputCol="features")

    stages += [vector_assembler] 
    pipeline = Pipeline(stages = stages)

    pipelineModel = pipeline.fit(df)
    df_temp = pipelineModel.transform(df)
    
    return df_temp


In [14]:
stages = []
categorical_features_index=[]
for categoricalCol in ['C6','C9','C14','C17','C20','C23','C25']:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index',handleInvalid = "skip")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    #encoder = VectorIndexer(inputCol=[stringIndexer.getOutputCol()], outputCol=[categoricalCol + "classVec"])
    categorical_features_index += [categoricalCol + 'classVec']
    stages += [stringIndexer]
#print(categorical_features_index)

vector_assembler = VectorAssembler( \
    inputCols= numeric_features+ categorical_features_index, \
    outputCol="features")

stages += [vector_assembler] 
pipeline = Pipeline(stages = stages)
    
pipelineModel = pipeline.fit(df_impute)
df_temp = pipelineModel.transform(df_impute)

KeyboardInterrupt: 

In [109]:
pd_df = df_temp.toPandas()
for i in range(0,10):
    print("----------------------------")
    print(pd_df.loc[i,"features"])

----------------------------
(120,[0,1,2,3,4,5,6,7,8,9,10,12,15,23,26,49,57,59,72],[6.0,1.0,26.0,3.0,46.0,10.0,17.0,38.0,132.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
----------------------------
(120,[1,2,3,4,5,6,7,8,10,12,15,23,25,49,57,59,72],[1.0,55.0,13.0,524.0,102.0,32.0,41.0,259.0,8.0,28.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
----------------------------
(120,[0,1,2,3,4,5,6,7,8,9,10,12,13,23,26,50,57,60,73],[2.0,283.0,26.0,14.0,627.0,33.0,2.0,28.0,28.0,1.0,1.0,14.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
----------------------------
(120,[1,2,3,4,5,6,7,8,10,12,13,23,28,55,59,78],[1.0,37.0,6.0,4.0,472.0,10.0,6.0,214.0,2.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0])
----------------------------
(120,[1,2,3,4,5,6,7,8,10,12,14,23,30,55,64,76],[118.0,1.0,1.0,9329.0,154.0,1.0,4.0,25.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
----------------------------
(120,[0,2,3,4,5,7,8,12,14,23,26,49,57,64,72],[3.0,4.0,12.0,36181.0,116.0,14.0,64.0,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
----------------------------
(120,[0,2,3,4,5,6,7,8,

In [18]:
start = time.time()
df_temp = preprocess(df_impute)
print("Wall time: {} seconds".format(time.time() - start))
df_temp.printSchema()

IllegalArgumentException: 'Field "C25Index" does not exist.\nAvailable fields: target, I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, I11, I12, I13, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21, C22, C23, C24, C25, C26, C6Index'

In [13]:
(trainingData, testData) = df_temp.randomSplit([0.7, 0.3])

In [40]:
model

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37751)
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:37751)

In [14]:
start = time.time()
rf = RandomForestClassifier(labelCol="target",\
featuresCol="features", numTrees=100)
model = rf.fit(trainingData)

predictions = model.transform(testData)
evaluator = BinaryClassificationEvaluator(labelCol = "target")

print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))
print("Wall time: {} seconds".format(time.time() - start))

Test Area Under ROC: 0.7064718946223008
Wall time: 2165.207647562027 seconds


In [38]:
predictions

DataFrame[target: int, I1: int, I2: int, I3: int, I4: int, I5: int, I6: int, I7: int, I8: int, I9: int, I10: int, I11: int, I12: int, I13: int, C1: string, C2: string, C3: string, C4: string, C5: string, C6: string, C7: string, C8: string, C9: string, C10: string, C11: string, C12: string, C13: string, C14: string, C15: string, C16: string, C17: string, C18: string, C19: string, C20: string, C21: string, C22: string, C23: string, C24: string, C25: string, C26: string, C6Index: double, C6classVec: vector, C9Index: double, C9classVec: vector, C14Index: double, C14classVec: vector, C17Index: double, C17classVec: vector, C20Index: double, C20classVec: vector, C23Index: double, C23classVec: vector, C25Index: double, C25classVec: vector, features: vector, rawPrediction: vector, probability: vector, prediction: double]

In [39]:
predictions.select("rawPrediction").show(10)

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving


Py4JError: functions does not exist in the JVM

In [25]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import math

def logloss(prob, actual):
    if actual == 1:
        return -math.log(prob)
    else:
        return -math.log(1-prob)

udf_logloss = udf(logloss, FloatType())

firstelement=udf(lambda v:float(v[1]),FloatType())
logloss = predictions.select(firstelement('probability').alias('probability'), 'target').withColumn("logloss", udf_logloss("probability","target"))\
       .select('logloss')


In [32]:
logloss.groupby().avg('logloss').collect()

[Row(avg(logloss)=0.5273571749425316)]

In [36]:
rdd_test.collect()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 24 in stage 63.0 failed 1 times, most recent failure: Lost task 24.0 in stage 63.0 (TID 16033, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
	at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
	at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 54418)
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/anaconda/lib/python3.6/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/opt/anaconda/lib/python3.6/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/anaconda/lib/python3.6/socketserver.py", line 721, in __init__
    self.handle()
  File "/opt/anaconda/lib/python3.6/site-packages/pyspark-2.3.1-py3.6.egg/pyspark/accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "/opt/anaconda/lib/python3.6/site-packages/pyspark-2.3.1-py3.6.egg/pyspark/serializers.py", line 685, in read_int
    raise EOFError
EOFError
---------------------------------

In [37]:
rdd_test = sc.textFile("./data/test.txt")
numeric_features = ['I'+str(i) for i in range(1, 14)]
categorical_features = ['C'+str(i) for i in range(1, 27)]
header_test = numeric_features + categorical_features
df_test = rdd_test.map(lambda x: x.split("\t")).toDF(header).cache()

# for var in numeric_features:
#     df_test =df_test.withColumn(var, df_test[var].cast(IntegerType()))
    
# df_test.replace('', None, categorical_features)
# with open("imputation.json") as json_file:  
#     impute = json.load(json_file)
# df_impute_test=df_test.replace('', None, categorical_features)
# df_impute_test = df_impute_test.fillna(impute)

# df_temp_test = pipelineModel.transform(df_impute_test)

# predictions = model.transform(testData)



ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/anaconda/lib/python3.6/site-packages/py4j-0.10.7-py3.6.egg/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving


Py4JError: An error occurred while calling o18.sc

In [117]:
predictions.toPandas()["probability"]

0          [0.700521863896183, 0.2994781361038169]
1        [0.7062626944017442, 0.29373730559825584]
2        [0.6892573449097972, 0.31074265509020277]
3           [0.851978477196818, 0.148021522803182]
4        [0.7860300990530122, 0.21396990094698776]
5         [0.6765754838411108, 0.3234245161588892]
6        [0.7918672520345007, 0.20813274796549938]
7        [0.7277201450496925, 0.27227985495030743]
8        [0.6858364207617469, 0.31416357923825305]
9         [0.7176300996024503, 0.2823699003975497]
10        [0.759579333527488, 0.24042066647251212]
11       [0.8057988902874068, 0.19420110971259316]
12       [0.7051661363415097, 0.29483386365849035]
13       [0.6448771427210933, 0.35512285727890663]
14         [0.76858804983946, 0.23141195016053986]
15          [0.692057817533587, 0.307942182466413]
16       [0.7782200693588682, 0.22177993064113177]
17       [0.7189172615915657, 0.28108273840843434]
18       [0.7613948758450163, 0.23860512415498378]
19       [0.8294749675522164, 0

In [112]:
start = time.time()
rf = RandomForestClassifier(labelCol="target",\
featuresCol="features", numTrees=100)
model = rf.fit(trainingData)

predictions = model.transform(testData)
evaluator = BinaryClassificationEvaluator(labelCol = "target")
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))
print("Wall time: {} seconds".format(time.time() - start))

Test Area Under ROC: 0.6926939542984374


In [226]:
evaluator.evaluate(predictions)

0.6255701514322198

In [None]:
start = time.time()
rf = RandomForestClassifier(labelCol="target",\
featuresCol="features", numTrees=10)
model = rf.fit(trainingData)

predictions = model.transform(testData)
print("Wall time: {} seconds".format(time.time() - start))
evaluator = BinaryClassificationEvaluator(labelCol = "target")
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

<a id='course_concepts_application'></a>
# 5. Application of Course Concepts