# Build Predictive Model(s)

In this workbook, you will read the merged dataset you created previously and you will create transformer, estimators and pipelines to build a binary classification model to predict wether a trip has a tip or not.

## Instructions:

1. Read in your merged dataset
2. Use transformes and encoders to perform feature engineering
3. Split into training and testing
4. Build `LogisticRegression` model(s) and train them using pipelines
5. Evaluate the performance of the model(s) using `BinaryClassificationMetrics`

You are welcome to add as many cells as you need below up until the next section. **You must include comments in your code.**

In [1]:
# importing and running spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("assignment4-a2").getOrCreate()
spark

In [2]:
# reading in the merged dataset from my s3 bucket
dfMerged = spark.read\
  .format('parquet')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load('s3://big-data-bucket-mine/dfMerged.parquet')

In [3]:
dfMerged.show(10) #looking at the first 10 rows of data to ensure reading in the data worked properly

+--------------------+--------------------+---------+-------------------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|           medallion|        hack_license|vendor_id|    pickup_datetime|rate_code|store_and_fwd_flag|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|
+--------------------+--------------------+---------+-------------------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+
|00005007A9F30E289...|16780B3E72BAA7A5C...|

In [4]:
#Adding a new column to show whether there is a tip included in the trip or not
import pyspark.sql.functions as f
dfMerged = dfMerged.withColumn("Tip?", f.when(dfMerged["tip_amount"] > 0, 1).otherwise(0))

In [5]:
import pyspark.ml.evaluation as ev
from pyspark.ml import Pipeline
import pyspark.ml.regression as rg
import pyspark.sql.functions as f
import pyspark.ml.feature as feat
import pyspark.ml.classification as cl

In [6]:
#using OneHotEncoderEstimator on Rate_Code to defferentiate between the regular rate and the high price rate

from pyspark.ml.feature import OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=["rate_code"],
                                outputCols=["rate_code_encoded"])
modela = encoder.fit(dfMerged)
dfMerged = modela.transform(dfMerged)
dfMerged.show(10)

+--------------------+--------------------+---------+-------------------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+----+-----------------+
|           medallion|        hack_license|vendor_id|    pickup_datetime|rate_code|store_and_fwd_flag|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|Tip?|rate_code_encoded|
+--------------------+--------------------+---------+-------------------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----------

In [7]:
import numpy as np

#Using Bucketizer to divide latitudes and longitudes columns into buckers instead of numerical for easier categorical use in model creation

In [8]:
# running bucketizer for pickup_longitude and adding it in the dataset
splits = [-float("inf"), 0, 5, float("inf")]

bucketizer = feat.Bucketizer(splits=splits, inputCol="pickup_longitude", outputCol="pickup_longitude_bkt")

dfMerged = bucketizer.transform(dfMerged)

In [9]:
# running bucketizer for pickup_latitude and adding it in the dataset
splits = [-float("inf"), 0,5, float("inf")]

bucketizer = feat.Bucketizer(splits=splits, inputCol="pickup_latitude", outputCol="pickup_latitude_bkt")

dfMerged = bucketizer.transform(dfMerged)

In [10]:
# running bucketizer for dropoff_longitude and adding it in the dataset
splits = [-float("inf"),0,5, float("inf")]

bucketizer = feat.Bucketizer(splits=splits, inputCol="dropoff_longitude", outputCol="dropoff_longitude_bkt")

dfMerged = bucketizer.transform(dfMerged)

In [11]:
# running bucketizer for dropoff_latitude and adding it in the dataset
splits = [-float("inf"),0,5, float("inf")]

bucketizer = feat.Bucketizer(splits=splits, inputCol="dropoff_latitude", outputCol="dropoff_latitude_bkt")

dfMerged = bucketizer.transform(dfMerged)

In [12]:
#looking at the first 10 rows of the data frame to make sure all transformers worked
dfMerged.show(10)

+--------------------+--------------------+---------+-------------------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+------------+----+-----------------+--------------------+-------------------+---------------------+--------------------+
|           medallion|        hack_license|vendor_id|    pickup_datetime|rate_code|store_and_fwd_flag|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|Tip?|rate_code_encoded|pickup_longitude_bkt|pickup_latitude_bkt|dropoff_longitude_bkt|dropoff_latitude_bkt|
+--------------------+--------------------+---------+-------------------+---------+------------------+-------------------+---------------+----------

In [13]:
#removing the columns that have been encoded/transformed into new columns
dfMerged = dfMerged.drop('pickup_latitude','pickup_longitude','rate_code','tip_amount','dropoff_latitude', 'dropoff_longitude')

In [14]:
dfMerged.show(10) #looking at the first 10 rows of the data frame to make only 21 columns still in dataframe

+--------------------+--------------------+---------+-------------------+------------------+-------------------+---------------+-----------------+-------------+------------+-----------+---------+-------+------------+------------+----+-----------------+--------------------+-------------------+---------------------+--------------------+
|           medallion|        hack_license|vendor_id|    pickup_datetime|store_and_fwd_flag|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|payment_type|fare_amount|surcharge|mta_tax|tolls_amount|total_amount|Tip?|rate_code_encoded|pickup_longitude_bkt|pickup_latitude_bkt|dropoff_longitude_bkt|dropoff_latitude_bkt|
+--------------------+--------------------+---------+-------------------+------------------+-------------------+---------------+-----------------+-------------+------------+-----------+---------+-------+------------+------------+----+-----------------+--------------------+-------------------+---------------------+-----------

In [15]:
dfMerged=dfMerged.drop('features') #removes the column 'features' if it already exists
#selects all numeric columns to be combined into column 'features'
Cols_to_Select = dfMerged["Tip?", "total_amount", "dropoff_longitude_bkt", "dropoff_latitude_bkt", "rate_code_encoded", "surcharge", "tolls_amount", "trip_time_in_secs", "fare_amount", "passenger_count", "trip_distance", "mta_tax", "pickup_latitude_bkt", "pickup_longitude_bkt"]
assembler = feat.VectorAssembler(inputCols=Cols_to_Select.columns, outputCol="features") #creates the VectorAssembler object

In [16]:
# running the VectorAssembler transformation onto the dataframe to create the 'features' column
dfMerged=assembler.setHandleInvalid("skip").transform(dfMerged)

In [17]:
#splitting the data into train, test, and predict datasets
splitted_data = dfMerged.randomSplit([0.8, 0.2], 12345)
train_data = splitted_data[0]
test_data = splitted_data[1]

In [18]:
# creating the logistic regression object 
logReg_obj = cl.LogisticRegression(
    labelCol="Tip?"
    , featuresCol = "features",
    maxIter=10
)
# using pipeline to run the logistic regression, plus all other objects intially created
pipeline = Pipeline(
    stages=[
        logReg_obj
    ])

pipelineModel = pipeline.fit(train_data) #running the model on training dataset


In [19]:
import pyspark.ml.evaluation as ev
#evaluating the model created against test dataset
results_logReg = (
    pipelineModel
    .transform(test_data)
    .select('Tip?', 'probability', 'prediction')
)

evaluator = ev.MulticlassClassificationEvaluator(
    predictionCol='prediction'
    , labelCol='Tip?')

(
    evaluator.evaluate(results_logReg)
    , evaluator.evaluate(
        results_logReg
        , {evaluator.metricName: 'weightedPrecision'}
    ) 
    , evaluator.evaluate(
        results_logReg
        , {evaluator.metricName: 'accuracy'}
    )
)

(0.9999997401694531, 0.9999997401695326, 0.9999997401694557)

## In the following cells, please provide the requested code and output. Do not change the order and/or structure of the cells.

In the following cell, print the Area Under the Curve (AUC) for your binary classifier.

In [20]:
#saves the model created into a summary variable to extract the area under the ROC curve
trainingSummary = pipelineModel.stages[-1].summary

print("areaUnderROC: " + str(trainingSummary.areaUnderROC))


areaUnderROC: 0.9999964360114013


In the following cell, provide the code that saves your model your S3 bucket.

In [21]:
#saving the logistic model in a variable and then saving in the S3 bucket
logistic_model = pipelineModel.stages[-1]
logistic_model.save("s3://big-data-bucket-mine/logistic_model_Assignment4/")

In [22]:
spark.stop() #stopping spark