In [1]:
# Loading autotime for the notebook
%load_ext autotime

time: 375 µs (started: 2022-08-17 09:33:53 +00:00)


In [2]:
# Setting the environment variables

time: 263 µs (started: 2022-08-17 09:33:53 +00:00)


In [3]:
import os
import sys
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook --no-browser"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/home/ec2-user/spark-2.4.4-bin-hadoop2.7"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

time: 1.94 ms (started: 2022-08-17 09:33:53 +00:00)


In [26]:
# Spark environment
from pyspark import SparkConf
from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

import numpy as np

time: 695 µs (started: 2022-08-17 10:26:34 +00:00)


# Ecommerce Churn Assignment

The aim of the assignment is to build a model that predicts whether a person purchases an item after it has been added to the cart or not. Being a classification problem, you are expected to use your understanding of all the three models covered till now. You must select the most robust model and provide a solution that predicts the churn in the most suitable manner. 

For this assignment, you are provided the data associated with an e-commerce company for the month of October 2019. Your task is to first analyse the data, and then perform multiple steps towards the model building process.

The broad tasks are:
- Data Exploration
- Feature Engineering
- Model Selection
- Model Inference

### Data description

The dataset stores the information of a customer session on the e-commerce platform. It records the activity and the associated parameters with it.

- **event_time**: Date and time when user accesses the platform
- **event_type**: Action performed by the customer
            - View
            - Cart
            - Purchase
            - Remove from cart
- **product_id**: Unique number to identify the product in the event
- **category_id**: Unique number to identify the category of the product
- **category_code**: Stores primary and secondary categories of the product
- **brand**: Brand associated with the product
- **price**: Price of the product
- **user_id**: Unique ID for a customer
- **user_session**: Session ID for a user


### Initialising the SparkSession

The dataset provided is 5 GBs in size. Therefore, it is expected that you increase the driver memory to a greater number. You can refer to notebook 1 for the steps involved here.

In [5]:
MAX_MEMORY = "14G"

spark = SparkSession \
    .builder \
    .appName("demo") \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
spark

time: 3.81 s (started: 2022-08-17 09:33:54 +00:00)


In [6]:
# Loading the clean data


time: 3.29 ms (started: 2022-08-17 09:33:58 +00:00)


<hr>

## Task 3: Model Selection
3 models for classification:	
- Logistic Regression
- Decision Tree
- Random Forest

### Model 2: Decision Trees

In [7]:
# Additional steps for Decision Trees, if any

time: 245 µs (started: 2022-08-17 09:33:58 +00:00)


#### Feature Transformation (Code will be same; check for the columns)

In [8]:
# Check if only the required columns are present to build the model
# If not, drop the redundant columns


time: 258 µs (started: 2022-08-17 09:33:58 +00:00)


In [9]:
# Categorising the attributes into its type - Continuous and Categorical


time: 253 µs (started: 2022-08-17 09:33:58 +00:00)


In [10]:
# Feature transformation for categorical features


time: 239 µs (started: 2022-08-17 09:33:58 +00:00)


In [11]:
# Vector assembler to combine all the features


time: 257 µs (started: 2022-08-17 09:33:58 +00:00)


In [12]:
# Pipeline for the tasks


time: 600 µs (started: 2022-08-17 09:33:58 +00:00)


In [13]:
# Transforming the dataframe df


time: 871 µs (started: 2022-08-17 09:33:58 +00:00)


In [14]:
# Schema of the transformed df


time: 603 µs (started: 2022-08-17 09:33:58 +00:00)


In [15]:
# Checking the elements of the transformed df - Top 20 rows


time: 559 µs (started: 2022-08-17 09:33:58 +00:00)


In [16]:
# Storing the transformed df in S3 bucket to prevent repetition of steps again


time: 670 µs (started: 2022-08-17 09:33:58 +00:00)


In [17]:
# Load transformed data
df_transformed = spark.read.parquet("Parquets/transformed_df.parquet")

time: 2.79 s (started: 2022-08-17 09:33:58 +00:00)


#### Train-test split

In [18]:
# Splitting the data into train and test (Remember you are expected to compare the model later)
df_train, df_test = df_transformed.randomSplit([0.7, 0.3], seed=42)

time: 69.7 ms (started: 2022-08-17 09:34:00 +00:00)


In [19]:
# Number of rows in train and test data
print(f"Number of Train rows: {df_train.count()}")
print(f"Number of Test rows: {df_test.count()}")

Number of Train rows: 628038
Number of Test rows: 270405
time: 22.7 s (started: 2022-08-17 09:34:01 +00:00)


#### Model Fitting

In [20]:
label_column = "is_purchased"

time: 598 µs (started: 2022-08-17 09:34:23 +00:00)


In [21]:
# Building the model with hyperparameter tuning
# Create ParamGrid for Cross Validation
# Initialising RandomForestClassifier
decision_tree = DecisionTreeClassifier(labelCol=label_column, 
                                       featuresCol="features", 
                                       seed=42)

# Creating Parameter Grid search on RF model
max_depth= [5, 10, 15, 20]
max_bins= [16, 32, 64, 128]
impurity = ["gini", "entropy"]

param_grid = ParamGridBuilder().addGrid(decision_tree.maxDepth, max_depth) \
                               .addGrid(decision_tree.maxBins, max_bins) \
                               .addGrid(decision_tree.impurity, impurity) \
                               .build()

class_evaluator = MulticlassClassificationEvaluator(labelCol=label_column, 
                                                    metricName="accuracy")

cross_validator = CrossValidator(estimator=decision_tree,
                                 estimatorParamMaps=param_grid,
                                 evaluator=class_evaluator,
                                 numFolds=10,
                                 parallelism=4)

# Run cross-validation, and choose the best set of parameters.
cross_validator_model = cross_validator.fit(df_train)

# Make predictions on testing data and calculating ROC metrics and model accuracy. 
prediction = cross_validator_model.transform(df_test)

time: 46min 23s (started: 2022-08-17 09:34:23 +00:00)


In [22]:
# Run cross-validation steps


time: 263 µs (started: 2022-08-17 10:20:47 +00:00)


In [23]:
# Fitting the models on transformed df


time: 640 µs (started: 2022-08-17 10:20:47 +00:00)


In [24]:
# Best model from the results of cross-validation


time: 1.58 ms (started: 2022-08-17 10:20:47 +00:00)


#### Model Analysis

Required Steps:
- Fit on test data
- Performance analysis
    - Appropriate Metric with reasoning

In [27]:
best_model_params = cross_validator_model.getEstimatorParamMaps()[np.argmax(cross_validator_model.avgMetrics)]
param_keys = list(best_model_params.keys())
for param in param_keys:
    print(f"{param.name} = {best_model_params[param]}")

maxDepth = 20
maxBins = 128
impurity = gini
time: 1.03 ms (started: 2022-08-17 10:26:39 +00:00)


#### Summary of the best Decision Tree model

In [28]:
best_model = cross_validator_model.bestModel

time: 459 µs (started: 2022-08-17 10:30:31 +00:00)


In [29]:
dt_model_path = "Models/DecisionTree"
best_model.save(dt_model_path)

time: 2.08 s (started: 2022-08-17 10:30:43 +00:00)
