# Modelling 

In this part I'm going to use data with features prepared earlier to train machine learning model to predict churn.

I'm gonna split the full dataset into train and test sets, test out several of the machine learning methods, evaluate the accuracy of the various models and tune parameters as necessary.
I will determine winning model based on test accuracy.

Since the churned users are a fairly small subset, I will use F1 score as the metric to optimize.

In [21]:
# import libraries
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import datetime
import pandas as pd
from pyspark.sql.window import Window

#import numpy
#from numpy import allclose
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, GBTClassifier, LinearSVC

In [8]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify_Project") \
    .getOrCreate()

In [9]:
# loading cleaned dataset
path = "mini_sparkify_data_features.json"
data_features = spark.read.json(path)

Firsty let's tranform our dataset by adding vector of features:

In [10]:
numericCols = ['number_of_songs_played', 'number_of_thumbs_down', 'number_of_thumbs_up', 'number_of_roll_advert', 'number_of_add_to_playlist', 'number_of_add_friend', 'number_of_errors', 'time_on_paid_version', 'time_whole']
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
df_assembled = assembler.transform(data_features)

In [11]:
df_assembled.select("userId", "features").show(10,False)

+------+----------------------------------------------------------+
|userId|features                                                  |
+------+----------------------------------------------------------+
|100010|[275.0,5.0,17.0,52.0,7.0,4.0,0.0,0.0,3820418.0]           |
|200002|[387.0,6.0,21.0,7.0,8.0,4.0,0.0,2469195.0,3930924.0]      |
|125   |(9,[0,3,8],[8.0,1.0,1774.0])                              |
|124   |[4079.0,41.0,171.0,4.0,118.0,74.0,6.0,5183736.0,5183736.0]|
|51    |[2111.0,21.0,100.0,0.0,52.0,28.0,1.0,1363340.0,1363340.0] |
|7     |[150.0,1.0,7.0,16.0,5.0,1.0,1.0,0.0,4387742.0]            |
|15    |[1914.0,14.0,81.0,1.0,59.0,31.0,2.0,4732403.0,4732403.0]  |
|54    |[2841.0,29.0,163.0,47.0,72.0,33.0,1.0,3697678.0,3697678.0]|
|155   |[820.0,3.0,58.0,8.0,24.0,11.0,3.0,1676949.0,2231525.0]    |
|100014|[257.0,3.0,17.0,2.0,7.0,6.0,0.0,3563513.0,3563513.0]      |
+------+----------------------------------------------------------+
only showing top 10 rows



Also let's transform label column to numerical:

In [12]:
label_stringIdx = StringIndexer(inputCol = 'label', outputCol = 'labelIndex')
df_final = label_stringIdx.fit(df_assembled).transform(df_assembled)

In [13]:
df_final.select("userId", "label", "labelIndex").show(5)

+------+-----+----------+
|userId|label|labelIndex|
+------+-----+----------+
|100010|    0|       0.0|
|200002|    0|       0.0|
|   125|    1|       1.0|
|   124|    0|       0.0|
|    51|    1|       1.0|
+------+-----+----------+
only showing top 5 rows



Now we can split our dataset into train and test sets

In [14]:
train, test = df_final.randomSplit([0.7, 0.3], seed=42)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 163
Test Dataset Count: 62


Also let's create function for printing metrics of our model:

In [15]:
def print_metrics(predictions):
    true_positives = predictions.select("labelIndex", "prediction").filter((F.col("labelIndex") == 1.0) & (F.col("prediction") == 1.0)).count()
    true_negatives = predictions.select("labelIndex", "prediction").filter((F.col("labelIndex") == 0.0) & (F.col("prediction") == 0.0)).count()
    false_positives = predictions.select("labelIndex", "prediction").filter((F.col("labelIndex") == 0.0) & (F.col("prediction") == 1.0)).count()
    false_negatives = predictions.select("labelIndex", "prediction").filter((F.col("labelIndex") == 1.0) & (F.col("prediction") == 0.0)).count()
    
    accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)
    print("Accuracy: ", accuracy)
    precision = true_positives / (true_positives + false_positives)
    print("Precision: ", precision)
    recall = true_positives / (true_positives + false_negatives)
    print("Recall: ", recall)
    #F1_score = 2 * (precision * recall) / (precision + recall)
    F1_score = (2 * true_positives) / (2 * true_positives + false_positives + false_negatives)
    print("F1 score: ", F1_score)

Now we are ready to train our models, let's start with 
### Random Forest Classifier:

In [16]:
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'labelIndex')
rfModel = rf.fit(train)
predictions_rf = rfModel.transform(test)

In [17]:
print_metrics(predictions_rf)

Accuracy:  0.8709677419354839
Precision:  0.5833333333333334
Recall:  0.7
F1 score:  0.6363636363636364


### Logistics Regression:

In [24]:
lr = LogisticRegression(featuresCol = 'features', labelCol = 'labelIndex')
lrModel = lr.fit(train)
predictions_lr = lrModel.transform(test)

In [25]:
print_metrics(predictions_lr)

Accuracy:  0.8709677419354839
Precision:  0.6
Recall:  0.6
F1 score:  0.6


### SVC:

In [22]:
svc = LinearSVC(featuresCol = 'features', labelCol = 'labelIndex')
svcModel = svc.fit(train)
predictions_svc = svcModel.transform(test)

In [23]:
print_metrics(predictions_svc)

Accuracy:  0.8387096774193549
Precision:  0.5
Recall:  0.7
F1 score:  0.5833333333333334


### Stochastic Gradient Boosting

In [29]:
gbt = GBTClassifier(featuresCol = 'features', labelCol = 'labelIndex')
gbtModel = gbt.fit(train)
predictions_gbt = gbtModel.transform(test)

In [30]:
print_metrics(predictions_gbt)

Accuracy:  0.8225806451612904
Precision:  0.4666666666666667
Recall:  0.7
F1 score:  0.56


As we can see the best model is Logistics Regression with accuracy of 87%. the most important metrics in this case is precision as we want to predict as many true positives as posibble. For our Logistic regression model it's 0.6, which is not a lot but if we tak einto account that we work on a very small dataset everything more that 50% is a good result.

### Model tuning

Now let's try to tune our model by applying grid search 

This code is based on an example from this website:
https://spark.apache.org/docs/latest/ml-tuning.html

In [33]:
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

pipeline = Pipeline(stages=[lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for lr.maxIter, 3 values for lr.aggregationDepth and 2 values for lr.fitIntercept
# this grid will have 3 x 3 x 2 = 18 parameter settings for CrossValidator to choose from.


paramGrid = ParamGridBuilder() \
    .addGrid(lr.maxIter, [100, 200, 300]) \
    .addGrid(lr.aggregationDepth, [2, 3, 4]) \
    .addGrid(lr.fitIntercept, [True, False]) \
    .build()

# We use f1 score as a metric for optimization
mce = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="labelIndex", metricName='fMeasureByLabel', metricLabel=1, beta=1.0)

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=mce,
                          numFolds=2) 

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(train)

In [34]:
prediction_cvModel = cvModel.transform(test)

In [35]:
print_metrics(prediction_cvModel)

Accuracy:  0.8709677419354839
Precision:  0.6
Recall:  0.6
F1 score:  0.6


As we can see parameters tuning did not improve our model, but this might still work when using bigger dataset.

In [4]:
spark.version

'3.5.0'