# Sparkify Project Workspace
This workspace contains a tiny subset (128MB) of the full dataset available (12GB). Feel free to use this workspace to build your project, or to explore a smaller subset with Spark before deploying your cluster on the cloud. Instructions for setting up your Spark cluster is included in the last lesson of the Extracurricular Spark Course content.

You can follow the steps below to guide your data analysis and model building portion of this project.

In [1]:
# Import libraries
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, \
                                  avg, from_unixtime, split, min, max, round, lit, mean
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

#from pyspark.sql.types import IntegerType, TimestampType
import datetime
from pyspark.sql.functions import to_date, year, month, dayofmonth, dayofweek, hour, date_format, substring

import numpy as np
import pandas as pd
import time
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

In [2]:
# Create a Spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

In [3]:
# Set time parser policy
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

In [5]:
# Load dataset
path = "data/mini_sparkify_event_data.json"
data = spark.read.json(path)
original_count = data.count()

### Common Functions

In [7]:
# Set plot's figure size
def set_plot_size(width, height):
    return plt.figure(figsize = [width, height]);

def get_user_logs(userId, sessionId=None):
    if sessionId == None:
        return data.where(data.userId == userId) \
            .select('tsDate', 'userId', 'sessionId', 'itemInSession', 'level', 'page') \
            .sort('tsDate', 'itemInSession')
    else:
        return data.where((data.userId == userId) & (data.sessionId == sessionId)) \
            .select('tsDate', 'userId', 'sessionId', 'itemInSession', 'level', 'page') \
            .sort('tsDate', 'itemInSession')
    
def get_users(churn):
    return data.where(data.churn == churn).select('userId').dropDuplicates()

# Load and prepare Dataset
In this workspace, the mini-dataset file is `mini_sparkify_event_data.json`. Load and clean the dataset, checking for invalid or missing data - for example, records without userids or sessionids. 

In [8]:
# Remove rows with missing users
data = data.where(~((col('userId').isNull()) | (col('userId') == '')))

# Exclude non-relevant columns
data = data.drop('firstName')
data = data.drop('lastName')

data = data.withColumn('regDate', (col('registration') / 1000.0).cast(TimestampType()))
data = data.withColumn('tsDate', (col('ts') / 1000.0).cast(TimestampType()))
data.take(1)

data = data.withColumn('city', split(data['location'], ',')[0])
data = data.withColumn('state', split(data['location'], ',')[1])
data = data.drop('location')

# Define churned users using Cancellation Confirmation event (canceled)
query_churn_by_cc = data.where(data.page == 'Cancellation Confirmation')
print(f'Churned users who cancelled subscription: {query_churn_by_cc.count()}')

# Label churned (canceled) users
canceled = query_churn_by_cc.select('userId').dropDuplicates().select('userId')
canceled_uids = [row.userId for row in canceled.collect()];
set_churn = udf(lambda x: 1 if x in canceled_uids else 0, IntegerType())
data = data.withColumn('churn', set_churn('userId'))

# Add [userRowId] column that assigns a 1-based index to every user's log ordered by [ts]
w =  Window.partitionBy(data.userId).orderBy('ts', 'itemInSession')
data = data.withColumn('userRowId', row_number().over(w))

# Add [userRowDescId] column that assigns a 1-based index to every user's log ordered by [ts] descending.
w =  Window.partitionBy(data.userId).orderBy(col('ts').desc(), col('itemInSession').desc())
data = data.withColumn('userRowDescId', row_number().over(w))

# Add last level column
last_levels = dict()
for row in data.where(data.userRowDescId == 1).select('userId', 'level').collect():
    last_levels[row.userId] = row.level
get_level = udf(lambda userId: last_levels[userId])
data = data.withColumn('lastLevel', get_level('userId'))

Churned users who cancelled subscription: 52


In [9]:
users = data.select('churn', 'userId').dropDuplicates()

In [11]:
get_users(1).count()

52

# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

In [12]:
labels = data.select(col('churn').alias('label'), 'userId').dropDuplicates()

### User Attribute Features

In [99]:
f_Gender = data \
    .select('userId', 'gender') \
    .dropDuplicates() \
    .replace(['M', 'F'], ['0', '1'], 'gender') \
    .select('userId', col('gender').cast('int').alias('Gender'))

In [100]:
f_LastLevel = data \
    .select('userId', 'lastLevel') \
    .dropDuplicates() \
    .replace(['free', 'paid'], ['0', '1'], 'lastLevel') \
    .select('userId', col('lastLevel').cast('int').alias('LastLevel'))

### Per Log Features

In [101]:
def page_count(page):
    return data \
        .where(data.page == page) \
        .groupby('userId') \
        .agg(count('userId').alias('count')) \
        .select('userId', col('count').alias(page.replace(' ', '') + 'Count'))

In [102]:
f_LogCount = data \
    .groupby('userId') \
    .agg(count('userId').alias('LogCount'))

In [103]:
f_SongCount = data \
    .where(data.page == 'NextSong') \
    .groupby('userId') \
    .agg(count('userId').alias('SongCount'))

In [104]:
f_NonSongCount = data \
    .where(data.page != 'NextSong') \
    .groupby('userId') \
    .agg(count('userId').alias('NonSongCount'))

In [105]:
f_AboutCount = page_count('About')

In [106]:
f_ThumbsUpCount = page_count('Thumbs Up')

In [107]:
f_RollAdvertCount = page_count('Roll Advert')

### Per Session Features

In [108]:
page_data = data.where(~data.page.isin(['Cancel', 'Cancellation Confirmation'])) \
    .select('page', 'userId', 'sessionId', 'ts')

In [109]:
# Average page count per session hour
session_hours = page_data \
    .groupby('userId', 'sessionId') \
    .agg(((max('ts') - min('ts'))/1000/3600).alias('sessionHours'))

def page_session_hour(page):
    return page_data \
        .where(data.page == page) \
        .join(session_hours, ['userId', 'sessionId'], 'inner') \
        .groupby( 'userId', 'sessionId', 'sessionHours') \
        .agg((count('userId')/col('sessionHours')).alias('avgPerSession')) \
        .groupby('userId') \
        .agg(avg('avgPerSession').alias('avg')) \
        .select('userId', col('avg').alias(page.replace(' ', '') + 'PerSessionHour'))

In [110]:
# Average page count per hour
user_hours = page_data \
    .groupby('userId', 'sessionId') \
    .agg(((max('ts') - min('ts'))/1000/3600).alias('sessionHours')) \
    .groupby('userId') \
    .agg(Fsum('sessionHours').alias('hours'))

def page_hour(page):
    return page_data \
        .where(data.page == page) \
        .join(user_hours, 'userId', 'inner') \
        .groupby('userId', 'hours') \
        .agg((count('userId')/col('hours')).alias('avg')) \
        .select('userId', col('avg').alias(page.replace(' ', '') + 'PerHour'))

In [111]:
f_SessionCount = data \
    .select('userId', 'sessionId') \
    .dropDuplicates() \
    .groupby('userId') \
    .agg(count('userId').alias('SessionCount'))

In [112]:
f_AvgSessionLength = data \
    .groupby('userId', 'sessionId') \
    .agg(((max('ts') - min('ts'))/1000).alias('sessionLength')) \
    .groupby('userId') \
    .agg(avg('sessionLength').alias('AvgSessionLength')) \

In [113]:
# Session gap
users = data.select('userId').dropDuplicates()

f_AvgSessionGap = data \
    .groupby('userId', 'sessionId') \
    .agg(min('ts').alias('startTime'), max('ts').alias('endTime')) \
    .groupby('userId') \
    .agg(count('userId').alias('sessionCount'), \
        ((max('endTime') - min('startTime'))/1000).alias('observationPeriodTime'), \
        (Fsum(col('endTime') - col('startTime'))/1000).alias('totalSessionTime')) \
    .where(col('sessionCount') > 1) \
    .join(users, 'userId', 'outer') \
    .fillna(0) \
    .select('userId', \
            (col('observationPeriodTime') - col('totalSessionTime')/(col('sessionCount') - 1)).alias('AvgSessionGap'))

In [114]:
f_DowngradePerSessionHour = page_session_hour('Downgrade')

In [115]:
f_ErrorPerSessionHour = page_session_hour('Error')

In [116]:
f_SettingsPerSessionHour = page_session_hour('Settings')

In [117]:
f_SaveSettingsPerSessionHour = page_session_hour('Save Settings')

In [118]:
f_LogoutPerSessionHour = page_session_hour('Logout')

In [119]:
f_SubmitDowngradePerSessionHour = page_session_hour('Submit Downgrade')

### Per Hour Features

In [120]:
f_RollAdvertPerHour = page_hour('Roll Advert')

In [121]:
f_ThumbsDownPerHour = page_hour('Thumbs Down')

In [122]:
f_UpgradePerHour = page_hour('Upgrade')

In [123]:
f_SubmitUpgradePerHour = page_hour('Submit Upgrade')

### Per Day Features

In [124]:
# Average page count per day
page_data = data \
    .where(~data.page.isin(['Cancel', 'Cancellation Confirmation'])) \
    .select('userId', 'page', 'date')

def page_day(page):
    return page_data \
        .where(data.page == page) \
        .groupby('userId', 'date') \
        .count() \
        .groupby('userId') \
        .agg(avg('count').alias(page.replace(' ', '') + 'PerDay'))

In [125]:
# Average sessions per day
f_SessionsPerDay = data \
    .select('userId', 'date', 'sessionId') \
    .dropDuplicates() \
    .groupby('userId', 'date') \
    .count() \
    .groupby('userId') \
    .agg(avg('count').alias('SessionsPerDay'))

In [126]:
f_AddFriendPerDay = page_day('Add Friend')

In [127]:
f_RollAdvertPerDay = page_day('Roll Advert')

In [128]:
f_ThumbsDownPerDay = page_day('Thumbs Down')

In [129]:
f_ThumbsUpPerDay = page_day('Thumbs Up')

### Per Song Features

In [130]:
# Total song length 
f_TotalSongLength = data \
    .where(data.page == 'NextSong') \
    .select('userId', 'length') \
    .groupby('userId') \
    .agg(Fsum('length').alias('TotalSongLength'))

In [131]:
# Unique song count
f_UniqueSongCount = data \
    .where(data.page == 'NextSong') \
    .select('userId', 'song') \
    .dropDuplicates() \
    .groupby('userId') \
    .agg(count('userId').alias('UniqueSongCount'))

In [132]:
# Share of unique songs among all user's songs
totals = data \
    .where(data.page == 'NextSong') \
    .select('userId') \
    .groupby('userId') \
    .agg(count('userId').alias('total'))

f_UniqueSongShare = data \
    .where(data.page == 'NextSong') \
    .select('userId', 'song') \
    .dropDuplicates() \
    .groupby('userId') \
    .count() \
    .join(totals, on = ['userId'], how = 'inner') \
    .select('userId', (col('count')/col('total')).alias('UniqueSongShare')) 

### model_data

In [134]:
# model 7 (churn #1, seed 0)
model_data  = labels.join(f_Gender, 'userId', 'outer') \
    .join(f_LastLevel, 'userId', 'outer') \
    .join(f_LogCount, 'userId', 'outer') \
    .join(f_SongCount, 'userId', 'outer') \
    .join(f_NonSongCount, 'userId', 'outer') \
    .join(f_AboutCount, 'userId', 'outer') \
    .join(f_ThumbsUpCount, 'userId', 'outer') \
    .join(f_RollAdvertCount, 'userId', 'outer') \
    .join(f_SessionCount, 'userId', 'outer') \
    .join(f_AvgSessionLength, 'userId', 'outer') \
    .join(f_AvgSessionGap, 'userId', 'outer') \
    .join(f_DowngradePerSessionHour, 'userId', 'outer') \
    .join(f_ErrorPerSessionHour, 'userId', 'outer') \
    .join(f_SettingsPerSessionHour, 'userId', 'outer') \
    .join(f_SaveSettingsPerSessionHour, 'userId', 'outer') \
    .join(f_LogoutPerSessionHour, 'userId', 'outer') \
    .join(f_SubmitDowngradePerSessionHour, 'userId', 'outer') \
    .join(f_RollAdvertPerHour, 'userId', 'outer') \
    .join(f_ThumbsDownPerHour, 'userId', 'outer') \
    .join(f_UpgradePerHour, 'userId', 'outer') \
    .join(f_SubmitUpgradePerHour, 'userId', 'outer') \
    .join(f_SessionsPerDay, 'userId', 'outer') \
    .join(f_AddFriendPerDay, 'userId', 'outer') \
    .join(f_RollAdvertPerDay, 'userId', 'outer') \
    .join(f_ThumbsDownPerDay, 'userId', 'outer') \
    .join(f_ThumbsUpPerDay, 'userId', 'outer') \
    .join(f_TotalSongLength, 'userId', 'outer') \
    .join(f_UniqueSongCount, 'userId', 'outer') \
    .join(f_UniqueSongShare, 'userId', 'outer') \
    .drop('userId') \
    .fillna(0)

In [177]:
# model 6
model_data  = labels.join(f_EventCount, 'userId', 'outer') \
    .join(f_AvgSessionCount, 'userId', 'outer') \
    .join(f_RollAdvertCount, 'userId', 'outer') \
    .join(f_ThumbsUpCount, 'userId', 'outer') \
    .join(f_RollAdvertCountPerHour, 'userId', 'outer') \
    .join(f_ThumbsDownCountPerHour, 'userId', 'outer') \
    .join(f_SettingsCountPerHour, 'userId', 'outer') \
    .join(f_LogoutCountPerHour, 'userId', 'outer') \
    .join(f_SaveSettingsCountPerHour, 'userId', 'outer') \
    .join(f_TotalSongLength, 'userId', 'outer') \
    .join(f_AvgNonSongEventCount, 'userId', 'outer') \
    .join(f_UniqueSongCount, 'userId', 'outer') \
    .join(f_UniqueSongShare, 'userId', 'outer') \
    .join(f_AvgGapTime, 'userId', 'outer') \
    .join(f_ThumbsDownCountPerDay, 'userId', 'outer') \
    .join(f_ThumbsUpCountPerDay, 'userId', 'outer') \
    .join(f_AvgSessionCountPerDay, 'userId', 'outer') \
    .drop('userId') \
    .fillna(0)

In [148]:
# model 5
model_data  = labels.join(f_EventCount, 'userId', 'outer') \
    .join(f_AvgSessionCount, 'userId', 'outer') \
    .join(f_RollAdvertCount, 'userId', 'outer') \
    .join(f_SubmitUpgradeCount, 'userId', 'outer') \
    .join(f_ThumbsUpCount, 'userId', 'outer') \
    .join(f_RollAdvertCountPerHour, 'userId', 'outer') \
    .join(f_SubmitUpgradeCountPerHour, 'userId', 'outer') \
    .join(f_ThumbsDownCountPerHour, 'userId', 'outer') \
    .join(f_UpgradeCountPerHour, 'userId', 'outer') \
    .join(f_SettingsCountPerHour, 'userId', 'outer') \
    .join(f_LogoutCountPerHour, 'userId', 'outer') \
    .join(f_SaveSettingsCountPerHour, 'userId', 'outer') \
    .join(f_TotalSongLength, 'userId', 'outer') \
    .join(f_AvgNonSongEventCount, 'userId', 'outer') \
    .join(f_UniqueSongCount, 'userId', 'outer') \
    .join(f_UniqueSongShare, 'userId', 'outer') \
    .join(f_AvgGapTime, 'userId', 'outer') \
    .join(f_ThumbsDownCountPerDay, 'userId', 'outer') \
    .join(f_ThumbsUpCountPerDay, 'userId', 'outer') \
    .join(f_UpgradeCountPerDay, 'userId', 'outer') \
    .drop('userId') \
    .fillna(0)

In [None]:
# model 4
model_data  = labels.join(f_EventCount, 'userId', 'outer') \
    .join(f_AvgSessionCount, 'userId', 'outer') \
    .join(f_RollAdvertCount, 'userId', 'outer') \
    .join(f_SubmitUpgradeCount, 'userId', 'outer') \
    .join(f_ThumbsUpCount, 'userId', 'outer') \
    .join(f_RollAdvertCountPerHour, 'userId', 'outer') \
    .join(f_SubmitUpgradeCountPerHour, 'userId', 'outer') \
    .join(f_ThumbsDownCountPerHour, 'userId', 'outer') \
    .join(f_UpgradeCountPerHour, 'userId', 'outer') \
    .join(f_SettingsCountPerHour, 'userId', 'outer') \
    .join(f_LogoutCountPerHour, 'userId', 'outer') \
    .join(f_SaveSettingsCountPerHour, 'userId', 'outer') \
    .join(f_TotalSongLength, 'userId', 'outer') \
    .join(f_AvgNonSongEventCount, 'userId', 'outer') \
    .join(f_UniqueSongCount, 'userId', 'outer') \
    .join(f_UniqueSongShare, 'userId', 'outer') \
    .join(f_AvgGapTime, 'userId', 'outer') \
    .drop('userId') \
    .fillna(0)

In [130]:
# model 3
model_data  = labels.join(f_EventCount, 'userId', 'outer') \
    .join(f_AvgSessionCount, 'userId', 'outer') \
    .join(f_RollAdvertCount, 'userId', 'outer') \
    .join(f_SubmitUpgradeCount, 'userId', 'outer') \
    .join(f_ThumbsUpCount, 'userId', 'outer') \
    .join(f_RollAdvertCountPerHour, 'userId', 'outer') \
    .join(f_SubmitUpgradeCountPerHour, 'userId', 'outer') \
    .join(f_ThumbsDownCountPerHour, 'userId', 'outer') \
    .join(f_UpgradeCountPerHour, 'userId', 'outer') \
    .join(f_SettingsCountPerHour, 'userId', 'outer') \
    .join(f_LogoutCountPerHour, 'userId', 'outer') \
    .join(f_SaveSettingsCountPerHour, 'userId', 'outer') \
    .join(f_TotalSongLength, 'userId', 'outer') \
    .join(f_AvgNonSongEventCount, 'userId', 'outer') \
    .join(f_UniqueSongCount, 'userId', 'outer') \
    .join(f_UniqueSongShare, 'userId', 'outer') \
    .drop('userId') \
    .fillna(0)

# Modeling
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

In [135]:
from pyspark.ml.feature import VectorAssembler, Normalizer, StandardScaler, MinMaxScaler;
from pyspark.ml import Pipeline;
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier,NaiveBayes,DecisionTreeClassifier;
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder;
from pyspark.ml.evaluation import MulticlassClassificationEvaluator;

In [136]:
model_data.printSchema()

root
 |-- label: integer (nullable = true)
 |-- Gender: integer (nullable = true)
 |-- LastLevel: integer (nullable = true)
 |-- LogCount: long (nullable = true)
 |-- SongCount: long (nullable = true)
 |-- NonSongCount: long (nullable = true)
 |-- AboutCount: long (nullable = true)
 |-- ThumbsUpCount: long (nullable = true)
 |-- RollAdvertCount: long (nullable = true)
 |-- SessionCount: long (nullable = true)
 |-- AvgSessionLength: double (nullable = false)
 |-- AvgSessionGap: double (nullable = false)
 |-- DowngradePerSessionHour: double (nullable = false)
 |-- ErrorPerSessionHour: double (nullable = false)
 |-- SettingsPerSessionHour: double (nullable = false)
 |-- SaveSettingsPerSessionHour: double (nullable = false)
 |-- LogoutPerSessionHour: double (nullable = false)
 |-- SubmitDowngradePerSessionHour: double (nullable = false)
 |-- RollAdvertPerHour: double (nullable = false)
 |-- ThumbsDownPerHour: double (nullable = false)
 |-- UpgradePerHour: double (nullable = false)
 |-- Sub

In [137]:
#Put the features to be trained into a vector
features = model_data.drop('label').columns
assembler = VectorAssembler(inputCols=features, outputCol="NumFeatures")

# Use StandardScaler
stdscaler = StandardScaler(inputCol="NumFeatures", outputCol="features")

# Use MinMaxScaler
minmaxscaler = MinMaxScaler(inputCol="NumFeatures", outputCol="features")

In [138]:
# spilt data for Train and test 
train, test = model_data.randomSplit([0.8, 0.2], seed = 0);

In [139]:
# using five different modeling approaches 

# Logistic Regression
model_lr = LogisticRegression(featuresCol="features")
# naive bayes
model_nb = NaiveBayes(featuresCol='features')
# decision tree 
model_dtc = DecisionTreeClassifier(featuresCol='features', seed=4)
# RandomForestClassifier
model_rf = RandomForestClassifier(featuresCol="features")
# GBT Classifier
model_gbt = GBTClassifier(featuresCol="features")

In [140]:
# Create Pipeline with standardscaled data 
pipeline_nb = Pipeline(stages=[assembler, stdscaler, model_nb]);
pipeline_lr = Pipeline(stages=[assembler, stdscaler, model_lr]);
pipeline_dtc = Pipeline(stages=[assembler, stdscaler, model_dtc]);
pipeline_rf = Pipeline(stages=[assembler, stdscaler, model_rf]);
pipeline_gbt = Pipeline(stages=[assembler, stdscaler, model_gbt])

In [141]:
# model performance evaluation
def eval_model(model, test, metric):
    """ Calculate Model Scores metric
        Input: 
            model- trained model or pipeline object
            metric- the metric chosen to measure performance
            data - dataset for the evaluation
        Output:
            score
    """
    predictions = model.transform(test)
    evaluator = MulticlassClassificationEvaluator(labelCol = "label", metricName = metric)

    # find score
    score = evaluator.evaluate(predictions)
    #return score
    return score

In [142]:
get_users(1).count()

52

### Logistic Regression

In [143]:
# Fit model
start_lr = time.time()
model_lr_fitted = pipeline_lr.fit(train);
end_lr = time.time()
lr_time = end_lr - start_lr

print('Total training time for logistic regression: {} seconds'.format(lr_time))

Total training time for logistic regression: 512.9540655612946 seconds


In [144]:
# model 7
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.82 Logistic Regression - F1 score: 0.8348235294117647


In [56]:
# Small model
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.7166666666666667 Logistic Regression - F1 score: 0.6682596575069694


In [73]:
# Extended model
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.7666666666666667 Logistic Regression - F1 score: 0.7542179662562699


In [107]:
# model 2
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.6935483870967742 Logistic Regression - F1 score: 0.6575340563509366


In [119]:
# model 3
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.6935483870967742 Logistic Regression - F1 score: 0.6412688470314528


In [137]:
# model 4 (with session gap)
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.8709677419354839 Logistic Regression - F1 score: 0.8525674199298487


In [155]:
# model 5
lr_f1 = eval_model(model_lr_fitted, test,"f1")

lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.8387096774193549 Logistic Regression - F1 score: 0.8295943687648757


In [168]:
# model 6
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.8548387096774194 Logistic Regression - F1 score: 0.8300747170148985


In [183]:
# model 6 - churn #1
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.8548387096774194 Logistic Regression - F1 score: 0.8300747170148985


In [190]:
# model 6 - churn #1 (seed=0)
lr_f1 = eval_model(model_lr_fitted, test,"f1")
lr_acc = eval_model(model_lr_fitted, test,"accuracy")
print('Logistic Regression - Accuracy: {}'.format(lr_acc),'Logistic Regression - F1 score: {}'.format(lr_f1))

Logistic Regression - Accuracy: 0.86 Logistic Regression - F1 score: 0.8645800176834659


### Other models

In [74]:
#  decision tree
start_dtc = time.time()
model_dtc_fitted = pipeline_dtc.fit(train);
end_dtc =  time.time()
dtc_time = end_dtc - start_dtc

print('Total training time for decision tree: {} seconds'.format(dtc_time))

# naive bayes
start_nb = time.time()
model_nb_fitted = pipeline_nb.fit(train);
end_nb = time.time()
nb_time = end_nb - start_nb

print('Total training time for naive bayes: {} seconds'.format(nb_time))

# random forest
start_rf =  time.time()
model_rf_fitted = pipeline_rf.fit(train);
end_rf =  time.time()
rf_time = end_rf - start_rf
print('Total training time for random forest: {} seconds'.format(rf_time))

# gradient-boosted tree
start_gbt =  time.time()
model_gbt_fitted = pipeline_gbt.fit(train)
end_gbt = time.time()
gbt_time = end_gbt - start_gbt

print('Total training time for gradient-boosted trees: {} seconds'.format(gbt_time))

Total training time for decision tree: 599.4960293769836 seconds
Total training time for naive bayes: 398.19781160354614 seconds
Total training time for random forest: 582.1303839683533 seconds
Total training time for gradient-boosted trees: 706.4461135864258 seconds


In [75]:
dtc_acc = eval_model(model_dtc_fitted, test,"accuracy")
dtc_f1 = eval_model(model_dtc_fitted, test,"f1")
print('Decision Tree - Accuracy: {}'.format(dtc_acc),'Decision Tree - F1 score: {}'.format(dtc_f1))

Decision Tree - Accuracy: 0.7 Decision Tree - F1 score: 0.6839945280437756


In [76]:
nb_acc = eval_model(model_nb_fitted, test,"accuracy")
nb_f1 = eval_model(model_nb_fitted, test,"f1")
print('Naive Bayes - Accuracy: {}'.format(nb_acc),'Naive Bayes - F1 score: {}'.format(nb_f1))

Naive Bayes - Accuracy: 0.6833333333333333 Naive Bayes - F1 score: 0.6851909025419369


In [77]:
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.7333333333333333 Random Forest - F1 score: 0.6935817805383023


In [78]:
gbt_acc =eval_model(model_gbt_fitted, test,"accuracy")
gbt_f1 =eval_model(model_gbt_fitted, test,"f1")

print('Gradient-Boosted Tree - Accuracy: {}'.format(gbt_acc),'Gradient-Boosted Tree - F1 score: {}'.format(gbt_f1))

Gradient-Boosted Tree - Accuracy: 0.6666666666666666 Gradient-Boosted Tree - F1 score: 0.6488828089375284


In [79]:
result = {'Model': ['Logistic Regression','Decision Tree','Naive Bayes','Random Forest','Grandient-Boosted Tree'],
          'F1': [lr_f1,dtc_f1,nb_f1,rf_f1,gbt_f1],
          'Accuracy':[lr_acc,dtc_acc,nb_acc,rf_acc,gbt_acc],
          'Training Time':[lr_time,dtc_time,nb_time,rf_time,gbt_time]}
result = pd.DataFrame(result)
result

Unnamed: 0,Model,F1,Accuracy,Training Time
0,Logistic Regression,0.754218,0.766667,753.909966
1,Decision Tree,0.683995,0.7,599.496029
2,Naive Bayes,0.685191,0.683333,398.197812
3,Random Forest,0.693582,0.733333,582.130384
4,Grandient-Boosted Tree,0.648883,0.666667,706.446114


In [80]:
feature_coeff_rf = model_rf_fitted.stages[-1].featureImportances
featureImp_rf = pd.DataFrame(list(zip(features, feature_coeff_rf)), columns=['Feature', 'FeatureImportances']).sort_values('FeatureImportances', ascending=False)

featureImp_rf

Unnamed: 0,Feature,FeatureImportances
8,ThumbsDownCountPerHour,0.140007
4,RollAdvertCount,0.098068
10,SettingsCountPerHour,0.097668
6,RollAdvertCountPerHour,0.085586
3,AvgSessionLength,0.074448
7,SubmitUpgradeCountPerHour,0.065775
5,UpgradeCount,0.06427
15,SubmitUpgradeShare,0.061985
13,UniqueSongCount,0.061182
11,TotalSongLength,0.059198


### Random Forest

In [145]:
# random forest
start_rf =  time.time()
model_rf_fitted = pipeline_rf.fit(train);
end_rf =  time.time()
rf_time = end_rf - start_rf
print('Total training time for random forest: {} seconds'.format(rf_time))

Total training time for random forest: 586.8280704021454 seconds


In [146]:
# model 7 - churn #1 (seed=0)
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.9 Random Forest - F1 score: 0.9032714412024758


In [192]:
# model 6 - churn #1 (seed=0)
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.84 Random Forest - F1 score: 0.8561904761904762


In [170]:
# model 6
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.8225806451612904 Random Forest - F1 score: 0.7923135430182093


In [157]:
# model 5 
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.8387096774193549 Random Forest - F1 score: 0.8061414392059554


In [139]:
# model 4 - with gap time
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.7903225806451613 Random Forest - F1 score: 0.7656811964506408


In [121]:
# model 3
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.7580645161290323 Random Forest - F1 score: 0.7007033713315548


In [110]:
# model 2 - to many features
rf_acc = eval_model(model_rf_fitted, test,"accuracy")
rf_f1 = eval_model(model_rf_fitted, test,"f1")
print('Random Forest - Accuracy: {}'.format(rf_acc),'Random Forest - F1 score: {}'.format(rf_f1))

Random Forest - Accuracy: 0.7096774193548387 Random Forest - F1 score: 0.6510545905707196


In [193]:
# coefficients (model 6 - churn #1 - seed 0)
feature_coeff_rf = model_rf_fitted.stages[-1].featureImportances
featureImp_rf = pd.DataFrame(list(zip(features, feature_coeff_rf)), columns=['Feature', 'FeatureImportances']) \
    .sort_values('FeatureImportances', ascending=False)
featureImp_rf

Unnamed: 0,Feature,FeatureImportances
13,AvgGapTime,0.348634
2,RollAdvertCount,0.077492
5,ThumbsDownCountPerHour,0.073061
16,AvgSessionCountPerDay,0.071618
6,SettingsCountPerHour,0.060837
14,ThumbsDownCountPerDay,0.044913
1,AvgSessionLength,0.042412
10,AvgNonSongEventCount,0.039702
15,ThumbsUpCountPerDay,0.037318
12,UniqueSongShare,0.037015


In [147]:
# coefficients (model 7)
feature_coeff_rf = model_rf_fitted.stages[-1].featureImportances
featureImp_rf = pd.DataFrame(list(zip(features, feature_coeff_rf)), columns=['Feature', 'FeatureImportances']) \
    .sort_values('FeatureImportances', ascending=False)
featureImp_rf

Unnamed: 0,Feature,FeatureImportances
10,AvgSessionGap,0.217396
18,ThumbsDownPerHour,0.096786
21,SessionsPerDay,0.081389
17,RollAdvertPerHour,0.054524
24,ThumbsDownPerDay,0.054109
22,AddFriendPerDay,0.046528
25,ThumbsUpPerDay,0.044301
23,RollAdvertPerDay,0.037289
11,DowngradePerSessionHour,0.037254
8,SessionCount,0.032588


In [171]:
# coefficients (model 6)
feature_coeff_rf = model_rf_fitted.stages[-1].featureImportances
featureImp_rf = pd.DataFrame(list(zip(features, feature_coeff_rf)), columns=['Feature', 'FeatureImportances']) \
    .sort_values('FeatureImportances', ascending=False)
featureImp_rf

Unnamed: 0,Feature,FeatureImportances
13,AvgGapTime,0.275502
16,AvgSessionCountPerDay,0.10983
5,ThumbsDownCountPerHour,0.10373
0,EventCount,0.067746
15,ThumbsUpCountPerDay,0.05795
14,ThumbsDownCountPerDay,0.052502
2,RollAdvertCount,0.039696
4,RollAdvertCountPerHour,0.039132
7,LogoutCountPerHour,0.036775
1,AvgSessionLength,0.035483


In [158]:
# coefficients (model 5)
feature_coeff_rf = model_rf_fitted.stages[-1].featureImportances
featureImp_rf = pd.DataFrame(list(zip(features, feature_coeff_rf)), columns=['Feature', 'FeatureImportances']) \
    .sort_values('FeatureImportances', ascending=False)
featureImp_rf

Unnamed: 0,Feature,FeatureImportances
16,AvgGapTime,0.257101
7,ThumbsDownCountPerHour,0.109913
17,ThumbsDownCountPerDay,0.057647
2,RollAdvertCount,0.056023
10,LogoutCountPerHour,0.055494
9,SettingsCountPerHour,0.054747
13,AvgNonSongEventCount,0.048382
5,RollAdvertCountPerHour,0.045455
15,UniqueSongShare,0.04392
0,EventCount,0.043546


In [140]:
# coefficients (model 4)
feature_coeff_rf = model_rf_fitted.stages[-1].featureImportances
featureImp_rf = pd.DataFrame(list(zip(features, feature_coeff_rf)), columns=['Feature', 'FeatureImportances']) \
    .sort_values('FeatureImportances', ascending=False)
featureImp_rf

Unnamed: 0,Feature,FeatureImportances
16,AvgGapTime,0.25141
7,ThumbsDownCountPerHour,0.130147
0,EventCount,0.085259
9,SettingsCountPerHour,0.067623
5,RollAdvertCountPerHour,0.057965
10,LogoutCountPerHour,0.057503
2,RollAdvertCount,0.054857
1,AvgSessionLength,0.043084
12,TotalSongLength,0.036899
8,UpgradeCountPerHour,0.035458


# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.