# Sparkify Project

In this notebook we will be looking at a Sparkify customer event data which includes 286,500 events for the fictional song service.

We are looking to answer the below questions:

1.  Can we predict that a user will churn?  Churn is when a customer downgrades or cancels their service.  We want to avoid this if at all possible by identifying users in danger of churn and offering them special deals or the like.
2.  How can we best make this prediction?  We will evaluate many options to perform this prediction and pick what works best.

We will be following the CRISP-DM process throughout. This process involves the below phases for this project:

1.  Business Understanding
2.  Data Understanding
3.  Prepare Data
4.  Data Modeling
5.  Evaluate the Results

We will annotate the section for each phase as we move along.

# CRISP-DM: Business Understanding

Churn is challenging and costly for most businesses.  These companies actively look for ways to avoid churn and retain their customers for as long as possible.  Subscription services like Netflix, AT&T, and others are all very concerned with churn.  If they could predict which customers would churn, they would be able to target offers to them to better retain them before the churn occurs.

Reference articles:

1.  Amaresan, Swetha "What is Customer Churn?"  https://blog.hubspot.com/service/what-is-customer-churn
2.  Stec, Carly "How to Reduce Customer Churn"  https://blog.hubspot.com/service/how-to-reduce-customer-churn
3.  Orac, Roman "Churn prediction"  https://towardsdatascience.com/churn-prediction-770d6cb582a5

# CRISP-DM: Data Understanding

The data we are looking at involves the Sparkify music service user event data.  Each row represents an event.  An event is when the user does something with the app like playing the next song, visiting the home page, changing their settings, or subscribing for the service.  The events are tied to the users and have a timestamp along with other song information as well.

Based on this data, we would like to predict the users who will churn.

Let's read in the data and take a look. There is one json data file involved that contains all the user's actions on the platform.

In [1]:
# import libraries
#pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, count, avg
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.functions import isnan

# scikit-learn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import GridSearchCV

# miscellaneous
import datetime
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify Project") \
    .getOrCreate()

### Let's load the data
We are using the mini dataset and not the s3 based larger dataset that can only be used on the EMR cluster.  

Note we tried using the larger dataset without the EMR cluster but this was not possible with the platform we had.

In [3]:
# load the data
use_full_dataset = False

# either load the full dataset or the smaller subset
if use_full_dataset:
    path = 's3n://udacity-dsnd/sparkify/sparkify_event_data.json'
else:
    path = 'mini_sparkify_event_data.json'
    #path = 's3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json'
    
# read the json data
df = spark.read.json(path)

### Let's take a look at the schema
Here you can see what each user event contains.  We will not need to use all these fields for our prediction so we will analyze which are the most relevant fields for our use.

At first glance it seems that the following are the most important:
1. gender
2. location
3. page
4. ts
5. userId

In [4]:
df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



### Let's experiment with some sql queries

In [5]:
# create the view
df.createOrReplaceTempView("df_table")

In [6]:
# Show one event for one user
spark.sql('''
            SELECT *
            FROM df_table
            WHERE userId == '132'
          ''').show(1)

+-----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+-----------------+------+-------------+--------------------+------+
|           artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|             song|status|           ts|           userAgent|userId|
+-----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+-----------------+------+-------------+--------------------+------+
|Justin Timberlake|Logged In|    Sadie|     F|            0|   Jones|275.66975| free|Denver-Aurora-Lak...|   PUT|NextSong|1537054553000|      131|Still On My Brain|   200|1538470796000|"Mozilla/5.0 (Mac...|   132|
+-----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+--------

In [7]:
# Show one cancellation event
spark.sql('''
            SELECT *
            FROM df_table
            WHERE page == 'Cancellation Confirmation'
          ''').show(1)

+------+---------+---------+------+-------------+--------+------+-----+------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|artist|     auth|firstName|gender|itemInSession|lastName|length|level|          location|method|                page| registration|sessionId|song|status|           ts|           userAgent|userId|
+------+---------+---------+------+-------------+--------+------+-----+------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|  null|Cancelled|   Adriel|     M|          104| Mendoza|  null| paid|Kansas City, MO-KS|   GET|Cancellation Conf...|1535623466000|      514|null|   200|1538943990000|"Mozilla/5.0 (Mac...|    18|
+------+---------+---------+------+-------------+--------+------+-----+------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
only showing to

### Let's take a look at the fields that are enumerations

Understanding the enumerated fields will be important for building our data model

In [8]:
# show the genders
df.select('gender').distinct().show()

# show the page types
df.select('page').distinct().show(200)

+------+
|gender|
+------+
|     F|
|  null|
|     M|
+------+

+--------------------+
|                page|
+--------------------+
|              Cancel|
|    Submit Downgrade|
|         Thumbs Down|
|                Home|
|           Downgrade|
|         Roll Advert|
|              Logout|
|       Save Settings|
|Cancellation Conf...|
|               About|
| Submit Registration|
|            Settings|
|               Login|
|            Register|
|     Add to Playlist|
|          Add Friend|
|            NextSong|
|           Thumbs Up|
|                Help|
|             Upgrade|
|               Error|
|      Submit Upgrade|
+--------------------+



In [9]:
# show the status codes, these look like web statuses
df.select('status').distinct().show()

# show some sample user id's
df.select('userId').distinct().show(5)

+------+
|status|
+------+
|   307|
|   404|
|   200|
+------+

+------+
|userId|
+------+
|100010|
|200002|
|   125|
|    51|
|   124|
+------+
only showing top 5 rows



In [10]:
# just show a sample row
df.head()

Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30')

### There are 286,500 events with 18 pieces of information for each

In [11]:
# show the size of the data
print((df.count(), len(df.columns)))

(286500, 18)


### Let's look for null data

Based on the below there are quite a few null values, but let's see if they matter to us.  Based on the fields we found important, let's see how many nulls are in those columns:

1. gender:  8,346 nulls
2. location:  8,346 nulls
3. page:  0
4. ts:  0
5. userId:  0

In [12]:
import pyspark.sql.functions as F

# this will show the number of nulls in each column
df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]).show()

+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
|artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId| song|status| ts|userAgent|userId|
+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
| 58392|   0|     8346|  8346|            0|    8346| 58392|    0|    8346|     0|   0|        8346|        0|58392|     0|  0|     8346|     0|
+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+



### Looking at the nulls for gender shows that the userId is blank for these rows, but not null

In [13]:
df.where(F.isnull(df.gender)).show(5)

+------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
|artist|      auth|firstName|gender|itemInSession|lastName|length|level|location|method| page|registration|sessionId|song|status|           ts|userAgent|userId|
+------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
|  null|Logged Out|     null|  null|          100|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355745000|     null|      |
|  null|Logged Out|     null|  null|          101|    null|  null| free|    null|   GET| Help|        null|        8|null|   200|1538355807000|     null|      |
|  null|Logged Out|     null|  null|          102|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355841000|     null|      |
|  null|Logged Out|     null|  nul

In [14]:
# just in case, let's drop any rows with null userId's
df = df.dropna(how='any', subset = ['userId'])

### ok there were no null userId's so there is no effect so far

In [15]:
print(df.count())

286500


### let's see how many events have a blank userId

Looks like this is the same set of 8,346 events.

In [16]:
df.where(df.userId == '').count()

8346

### let's filter out the blank userId events as they are not useful for our prediction

Now we have 278,154 events left.

In [17]:
# filter out the blanks
df = df.filter(df['userId'] != '')
print(df.count())

278154


### there are no gender or location nulls any longer

Looks like we killed two birds with one stone.  :-)

In [18]:
print(df.where(F.isnull(df.gender)).count())
print(df.where(F.isnull(df.location)).count())

0
0


### Let's take a look at what a cancellation looks like

We will also look at a user who cancelled and what his event stream looked like to see what a typical user lifecycle looks like.

In [19]:
df.filter("page = 'Cancellation Confirmation'").show(5)

+------+---------+---------+------+-------------+--------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|artist|     auth|firstName|gender|itemInSession|lastName|length|level|            location|method|                page| registration|sessionId|song|status|           ts|           userAgent|userId|
+------+---------+---------+------+-------------+--------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|  null|Cancelled|   Adriel|     M|          104| Mendoza|  null| paid|  Kansas City, MO-KS|   GET|Cancellation Conf...|1535623466000|      514|null|   200|1538943990000|"Mozilla/5.0 (Mac...|    18|
|  null|Cancelled|    Diego|     M|           56|   Mckee|  null| paid|Phoenix-Mesa-Scot...|   GET|Cancellation Conf...|1537167593000|      540|null|   200|1539033046000|"Mozilla/5.0 (iPh...|    32|
|  nu

In [20]:
df.select(['userId', 'gender', 'level', 'location', 'ts', 'artist', 'page']).where(df.userId == '18').collect()

[Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538499917000, artist=None, page='Home'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538499933000, artist='Mike And The Mechanics', page='NextSong'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538500208000, artist='Taking Back Sunday', page='NextSong'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538500476000, artist='Beirut', page='NextSong'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538500654000, artist='Bob Log III', page='NextSong'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538500842000, artist='Krisiun', page='NextSong'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538500856000, artist=None, page='Settings'),
 Row(userId='18', gender='M', level='paid', location='Kansas City, MO-KS', ts=1538

In [21]:
df.select(['page']).where(df.userId == '18').collect()

[Row(page='Home'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='Settings'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='Settings'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='Thumbs Up'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='Thumbs Up'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='Logout'),
 Row(page='Home'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='Add to Playlist'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='NextSong'),
 Row(page='N

# CRISP-DM: Prepare Data

We will now look even more closely at the data from the event log and prepare it for our use with features and labels.

## Features

Based on our data analysis we will be using the following features in order to predict the churn.  Included is an analysis of why this data is important and should be a feature.

### Inputs
1.  Thumbs Up total.  We will count how many times the user applied a thumbs up to a song.  The more the user applies these, the more sticky they should become to the service since the service knows more and more about them.
2.  Thumbs Down total.  We will count how many times the user applied a thumbs down to a song.  Again, this should be indicative of the stickiness of the user but could also indicate that the user is not satisfied and/or the recommendation system used by the service is not working very well.
3.  Add to Playlist total.  We will count how many times a user added to a playlist.  Again, a playlist is a service feature that helps make users more sticky and will help reduce churn.
4.  Add Friend total.  We will count how many times a user added a friend.  Due to network effect, the more friends you have on a service, the more likely you are to stay with that service, thus reducing churn.
5.  Error total.  We will count how many errors the user encountered.  Technical difficulties would turn off a user and make it more likely they would churn.
6.  Next Song count per day for last 7 days.  We will count how many songs the user has played per day for the last 7 days of their training data set.  The theory is that if the user is active they should not churn, but if the user shows signs of slowing down their usage of the service recently, they would be more likely to churn.
7.  Next Song count per hour bucket on average over all time.  We will count how many songs the user has played per hour bucket over the training data set.  An hour bucket is a grouping of hours so that songs can be summed over this time window.  If you have 8 hour buckets, then each bucket represents 3 contiguous hours to count songs over.  The theory here is that by looking at the song playing frequency over the course of a day we can categorize users and find users that are more alike to each other.  This will lead to a better prediction.

### Prediction
1.  Churn 0 or 1.  0 indicates no churn while 1 indicates churn.  For the training data any user who has a 'Cancellation Confirmation' event will be defined as having churned.


### We will start by getting the data we need from the larger data set

We decided that userId, gender, and location were important for prediction so we will grab those.  For gender and location we will convert those into category codes which are helpful for training models.

In [22]:
# get list of users along with their gender and location
users = df.select('userId', 'gender', 'location').distinct().toPandas()

# convert gender to category code
users['gender'] = users['gender'].astype('category').cat.codes

# convert location to category code
users['location'] = users['location'].astype('category').cat.codes

print(users.count())
users.head()

userId      225
gender      225
location    225
dtype: int64


Unnamed: 0,userId,gender,location
0,100021,1,26
1,139,1,56
2,41,0,58
3,72,0,82
4,200016,0,9


### Now let's get the total user activity counts for each user

This will be useful later as we scale the per hour and per day activity for each user.  Stay tuned.

In [23]:
# get list of page counts overall
total_counts = df.groupBy('userId').agg(count('*').alias('count')).sort('userId').toPandas()
total_counts.set_index('userId', inplace=True)
total_counts.head()

Unnamed: 0_level_0,count
userId,Unnamed: 1_level_1
10,795
100,3214
100001,187
100002,218
100003,78


### Let's also get the page counts per page type for each user

We will need this to distinguish between the various activity types we will be using for prediction.

In [24]:
# get list of page counts by page type
page_counts = df.groupBy('userId', 'page').agg(count('*').alias('count')).sort('userId', 'page').toPandas()
page_counts.head()

Unnamed: 0,userId,page,count
0,10,About,2
1,10,Add Friend,12
2,10,Add to Playlist,9
3,10,Downgrade,7
4,10,Help,1


### Now let's get data related to the songs played per hour

In order to do that we need to decode the timestamp into the hour it belongs to and then add a column for the decoded hour

In [25]:
# extract the hour from a timestamp
get_hour = udf(lambda x: datetime.datetime.fromtimestamp(x / 1000.0).hour, IntegerType())

In [26]:
# add a column for the hour
df_new = df.withColumn('hour', get_hour(df.ts))

In [27]:
# aggregate the user's activity for each hour of the day
# we will use this later
hour_counts = df_new.groupBy('userId', 'hour').agg(count('*').alias('count')).sort('userId', 'hour').toPandas()
hour_counts.head()

Unnamed: 0,userId,hour,count
0,10,0,27
1,10,1,18
2,10,2,34
3,10,3,40
4,10,4,52


### Now let's get data related to the songs played per day

In order to do that we first need to find the start of the event log by finding the lowest timestamp.  From there we will compute the relative day of every other event in the event log.  Once we have the relative days, we can put this data in order for each user so we can use it for prediction.

In [28]:
# find the lowest timestamp overall
min_ts = df.groupby().min('ts').collect()[0]['min(ts)']

In [29]:
# create a column that shows the relative day for the entry
df_new = df.withColumn('rel_day', ((df.ts - min_ts)/(1000*60*60*24)).cast(IntegerType()))

In [30]:
# aggregate the user's activity for each relative day and sort this per user and relative day
# we will use this later
day_counts = df_new.groupBy('userId', 'rel_day').agg(count('*').alias('count')).sort('userId', 'rel_day').toPandas()
day_counts = day_counts.sort_values(by=['userId', 'rel_day'], ascending=False)
day_counts.reset_index(inplace=True, drop=True)
day_counts.head()

Unnamed: 0,userId,rel_day,count
0,99,57,126
1,99,56,7
2,99,54,141
3,99,52,88
4,99,49,23


### Now we will define our churn measure and save it.

We have defined churn as whenever the user performs a confirmed cancellation.  We will extract this data and save it for each user.  0 indicates the user has not churned while 1 indicates the user has churned for training and testing purposes.

In [31]:
# create churn measure
churn = page_counts[page_counts['page'] == 'Cancellation Confirmation'].copy()
churn.drop('page', inplace=True, axis=1)
churn.rename(columns={'count': 'churn'}, inplace=True)
churn.head()

Unnamed: 0,userId,churn
33,100001,1
54,100003,1
80,100005,1
93,100006,1
104,100007,1


### Once we have the churn defined, we will add it to our labels which we'll be trying to predict.

In [32]:
# create labels with all users
labels = users.copy()

# add churn
labels = labels.merge(churn, on='userId', how='left')
labels.fillna(0, inplace=True)
labels.churn = labels.churn.astype(int)

# set user id as the index
labels.set_index('userId', inplace=True)

print(labels.count())
labels.head()

gender      225
location    225
churn       225
dtype: int64


Unnamed: 0_level_0,gender,location,churn
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100021,1,26,1
139,1,56,0
41,0,58,0
72,0,82,0
200016,0,9,1


### Now we will setup our features which will be used to predict with.  

We will be taking all the data we identified earlier.  Some of this data can be found in the event log directly.  Features like userId, gender, and location are examples of this.  Other data has to be computed as sums of activity.  We will be using the dataframes we created earlier to populate this feature data.

In [33]:
# create features with all users along with their gender and location code
features = users.copy()

# add thumbs up
thumbs_up = page_counts[page_counts['page'] == 'Thumbs Up'][['userId', 'count']]
thumbs_up.rename(columns={'count': 'thumbs_up'}, inplace=True)
features = features.merge(thumbs_up, on='userId', how='left')

# add thumbs down
thumbs_down = page_counts[page_counts['page'] == 'Thumbs Down'][['userId', 'count']]
thumbs_down.rename(columns={'count': 'thumbs_down'}, inplace=True)
features = features.merge(thumbs_down, on='userId', how='left')

# add add to playlist
add_to_playlist = page_counts[page_counts['page'] == 'Add to Playlist'][['userId', 'count']]
add_to_playlist.rename(columns={'count': 'add_to_playlist'}, inplace=True)
features = features.merge(add_to_playlist, on='userId', how='left')

# add add friend
add_friend = page_counts[page_counts['page'] == 'Add Friend'][['userId', 'count']]
add_friend.rename(columns={'count': 'add_friend'}, inplace=True)
features = features.merge(add_friend, on='userId', how='left')

# add error
error = page_counts[page_counts['page'] == 'Error'][['userId', 'count']]
error.rename(columns={'count': 'error'}, inplace=True)
features = features.merge(error, on='userId', how='left')

# set user id as the index
features.set_index('userId', inplace=True)

# fill in any missing data
features.fillna(0, inplace=True)

print(features.count())
print(features.shape)
features.head()

gender             225
location           225
thumbs_up          225
thumbs_down        225
add_to_playlist    225
add_friend         225
error              225
dtype: int64
(225, 7)


Unnamed: 0_level_0,gender,location,thumbs_up,thumbs_down,add_to_playlist,add_friend,error
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
100021,1,26,11.0,5.0,7.0,7.0,2.0
139,1,56,18.0,5.0,13.0,6.0,1.0
41,0,58,76.0,10.0,61.0,36.0,1.0
72,0,82,6.0,1.0,3.0,3.0,1.0
200016,0,9,19.0,5.0,3.0,0.0,0.0


### Now we will add the counts by hour buckets to our features list

This is a bit tricky as we are using hour buckets.  In our case we used 3hr window buckets, making 8 buckets in total.  We can use the dataframes we created earlier that counted the activity per user per hour and roll these into the hour window buckets.

We applied a scaling factor for each user by counting the total activity for that user and dividing all counts by that number.  In this way we can identify users that are like each other without requiring them to have the exact same some counts per hour.  This is more normalized data across users.

We also experimented with having no hourly data in our feature set so we created a parameter to do just that.  In the end, we found that the hourly data was useful for prediction so we have kept that on.

In [34]:
# add hour counts for each user
add_hours_features = True

# set up hour windows so the hours can be grouped in buckets instead of the full 24 slots
hour_window = 3
buckets = int(24 / hour_window)

# create zero columns in features for all our hour buckets
for bucket in range(buckets):
    col_name = 'hour_' + str(bucket)
    if add_hours_features:
        features[col_name] = 0.0

# go through each of the hours counts we found earlier and fill them in our features
for index, row in hour_counts.iterrows():
    # find the total song counts for this user so we can use for scaling
    userId = row['userId']
    total_count = total_counts.at[userId, 'count']
    
    # find the column name to use based on the buckets
    col_index = int(row['hour'] / hour_window)
    col_name = 'hour_' + str(col_index)
    if add_hours_features:
        features.at[row['userId'], col_name] += row['count'] / total_count

# fill in any missing data
features.fillna(0, inplace=True)

print(features.count())
print(features.shape)
features.head()

gender             225
location           225
thumbs_up          225
thumbs_down        225
add_to_playlist    225
add_friend         225
error              225
hour_0             225
hour_1             225
hour_2             225
hour_3             225
hour_4             225
hour_5             225
hour_6             225
hour_7             225
dtype: int64
(225, 15)


Unnamed: 0_level_0,gender,location,thumbs_up,thumbs_down,add_to_playlist,add_friend,error,hour_0,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100021,1,26,11.0,5.0,7.0,7.0,2.0,0.31348,0.200627,0.169279,0.184953,0.0,0.0,0.07837,0.053292
139,1,56,18.0,5.0,13.0,6.0,1.0,0.14442,0.2407,0.293217,0.166302,0.037199,0.002188,0.019694,0.09628
41,0,58,76.0,10.0,61.0,36.0,1.0,0.143694,0.150901,0.142342,0.113514,0.085586,0.099099,0.128829,0.136036
72,0,82,6.0,1.0,3.0,3.0,1.0,0.0,0.0,0.0,0.059829,0.512821,0.42735,0.0,0.0
200016,0,9,19.0,5.0,3.0,0.0,0.0,0.093284,0.201493,0.261194,0.074627,0.070896,0.164179,0.123134,0.011194


### Now we will add the counts by day to our features list

This is a bit tricky as we are counting days backwards and not every user had activity on the same days.  Luckily we constructed the data we needed earlier.  That data is sorted by user and by most recent day first.

Using this we find the most recent day the user had activity and save that.  Then we move to the next most recent day the user had activity and record that in the proper slot.  We continue this until we reach the limit of number of days to keep, which is a parameter.  We set this to 7 days or 1 week which seemed like a good window to look at for changes in user activity.

We also experimented with having no daily data in our feature set so we created a parameter to do just that.  In the end, we found that the daily data was useful for prediction so we have kept that on.

In [35]:
# add day counts for each user
add_days_features = True

# configure the number of days we should keep
days_to_keep = 7

# go through each of the day counts we collected earlier
cur_userId = 0
cur_rel_day = 0
for index, row in day_counts.iterrows():
    userId = row['userId']
    rel_day = row['rel_day']
    
    # if we switched to a new user, reset the index
    if userId != cur_userId:
        cur_userId = userId
        day_index = 0
        
    # otherwise carry on with the same user using the relative days
    else:
        day_index += (cur_rel_day - rel_day)
        
    # update for the next loop
    cur_rel_day = rel_day

    # if we are past our days to keep, move on
    if day_index >= days_to_keep:
        continue
        
    # find the total song counts for this user so we can use for scaling
    total_count = total_counts.at[userId, 'count']
    
    # set the counts in our features
    col = 'day_' + str(day_index)
    if add_days_features:
        features.at[row['userId'], col] = row['count'] / total_count
    
features.fillna(0, inplace=True)

print(features.count())
print(features.shape)
features.head()

gender             225
location           225
thumbs_up          225
thumbs_down        225
add_to_playlist    225
add_friend         225
error              225
hour_0             225
hour_1             225
hour_2             225
hour_3             225
hour_4             225
hour_5             225
hour_6             225
hour_7             225
day_0              225
day_1              225
day_3              225
day_5              225
day_2              225
day_6              225
day_4              225
dtype: int64
(225, 22)


Unnamed: 0_level_0,gender,location,thumbs_up,thumbs_down,add_to_playlist,add_friend,error,hour_0,hour_1,hour_2,...,hour_5,hour_6,hour_7,day_0,day_1,day_3,day_5,day_2,day_6,day_4
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100021,1,26,11.0,5.0,7.0,7.0,2.0,0.31348,0.200627,0.169279,...,0.0,0.07837,0.053292,0.570533,0.0,0.0,0.0,0.0,0.0,0.0
139,1,56,18.0,5.0,13.0,6.0,1.0,0.14442,0.2407,0.293217,...,0.002188,0.019694,0.09628,0.021882,0.0,0.0,0.0,0.0,0.485777,0.0
41,0,58,76.0,10.0,61.0,36.0,1.0,0.143694,0.150901,0.142342,...,0.099099,0.128829,0.136036,0.077477,0.0,0.015315,0.0,0.112162,0.081982,0.014865
72,0,82,6.0,1.0,3.0,3.0,1.0,0.0,0.0,0.0,...,0.42735,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
200016,0,9,19.0,5.0,3.0,0.0,0.0,0.093284,0.201493,0.261194,...,0.164179,0.123134,0.011194,0.44403,0.067164,0.0,0.0,0.0,0.0,0.0


# CRISP-DM: Data Modeling

Now we will move on to the data modeling phase of CRISP-DM. Here we will model the data using various models and try to predict using that model.

Now let us try to predict the churn based on the full feature set using various models.

### Before predicting, though, let us complete normalization of the data using scikit learn.  Normalization is important so that one feature doesn't dominate the prediction versus another.

In [36]:
# normalize all the features
X = features.values
min_max_scaler = preprocessing.MinMaxScaler()
X = min_max_scaler.fit_transform(X)
print(X)
print(len(X))

# get the labels
y = labels['churn'].values
print(y)
print(len(y))

[[1.         0.2300885  0.02517162 ... 0.         0.         0.        ]
 [1.         0.49557522 0.04118993 ... 0.         1.         0.        ]
 [0.         0.51327434 0.17391304 ... 0.31405405 0.16876471 0.02756194]
 ...
 [0.         0.27433628 0.12585812 ... 0.         0.09045597 0.        ]
 [1.         0.73451327 0.00686499 ... 1.         0.         0.        ]
 [1.         0.15929204 0.04805492 ... 0.         0.         0.        ]]
225
[1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0
 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1
 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0
 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1
 1 0 0]
225


### Now we will split the data into training and test datasets

We will keep 80% of the data for training and 20% for test.  We will not need the validation data set because for hyperparameter tuning we will be using grid search later on.  Stay tuned.

In [37]:
# split into training and test sets
target_names = ['no churn', 'churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((180, 22), (45, 22), (180,), (45,))

### Let's try some classifiers and see how they work out

Logistic regression and SVC were somewhat hopeless with this dataset so we abandoned them early.

In [38]:
from sklearn.linear_model import LogisticRegression

lrm = LogisticRegression().fit(X_train, y_train)
y_pred = lrm.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
y_pred, y_test

              precision    recall  f1-score   support

    no churn       0.79      1.00      0.88        34
       churn       1.00      0.18      0.31        11

   micro avg       0.80      0.80      0.80        45
   macro avg       0.90      0.59      0.60        45
weighted avg       0.84      0.80      0.74        45





(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0]),
 array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        0]))

In [39]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
y_pred, y_test

              precision    recall  f1-score   support

    no churn       0.76      1.00      0.86        34
       churn       0.00      0.00      0.00        11

   micro avg       0.76      0.76      0.76        45
   macro avg       0.38      0.50      0.43        45
weighted avg       0.57      0.76      0.65        45



  'precision', 'predicted', average, warn_for)


(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0]),
 array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        0]))

### The next set of classifiers were more promising so we looked at each of these.

K Neighbors, Decision Tree, Random Forest, and MLP showed somewhat promising results for prediction.  Looking at the f1 score for the churn use case was our target.  The f1 score ranged from 0.17 to 0.63.

In [40]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=2)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
y_pred, y_test

              precision    recall  f1-score   support

    no churn       0.77      1.00      0.87        34
       churn       1.00      0.09      0.17        11

   micro avg       0.78      0.78      0.78        45
   macro avg       0.89      0.55      0.52        45
weighted avg       0.83      0.78      0.70        45



(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0]),
 array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        0]))

In [41]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
y_pred, y_test

              precision    recall  f1-score   support

    no churn       0.83      0.88      0.86        34
       churn       0.56      0.45      0.50        11

   micro avg       0.78      0.78      0.78        45
   macro avg       0.69      0.67      0.68        45
weighted avg       0.77      0.78      0.77        45



(array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0]),
 array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        0]))

In [42]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
y_pred, y_test

              precision    recall  f1-score   support

    no churn       0.79      0.97      0.87        34
       churn       0.67      0.18      0.29        11

   micro avg       0.78      0.78      0.78        45
   macro avg       0.73      0.58      0.58        45
weighted avg       0.76      0.78      0.73        45





(array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0]),
 array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        0]))

In [43]:
from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(hidden_layer_sizes=(200,), alpha=1e-5, solver='lbfgs', max_iter=3000)
mlpc.fit(X_train, y_train)
y_pred = mlpc.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
print(f1_score(y_test, y_pred))
y_pred, y_test

              precision    recall  f1-score   support

    no churn       0.86      0.94      0.90        34
       churn       0.75      0.55      0.63        11

   micro avg       0.84      0.84      0.84        45
   macro avg       0.81      0.74      0.77        45
weighted avg       0.84      0.84      0.84        45

0.631578947368421


(array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        1]),
 array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        0]))

### Now let's run grid searches for all the promising models.

In order to do this we needed to understand the parameters for each of these models.  The below reference material was extremely useful to doing this:

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

### Create a helper function to run the grid search

This will be run repeatedly on each of the different promising models.

In [44]:
# create f1 scorer to maximize churn prediction
f1 = make_scorer(f1_score, average='binary')

def run_grid_search(model, params):
    """ 
    For the given model and parameters, run a grid search and show the results.
    Parameters: 
    model (estimator): The classification estimator to use for the grid search.
    params (dict):  The parameter grid to search on.
  
    Returns: 
    <Nothing> The grid search prediction results will be shown.
    """
    cv = GridSearchCV(model, param_grid=params, verbose=1, scoring=f1, n_jobs=-1, cv=3)
    cv.fit(X_train, y_train)
    y_pred = cv.predict(X_test)
    print(cv.best_params_)
    print(classification_report(y_test, y_pred, target_names=target_names))
    print(f1_score(y_test, y_pred))
    print(y_pred)
    print(y_test)

### K Neighbors Grid Search

In [45]:
# perform grid search on K Neighbors
params = {
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p': [1, 2, 3]
}

knc = KNeighborsClassifier(n_neighbors=2)
run_grid_search(knc, params)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


{'algorithm': 'auto', 'p': 1, 'weights': 'distance'}
              precision    recall  f1-score   support

    no churn       0.80      0.97      0.88        34
       churn       0.75      0.27      0.40        11

   micro avg       0.80      0.80      0.80        45
   macro avg       0.78      0.62      0.64        45
weighted avg       0.79      0.80      0.76        45

0.39999999999999997
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
 0 1 0 0 0 0 0 0]
[0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 1 1 0 0 0 0 0 0]


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:    1.3s finished


### MLP Grid Search

In [46]:
# perform grid search on MLP classifier
params = {
    'max_iter': [3000, 5000],
    'activation': ['logistic', 'tanh', 'relu'],
    'warm_start': [False, True],
    'hidden_layer_sizes': [(200,), (100,), (300,), (50,10), (50,20), (100,20), (200,20)]
}

mlpc = MLPClassifier(solver='lbfgs')
run_grid_search(mlpc, params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 84 candidates, totalling 252 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done 252 out of 252 | elapsed:    9.3s finished


{'activation': 'logistic', 'hidden_layer_sizes': (50, 20), 'max_iter': 3000, 'warm_start': True}
              precision    recall  f1-score   support

    no churn       0.79      0.79      0.79        34
       churn       0.36      0.36      0.36        11

   micro avg       0.69      0.69      0.69        45
   macro avg       0.58      0.58      0.58        45
weighted avg       0.69      0.69      0.69        45

0.36363636363636365
[0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 1]
[0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 1 1 0 0 0 0 0 0]


### Decision Tree Grid Search

In [47]:
# perform grid search on Decision Tree
params = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
}

dtc = DecisionTreeClassifier()
run_grid_search(dtc, params)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
{'criterion': 'entropy', 'splitter': 'best'}
              precision    recall  f1-score   support

    no churn       0.86      0.91      0.89        34
       churn       0.67      0.55      0.60        11

   micro avg       0.82      0.82      0.82        45
   macro avg       0.76      0.73      0.74        45
weighted avg       0.81      0.82      0.82        45

0.6
[0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0]
[0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 1 1 0 0 0 0 0 0]


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.0s finished


### Random Forest Grid Search

In [48]:
# perform grid search on Decision Tree
params = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [100, 300, 500],
    'bootstrap': [False, True]
}

rfc = RandomForestClassifier()
run_grid_search(rfc, params)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  21 out of  36 | elapsed:    1.2s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:    2.1s finished


{'bootstrap': False, 'criterion': 'gini', 'n_estimators': 300}
              precision    recall  f1-score   support

    no churn       0.83      1.00      0.91        34
       churn       1.00      0.36      0.53        11

   micro avg       0.84      0.84      0.84        45
   macro avg       0.91      0.68      0.72        45
weighted avg       0.87      0.84      0.82        45

0.5333333333333333
[0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0]
[0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 1 1 0 0 0 0 0 0]


# CRISP-DM: Evaluate the Results

In this final phase of CRISP-DM for this project, we will evaluate the results of the models we created using the Sparkify events log.

Looking at the grid search of the models we selected, Decision Tree was the clear winner with an F1 score of 0.60.  Second best was the MLP classifier with 0.41.  Using the hyperparameter tuned Decision Tree we had 67% precision and 55% recall.

This would be useful to Sparkify to help identify the users that may be churning so offers can be made to retain them.  Even in the worst case that the prediction is wrong, providing an offer to the user will not result in future churn and may help retain them in the future as well.

This project explored a large data set and how to process it effectively for use with machine learning.  The ever present churn problem was tackled for the fictional Sparkify music streaming company.  It was eye opening just how data can be prepared in so many ways and so many models can be applied for prediction.  In truth, many, many combinations were tried before the final results were obtained.  Trial and error is surely a key part of data science along with diligent work analyzing the data itself.