# Sparkify Project Workspace
This workspace contains a tiny subset (128MB) of the full dataset available (12GB). Feel free to use this workspace to build your project, or to explore a smaller subset with Spark before deploying your cluster on the cloud. Instructions for setting up your Spark cluster is included in the last lesson of the Extracurricular Spark Course content.

You can follow the steps below to guide your data analysis and model building portion of this project.

In [31]:
from datetime import datetime

from pyspark.sql import SparkSession

from pyspark.sql.functions import min as smin, max as smax, sum as ssum, round as sround
from pyspark.sql.functions import isnan, when, first, avg, last, count, countDistinct, col, lag, lead, coalesce, lit

from pyspark.sql.window import Window
from pyspark.sql.functions import to_date, date_format
from pyspark.sql.types import TimestampType, IntegerType
 
import jupyter_utils as j

j.reload(j)

In [17]:
filepath = 'medium_sparkify_event_data.json'

In [18]:
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

spark.sparkContext.setLogLevel('DEBUG')

# Load and Clean Dataset
In this workspace, the mini-dataset file is `mini_sparkify_event_data.json`. Load and clean the dataset, checking for invalid or missing data - for example, records without userids or sessionids. 

In [19]:
# df = spark.read.option("inferSchema", "true").option("header", "true").option("encoding", "utf-8").csv(filepath)
df = spark.read.option("inferSchema", "true").option("header", "true").option("encoding", "utf-8").json(filepath)

In [5]:
df.dtypes

[('artist', 'string'),
 ('auth', 'string'),
 ('firstName', 'string'),
 ('gender', 'string'),
 ('itemInSession', 'bigint'),
 ('lastName', 'string'),
 ('length', 'double'),
 ('level', 'string'),
 ('location', 'string'),
 ('method', 'string'),
 ('page', 'string'),
 ('registration', 'bigint'),
 ('sessionId', 'bigint'),
 ('song', 'string'),
 ('status', 'bigint'),
 ('ts', 'bigint'),
 ('userAgent', 'string'),
 ('userId', 'string')]

In [6]:
df.show(n=2, truncate=False, vertical=True)

-RECORD 0-----------------------------------------------------------------------------------------------------------------------------------
 artist        | Martin Orford                                                                                                              
 auth          | Logged In                                                                                                                  
 firstName     | Joseph                                                                                                                     
 gender        | M                                                                                                                          
 itemInSession | 20                                                                                                                         
 lastName      | Morales                                                                                                                    
 length      

In [7]:
df.columns

['artist',
 'auth',
 'firstName',
 'gender',
 'itemInSession',
 'lastName',
 'length',
 'level',
 'location',
 'method',
 'page',
 'registration',
 'sessionId',
 'song',
 'status',
 'ts',
 'userAgent',
 'userId']

In [20]:
not_na_columns = [ 'userId', 'sessionId' ]

In [21]:
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+---+---------+------+
|artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId|song|status| ts|userAgent|userId|
+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+---+---------+------+
|     0|   0|        0|     0|            0|       0|     0|    0|       0|     0|   0|           0|        0|   0|     0|  0|        0|     0|
+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+---+---------+------+



In [22]:
df.groupBy('userId').count().orderBy('count', ascending = False).show(50)

+------+-----+
|userId|count|
+------+-----+
|      |15700|
|    92| 9767|
|   140| 7448|
|300049| 7309|
|   101| 6842|
|300035| 6810|
|   195| 6184|
|   230| 6019|
|   163| 5965|
|   250| 5678|
|    18| 5511|
|   276| 5346|
|300017| 5266|
|    87| 5243|
|   293| 5125|
|300021| 5076|
|    42| 4952|
|300011| 4816|
|    30| 4737|
|    12| 4232|
|300031| 4194|
|   126| 4190|
|   283| 4181|
|   228| 4092|
|   100| 3999|
|   259| 3633|
|   105| 3597|
|   246| 3566|
|   121| 3541|
|   269| 3511|
|   292| 3504|
|    70| 3465|
|    35| 3456|
|    38| 3211|
|    98| 3206|
|   282| 3191|
|   185| 3088|
|300023| 3018|
|100009| 2987|
|   157| 2966|
|200023| 2955|
|   104| 2950|
|   174| 2917|
|   225| 2849|
|300038| 2829|
|   172| 2728|
|    85| 2696|
|   258| 2684|
|200020| 2654|
|   179| 2639|
+------+-----+
only showing top 50 rows



In [23]:
df.select("artist").distinct().count()

21248

In [24]:
df.select('length').describe().show()

+-------+------------------+
|summary|            length|
+-------+------------------+
|  count|            432877|
|   mean|248.66459278007508|
| stddev| 98.41266955052019|
|    min|           0.78322|
|    max|        3024.66567|
+-------+------------------+



In [25]:
print(f'Rows before: {df.count()}')

df = df.where(df.userId != '')

print(f'Rows after: {df.count()}')

Rows before: 543705
Rows after: 528005


# Exploratory Data Analysis
When you're working with the full dataset, perform EDA by loading a small subset of the data and doing basic manipulations within Spark. In this workspace, you are already provided a small subset of data you can explore.

### Define Churn

Once you've done some preliminary analysis, create a column `Churn` to use as the label for your model. I suggest using the `Cancellation Confirmation` events to define your churn, which happen for both paid and free users. As a bonus task, you can also look into the `Downgrade` events.

### Explore Data
Once you've defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

In [14]:
df.groupBy('page').count().orderBy('count', ascending = False).show(50)

+--------------------+------+
|                page| count|
+--------------------+------+
|            NextSong|432877|
|           Thumbs Up| 23826|
|                Home| 19089|
|     Add to Playlist| 12349|
|          Add Friend|  8087|
|         Roll Advert|  7773|
|              Logout|  5990|
|         Thumbs Down|  4911|
|           Downgrade|  3811|
|            Settings|  2964|
|                Help|  2644|
|               About|  1026|
|             Upgrade|   968|
|       Save Settings|   585|
|               Error|   503|
|      Submit Upgrade|   287|
|    Submit Downgrade|   117|
|              Cancel|    99|
|Cancellation Conf...|    99|
+--------------------+------+



Some questions about the data:

- Are errors related to downgrading canceling the service?
- Having a certain number of friends or a sense of community can decrease the churn?
- Thumbs down are related to churn? (could the quality of the songs catalog affect the churn)
- The advertising is not annoying the users?
- Users with stay connected for more time have less change to churn?
- Is the home page relevant?
- Users, who access the downgrade page are how much more willing to churn?

In [15]:
df.groupBy('status').count().orderBy('count', ascending = False).show(20)

+------+------+
|status| count|
+------+------+
|   200|483600|
|   307| 43902|
|   404|   503|
+------+------+



In [16]:
df.filter('userId = 92').groupBy('page').count().orderBy('count', ascending = False).show(50)

+----------------+-----+
|            page|count|
+----------------+-----+
|        NextSong| 8177|
|       Thumbs Up|  400|
|            Home|  308|
| Add to Playlist|  248|
|      Add Friend|  158|
|          Logout|   96|
|       Downgrade|   85|
|     Thumbs Down|   80|
|            Help|   62|
|     Roll Advert|   60|
|        Settings|   48|
|           About|   17|
|           Error|   13|
|         Upgrade|    7|
|   Save Settings|    5|
|  Submit Upgrade|    2|
|Submit Downgrade|    1|
+----------------+-----+



In [17]:
df.filter('userId = 92').groupBy('page').count().orderBy('count', ascending = False).show(50)

+----------------+-----+
|            page|count|
+----------------+-----+
|        NextSong| 8177|
|       Thumbs Up|  400|
|            Home|  308|
| Add to Playlist|  248|
|      Add Friend|  158|
|          Logout|   96|
|       Downgrade|   85|
|     Thumbs Down|   80|
|            Help|   62|
|     Roll Advert|   60|
|        Settings|   48|
|           About|   17|
|           Error|   13|
|         Upgrade|    7|
|   Save Settings|    5|
|  Submit Upgrade|    2|
|Submit Downgrade|    1|
+----------------+-----+



In [18]:
df.filter('userId = 92').groupBy('userAgent').count().orderBy('count', ascending = False).show(50)

+--------------------+-----+
|           userAgent|count|
+--------------------+-----+
|"Mozilla/5.0 (iPa...| 9767|
+--------------------+-----+



In [156]:
df.filter('userId = 92 and song != \'null\' ').groupBy('song').count().orderBy('count', ascending = False).show(50)

+--------------------+-----+
|                song|count|
+--------------------+-----+
|      You're The One|   49|
|                Undo|   34|
|             Revelry|   28|
|Horn Concerto No....|   27|
|    Ain't Misbehavin|   23|
|             Secrets|   20|
|             Invalid|   20|
|Dog Days Are Over...|   19|
|            Tive Sim|   18|
|            Marry Me|   16|
|              Canada|   16|
|        Use Somebody|   16|
|       ReprÃÂ©sente|   15|
|    Bring Me To Life|   14|
|  Sayonara-Nostalgia|   14|
|       Sehr kosmisch|   14|
|         Bulletproof|   13|
|Catch You Baby (S...|   13|
| I CAN'T GET STARTED|   12|
|                Home|   11|
|              Yellow|   10|
|Don't Stop The Music|   10|
|    Hey_ Soul Sister|   10|
|         The Maestro|   10|
|          Kryptonite|   10|
|Make Love To Your...|   10|
|           Fireflies|   10|
|SinceritÃÂ© Et J...|    9|
|    Times Like These|    9|
|      Drop The World|    9|
|        Day 'N' Nite|    9|
|         Bubb

In [20]:
# w_session = Window.partitionBy('sessionId').orderBy('ts')

# df.filter('page = "Cancellation Confirmation"').select('ts', 'sessionId', 'itemInSession', 'userId', 'last_page_on_session', 'page').show() 

In [21]:
df.select(['ts', 'page', 'sessionId', 'itemInSession', 'song', 'artist']).filter('userId = 92').orderBy('ts', ascending = False).show(100)

+-------------+---------------+---------+-------------+--------------------+--------------------+
|           ts|           page|sessionId|itemInSession|                song|              artist|
+-------------+---------------+---------+-------------+--------------------+--------------------+
|1543615837000|     Add Friend|     4790|           86|                null|                null|
|1543615836000|           Home|     4790|           85|                null|                null|
|1543615832000|       NextSong|     4790|           84|           Fireflies|  Charttraxx Karaoke|
|1543615597000|       NextSong|     4790|           83|    Music Of The Sun|             Rihanna|
|1543615053000|       NextSong|     4790|           82|Kun Puut Tekee Se...|Scandinavian Musi...|
|1543614872000|       NextSong|     4790|           81|             The Sun|   Portugal. The Man|
|1543614674000|       NextSong|     4790|           80|             Banquet|          Bloc Party|
|1543614504000|     

In [22]:
df_ts = df.filter('userId = 1333174').select([smin('ts').alias('min_ts'), smax('ts').alias('max_ts')]).rdd.flatMap(lambda x: x).collect()

In [24]:
def to_datetime(milliseconds, dt_format = '%Y-%m-%d %H:%M:%S'):
    return datetime.fromtimestamp(milliseconds / 1000).strftime(dt_format)  

# list(map(to_datetime, df_ts))

In [25]:
list(map(to_datetime, df.select([smin('ts').alias('min_ts'), smax('ts').alias('max_ts')]).rdd.flatMap(lambda x: x).collect()))

['2018-09-30 21:00:11', '2018-11-30 22:01:06']

In [26]:
df.groupBy('location').count().orderBy('count', ascending = False).show(50, False)

+--------------------------------------------+-----+
|location                                    |count|
+--------------------------------------------+-----+
|New York-Newark-Jersey City, NY-NJ-PA       |40156|
|Los Angeles-Long Beach-Anaheim, CA          |34278|
|Boston-Cambridge-Newton, MA-NH              |17574|
|Chicago-Naperville-Elgin, IL-IN-WI          |15194|
|San Francisco-Oakland-Hayward, CA           |11428|
|Atlanta-Sandy Springs-Roswell, GA           |11211|
|Phoenix-Mesa-Scottsdale, AZ                 |11184|
|Dallas-Fort Worth-Arlington, TX             |11061|
|Denver-Aurora-Lakewood, CO                  |9808 |
|Houston-The Woodlands-Sugar Land, TX        |8707 |
|Tampa-St. Petersburg-Clearwater, FL         |8330 |
|Miami-Fort Lauderdale-West Palm Beach, FL   |8180 |
|Indianapolis-Carmel-Anderson, IN            |7691 |
|Minneapolis-St. Paul-Bloomington, MN-WI     |7462 |
|Louisville/Jefferson County, KY-IN          |7457 |
|Fresno, CA                                  |

# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

In [27]:
df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [28]:
df.filter('page = "Cancellation Confirmation"').show(50)

+------+---------+---------+------+-------------+----------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|artist|     auth|firstName|gender|itemInSession|  lastName|length|level|            location|method|                page| registration|sessionId|song|status|           ts|           userAgent|userId|
+------+---------+---------+------+-------------+----------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|  null|Cancelled|   Olivia|     F|           40|      Carr|  null| free|      Fort Wayne, IN|   GET|Cancellation Conf...|1536758439000|      490|null|   200|1538400616000|Mozilla/5.0 (Wind...|   208|
|  null|Cancelled|  Lillian|     F|          234|   Cameron|  null| paid|        Columbus, OH|   GET|Cancellation Conf...|1533472700000|      471|null|   200|1538482793000|Mozilla/5.0 (Wind...|   

In [None]:
# Testing samples
user_id = 100010
# user_id = 121

In [26]:
CHURN_CANCELLATION_PAGE = 'Cancellation Confirmation'
REGISTRATION_PAGE = 'Submit Registration'
milliseconds_to_hours = 3600 * 1000
minutes_to_hours = 60 * 60
TRUE = 1
FALSE = 0

def transform_records(df):
    
    ts_events = df.select([smin('ts').alias('min_ts'), smax('ts').alias('max_ts')]).collect()[0]
    min_ts = ts_events[0]
    max_ts = ts_events[1]

    w_session = Window.partitionBy('sessionId').orderBy('ts')
    w_user_session = Window.partitionBy('sessionId', 'userId').orderBy('ts').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    w_user = Window.partitionBy('userId').orderBy('ts').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    
    # Create features
    df = df.withColumn('previous_page', lag(df.page).over(w_session))
    df = df.withColumn('last_event_ts', last(col('ts')).over(w_user))
    df = df.withColumn('last_page', last(col('page')).over(w_user))
    df = df.withColumn('cancellation_ts', when(df.last_page == CHURN_CANCELLATION_PAGE, df.last_event_ts).otherwise(max_ts))
    df = df.withColumn('register_page', first(col('previous_page')).over(w_user))
    df = df.withColumn('first_ts', first(col('ts')).over(w_user))
    df = df.withColumn('registration_ts', when(df.register_page == REGISTRATION_PAGE, df.first_ts).otherwise(min_ts))
    df = df.withColumn('ts_elapsed', last(df.ts).over(w_session) - first(df.ts).over(w_user_session))
    df = df.withColumn('session_duration', smax(df.ts_elapsed).over(w_user_session))
    
    return df
    
def create_session_dimension(df):
    
    # sessions from the user
    df_sessions = df.orderBy(df.sessionId).groupBy('sessionId', 'userId').agg(
        smax(df.ts).alias('s_ts'),
        smin(df.ts).alias('e_ts'),
        ssum(df.length).alias('total_playback'),
        count(when(df.page == 'Thumbs Up', True)).alias("n_s_likes"),
        count(when(df.page == 'Thumbs Down', True)).alias("n_s_dislikes"),
        
        count(when(df.page == 'NextSong', True)).alias("n_s_songs"),
        count(when(df.page == 'Add Friend', True)).alias("n_s_friends"),
        count(when(df.page == 'Add to Playlist', True)).alias("n_s_add_play"),
        count(when(df.page == 'Home', True)).alias("n_s_home"),
        count(when(df.page == 'Roll Advert', True)).alias("n_s_ads"),
        count(when(df.page == 'Help', True)).alias("n_s_help"),
        count(when(df.page == 'Error', True)).alias("n_s_error"),
        count(when(df.page == 'Settings', True)).alias("n_s_sets"),
        count(col('page')).alias('n_s_actions')
        
    ) 

    w_user_sessions_interval = Window.partitionBy('userId').orderBy('s_ts')
    # Calculate the interval until the next session
    df_sessions = df_sessions.withColumn('interval_to_session', col('s_ts') - lag(col('e_ts')).over(w_user_sessions_interval))

    # We should remove the null lines before count/group to not account 2 times the mean interval
    df_sessions = df_sessions.groupBy('userId').agg(
        (avg(df_sessions.interval_to_session) / milliseconds_to_hours).alias('a_tiaw'),
        (avg(df_sessions.total_playback) / minutes_to_hours).alias('a_play'),
        avg(df_sessions.n_s_likes).alias('a_like'),
        avg(df_sessions.n_s_dislikes).alias('a_disl'),
        
        avg(df_sessions.n_s_songs).alias('a_song'),
        avg(df_sessions.n_s_friends).alias('a_frie'),
        avg(df_sessions.n_s_add_play).alias('a_adde'),
        avg(df_sessions.n_s_home).alias('a_home'),
        avg(df_sessions.n_s_ads).alias('a_ads'),
        avg(df_sessions.n_s_help).alias('a_help'),
        avg(df_sessions.n_s_error).alias('a_erro'),
        avg(df_sessions.n_s_sets).alias('a_sett'),
        avg(df_sessions.n_s_actions).alias('a_acti')
    )
    
    return df_sessions

def create_user_dimension(df):
    
    df_user_profile = df.groupby('userId')\
        .agg( 

            first(when(col('gender') == 'M', TRUE).otherwise(FALSE)).alias('male'),

            smin(col('first_ts')).alias('ts_start'),
            smax(col('last_event_ts')).alias('ts_end'),

            # Subscription
            count(when(col('page') == 'Submit Downgrade', True)).alias("n_down"),
            count(when(col('page') == 'Submit Upgrade', True)).alias("n_upgr"),
            last(when(col('level') == 'paid', TRUE).otherwise(FALSE)).alias('paid'),
            first(when(col('last_page') == CHURN_CANCELLATION_PAGE, TRUE).otherwise(FALSE)).alias('canc'),

            # Streaming
            count(when(col('page') == 'NextSong', True)).alias("n_song"),
            count(when(col('page') == 'Thumbs Up', True)).alias("n_like"),
            count(when(col('page') == 'Thumbs Down', True)).alias("n_disl"),
            countDistinct(col('sessionId')).alias("n_sess"),
            (avg(col('session_duration')) / milliseconds_to_hours).alias("a_stim"),

            # Community
            count(when(col('page') == 'Add Friend', True)).alias("n_frie"),
            count(when(col('page') == 'Add to Playlist', True)).alias("n_adde"),

            # Other
            count(when(col('page') == 'Home', True)).alias("n_home"),
            count(when(col('page') == 'Roll Advert', True)).alias("n_ads"),
            count(when(col('page') == 'Help', True)).alias("n_help"),
            count(when(col('page') == 'Error', True)).alias("n_erro"),
            count(when(col('page') == 'Settings', True)).alias("n_sett"),
            count(col('page')).alias("n_acti")
        )
    
    return df_user_profile

In [644]:
df_sessions = create_session_dimension(df)

df_sessions.where(df_sessions.userId == 100010).show(1)

+------+-------------+-----------------+---------+------------+---------+-----------+------------+--------+-------+--------+----------+--------+-----------+
|userId|avg_time_away|     avg_playback|avg_likes|avg_dislikes|avg_songs|avg_friends|avg_add_play|avg_home|avg_ads|avg_help|avg_errors|avg_sets|avg_actions|
+------+-------------+-----------------+---------+------------+---------+-----------+------------+--------+-------+--------+----------+--------+-----------+
|100010|       73.035|3.593187602777778|      2.0|         1.5|     48.0|        1.5|         0.5|     1.5|   11.0|     0.0|       0.0|     0.0|       68.5|
+------+-------------+-----------------+---------+------------+---------+-----------+------------+--------+-------+--------+----------+--------+-----------+



In [27]:
def sort_features(df, columns_order):
    _columns = df.columns
    _columns.sort()
    
    for _idx, _val in list(enumerate(columns_order)):
        _columns.pop(_columns.index(_val))
        _columns.insert(_idx, _val)
        
    assert len(_columns) == len(df.columns)

    return _columns

In [30]:
df = transform_records(df)

df_sessions = create_session_dimension(df)

df_users = create_user_dimension(df)

_columns = list(set(df_users.schema.names + df_sessions.schema.names) - set(['ts_start', 'ts_end']))

df_users = df_users.orderBy(df_users.userId).join(df_sessions, on = 'userId').select(_columns) 

# Enforces the order for some columns
_columns = sort_features(df_users, [ 'userId', 'male', 'paid', 'canc'])

### WARN: Only round to display
df_users.select([sround(c, 0).cast(dataType = IntegerType()).alias(c) for c in _columns]).fillna(0).show(50)

+------+----+----+----+------+------+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+-----+------+------+------+------+------+------+------+------+------+------+------+
|userId|male|paid|canc|a_acti|a_adde|a_ads|a_disl|a_erro|a_frie|a_help|a_home|a_like|a_play|a_sett|a_song|a_stim|a_tiaw|n_acti|n_adde|n_ads|n_disl|n_down|n_erro|n_frie|n_help|n_home|n_like|n_sess|n_sett|n_song|n_upgr|
+------+----+----+----+------+------+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+-----+------+------+------+------+------+------+------+------+------+------+------+
|100010|   0|   0|   1|    69|     1|   11|     2|     0|     2|     0|     2|     2|     4|     0|    48|     4|    73|   137|     1|   22|     3|     0|     0|     3|     0|     3|     4|     2|     0|    96|     0|
|200002|   1|   0|   1|    79|     1|    2|     1|     0|     0|     0|     5|     3|     4|     0|    62|     6|   175|   395| 

In [673]:
df_users.select(_columns).fillna(0).toPandas().to_csv('sparkify_data.csv', index = False)

In [549]:
df.agg(countDistinct(df.userId).alias('unique_users')).show()

+------------+
|unique_users|
+------------+
|         449|
+------------+



In [544]:
df_users.orderBy(df_users.userId).join(df_sessions, on = 'userId').select(_columns).count()

425

In [548]:
df_users.orderBy(df_users.userId).join(df_sessions, on = 'userId').select(_columns).groupBy('canceled').agg(count(df_users.canceled).alias('total')).show()

+--------+-----+
|canceled|total|
+--------+-----+
|       1|   90|
|       0|  335|
+--------+-----+



- Advertises number (per session and all)
    - The user **100010** returned after some idle time and received a considerable amount of advertises;
    - Also, after thumbs down, I received two advertisements on four sounds. Then canceled the service.
- Number of sessions
- Paid subscription time
- Avg songs before an ad
- Number of skipped songs

In [553]:
df.schema.names

['artist',
 'auth',
 'firstName',
 'gender',
 'itemInSession',
 'lastName',
 'length',
 'level',
 'location',
 'method',
 'page',
 'registration',
 'sessionId',
 'song',
 'status',
 'ts',
 'userAgent',
 'userId']

In [None]:
to_date(df.ts.cast(dataType=TimestampType()))

In [581]:
df.where(df.userId == user_id).select(['artist',
 'auth',
 'firstName',
 'gender',
 'itemInSession',
 'lastName',
 'length',
 'level', 
 'page',
 'sessionId',
 'song', 
 'ts', 
 'userId']).orderBy('sessionId', 'itemInSession').withColumn('datetime', date_format((df.ts/1000).cast(dataType=TimestampType()), 'HH:mm:ss dd-MM-YYYY')).show(350, True)

+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+---------+--------------------+-------------+------+-------------------+
|              artist|     auth|firstName|gender|itemInSession| lastName|   length|level|                page|sessionId|                song|           ts|userId|           datetime|
+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+---------+--------------------+-------------+------+-------------------+
|              Darude|Logged In| Darianna|     F|            0|Carpenter|226.08934| free|            NextSong|       62|           Sandstorm|1538991392000|100010|06:36:32 08-10-2018|
|             Justice|Logged In| Darianna|     F|            1|Carpenter|285.41342| free|            NextSong|       62|Phantom Part 1.5 ...|1538991618000|100010|06:40:18 08-10-2018|
|    Five Iron Frenzy|Logged In| Darianna|     F|            2|Carpenter|236.09424| f

# Modeling
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

In [46]:
from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.feature import StringIndexer, VectorAssembler

In [78]:
columns_to_exclude = set(['userId'])

columns_to_use = list(set(df_users.columns) - columns_to_exclude)

columns_to_train = list(set(columns_to_use) - set(['canc']))

columns_to_use.sort()
columns_to_train.sort()

print(f'Columns: {columns_to_use}\n')
print(f'Columns to train: {columns_to_train}')

Columns: ['a_acti', 'a_adde', 'a_ads', 'a_disl', 'a_erro', 'a_frie', 'a_help', 'a_home', 'a_like', 'a_play', 'a_sett', 'a_song', 'a_stim', 'a_tiaw', 'canc', 'male', 'n_acti', 'n_adde', 'n_ads', 'n_disl', 'n_down', 'n_erro', 'n_frie', 'n_help', 'n_home', 'n_like', 'n_sess', 'n_sett', 'n_song', 'n_upgr', 'paid']

Columns to train: ['a_acti', 'a_adde', 'a_ads', 'a_disl', 'a_erro', 'a_frie', 'a_help', 'a_home', 'a_like', 'a_play', 'a_sett', 'a_song', 'a_stim', 'a_tiaw', 'male', 'n_acti', 'n_adde', 'n_ads', 'n_disl', 'n_down', 'n_erro', 'n_frie', 'n_help', 'n_home', 'n_like', 'n_sess', 'n_sett', 'n_song', 'n_upgr', 'paid']


In [240]:
CHURN_LABEL = 'canc'
TRAIN_SPLIT_RATIO = .7
TEST_SPLIT_RATIO = .2

SPLIT_RATIO = [TRAIN_SPLIT_RATIO, TEST_SPLIT_RATIO]

In [243]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

def evaluate_multiclass_classifier(predictions, columns):
    metrics_to_evaluate = [ 'accuracy', 'f1', 'weightedPrecision', 'weightedRecall' ]
    
    result = {}
    for metric in metrics_to_evaluate:
        evaluator = MulticlassClassificationEvaluator(labelCol = columns[0], predictionCol = columns[1], metricName = metric)
        value = evaluator.evaluate(predictions)
        result[metric] = value
        print(f'{metric}: {value}') 
    
    return result

def train_random_forest_classifier(df, columns, train_cloumns):
    
    # Create the new dataframe
    data = df.select(columns).fillna(0)
    
    # Split train/test
    (train_df, test_df) = data.randomSplit(SPLIT_RATIO, seed = 42)
    
    # Create the indexer for labels
    l_indexer = StringIndexer(inputCol = CHURN_LABEL, outputCol = 'idx_labels')
    
    # Create the feature tranformer, to generate an array representation of all features on dataset
    f_assembler = VectorAssembler(inputCols = train_cloumns, outputCol = 'features')
    
    # Create the model instance
    rf_classifier = RandomForestClassifier(labelCol = 'idx_labels', featuresCol = 'features', numTrees=10)

    # Converts the predictions to original labels
    l_translator = IndexToString(inputCol = 'prediction', outputCol = 'predictedLabel', labels = [ 'Not churn', 'Churn' ])

    # Create the pipeline
    pipeline = Pipeline(stages = [ l_indexer, f_assembler, rf_classifier, l_translator ])

    # Train the model
    model = pipeline.fit(train_df)

    # Test the model
    predictions = model.transform(test_df)

    return model.stages[2], predictions
    
from pyspark.ml.feature import StandardScaler

binary_features = [ 'paid', 'male' ]
numeric_features = ['a_acti', 'a_adde', 'a_ads', 'a_play', 'a_sett', 'a_song', 'a_stim', 'a_tiaw', 
                    'male', 'n_acti', 'n_adde', 'n_ads', 'n_disl', 'n_down', 'n_erro', 'n_frie', 'n_help', 'n_home',
                    'n_like', 'n_sess', 'n_sett', 'n_song', 'n_upgr' ]
    
def create_random_forest_pipeline():
    
    l_indexer = StringIndexer(inputCol = CHURN_LABEL, outputCol = 'idx_labels')
    
    f_binaries = VectorAssembler(inputCols = binary_features, outputCol = 'bin_features')
    
    f_numeric = VectorAssembler(inputCols = numeric_features, outputCol = 'num_features')
    
    f_scaler = StandardScaler(inputCol = 'num_features', outputCol = 'num_features_escaled', withStd = True, withMean = True)
    
    f_all = VectorAssembler(inputCols = [ 'bin_features' , 'num_features_escaled' ], outputCol = 'features')
    
    # Create the model instance
    rf_classifier = RandomForestClassifier(labelCol = 'canc', featuresCol = 'features', seed = 42)

    # Create the pipeline
    pipeline = Pipeline(stages = [ l_indexer, f_binaries, f_numeric, f_scaler, f_all, rf_classifier ])

    return pipeline

In [145]:
model, predictions = train_random_forest_classifier(df_users, columns_to_use, columns_to_train)

In [150]:
evaluate_multiclass_classifier(predictions, ('canc', 'prediction'))

accuracy: 0.8181818181818182
f1: 0.7823601642884364
weightedPrecision: 0.8007451564828614
weightedRecall: 0.8181818181818181


In [151]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol = 'canc', metricName = 'areaUnderROC')

evaluator.evaluate(predictions)

0.7091346153846155

In [245]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

def random_forest_grid_search(pipeline):

    grid_rf = ParamGridBuilder()
    grid_rf = grid_rf.addGrid(rf.maxDepth, [50]) #[5, 10, 15, 20, 25]) 
    grid_rf = grid_rf.addGrid(rf.numTrees, [40]) #[20, 40, 60, 70])
    grid_rf = grid_rf.build()
        
    evaluator = BinaryClassificationEvaluator(labelCol = 'canc')
    
    cv = CrossValidator(estimator = pipeline, estimatorParamMaps = grid_rf, evaluator = evaluator, numFolds = 3, parallelism = 10)
    
    return cv

In [246]:
# Create the new dataframe
data = df_users.select(columns_to_use).fillna(0)

# Split train/test
(train_df, test_df) = data.randomSplit(SPLIT_RATIO, seed = 42)

In [247]:
pipeline = create_random_forest_pipeline()

In [248]:
cv_rf = random_forest_grid_search(pipeline)

In [249]:
cv_rf_results = cv_rf.fit(train_df)

In [250]:
import pandas as pd

scores = cv_rf_results.avgMetrics
params = [{p.name: v for p, v in m.items()} for m in cv_rf.getEstimatorParamMaps()]
params_pd = pd.DataFrame(params)
params_pd['score'] = scores
params_pd

Unnamed: 0,maxDepth,numTrees,score
0,50,40,0.706607


In [251]:
evaluator = BinaryClassificationEvaluator(labelCol = 'canc', metricName = 'areaUnderROC')
 
best_model_results = cv_rf_results.bestModel.transform(test_df)
    
evaluator.evaluate(best_model_results)

0.7489855072463769

In [252]:
evaluate_multiclass_classifier(best_model_results, ('canc', 'prediction'))

accuracy: 0.7959183673469388
f1: 0.7674625405717843
weightedPrecision: 0.7739108182457936
weightedRecall: 0.7959183673469388


{'accuracy': 0.7959183673469388,
 'f1': 0.7674625405717843,
 'weightedPrecision': 0.7739108182457936,
 'weightedRecall': 0.7959183673469388}

In [237]:
best_model_results.select(['features', 'prediction', 'canc']).show(10, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |prediction|canc|
+-------------

In [204]:
best_model_results.select(['rawPrediction', 'prediction', 'canc']).show(10, False)

+--------------------+----------+----+
|       rawPrediction|prediction|canc|
+--------------------+----------+----+
|[7.36601764248823...|       1.0|   1|
|[17.0575710379575...|       0.0|   1|
|[16.4893218608728...|       0.0|   1|
|[16.3220158268161...|       0.0|   1|
|[10.4271757825937...|       0.0|   0|
|[13.8699023989112...|       0.0|   1|
|[14.3856274927273...|       0.0|   0|
|[16.2067887268884...|       0.0|   0|
|[15.0503712183422...|       0.0|   1|
|[18.0768931186343...|       0.0|   0|
+--------------------+----------+----+
only showing top 10 rows



In [130]:
test_df.filter('canc = 1').count()

24

In [131]:
train_df.filter('canc = 1').count()

75

In [253]:
best_model_results.select("prediction", "canc", "features").filter('canc = 1').groupby(['canc', 'prediction']).agg({'canc':'count'}).show(50)

+----+----------+-----------+
|canc|prediction|count(canc)|
+----+----------+-----------+
|   1|       0.0|         16|
|   1|       1.0|          7|
+----+----------+-----------+



In [269]:
cv_rf_results.bestModel.stages[-1]

RandomForestClassificationModel (uid=RandomForestClassifier_18ab5d75a053) with 20 trees

# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.