# Sparkify Project Workspace
This workspace contains a tiny subset (128MB) of the full dataset available (12GB). Feel free to use this workspace to build your project, or to explore a smaller subset with Spark before deploying your cluster on the cloud. Instructions for setting up your Spark cluster is included in the last lesson of the Extracurricular Spark Course content.

You can follow the steps below to guide your data analysis and model building portion of this project.

## Content
- [Exploratory Data Analysis](#Exploratory_Data_Analysis)<br>
    - [DataFrame preliminary Description](#describe_df)<br>
    - [Study DataFrame for based on userId](#df_example_study)<br>
    - [Define Churn - Analysing Cancellation Confirmation Events](#cancellation)<br>
    - [Define Churn - Analysing Downgrade Events](#downgrade)<br>

In [1]:
# import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum
import pyspark.sql.functions as psqf
from pyspark.sql.functions import col, countDistinct,  mean as _mean, stddev as _stddev
from pyspark.sql import Window

from pyspark.sql import functions as F

from pyspark.sql.types import TimestampType

import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

from IPython.display import Markdown, display, HTML
def printmd(string):
    display(Markdown(string))

In [None]:
# plotting module
import plot_df # simpliefies data plotting with a specially designed layout

def plot_bar(df, title, **kwargs):
    """
    INPUT:
        - df - dataframe to be plotted as bar plot, 
                index of df --> categorical x axis of the bar plot, 
                columns of df --> the bars of the plot
        - title - provide a title as string for the plot
        - **kwargs - (optional) paramters to overwrite the default setting of the bar_setting_dict 
    OUTPUT:
        - A bar plot with a specially designed layout
    
    AIM:
        1.) Overwrite the default setting of the bar_setting_dict dictionary
        2.) execute plot_df.plot_df_bar() - this function will then create the plot based on the provided settings
    """
    bar_setting_dict={
                    'x_label' : '',
                    'y_label' : '',
                    'figsize' :(6,4),
                    'layout' : (1, 1),
                    'width' : 0.3,
                    'align' : 'center',
                    'subplots' : False,
                    'fontsize_title' : 14,
                    'fontsize_axes_values' : 14,
                    'fontsize_axes_label' : 14,
                    'fontsize_text' : 14,
                    'fontsize_legend' : 14,
                    'set_yticks_range' : False,
                    'yticks_start' : None,
                    'yticks_end' : None,
                    'yticks_step' : None,
                    'legend_state' : False,
                    'legend_list_to_plot' : '',
                    'legend_move' : False,
                    'legend_x' : None,
                    'legend_y' : None}
    for key, value in kwargs.items():
        bar_setting_dict[key] = value
    plot_df.plot_df_bar(df, title, bar_setting_dict)

    
def plot_pie(df, title, explode, **kwargs):
    """
    INPUT:
        - df - dataframe to be plotted as a pie plot, columns of df --> the pie pieces of the plot
        - title - provide a title as string for the plot
        - explode - listto set the explosion of each pie piece, length must be equal to number of pie pieces
        - **kwargs - (optional) paramters to overwrite the default setting of the pie_setting_dict 
    OUTPUT:
        - A pie chart with a specially designed layout
    
    AIM:
        1.) Overwrite the default setting of the pie_setting_dict dictionary
        2.) execute plot_df.plot_df_bar() - this function will then create the plot based on the provided settings
    """
    pie_setting_dict = {'figsize' : (10,5),
                        'shadow' : True,
                        'autopct' : '%1.1f%%',
                        'startangle' :90,
                        'fontsize_title' : 14,
                        'fontsize_text' : 14,
                        'fontsize_legend' : 14,
                        'legend_state' : True,
                        'legend_title' : '',   
                        'legend_list_to_plot' : '',
                        'legend_move' : False,
                        'legend_x' : None,
                        'legend_y' : None}
    
    for key, value in kwargs.items():
        pie_setting_dict[key] = value
    plot_df.plot_df_pie(df, title, explode, pie_setting_dict)


In [2]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName("Wrangling Data") \
    .getOrCreate()

# Load and Clean Dataset
In this workspace, the mini-dataset file is `mini_sparkify_event_data.json`. Load and clean the dataset, checking for invalid or missing data - for example, records without userids or sessionids. 

In [3]:
path = "mini_sparkify_event_data.json"
df = spark.read.json(path)

# <a class="anchor" id="Exploratory_Data_Analysis">Exploratory Data Analysis</a>
When you're working with the full dataset, perform EDA by loading a small subset of the data and doing basic manipulations within Spark. In this workspace, you are already provided a small subset of data you can explore.

### Define Churn

Once you've done some preliminary analysis, create a column `Churn` to use as the label for your model. I suggest using the `Cancellation Confirmation` events to define your churn, which happen for both paid and free users. As a bonus task, you can also look into the `Downgrade` events.

### Explore Data
Once you've defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

## <a class="anchor" id="describe_df">DataFrame preliminary Description</a>

In [None]:
df.printSchema()

In [None]:
# Readable dataframe head using pandas
df_pd = df.toPandas()
df_pd.head()

### Shape of DataFrame

In [None]:
printmd('#### Number of entries: ' + str(df.count()))
printmd('#### Number of columns: ' + str(len(df.columns)))

### Describe the Dataframe

In [None]:
df.toPandas().describe()

### Check for unique values

In [None]:
# count for unique value for each column
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).show()

### Check for null values

In [None]:
# count null values for each column
df_nulls = df.select([psqf.count(psqf.when(psqf.isnull(c), c)).alias(c) for c in df.columns]).show()

### Result: 
- There are 286500 entries in the dataframe 
- There are 226 unique users, 189 unique last names and 56 unique first names.
- There are 2354 unique sseeionIds.
- There 8346 user without first and last name and without gender. They seem to be anonymous. 
- However: There are 0 null userIds. To do: Check the userId value for those 8346 users.
- There are 0 null sessionIds

### Check for missing data

In [None]:
# Missing userID
number_missing_values = df.count() - df.dropna(how = "any", subset = ["userId", "sessionId"]).count()

In [None]:
printmd('#### There are ' + str(number_missing_values) + ' missing values for userId and sessionId')

### Check userID for uncommon entries

In [None]:
df.describe("userId").show()

In [None]:
df.select("userId").dropDuplicates().sort("userId").show()

In [None]:
empty_string_user = df[df["userId"] == ""]

In [None]:
number_empty_string_user = empty_string_user.count()

In [None]:
printmd('#### There are '+ str(number_empty_string_user) + ' userId with empty string')

In [None]:
df.select(["userId", "firstName", "lastName", "gender", "level"]).where(df.userId== "").show()

In [None]:
df.select(["userId", "firstName", "lastName", "gender"]).where(df.userId== "").dropDuplicates().show()

In [None]:
df.select(["userId", "firstName", "lastName", "gender"]).where(df.userId== "").count()

In [None]:
df.select(["userId", "firstName", "lastName", "gender", "level"]).where(df.userId== "").count()

#### Result: 
- All users with "empty string" userId did not provide their first name, last name and gender. 
- They are anonymous and free users. 

### For the registered users: How many paid/free accounts are there?

In [None]:
df.drop_duplicates(subset=['userId']).where(df.level =="paid").count()

In [None]:
df.drop_duplicates(subset=['userId']).where(df.level =="free").count()

### Result:
- 48 paid accounts
- 178 free accounts

### What are the web pages a user can visit?

In [None]:
df.select("page").dropDuplicates().sort("page").show(n=50)

### Gender distribution of users

In [None]:
gender_distribution = df.drop_duplicates(subset=['userId']).groupby('gender').count()
gender_distribution.show()

In [None]:
gender_distribution_pd = gender_distribution.dropna().toPandas()
gender_distribution_pd.set_index('gender', inplace=True)
gender_distribution_pd

In [None]:
plot_bar(gender_distribution_pd, 'Distribution of ')

### Total number of songs played

In [39]:
df.where(df.page=='NextSong').count()

228108

## <a class="anchor" id="df_example_study">Study DataFrame for based on userId</a>

- Study the the DataFrame based on a provided userId. 
- Change the **userId in the next cell** to test different users

In [29]:
userId = 8

### Overview

In [30]:
# Overview
df.where(df.userId==userId).toPandas()

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId,hour,seconds
0,,Logged In,Christina,F,1,Carrillo,,free,"St. Louis, MO-IL",GET,Home,1533650280000,234,,200,1539186658000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,15,1.539186658E9
1,,Logged In,Christina,F,2,Carrillo,,free,"St. Louis, MO-IL",PUT,Add Friend,1533650280000,234,,307,1539186659000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,15,1.539186659E9
2,,Logged In,Christina,F,3,Carrillo,,free,"St. Louis, MO-IL",PUT,Add Friend,1533650280000,234,,307,1539186660000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,15,1.53918666E9
3,,Logged In,Christina,F,4,Carrillo,,free,"St. Louis, MO-IL",GET,Home,1533650280000,234,,200,1539186664000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,15,1.539186664E9
4,Morcheeba,Logged In,Christina,F,5,Carrillo,398.91546,free,"St. Louis, MO-IL",PUT,NextSong,1533650280000,234,Almost Done,200,1539186688000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,15,1.539186688E9
5,Dan Le Sac vs Scroobius Pip,Logged In,Christina,F,6,Carrillo,241.57995,free,"St. Louis, MO-IL",PUT,NextSong,1533650280000,234,Snob,200,1539187086000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,15,1.539187086E9
6,Zac Brown Band,Logged In,Christina,F,7,Carrillo,237.08689,free,"St. Louis, MO-IL",PUT,NextSong,1533650280000,234,Chicken Fried (Album),200,1539187327000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,16,1.539187327E9
7,Van Der Graaf Generator,Logged In,Christina,F,8,Carrillo,407.17016,free,"St. Louis, MO-IL",PUT,NextSong,1533650280000,234,Masks,200,1539187564000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,16,1.539187564E9
8,The Verve,Logged In,Christina,F,9,Carrillo,360.25424,free,"St. Louis, MO-IL",PUT,NextSong,1533650280000,234,Bitter Sweet Symphony,200,1539187971000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,16,1.539187971E9
9,Carpenters,Logged In,Christina,F,10,Carrillo,250.25261,free,"St. Louis, MO-IL",PUT,NextSong,1533650280000,234,Ticket To Ride,200,1539188331000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8,16,1.539188331E9


### Short user report info

In [31]:
def get_user_info(df, userId): 
    printmd('### user with userId = ' + str(userId))
   
    # number of total user actions
    printmd('#### number of total user actions')
    print(df.where(df.userId==userId).count())
 
    # paid or free user
    printmd('#### paid and/or free user')
    df.where(df.userId==userId).groupby('level').count().show()
    
    # number of sessions
    num_session = df.where(df.userId==userId).groupby('sessionId').count().count()
    printmd('#### number of sessions')
    print(num_session)
    
    # session overview
    printmd('#### session overview')
    df.where(df.userId==userId).groupby('sessionId').count().show()
    
    # number of songs played per session
    printmd('#### number of songs played per session')
    df.where((df.userId==userId) & (df.page=='NextSong')).groupby('sessionId').count().show()
    
    # mean Number of played songs per session 
    mean_num_plyed_songs =df.where((df.userId==8) & (df.artist!='None')).groupby('sessionId').count().select(_mean(col('count')).alias('avg_num_songs_per_session'))
    printmd('#### mean number of played songs per session')
    mean_num_plyed_songs.show()
    
    # get all user's sessionIds
    sessionId_list = df.where(df.userId==userId).select('sessionId').drop_duplicates(subset=['sessionId']).rdd.flatMap(lambda x: x)
    printmd('#### all user sessionIds')
    print(sessionId_list.collect())
    
get_user_info(df, userId)

### user with userId = 8

#### number of total user actions

334


#### paid and/or free user

+-----+-----+
|level|count|
+-----+-----+
| free|  334|
+-----+-----+



#### number of sessions

7


#### session overview

+---------+-----+
|sessionId|count|
+---------+-----+
|      720|   66|
|     1200|   18|
|     1811|    9|
|     1312|   55|
|      234|  155|
|     1507|    6|
|     2211|   25|
+---------+-----+



#### number of songs played per session

+---------+-----+
|sessionId|count|
+---------+-----+
|      720|   49|
|     1200|   11|
|     1811|    6|
|     1312|   45|
|      234|  117|
|     1507|    4|
|     2211|   19|
+---------+-----+



#### mean number of played songs per session

+-------------------------+
|avg_num_songs_per_session|
+-------------------------+
|       35.857142857142854|
+-------------------------+



#### all user sessionIds

[720, 1200, 1811, 1312, 234, 1507, 2211]


### Get user's sessionIds 

In [32]:
# helper function - get all sessions for user with userId
def get_sessionIds_for_userId(df, userId):
    sessionId_list = df.where(df.userId==userId).select('sessionId').drop_duplicates(subset=['sessionId']).rdd.flatMap(lambda x: x).collect()
    return sessionId_list
get_sessionIds_for_userId(df, userId)

[720, 1200, 1811, 1312, 234, 1507, 2211]

### Get user page activities for a certain session
- Change **sessionId in the next cell** to check page activities of different sessions

In [33]:
sessionId = 1507

In [34]:
# helper function - get list of activities for user with userId in session with sesionId
def get_user_activities_for_session(df, userId, sessionId):
    page_names_list = df.where((df.userId==userId) & (df.sessionId==sessionId)).select('page').rdd.flatMap(lambda x: x).collect()
    return page_names_list

get_user_activities_for_session(df, userId, sessionId)

['Home', 'NextSong', 'NextSong', 'NextSong', 'Upgrade', 'NextSong']

### Create time/duration columns

In [28]:
# Convert timestamps to datetime from epoch time to get the hour of the day. Create a user defined function called get_hour
get_hour = udf(lambda x: datetime.datetime.fromtimestamp(x / 1000.0).hour)

# Convert epoch time in milliseconds to econds
get_seconds = udf(lambda x: x / 1000.0)

def create_hour_column(df):
    # Add a new column called 'hour'
    df = df.withColumn("hour", get_hour(df.ts))
    return df

def create_seconds_column(df):
    # Add a new column called 'seconds'
    df = df.withColumn("seconds", get_seconds(df.ts))
    return df

def get_session_page_duration(df, userId, sessionId):
    # get activities for a certain user sessionId
    session_activities = df.where((df.userId==userId) & (df.sessionId==sessionId)).select('sessionId', 'page', 'seconds', 'ts')

    
    session_window = Window.partitionBy().orderBy("ts")

    session_activities = session_activities.withColumn("next_value", F.lead(session_activities.seconds).over(session_window))
    session_activities = session_activities.withColumn("page_duration_in_sec", F.when(F.isnull(session_activities.next_value - session_activities.seconds), 0)
                                  .otherwise(session_activities.next_value - session_activities.seconds))
    
    
    return session_activities


# Add hour column
df = create_hour_column(df)
# Add seconds column
df = create_seconds_column(df)


session_activities = get_session_page_duration(df, userId=8, sessionId=1507)
session_activities.show()

+---------+--------+-------------+-------------+-------------+--------------------+
|sessionId|    page|      seconds|           ts|   next_value|page_duration_in_sec|
+---------+--------+-------------+-------------+-------------+--------------------+
|     1507|    Home|1.541890239E9|1541890239000| 1.54189024E9|                 1.0|
|     1507|NextSong| 1.54189024E9|1541890240000|1.541890534E9|               294.0|
|     1507|NextSong|1.541890534E9|1541890534000| 1.54189087E9|               336.0|
|     1507|NextSong| 1.54189087E9|1541890870000|1.541890906E9|                36.0|
|     1507| Upgrade|1.541890906E9|1541890906000|1.541891046E9|               140.0|
|     1507|NextSong|1.541891046E9|1541891046000|         null|                 0.0|
+---------+--------+-------------+-------------+-------------+--------------------+



In [None]:
# SQL view 
df.createOrReplaceTempView("sparkify_sql_view")

### Average number of songs per session per user

In [35]:
num_sessions = df.groupby('userId','sessionId').count().show()

+------+---------+-----+
|userId|sessionId|count|
+------+---------+-----+
|      |      268|   15|
|    92|      358|   73|
|    42|      433|   21|
|   101|      635|  855|
|      |      565|   15|
|   120|      627|  261|
|   140|      798|    6|
|      |      814|   15|
|   122|      691|    9|
|      |     1053|    4|
|    29|     1030|   19|
|      |     1183|    1|
|     8|     1200|   18|
|      |     1305|   10|
|      |     1446|    8|
|    96|     1653|  160|
|   153|     1794|   74|
|      |     1592|    6|
|    97|     2019|   84|
|    35|     2270|   13|
+------+---------+-----+
only showing top 20 rows



In [None]:
"""
num_of_sessions = spark.sql('''
                               SELECT userId, avg(count) as average 
                               FROM (SELECT userId, count(*) as count 
                                     FROM sparkify_sql_view 
                                     group by sessionId, userId) 
                               group by userId
                               '''
                              )
num_of_sessions.show()
"""

### Number of thumbs up per user

In [36]:
tumbs_up = df.select('userId').where(df.page=='Thumbs Up').groupby('userId').count().show()

+------+-----+
|userId|count|
+------+-----+
|100010|   17|
|200002|   21|
|    51|  100|
|   124|  171|
|     7|    7|
|    54|  163|
|    15|   81|
|   155|   58|
|   132|   96|
|   154|   11|
|100014|   17|
|   101|   86|
|    11|   40|
|   138|   95|
|300017|  303|
|    29|  154|
|    69|   72|
|100021|   11|
|    42|  166|
|   112|    9|
+------+-----+
only showing top 20 rows



In [None]:
"""
thumbs_up = spark.sql('''
                      SELECT userId, count(*) as num_thumbs_up
                      FROM sparkify_sql_view 
                      where page = 'Thumbs Up' 
                      group by userId
                      ''')

thumbs_up.show()
"""

### Total number of thumbs down

In [37]:
tumbs_down = df.select('userId').where(df.page=='Thumbs Down').groupby('userId').count().show()

+------+-----+
|userId|count|
+------+-----+
|100010|    5|
|200002|    6|
|    51|   21|
|   124|   41|
|     7|    1|
|    15|   14|
|    54|   29|
|   155|    3|
|   132|   17|
|100014|    3|
|   101|   16|
|    11|    9|
|   138|   24|
|300017|   28|
|    29|   22|
|    69|    9|
|100021|    5|
|    42|   25|
|   112|    3|
|    73|    7|
+------+-----+
only showing top 20 rows



In [None]:
"""
thumbs_down = spark.sql('''
                        SELECT userId, count(*) as num_thumbs_down 
                        FROM sparkify_sql_view 
                        where page = 'Thumbs Down' 
                        group by userId
                        ''')

thumbs_down.show()
"""

In [52]:
df.where((df.userId==8) & (df.page=='NextSong')).groupby('sessionId').count().select(_mean('count')).show()

+------------------+
|        avg(count)|
+------------------+
|35.857142857142854|
+------------------+



### Average number of songs played per session

In [57]:
# mean Number of played songs per session 
mean_num_plyed_songs = udf(lambda x: df.where((df.userId==x) & (df.page=='NextSong')).groupby('sessionId').count().select(_mean('count')))

mean_num_plyed_songs(df.userId)


Traceback (most recent call last):
  File "/opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/serializers.py", line 590, in dumps
    return cloudpickle.dumps(obj, 2)
  File "/opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 863, in dumps
    cp.dump(obj)
  File "/opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 260, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 400, in save_function
    self.save_function_tuple(obj)
  File "/opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/cloudpickl

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o1494.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



### Average number of songs played per user

### Average number of songs played per user per session

## <a class="anchor" id="cancellation">Define Churn - Analysing Cancellation Confirmation Events</a>

In [None]:
df_pd = df.filter("page = 'Cancellation Confirmation'").toPandas()
df_pd

In [None]:
df.filter("page = 'Cancellation Confirmation'").count()

## <a class="anchor" id="downgrade">Define Churn - Analysing Downgrade Events</a>

In [None]:
df_pd = df.filter("page = 'Downgrade'").toPandas()
df_pd

In [None]:
df.filter("page = 'Downgrade'").count()

In [None]:
churn = udf(lambda x: int(x=="Cancellation Confirmation"), IntegerType())
downgrade_churn = udf(lambda x: int(x=="Downgrade"), IntegerType())

df = df.withColumn("downgraded", downgrade_churn("page")).withColumn("cancelled", churn("page"))



In [None]:
#distribution of users downgrades and cancellations
df.select(['userId', 'downgraded', 'cancelled'])\
    .groupBy('userId').sum()\
    .withColumnRenamed('sum(downgraded)', 'downgraded')\
    .withColumnRenamed('sum(cancelled)', 'cancelled').describe().show()

In [None]:
windowvalue = Window.partitionBy("userId").orderBy(desc("ts")).rangeBetween(Window.unboundedPreceding, 0)


df = df.withColumn("churn_phase", Fsum("cancelled").over(windowvalue))\
    .withColumn("downgrade_phase", Fsum("downgraded").over(windowvalue))

# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

# Modeling
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.