Skip to content

This project identifies factors that help predict user churn risk by analyzing user event data; builds and tunes machine learning models using Spark ML libraries to predict churn users.

Notifications You must be signed in to change notification settings

candywendao/Sparkify_Predict_Churn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Predict Customer Churn with Spark

Project Overview

Sparkify is a frictional music application like Spotify that offers both free and paid services. Customers churn happens when a user downgrade from paid premium service to free tier or cancel the service altogether.

This project focuses on identifying factors that help predict user churn risk by analyzing Sparkify’s user event data and builds and tunes machine learning models using Spark ML libraries to predict churn users.

In real cases, user event datasets are large data files that record every detailed interactions users have with the application. Currently we’re analyzing a small subset data file (128 MB) in local workspace. In the future we'll analyze the full data (12 GB) in the cloud with AWS.

  1. Medium Post

Project details and results are discussed in the post on Medium.

  1. File Structure

    • 'Sparkify_Churn_Prediction.ipynb' is the ipython notebook that includes code for data cleaning, data wrangling, model preprocessing, and model training.
    • 'ML_metrics.xlsx' stores the evaluation metrics of three models.
    • Folder 'image' saves screenshots of charts and tables used in the repo.
    • Data input file is too large to be uploaded here.
  2. Implementation

This project is implemented using Apache Spark Python API and uses the following python libraries:

- pyspark
- numpy
- pandas
- matplotlib
- seaborn
- datetime
- time

Exploratory Data Analysis

  1. Data Exploration

    • Data file ''mini_sparkify_event_data.json" contains 286500 lines and 18 columns of user event data from 2018–10–01 to 2018–12–03; Data schema is shown as below: Data Schema

    • Null values: Dropped 8346 lines of null values created by 'Logged-out' and 'guest' users; Columns 'Song', 'Artist' and 'Length' logged null values when users were visiting pages other than 'NextSong' page thus these null values are kept in the dataset.

    • 225 distinct users included: 104 female and 121 male.

  2. Data Visualization

Charts can be found from the notebook in this repo or the Medium post.

Modeling and Evaluation

  1. Feature Engineering
    • gender (Male: 1; Female: 0)
    • level (paid or free)
    • most frequent device
    • number of songs listened per user
    • number of artists listened to per user
    • number of sessions per user
    • average number of pages visited per session per user
    • number of ads views per user
    • number of page views of ‘Error’ per user
    • number of page views of ‘Help’ per user
    • number of page views of ‘Downgrade’ per user
    • number of page views of ‘Upgrade’ per user
    • number of page views of ‘Logout’ per user
    • percentage of thumb-ups of total ‘NextSong’ page views per user
    • percentage of thumb-downs of total ‘NextSong’ page views per user

A screenshot of the preprocessed dataset: Screenshot of Dataset

  1. Modeling and Evaluation
  • Methods: first train the base model, Logistic Regression and then apply Random Forest and Gradient-boosted tree.

  • Metrics: As the sample size is comparatively small, we use f1 score as the accuracy metric, which considers both the precision the recall to compute the score.

  • Results:

According to the results, the model using Random Forest looks very promising as it achieved 0.90 and 0.78 scores for train and test datasets.

Metrics

  • Future Improvement:

In the future, we’ll apply the models on the full dataset. To improve the mode performance, we may consider add more features such as location, time spent on the application, and device used when churn.

Acknowledgement

I wish to thank Udacity for the instructions and advice.

About

This project identifies factors that help predict user churn risk by analyzing user event data; builds and tunes machine learning models using Spark ML libraries to predict churn users.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages