Machine learning project modeling user churn of a hypothetical music streaming service
Data Science Nanodegree - Capstone Project: PySpark Customer Churn Prediction for the Sparkify Music Streaming Service
Head over to Medium to read my blogpost at
https://davidweisspost.medium.com/churn-prediction-with-pyspark-52ddece92ba4
The Capstone project for Udacity's Data Scientist Nanodegree. This project involves predicting Customer Churn for a hypothetical music streaming app Sparkify, using Spark's MLlib to engineer features and build a classification model. The dataset used here is a medium-sized (248 MB, with 544,000 rows) version of the whole dataset (which is 12 GB).
This project is worked on IBM Cloud's Watson Studio, uploading the data cluster, with a Python 3.7/Spark 3.0 enabled Jupyter Notebook.
Using pyspark
, the project broadly involves the following:
- Loading and cleaning the data
- Exploratory Data Analysis
- Feature Engineering - appropriate features are selected based on the EDA
- Modelling - two different classification models are tested and evaluated
- Model Tuning - Hyperparameter tuning using grid search
- Concluding Remarks
- Python 3.6+
pyspark.*
- Jupyter - available through this link, or IBM Watson Studio (Lite)