Spark Python API is used in code. To install PySpark run the following:
pip install pyspark
All other libraries used in the code should be in Anaconda distribution of Python 3.8. Jupyter notebook was run on Udacity workspace with Spark pre-installed. Output of the code may be different in other environments.
In today's dynamic world, we are seeing a growing number of large data, and manipulate these large data on personal computer can be a problem due to lack of memory. This is a perfect opportunity to experiment Spark, a big data analytics tool, on a dataset of music streaming application. Churn rate is what most service businesses pay attention to, and this project analyzes user behaviors in the music streaming app, Sparkify, and build machine learning models to predict churn.
Dataset is user activity log, which includes user information, browser, account level, location, song, artist, and timestamp.
Sparkify.ipynb - Jupyter notebook that includes code, plots and outputs related to exploratory data analysis and steps in data preprocessing and machine learning model building.
PageVisualization.PNG, StayChurn.PNG - plots for data visualizations
Schema.PNG, describesummary.PNG, pagecount.PNG, missing1.PNG, missing2.PNG - output images for dataset
featureimportance.PNG, logregparam.PNG, lsvcparam.PNG, paramMetrics.PNG, rfparam.PNG, bestScore.PNG - output images for models
Sparkify.PNG - Sparkify logo
Logistic Regression, Random Forest, and Linear Support Vector
Findings of this project was published on Medium available here.
Interesting insights about Sparkify user behavior were discovered in the analysis, including number of songs listened and roll advertistements shown to groups of users who stayed and churned. Three machine learning models implemented all achieved more than 83% accuracy on predicting churn.
This project is the capstone project in Udacity Data Scientist nanodegree with collaboration from Insight Data Science. Data is only available in workspace provided by Udacity.