Installation

Spark Python API is used in code. To install PySpark run the following:

pip install pyspark

All other libraries used in the code should be in Anaconda distribution of Python 3.8. Jupyter notebook was run on Udacity workspace with Spark pre-installed. Output of the code may be different in other environments.

Project Motivation

In today's dynamic world, we are seeing a growing number of large data, and manipulate these large data on personal computer can be a problem due to lack of memory. This is a perfect opportunity to experiment Spark, a big data analytics tool, on a dataset of music streaming application. Churn rate is what most service businesses pay attention to, and this project analyzes user behaviors in the music streaming app, Sparkify, and build machine learning models to predict churn.

Dataset

Dataset is user activity log, which includes user information, browser, account level, location, song, artist, and timestamp.

File Descriptions

Sparkify.ipynb - Jupyter notebook that includes code, plots and outputs related to exploratory data analysis and steps in data preprocessing and machine learning model building.

PageVisualization.PNG, StayChurn.PNG - plots for data visualizations

Schema.PNG, describesummary.PNG, pagecount.PNG, missing1.PNG, missing2.PNG - output images for dataset

featureimportance.PNG, logregparam.PNG, lsvcparam.PNG, paramMetrics.PNG, rfparam.PNG, bestScore.PNG - output images for models

Sparkify.PNG - Sparkify logo

Statistical Models

Logistic Regression, Random Forest, and Linear Support Vector

Result

Findings of this project was published on Medium available here.

Interesting insights about Sparkify user behavior were discovered in the analysis, including number of songs listened and roll advertistements shown to groups of users who stayed and churned. Three machine learning models implemented all achieved more than 83% accuracy on predicting churn.

Acknowledgements

This project is the capstone project in Udacity Data Scientist nanodegree with collaboration from Insight Data Science. Data is only available in workspace provided by Udacity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Project Motivation

Dataset

File Descriptions

Statistical Models

Result

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
LICENSE		LICENSE
PageVisualization.PNG		PageVisualization.PNG
README.md		README.md
Schema.PNG		Schema.PNG
Sparkify.PNG		Sparkify.PNG
Sparkify.ipynb		Sparkify.ipynb
StayChurn.PNG		StayChurn.PNG
bestScore.PNG		bestScore.PNG
describesummary.PNG		describesummary.PNG
featureimportance.PNG		featureimportance.PNG
logregparam.PNG		logregparam.PNG
lsvcparam.PNG		lsvcparam.PNG
missing1.PNG		missing1.PNG
missing2.PNG		missing2.PNG
pagecount.PNG		pagecount.PNG
paramMetrics.PNG		paramMetrics.PNG
rfparam.PNG		rfparam.PNG

License

edwinhung/Sparkify_Project

Folders and files

Latest commit

History

Repository files navigation

Installation

Project Motivation

Dataset

File Descriptions

Statistical Models

Result

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages