Within this README.md file you will find:
- Introduction
- Overview of Repository Contents
- Project Objectives
- Overview of the Process
- Findings & Recommendations
- Conclusion / Summary
Build a classifier to identify whether a customer will "soon" churn and stop doing business with SyriaTel. Ultimate goal is to label at risk customers to enable the Company to "save" these customers via promotions or other outreach measures.
README.md
telecom_churn_classifier.ipynb
- clean jupyter notebook containing all code modelstelecom_churn.csv
- dataset usedbackup_files
- directory containing rough, working, and in-process codeSyriaTel Customer Data Analysis.pdf
- non-technical presentationdirectory.pdf
- pdf of github directoryjup_notebook.pdf
- pdf of jupyter notebook
Build a classifier to predict whether a customer will soon stop doing business with SyriaTel. Follow CRISP-DM Machine Learning process to explore dataset, prepare data for modeling, modeling, and post-model evaluation. We will also be focused on identifying which performance metrics will likely be best to evaluate our performance and ability to properly identify churning customers. Provide as an output a list of customers who are most likely to churn according to our best model to company.
Following CRISP-DM, the process outlined within telecom_churn_classifier.ipynb
follows 6 key steps, including:
- Business Understanding: Outlines facts and requirements of the project. Specifically, a classifier will be built and trained on various SyriaTel customer data to predict whether a customer will be labeled as a 1 (churn) or a 0 (non-churn customer). Understanding which customers are likely to churn, in addition to various patterns within the data should enable SyriaTel to perform more targeted customer outreach and hopefully relate various customer features with the strength of the customer relationship going forward.
- Data Understanding: focused on unpacking all data that will be used in this classification problem (again primarily SyriaTel customer data). This section will focus on the distribution of our data, any imbalances within our target predictor, and the identification of which features are likely to impact or be associated with churn.
- Data Preparation: Further preprocessing of our data to prepare for modeling. This includes splitting into training and test sets, encoding necessary columns, and handling any other data processing prior to modeling. This is also the section in which synthetic training data is created via SMOTE to help with class imbalance.
- Modeling: this section trains and evaluates the performance of a number of machine learning models, primarily focused on decision trees, random forests, and XG Boosting algorithms
- Evaluation: Final / optimal model is selected and final performance metrics of final model discussed and evaluated. Focused on F1 Score, Recall, and Accuracy as performance metrics.
- Deployment: Generate predictions on all data to provide SyriaTel with a list of customers that are at highest risk of churning based on our final / best classifier.
The best performing model we saw was our tuned XG Boosting algorithm, with an AUC of 0.865, f1 score of 80%, recall of 80% and overall acuracy of 94%. Looking at our final model feature importance, the most important features appear to be whether or not a customer is on an international plan, whether or not a customer is on a voice mail plan, and the number of customer service calls to date. Additionally, a list of 431 customers deemed "at-risk" of churning by our model. While these customers have already churned, model can be used going forward to generate a similar risk of existing customers with risk of churn. Recommended that Company begins targeted outreach / customer-saving metrics on this list of customers first. Additionally, customers identified by our model as low-risk of churning may be able to be targeted via price increases / other revenue raising exercises.
Through an iterative modeling and data preparation process, we were able to tune a model with 80% recall, and overall accuracy of 94%. Throughout this process, recall and f1 score were favored over other metrics as the Company is likely not as concerned with false positives as customer-saving metrics targeted at this mis-labeled customers likely do not cost the Company much in comparison to potentially having that customer churn.