# Capstone 3 - Customer Churn Prediction for Telco

Customers at Telco are very important. Anytime a customer decides to leave, it is vital to understand the reason for why they left. Some customer churn is inevitable: a customer moves away from the area, for example. For any other churn, we need to be able to identify why it happens, and look to create solutions to avoid this in the future. Without customers, our company doesn't exist. 

### Data

We analysed data from the [Telco Data Set](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113), where we analyzed data from Q3 of our customers, including demographics, location, and billing/services information. Within the data sets listed, which included:

* TELCO customer demographics
* TELCO customer location
* TELCO customer services
* TELCO customer population
* TELCO customer status

Each of these data sets provided a few different pieces of information, all attached to the individual customer ID, which we combined and cleaned, creating a file *Telco customer churn*, while also creating *clean_data* for our machine learning models.

### Goals for project:

* **Identify any specific reasons or areas to look at for customer churn**
* **Create a machine learning model that can accurately predict the likelihood of a customer's possibility for leaving.**

### Data Cleaning

As shown above, we took data for each customer and combined into one data set. The columns that we decided to work with were found as follows:

* Demographics
    * Marriage Status
    * Number of Dependents
    * Is the customer a senior citizen
* Location
    * Zip Code
    * Longitude/Latitude
* Services
    * Phone services
    * Internet services
    * Tenure in months with Telco
    * Other services
    * Streaming
    * Billing info
* Status
    * **Churn Value(what we are trying to predict)**


### Exploring the Data

After cleaning the data and organizing it as shown above, we dived in to see if we could find any possible explanations for customers leaving. From this, we were able to identify a few trends that give us areas to improve our services. Overall, our customers left generally due to:

* **Competitor Offers**
* **Attitude of customer service**

This is displayed below in the analysis of churn reasons given by our customers that left:

<img src='images/total_churn_reasons.png'>

Of the two main reason for customers leaving, attitude of support or service is a fix that is within the control internally of our company. I will leave judgement of how to fix this with the directors that lead these departments. One observation I would make that is worth noting, however, is that customers with no tech support within their services tended to churn much more often than those with the service. It could be benefitial to consider ways to make the service more available to help retain customers. We will see this supported in the graphics below:

<img src='images/services.png'>

As we analyse the customers with a lack of services including Tech Support, Online Security, and Online Backup, we see that these customers are more likely to leave. This churn could be a result of an offer for a similar product with these things included at a lower price.

One area I'd like to highlight is the internet services. We see that customers with Fiber Optics tended to churn at nearly double the rate of those without the service. This could lead to our other major churn reason: *Competitor Offers*

If we do some additional research, we would likely find that areas that were offered Fiber Optics deals by competitors correspond to the customers that left. This leads to our next way of analysing the data available, based on location of customers. Before we look at the graphs, we should note that there are many small areas that had high churn rates, but relatively low customer totals. Most customers that left were from larger areas where we had more customers overall. We see, however, that a few regions have higher percentages of customer churn.

<img src='images/City_Churn_total.png'>

<img src='images/Cities_Churn_perc.png'>

We see that Los Angeles has the majority of our customers that left, but only the 9th highest percentage of customers that left. It seems that customers in Santa Rosa and Modesto are leaving at much higher rates overall than other areas, both of which saw over 40% customer turnover. This is in comparison to Los Angeles and San Diego, the two cities with the most total customer churn, that saw roughly 30% customer turnover. It would be worth looking into whether there are competitors in these cities that are adding to the turnover rates.

### Modeling

We decided to look at 4 models and compare the results that came from each:

* K Means Classifier
* Ridge Classifier
* Support Vector Classifier
* *Sequential Neural Network*

The first three are all simple models, and produced acceptable results afer some hyperparameter tuning and feature selection. In order to lower the dimensionality of our training and testing data sets, we chose the 5 best features using the SelectKBest module from sklearn. 

These features were:

* Tenure Months
* Internet Service_Fiber optic
* Contract_Month-to-month
* Contract_Two year
* Payment Method_Electronic check

Each of these features were used to create a training and testing data set. Because our initial dataset was so small (only 7032 total entries), we used a 80/20 train-test split. To show the over all model effectiveness, we decided to use two metrics, False Negative Rate and True Positive Rate. By minimizing false negative rate, we can reduce the number of customers that are likely to leave without any intervention. Alternatively, by maximizing true positive rate, we can find the vast majority of customers that are likely to leave and take proactive decisions as a result to decrease the customers that end up actually leaving. Below are the models results for these metrics:

| **Model** | **False Negative Rate** | **True Positive Rate** |
|:---------- | :-------------------- | :------------------- |
|K Means | 0.101779 | 0.898221 |
|**Ridge** | **0.085968** | **0.914032** |
|SVC | 0.101779 |0.898221 |
|Sequential NN | 0.116601 | 0.883399 |

We see that, of the 4 models we created, that the Ridge Model is clearly the best in regards to the metrics we chose. This means that our Ridge model should do the best job of predicting which customers will churn and should have the least number of customers that will slip through the cracks.

Lastly, the model was optimized using a Grid Search over the alpha values, which ultimately produced the model:
* RidgeClassifier(alpha=0.1)

### Future Improvements

We were able to create a model that produced a true positive rate of over 90% and a false negative rate of under 10%. While these metrics are sufficient for success, our data set is notably small, with only 7200 total entries to work with. This meant that we were only able to use 5 features, as choosing more features would likely make the model not feasible to application to the larger future data. Thus, we feel that we could have a better prediction model with the following:

* More data entries to train the models
* With more data, increase the number of features
* More information in regards to what tech support the customers needed and what competitor offers are being promoted and in what areas to focus efforts to limit customer churn.