# Telco Customer Churn Reduction

# 1. Planning

### Describe the project and goals.

<b>Objective/Goals:</b>
- Find the drivers for customer churn at Telco.
- Construct a machine learning (ML) classification model that accurately predicts customer churn.
- Create repeateable process modules.
- Document your process to be presented and read like a report.
- Answer questions about your code, process, and findings.

### Task out how you will work through the pipeline.

<b>Data Science Pipeline:</b>
1. Planning
2. Acquisition
3. Preparation
4. Exploration
5. Modeling 
6. Delivery 

### Incluce a data dictionary.

- customer_id                  
- gender                        
- senior_citizen             
- partner                   
- dependents               
- phone_service
- internet_service	
- contract_type	
- payment_type	
- monthly_charges	
- total_charges	
- churn	
- tenure	
- is_female	
- has_churned	
- has_phone	
- has_internet	
- has_phone_and_internet
- partner_dependents	
- average_monthly_charges
- contract_type
- phone_type	
- internet_type
- service_type

### Clearly state your starting hypotheses.

- The largest source of churn is coming from month-to-month contracts because of increasing monthly charges.
- Specifically, the phone and internet bundle customers are churning the most.

### Project Specifications

- Why are our customers churning?
- Are there clear groupings where a customer is more likely to churn?
- What if you consider contract type? 
- Is there a tenure that month-to-month customers are most likely to churn? 
- Are there features that indicate a higher propensity to churn?
- Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point?


# 2. Acquisition

- run acquire.py
- summarize data (.info(), .describe(), .value_counts(), ...)
- plot distributions of individual variables 


In [2]:
import acquire
import prepare

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [14]:
df = acquire.get_telco_data()

In [15]:
df.head()

Unnamed: 0,customer_id,gender,senior,partner,dependents,phone_service,internet_service,contract_type,payment_type,monthly_charges,...,has_internet,has_phone_and_internet,partner_dependents,Excel functions -> Exercise3,average_monthly_charges,Excel functions -> Exercise4,contract_type.1,phone_type,internet_type,service_type
0,9995-HOTOH,Male,0,Yes,Yes,0,1,2,Electronic check,$59.00,...,True,False,3,20141117,$59.00,True,2 Year,No Phone Service,DSL,Internet Only
1,9993-LHIEB,Male,0,Yes,Yes,1,1,2,Mailed check,$67.85,...,True,True,3,20140607,$67.85,True,2 Year,One Line,DSL,Phone+Internet
2,9992-UJOEL,Male,0,No,No,1,1,0,Mailed check,$50.30,...,True,True,0,20191216,$50.30,True,Month-to-Month,One Line,DSL,Phone+Internet
3,9992-RRAMN,Male,0,Yes,No,2,2,0,Electronic check,$85.10,...,True,True,1,20180412,$85.10,True,Month-to-Month,Two or More Lines,Fiber Optic,Phone+Internet
4,9987-LUTYD,Female,0,No,No,1,1,1,Mailed check,$55.15,...,True,True,0,20181228,$55.15,True,1 Year,One Line,DSL,Phone+Internet


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 26 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   customer_id                   7043 non-null   object
 1   gender                        7043 non-null   object
 2   senior                        7043 non-null   int64 
 3   partner                       7043 non-null   object
 4   dependents                    7043 non-null   object
 5   phone_service                 7043 non-null   int64 
 6   internet_service              7043 non-null   int64 
 7   contract_type                 7043 non-null   int64 
 8   payment_type                  7043 non-null   object
 9   monthly_charges               7043 non-null   object
 10  total_charges                 7032 non-null   object
 11  churn                         7043 non-null   object
 12  tenure                        7043 non-null   int64 
 13  is_female         

In [17]:
df.describe()

Unnamed: 0,senior,phone_service,internet_service,contract_type,tenure,partner_dependents,Excel functions -> Exercise3
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,0.162147,1.325004,1.222916,0.690473,32.368309,1.082209,20170130.0
std,0.368612,0.64273,0.778877,0.833755,24.597021,1.226274,20574.6
min,0.0,0.0,0.0,0.0,0.0,0.0,20130700.0
25%,0.0,1.0,1.0,0.0,9.0,0.0,20150710.0
50%,0.0,1.0,1.0,0.0,29.0,1.0,20170920.0
75%,0.0,2.0,2.0,1.0,55.0,2.0,20190520.0
max,1.0,2.0,2.0,2.0,79.0,3.0,20200210.0


# 3. Preparation

Explore missing values and document takeaways/action plans for handling them. 
- Should you remove the observations with a missing value for that variable? (remove row) 
- Should you remove the variable altogether? (remove column) 
- Is 'missing' equivalent to 0 (or some other constant value) in the specific case of this variable?
- Should you replace the missing values with a value it is most likely to represent, like mean/median/mode? 
- Document your takeaways.
- Explore data types and adapt types or data values as needed to have numeric represenation of each attribute. 
- Run prepare.py. 

In [18]:
df = prepare.telco_data_prep()

# 4. Exploration 

Answer the key questions, your hypotheses, and figure out the drivers of churn. You are required to run at least 2 statistical tests in your data exploration. Make sure you document your hypotheses and set your alpha before running the tests and document your findings well.

- If a group is identified by tenure, is there a cohort or cohorts who have a higher rate of churn than other cohorts?
Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers)

Are there features that indicate a higher propensity to churn? 
For Example: type of internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.
- Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
- If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?
- Controlling for services (phone_id, internet_service_type_id, online_security_backup, device_protection, tech_support, and contract_type_id), is the mean monthly_charges of those who have churned significantly different from that of those who have not churned? (Use a t-test to answer this.)
- How much of monthly_charges can be explained by internet_service_type?

Hint: correlation test - State your hypotheses and your conclusion clearly.
- How much of monthly_charges can be explained by internet_service_type +phone_service_type (0, 1, or multiple lines). State your hypotheses and your conclusion clearly.
- Create visualizations exploring the interactions of variables (independent with independent and independent with dependent). The goal is to identify features that are related to churn, identify any data integrity issues, understand 'how the data works'. 

For example: We may find that all who have online services also have device protection. In that case, we don't need both of those. 
The visualizations done in your analysis for the questions above work toward answering the target question below.
- What can you say about each variable's relationship to churn, based on your initial exploration? If there appears to be some sort of interaction or correlation, assume there is no causal relationship and brainstorm (and document) ideas on reasons there could be correlation.

Summarize your conclusions, provide clear answers to the specific questions, and summarize any takeaways/action plan from the work above.


# 5. Modeling

You are required to establish a baseline accuracy to determine if having a model is better than no model and train and compare at least 3 different models. Document these steps well.

- Feature Selection: Are there any variables that seem to provide limited to no additional information? If so, remove them.
- Train (fit, transform, evaluate) multiple different models, varying the model type and hyperparameters.
- Compare evaluation metrics across all the models, and select the ones you want to test using your validate dataframe.
- Based on how your evaluation of your models using the train and validate datasets, choose your best model that you will try with your test data.
- Test the final model (transform, evaluate) on your out-of-sample data (the testing data set). Summarize the performance. Interpret your results.

# 6. Delivery

Draw Conclusions
- Summarize your findings

- Key takeaways and next steps