<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTu8IayzMrmGKgbnT0hGkk6k7FhiK1ICbTNUA&usqp=CAU" />

In [5]:
import os
from acquire import test, grab_telco, prep_t, telco_test, split
from env import username, password, host
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

import warnings
warnings.filterwarnings('ignore')

import sklearn.metrics as mtc
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score, precision_score, f1_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

<div class="alert alert-block alert-success">
    
# Business Goals
## - Find drivers for customer churn at Telco. Why are customers churning?
## - Construct a Machine Learning classification model that accurately predicts customer churn.
## - Deliver a report that a non-data scientist can read through and understand what steps were taken, why and what was the outcome?

<div class="alert alert-block alert-info">
    
# Initial Questions:
### Are certain groups (age, gender, etc.) of customers unsatisfied with our service?

### Do we offer products/services(internet types) that do not meet expectations?

### Are we charging too much?

### Do we lack customer support?

### Do we need to put more focus on the types of contracts we offer for customers?

# Acquire and view the data
>We will grab this data from the telco.csv file

In [6]:
telco = grab_telco()
telco.head()

Unnamed: 0.1,Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,...,device_protection,tech_support,streaming_tv,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn
0,0,0002-ORFBO,Female,0,Yes,Yes,9,Yes,No,1,...,No,Yes,Yes,No,2,Yes,2,65.6,593.3,No
1,1,0003-MKNFE,Male,0,No,No,9,Yes,Yes,1,...,No,No,No,Yes,1,No,2,59.9,542.4,No
2,2,0004-TLHLJ,Male,0,No,No,4,Yes,No,2,...,Yes,No,No,No,1,Yes,1,73.9,280.85,Yes
3,3,0011-IGKFF,Male,1,Yes,No,13,Yes,No,2,...,Yes,No,Yes,Yes,1,Yes,1,98.0,1237.85,Yes
4,4,0013-EXCHZ,Female,1,Yes,No,3,Yes,No,2,...,No,Yes,Yes,No,1,Yes,2,83.9,267.4,Yes


# Look at the column(feature) Dtypes and check for cells with no values (nulls)

In [7]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                7043 non-null   int64  
 1   customer_id               7043 non-null   object 
 2   gender                    7043 non-null   object 
 3   senior_citizen            7043 non-null   int64  
 4   partner                   7043 non-null   object 
 5   dependents                7043 non-null   object 
 6   tenure                    7043 non-null   int64  
 7   phone_service             7043 non-null   object 
 8   multiple_lines            7043 non-null   object 
 9   internet_service_type_id  7043 non-null   int64  
 10  online_security           7043 non-null   object 
 11  online_backup             7043 non-null   object 
 12  device_protection         7043 non-null   object 
 13  tech_support              7043 non-null   object 
 14  streamin

---

# Prep the data
>**Create dummy variables for modeling purposes<br>
(1: 'Yes', 0: 'No')**

In [8]:
telco = prep_t(telco)

In [9]:
telco.head(3)

Unnamed: 0,customer_id,senior_citizen,tenure,internet_service_type_id,contract_type_id,payment_type_id,monthly_charges,total_charges,streaming_tv_Yes,streaming_movies_Yes,...,churn_Yes,gender_Male,partner_Yes,dependents_Yes,phone_service_Yes,online_backup_Yes,device_protection_Yes,tech_support_Yes,online_security_Yes,multiple_lines_Yes
0,0002-ORFBO,0,9,1,2,2,65.6,593.3,1,0,...,0,0,1,1,1,1,0,1,0,0
1,0003-MKNFE,0,9,1,1,2,59.9,542.4,0,1,...,0,1,0,0,1,0,0,0,0,1
2,0004-TLHLJ,0,4,2,1,1,73.9,280.85,0,0,...,1,1,0,0,1,0,1,0,0,0


## Separate Customer ID and Total Charges from the data. 
>**The feature is not needed to run tests, but I'd like to still keep it.<br>
<br>
I do not believe Total Charges will aid in the prediction of churn based on how it is defined(the amount of charges over a customers lifetime with Telco**.

In [None]:
customers = telco[['customer_id', 'total_charges']]
customers.head()

## ... And then make sure the dataset reflects that change

In [None]:
telco = telco.loc[:, telco.columns != 'total_charges']
telco.head()

### Now all features are numerical values

## Additionally, I want to make sure I don't forget what each value represents for the categorical data:

### Internet Service Type:
> 1. DSL
2. Fiber Optic
3. None

### Payment Type: 
> 1. Electronic check
2. Mailed check
3. Bank Transfer
4. CC

### Contract Type: 
> 1. Month-to-Month
2. One-year
3. Two-year

---

# I ran statistical tests on the whole dataset to give initial direction

In [None]:
telco_test(telco)

<div class="alert alert-block alert-success">
    
## For now, I want to focus on:
**Internet Service Type:**<br>
> Fiber Optic reprsents almost half of customers, yet churn at over twice the rate of other internet types.<br>

**Online Backup:**<br>
> The population without online back is twice as large as those who do, yet churn 10% more.

**Device Protection:**<br>
> 2/3 of customer don't have device protection and churn more often that those who do, it could be beneficial to target them.

**Tech Support:**<br>
> Almost 3/4 of customers don't utilize Tech Support and churn twice as often as those who do

**Online Security:**<br>
> Over 70% of customer do not have Online security yet churn twice as fast as those who do.

**Payment Type:**
> Electronic Check payments churn at almost 2.5x the rate of any other payment method and represent a third of all customers.
    
**Age:**<br>
> Senior Citizens are twice as likely to churn than their younger counter parts.

**Tenure:**<br>
> It's clear that long-term customers are less and less likely to churn over time.



### I chose these features based on the statistical relationship coupled with the proportion of each features 'Yes' or 'No'
### I decided against using the rest of the features due to either lack of relationship or other factors that cannot be controlled
> **Example**: Dependents, those with no dependents churn at twice the rate of those who do have dependents, but that does not indicate the reason thye do so.

# Focus on those features

In [None]:
telco = telco[['customer_id', 
               'internet_service_type_id',
               'payment_type_id',
               'tenure',
               'paperless_billing_Yes',
               'churn_Yes',
               'phone_service_Yes',
               'online_backup_Yes',
               'tech_support_Yes',
               'online_security_Yes']]
telco.head(3)

---

# Baseline
>**Establish the baseline rate of Churn**

In [None]:
telco.churn_Yes.mean()

---

# Understand how the data is shaped to make sure it is properly split in Train, Validate, and Test

In [None]:
telco.shape

---

# Split the data
>**Train, Validate, and Test**

In [None]:
train, validate, test = split(telco)
train.shape, validate.shape, test.shape

In [None]:
train.head(3)

# Set the X,Y Train
>**Dropping churn from x train<br>
Additionally, Customer ID needs to be separated before models are made**

In [None]:
x_train = train.drop(columns=['churn_Yes', 'customer_id'])
y_train = train.churn_Yes
train_id = train.customer_id

x_validate = validate.drop(columns=['churn_Yes', 'customer_id'])
y_validate = validate.churn_Yes
validate_id = validate.customer_id

x_test = test.drop(columns=['churn_Yes', 'customer_id'])
y_test = test.churn_Yes
test_id = test.customer_id

---

# Best 3 Models Discovered

>### Decision Tree (max_depth=3)<br>
>### KNN (nearest 10)<br>
>### Logistic Regression

# Decision Tree
### I used a low depth to avoid overfitting

In [None]:
tree = DecisionTreeClassifier(max_depth=3)

In [None]:
tree = tree.fit(x_train, y_train)
y_predict = tree.predict(x_train)
y_pred_prob = tree.predict_proba(x_train)

In [None]:
cm = pd.DataFrame(confusion_matrix(y_train, y_predict))

## Train
**Model Score**: 79.67% <br>
**Precision**: 82%<br>
**Recall**: 93% <br>
**F1 Score**: 87%

In [None]:
print(classification_report(y_train, y_predict))

## Validate
**Model Score**: 77.50%<br>
**Precision**: 80%<br>
**Recall**: 93%<br>
**F1 Score**: 86%

In [None]:
y_pred_val = tree.predict(x_validate)
print(classification_report(y_validate, y_pred_val))

---

# Logistic Regression

In [None]:
logit = LogisticRegression(random_state=248)

In [None]:
logit.fit(x_train, y_train)

In [None]:
y_pred_lr = logit.predict(x_train)

In [None]:
y_pred_prob_lr = logit.predict_proba(x_test)
y_pred_prob_lr = pd.DataFrame(y_pred_prob_lr, columns=['0: NotChurn', '1: Churn'])
y_pred_prob_lr.head(10)

## Train

**Model/Accuracy Score**: 78%<br>
**Precision**: 81%<br>
**Recall**: 91%<br>
**F1 Score**: 86%

In [None]:
print(classification_report(y_train, y_pred_lr))

## Validate

**Model/Accuracy Score**: 75%<br>
**Precision**: 79%<br>
**Recall**: 90%<br>
**F1 Score**: 84%

In [None]:
y_pred_lr_val = logit.predict(x_validate)

In [None]:
print(classification_report(y_validate, y_pred_lr_val))

---

# KNN (nearest 10)

In [None]:
knn = KNeighborsClassifier(n_neighbors=10, weights='uniform')
knn = knn.fit(x_train, y_train)

In [None]:
y_predict_knn = knn.predict(x_train)

In [None]:
y_pred_prob_knn = knn.predict_proba(x_train)

## Train

**Model/Accuracy Score**: 81%<br>
**Precision**: 83%<br>
**Recall**: 93%<br>
**F1 Score**: 88%

In [None]:
print(classification_report(y_train, y_predict_knn))

## Validate

**Model/Accuracy Score**: 77%<br>
**Precision**: 80%<br>
**Recall**: 90%<br>
**F1 Score**: 85%

In [None]:
y_predict_knn_val = knn.predict(x_validate)

In [None]:
print(classification_report(y_validate, y_predict_knn_val))

# The Overall Best Model is... Decision Tree

 ### Test:
**Model Score**: 78.78%<br>
**Precision**: 81%<br>
**Recall**: 92%<br>
**F1 Score**: 86%

In [None]:
y_pred_test = tree.predict(x_test)
print(classification_report(y_test, y_pred_test))

<div class="alert alert-block alert-warning">

# Recommendations:
    
> **Focus efforts on offering discounted or free online security<br>
    Same for Tech Support**

# If I could do further analysis, I would explore tenure and age more closely

### Those who are senior citizens are about 16% of customers, but churn twice as much as their younger counterparts
### See more precisely in tenure to target to keep customers early on
### Understand why those opted into paperless billing churn at high rates

---