# Telco Churn Classification Project

---

<img src="https://23x6xj3o92m9361dbu2ij362-wpengine.netdna-ssl.com/wp-content/uploads/2019/02/thumbnail-3ac78ae8dc4ab8e78d3937d8a6b35326-1200x370.jpeg" alt="Iris" title="Iris Dataset" width="500" height="200" />

---

## Imports

---

In [1]:
# imports

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# standard imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score # might never use this one, if so, remove

# models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# others
import graphviz
from graphviz import Graph

# Custom module imports
import acquire as a
import prepare as p
import explore as e

---

## Plan

---

- [x] Create README.md with data dictionary, project and business goals, come up with initial hypotheses.
- [x] Acquire data from the Codeup Database and create a function to automate this process. Save the function in an acquire.py file to import into the Final Report Notebook.
- [x] Clean and prepare data for the first iteration through the pipeline, MVP preparation. Create a function to automate the process, store the function in a prepare.py module, and prepare data in Final Report Notebook by importing and using the funtion.
- [x] Clearly define two hypotheses, set an alpha, run the statistical tests needed, reject or fail to reject the Null Hypothesis, and document findings and takeaways.
- [x] Establish a baseline accuracy and document well.
- [x] Train three different classification models.
- [x] Evaluate models on train and validate datasets.
- [x] Choose the model with that performs the best and evaluate that single model on the test dataset.
- [x] Create csv file with the measurement id, the probability of the target values, and the model's prediction for each observation in my test dataset.
- [x] Document conclusions, takeaways, and next steps in the Final Report Notebook.


---

## Executive Summary - Conclusion and Next Steps

---

- Add this later

---

## Acquire

---

In [17]:
# read raw data info a dataframe (df)
df = a.get_telco_data()

In [3]:
# lets see how many rows and columns we have
df.shape

(7043, 24)

In [4]:
# let's take a look at first 5 rows transposed for readability
df.head().T

Unnamed: 0,0,1,2,3,4
payment_type_id,2,2,1,1,3
contract_type_id,1,1,1,1,1
internet_service_type_id,3,3,3,3,3
customer_id,0030-FNXPP,0031-PVLZI,0098-BOWSO,0107-WESLM,0114-RSRRW
gender,Female,Female,Male,Male,Female
senior_citizen,0,0,0,0,0
partner,No,Yes,No,No,Yes
dependents,No,Yes,No,No,No
tenure,3,4,27,1,10
phone_service,Yes,Yes,Yes,Yes,Yes


In [5]:
# let's get some more info on our raw data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   payment_type_id           7043 non-null   int64  
 1   contract_type_id          7043 non-null   int64  
 2   internet_service_type_id  7043 non-null   int64  
 3   customer_id               7043 non-null   object 
 4   gender                    7043 non-null   object 
 5   senior_citizen            7043 non-null   int64  
 6   partner                   7043 non-null   object 
 7   dependents                7043 non-null   object 
 8   tenure                    7043 non-null   int64  
 9   phone_service             7043 non-null   object 
 10  multiple_lines            7043 non-null   object 
 11  online_security           7043 non-null   object 
 12  online_backup             7043 non-null   object 
 13  device_protection         7043 non-null   object 
 14  tech_sup

In [6]:
# let's check out some summary statistics for our numerical columns in the raw data
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
payment_type_id,7043.0,2.315633,1.148907,1.0,1.0,2.0,3.0,4.0
contract_type_id,7043.0,1.690473,0.833755,1.0,1.0,1.0,2.0,3.0
internet_service_type_id,7043.0,1.872923,0.737796,1.0,1.0,2.0,2.0,3.0
senior_citizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
monthly_charges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


In [18]:
# let's make sure we have info on phone and internet service before removing redundant info
df.internet_service_type.value_counts()

Fiber optic    3096
DSL            2421
None           1526
Name: internet_service_type, dtype: int64

In [19]:
df.phone_service.value_counts()

Yes    6361
No      682
Name: phone_service, dtype: int64

In [20]:
# let's take a look at an example of the redundancy for each service type
df.online_security.value_counts()

No                     3498
Yes                    2019
No internet service    1526
Name: online_security, dtype: int64

In [21]:
df.multiple_lines.value_counts()

No                  3390
Yes                 2971
No phone service     682
Name: multiple_lines, dtype: int64

### Takeaways
- data was successfully imported using SQL query the first time and using cached .csv copy of df after that
    - see `acquire.py` file in repo for code
- need to drop first four columns as they are database keys and don't add value
- need to change all binary categorical columns to a consistent format (some are being represented as integers while others with strings)
    - `senior_citizen` should be changed from integers of `0` and `1` to strings of `"Yes"` and `"No"` to be consistent with other binary categorical columns
    - Encoding of these categorical columns will take place later
- `total_charges` column is in object format and will need to be formatted to float
- some columns appear to contain redundant information and will cause unnecessary column inflation when encoding takes place
    - there is a column `internet_service_type` that tells you if they don't have internet but 6 other columns have a category for `"No internet service"`
        - need to change those other 6 columns to only have 2 categories of `"Yes"` or `"No"` to remove redundancy where those who don't have internet would be converted to `"No"` group
    - there is a column `phone_service` that tells you if they have phone service or not but there is another column `multiple_lines` that also has a `"No phone service"` category
        - need to change this other column to only have 2 categories of `"Yes"` or `"No"` to remove redundancy where those who don't have phone would be converted to `"No"` group
- **I will wait to plot distributions of individual variables until I have made these changes so that the information is meaningful** 

---

## Prepare

---

### Clean

---

In [22]:
# let's use the prepare.py module to implement findings above and clean data
df = p.clean_telco(df)

In [23]:
# let's take a look at cleaned data shape
df.shape

(7043, 20)

In [24]:
# let's take a look at cleaned data
df.head().T

Unnamed: 0,0,1,2,3,4
gender,Female,Female,Male,Male,Female
senior_citizen,No,No,No,No,No
partner,No,Yes,No,No,Yes
dependents,No,Yes,No,No,No
tenure,3,4,27,1,10
phone_service,Yes,Yes,Yes,Yes,Yes
multiple_lines,No,No,No,No,No
online_security,No,No,No,No,No
online_backup,No,No,No,No,No
device_protection,No,No,No,No,No


In [10]:
# let's make sure data types are correct for modified columns
print(f'Data type for senior_citizen column is now: {df.senior_citizen.dtype}.')
print(f'Data type for total_charges column is now: {df.total_charges.dtype}.')

Data type for senior_citizen column is now: object.
Data type for total_charges column is now: float64.


In [25]:
# let's make sure redundencies were taken care of for each service type
df.online_security.value_counts()

No     5024
Yes    2019
Name: online_security, dtype: int64

In [26]:
df.multiple_lines.value_counts()

No     4072
Yes    2971
Name: multiple_lines, dtype: int64

#### Cleaning Takeaways
- data was cleaned successfully using imported prepare.py module (see module in repo for code)
    - unneeded columns were successfully dropped
    - `senior_citizen` column was successfully modified
    - `total_charges` column data type was successfully cast to float
        - There were 11 values of `" "` in this column that all had values of `0` for tenure, so these columns were converted to value of `0`
    - 7 columns with redundancies were successfully modified according to observations and explanation above

---