# Telco Data

Working notebook for telco project

In [1]:
import acquire_telco
import prepare_telco

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd 
import math

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
import seaborn as sns

## Acquire

We will aquire the telco data from the Codeup SQL database using a function stored in acquire_telco.py.

In [12]:
# import acquire_telco.py
import acquire_telco

In [3]:
# acquire
df = acquire_telco.new_telco_data()
df.head()

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,3,0016-QLJIS,Female,0,Yes,Yes,65,Yes,...,Yes,Yes,Yes,Yes,90.45,5957.9,No,Two year,DSL,Mailed check
1,4,1,3,0017-DINOC,Male,0,No,No,54,No,...,Yes,Yes,No,No,45.2,2460.55,No,Two year,DSL,Credit card (automatic)
2,3,1,3,0019-GFNTW,Female,0,No,No,56,No,...,Yes,No,No,No,45.05,2560.1,No,Two year,DSL,Bank transfer (automatic)
3,4,1,3,0056-EPFBG,Male,0,Yes,Yes,20,No,...,Yes,No,No,Yes,39.4,825.4,No,Two year,DSL,Credit card (automatic)
4,3,1,3,0078-XZMHT,Male,0,Yes,No,72,Yes,...,Yes,Yes,Yes,Yes,85.15,6316.2,No,Two year,DSL,Bank transfer (automatic)


## Prepare

Here we will prepare the data using functions stored in prepare_telco.py. 
We are going to use this data to develop a model, therefore we need to make sure there are no incompatible datatypes, and we are left with only clean columns that are useful. We will also need to split the data into train, validate, test so that we can test the accuracy of our model. 

We will run the clean_split_telco_data function that runs both the wrangle_telco and the train_validate_test_split functions. 

The wrangle_telco data will do the following:
- remove duplicates
- drop whitespaces
- drop the columns that don't seem useful: 'payment_type_id', 'internet_service_type_id', 'contract_type_id', 'customer_id', and 'gender'
- converts 'total_charges' from and object to a float
- converts the binary categorical variables to numeric: 'tenure', 'churn', 'partner', 'dependents', 'paperless_billing', 'phone_service', 'multiple_lines', 'online_security', 'streaming_movies', 'streaming_tv', 'online_backup', 'device_protection', 'tech_support', 'is_autopay'
- gets dummies from non-binary object varibales 
- concatenate dummy dataframe to original 
- drop the object columns we created dummies from 
- returns the cleaned df

The train_validate_test_split function will do the following: 

- takes in a dataframe (df) and returns 3 dfs (train, validate, and test) split 20%, 24%, 56% respectively 
- takes in a random seed for replicating results

In [11]:
# import prepare_telco.py
import prepare_telco

In [13]:
# run functions from prepare_teco.py to prepare and split the data
train, validate, test = prepare_telco.clean_split_telco_data(df)

In [14]:
train.shape

(3937, 27)

In [7]:
validate.shape

(1688, 27)

In [8]:
test.shape

(1407, 27)

In [15]:
# verify the df we brought in is what we want
train.head()

Unnamed: 0,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,...,contract_type_Month-to-month,contract_type_One year,contract_type_Two year,internet_service_type_DSL,internet_service_type_Fiber optic,internet_service_type_None,payment_type_Bank transfer (automatic),payment_type_Credit card (automatic),payment_type_Electronic check,payment_type_Mailed check
6096,0,1,0,70,1,0,0,0,0,0,...,0,0,1,0,0,1,1,0,0,0
1603,0,1,1,15,1,0,0,1,1,1,...,1,0,0,1,0,0,0,0,0,1
5350,1,1,0,52,1,1,1,1,1,0,...,1,0,0,0,1,0,0,0,1,0
2068,0,0,0,39,0,0,0,0,0,1,...,1,0,0,1,0,0,0,0,1,0
6366,0,1,0,32,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,1


In [16]:
# check that datatypes are all compatible for modeling
train.dtypes

senior_citizen                              int64
partner                                     int64
dependents                                  int64
tenure                                      int64
phone_service                               int64
multiple_lines                              int64
online_security                             int64
online_backup                               int64
device_protection                           int64
tech_support                                int64
streaming_tv                                int64
streaming_movies                            int64
paperless_billing                           int64
monthly_charges                           float64
total_charges                             float64
churn                                       int64
is_autopay                                   bool
contract_type_Month-to-month                uint8
contract_type_One year                      uint8
contract_type_Two year                      uint8


## Explore

Here we will explore the telco data to find the key drivers of customer churn. 

We will ask some initial questions and and answer thoes questions through visuals, statistics, or both. 

Initial Questions:
    
    1. Are customers with a certain service type more or less likely to churn? 
      - Specifically are customers with fiber more likely to churn? 
    2. What month are customers most likely to churn in?  
        - Does this depend on their contract/service type? 
    3. Do the customers that churn have a higher monthly cost than those that do not churn? 
    4. Do the customers that churn havel more or less lines than those how don't? 


 1. Are customers with a certain service type more or less likely to churn? 
      - Specifically are customers with fiber more likely to churn? 

## Decision Tree

## Random Forest

## KNN