# Business Understanding

- Why are you using machine learning rather than a simpler approach?
- What is it about the problem/data that is suitable for logistic regression? 
- Objective: 
  - Predict which customers who will "churn" (leave a business service), given the data in our training set associated with each subscriber to SyriaTel's phone plan. This way we can identify these customers before they churn, which will hopefully allow us to find ways to retain them before they leave.

# Data Understanding


| Variable | Definition | Key/Notes |
| -------- | -------- | -------- |  
| churn | Has customer ceased doing business with SyriaTel | False = has not churned, True = has churned 
| state | US State | Categorical number that must be one-hot-encoded. NOT ordinal.|
| account length | Smaller number (length) indicates older account | |
| area code | Phone number area code | |
| phone number | Phone number | |
| international plan | Customer has intl. plan | 'yes', 'no'(note: although categorical, this is already "one hot encoded" because it is binary) |
| voice mail plan | Customer has voice mail plan | 'yes', 'no'(see above)|
| number vmail messages | | | |
| total day minutes
| total day calls
| total day charge
| total eve minutes
| total eve calls
| total eve charge
| total night minutes
| total night calls
| total night charge
| total intl minutes
| total intl calls
| total intl charge
| customer service calls

# Get Data and Import Libraries: 

In [381]:
# Import Required Python Libraries:
import pandas as pd
import numpy as np
import math

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector as selector

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.feature_selection import SelectFromModel

from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve

from sklearn.impute import SimpleImputer

In [382]:
# Import Data:
df = pd.read_csv('./data.csv')

# Initial EDA
- Explore variables
- Nulls? 
- Categorical, binary, or numerical?

In [383]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [384]:
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


In [385]:
df.value_counts('international plan')

international plan
no     3010
yes     323
dtype: int64

In [386]:
df.value_counts('voice mail plan')

voice mail plan
no     2411
yes     922
dtype: int64

In [387]:
df.value_counts('account length')

account length
105    43
87     42
101    40
93     40
90     39
       ..
199     1
191     1
188     1
175     1
243     1
Length: 212, dtype: int64

In [388]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

## Thoughts on Data (Consider the business problem when choosing features)
- Area codes (and by associate phone numbers) and State do not match (415 is not an area code in Kansas)
- "State" may be a useful geographical feature to consider, but lots of people live in states that don't match their phone #'s area code, so area code isn't a reliable indicator of location.
- There are no nulls
- Categorical Variables (besides target which is Churn)
- ## Numeric vs. Categorical:
  - Is it numeric or categorical?
    - As "Is an increase of 2 in this variable twice as much as an increase of 1?"
  - State
- These are boolean value columns - so they don't need to be one-hot-encoded, just converted from yes/no to 1/0
  - international plan
  - voice mail plan
- Ordinal values -- there are none
- To Drop:
  -   Area Code (because an increase of 1 does'nt mean twice as many)
  -   Phone number (because an increase of 1 does'nt mean twice as many)
- Calls vs. Minutes
  - The more calls doesn't necessarily mean more minutes, so we will keep calls and minutes (they are not redundant)

# Data Cleaning

In [389]:
# Convert yes/no values to 0/1:
df['international plan'] = df['international plan'].replace(to_replace=['no', 'yes'], value=[0, 1])
df['voice mail plan'] = df['voice mail plan'].replace(to_replace=['no', 'yes'], value=[0, 1])

In [390]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,0,1,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,0,1,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,0,0,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,1,0,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,1,0,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [391]:
# Convert target variable from True/False to 1/0
# Prior convention: False = has not churned, True = has churned 
# We will convert to this convention: 0 = False, 1 = True
df['churn'] = df['churn'].replace(to_replace=[False, True], value=[0, 1])

In [392]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,0,1,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,OH,107,415,371-7191,0,1,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,NJ,137,415,358-1921,0,0,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,OH,84,408,375-9999,1,0,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,OK,75,415,330-6626,1,0,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [393]:
df.value_counts('state')

state
WV    106
MN     84
NY     83
AL     80
OH     78
WI     78
OR     78
WY     77
VA     77
CT     74
VT     73
MI     73
ID     73
UT     72
TX     72
IN     71
KS     70
MD     70
NJ     68
NC     68
MT     68
NV     66
CO     66
WA     66
MA     65
MS     65
RI     65
AZ     64
MO     63
FL     63
ME     62
NM     62
ND     62
NE     61
DE     61
OK     61
SC     60
SD     60
KY     59
IL     58
NH     56
AR     55
GA     54
DC     54
HI     53
TN     53
AK     52
LA     51
PA     45
IA     44
CA     34
dtype: int64

In [394]:
# We have an imbalanced dataset skewed towards False
df.value_counts('churn')

churn
0    2850
1     483
dtype: int64

# Train Test Split

In [395]:
# Create X (predictors) and y (target) variables:
# Here we'll drop area code and phone number while we're at it:
X = df.drop(['area code', 'phone number','churn'], axis = 1).reset_index(drop=True)
y = df.churn.reset_index(drop=True)

# Split Data into train and test:
X_train, x_test, y_train, y_test = train_test_split(X,y, random_state=666)

# Preview top 10:
X.head()

Unnamed: 0,state,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
0,KS,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,OH,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1
2,NJ,137,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0
3,OH,84,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,OK,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3


In [396]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2499 entries, 1777 to 2284
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   2499 non-null   object 
 1   account length          2499 non-null   int64  
 2   international plan      2499 non-null   int64  
 3   voice mail plan         2499 non-null   int64  
 4   number vmail messages   2499 non-null   int64  
 5   total day minutes       2499 non-null   float64
 6   total day calls         2499 non-null   int64  
 7   total day charge        2499 non-null   float64
 8   total eve minutes       2499 non-null   float64
 9   total eve calls         2499 non-null   int64  
 10  total eve charge        2499 non-null   float64
 11  total night minutes     2499 non-null   float64
 12  total night calls       2499 non-null   int64  
 13  total night charge      2499 non-null   float64
 14  total intl minutes      2499 non-null

# Pipeline Set-Up

In [397]:
# Define predictor features as list of numerical and list of categorical:

# List of numerical features:
numfeat = ['account length', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls']

# List of categorical features:
catfeat = ['state']

# List of binary features:
# Interestingly, a boolean feature doesn't have to be "bool" type
# binfeat = ['international plan', 'voice mail plan']

In [398]:
# Pipeline for numerics:


# scale

numpipe = Pipeline([
    ('ss', StandardScaler())
])


In [399]:
# Pipeline for categoricals:


# one-hot-encoder

catpipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')) # Using handle_unknown param en lieu of stratify during train-test-split
])

In [400]:
# Pipeline for binaries:

# impute nulls

# binpipe = Pipeline([
#    ('binimp', SimpleImputer(strategy='most_frequent'))
#])

## We now have our numeric and categorical pipelines. 
- Since we don't have any nulls, and binary features don't need to be transformed or one-hot-encoded,
- Our binary features don't have to be in a pipeline object
- Next step is to handle columns holistically with `ColumnTransformer`

In [401]:
# We will use our pipeline objects as transformer argument for ColumnTransformer
# "Pipeline" has inherited a number of classes from our transforms (??)
# We don't need to do any transformations to the binary feature columns, so we use the 'passthrough'
# Argument to let them through without transformation.

# our transformer is a tuple and includes a columns argument

ColTrans = ColumnTransformer(transformers=[
    ('numerics', numpipe, numfeat),
    ('categoricals', catpipe, catfeat)
], remainder='passthrough')

# Dummy Model
- For our dummy model, the only variables we are excluding are "area code" and "phone number" because we've determined these columns don't contain information that signifies geographical location or any other information that would correlate to churn.

In [402]:
# Create Dummy Model Pipeline

dumpipe = Pipeline([
    ('ct', ColTrans),
    ('dummy', DummyClassifier(strategy='most_frequent'))
])

In [403]:
# Fit the dummy regressor to the training data:

dumpipe.fit(X_train, y_train)

Pipeline(steps=[('ct',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numerics',
                                                  Pipeline(steps=[('ss',
                                                                   StandardScaler())]),
                                                  ['account length',
                                                   'number vmail messages',
                                                   'total day minutes',
                                                   'total day calls',
                                                   'total day charge',
                                                   'total eve minutes',
                                                   'total eve calls',
                                                   'total eve charge',
                                                   'total night minutes',
                                                 