# Preprocessing for Churn Modeling

**Import dataset**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

telco = pd.read_csv('./datasets/Churn.csv')
print(telco.head())

   Account_Length  Vmail_Message  Day_Mins  Eve_Mins  Night_Mins  Intl_Mins  \
0             128             25     265.1     197.4       244.7       10.0   
1             107             26     161.6     195.5       254.4       13.7   
2             137              0     243.4     121.2       162.6       12.2   
3              84              0     299.4      61.9       196.9        6.6   
4              75              0     166.7     148.3       186.9       10.1   

   CustServ_Calls Churn Intl_Plan Vmail_Plan  ...  Day_Charge  Eve_Calls  \
0               1    no        no        yes  ...       45.07         99   
1               1    no        no        yes  ...       27.47        103   
2               0    no        no         no  ...       41.38        110   
3               2    no       yes         no  ...       50.90         88   
4               3    no       yes         no  ...       28.34        122   

   Eve_Charge  Night_Calls  Night_Charge  Intl_Calls  Intl_Charge  S

**Model Assumptions**
- Some assumptions that models make:
    - That features are normally distributed
    - That features are on the same scale

**Data types**
- Machine learning algorithms require numeric data types
    - Need to encode categorical variable as numeric

**Standarization**
- Centers the distribution around the mean
- Calculates the number of standard deviations away from the mean each point is

## Encoding Binary Features

First we are going to recast some features. Here we are going to assign the values *1* to *'yes'* and *0* to *'no'*

In [2]:
telco['Vmail_Plan'] = telco['Vmail_Plan'].replace({'no': 0, 'yes': 1})
telco['Churn'] = telco['Churn'].replace({'no': 0, 'yes': 1})
telco['Intl_Plan'] = telco['Intl_Plan'].replace({'no': 0, 'yes': 1})

print(telco['Vmail_Plan'].head())
print(telco['Churn'].head())

0    1
1    1
2    0
3    0
4    0
Name: Vmail_Plan, dtype: int64
0    0
1    0
2    0
3    0
4    0
Name: Churn, dtype: int64


## One hot encoding

We are going to use this technique to encode the state

In [3]:
telco_state = pd.get_dummies(telco['State'])

print(telco_state.head())

   AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  ...  SD  TN  TX  UT  VA  VT  WA  \
0   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
1   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
2   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
3   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
4   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   

   WI  WV  WY  
0   0   0   0  
1   0   0   0  
2   0   0   0  
3   0   0   0  
4   0   0   0  

[5 rows x 51 columns]


## Feature Scaling
Let's fit the scale so every feature is in the same scale

In [4]:
telco_to_scale = telco[['Intl_Calls', 'Night_Mins']]
telco_scaled = StandardScaler().fit_transform(telco_to_scale)
telco_scaled_df = pd.DataFrame(telco_scaled, columns = ['Intl_Calls', 'Night_Mins'])

print(telco_scaled_df.describe())

         Intl_Calls    Night_Mins
count  3.333000e+03  3.333000e+03
mean  -8.527366e-18  7.887813e-17
std    1.000150e+00  1.000150e+00
min   -1.820289e+00 -3.513648e+00
25%   -6.011951e-01 -6.698545e-01
50%   -1.948306e-01  6.485803e-03
75%    6.178983e-01  6.808485e-01
max    6.307001e+00  3.839081e+00


## Feature Selection and Engineering

**Dropping unnecessary features**
- Unique identifiers
    - Phone numbers
    - Social security numbers
    - Account numbers

**Dropping correlated features**
- Highly correlated features can be dropped
- They provide no additional information to the model

**Feature engineering**
- Creating new features to help impove model performance
- Should consult with business and subject matter experts

In [5]:
telco = telco.drop(['Area_Code', 'Phone'], axis = 1)
print(telco.columns)

Index(['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins',
       'Intl_Mins', 'CustServ_Calls', 'Churn', 'Intl_Plan', 'Vmail_Plan',
       'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge', 'Night_Calls',
       'Night_Charge', 'Intl_Calls', 'Intl_Charge', 'State'],
      dtype='object')


In [6]:
telco['Avg_Night_Calls'] = telco['Night_Mins'] / telco['Night_Calls']

print(telco['Avg_Night_Calls'].head())

0    2.689011
1    2.469903
2    1.563462
3    2.212360
4    1.544628
Name: Avg_Night_Calls, dtype: float64
