# Classification
## Motivating example
A telecom company wants to anticipate a current customer's decision to end their contract and turn to a competitor's services. This would allow for proactivate measures such as targeted ads and promotions to increase the probability that that customer is retained.

This is an example of a binary classification problem, whereby each data point is assigned one of two possible discrete classes. In our case, the customer base is split base on whether or not they are predicted to churn in the near future.

In [35]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

## Data Preparation

In [36]:
df = pd.read_csv('data.csv')

In [37]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [38]:
# standardize column names
df.columns = map(lambda x: x.lower().replace(" ", "_"), df.columns)

In [39]:
str_cols = df.dtypes[df.dtypes == 'object'].index

for col in str_cols:
	df[col] = df[col].str.lower().str.replace(" ", "_")

In [40]:
# convert and impute total charges
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce').fillna(0)

In [41]:
df.churn = (df.churn == 'yes').astype(int)

## Validation Framework

In [42]:
def data_split(df, test_size, val_size, random_state=42):
	df, df_test = train_test_split(df, test_size=test_size, random_state=random_state)
	df_train, df_val = train_test_split(df, test_size=val_size/(1-test_size), random_state=random_state)

	return df_train, df_val, df_test

In [43]:
df_train, df_val, df_test = data_split(df, test_size=.2, val_size=.2, random_state=1)

In [44]:
assert df_val.shape == df_test.shape

In [45]:
df_train.reset_index(drop=True, inplace=True)
df_val.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

In [46]:
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

df_train.drop('churn', axis=1, inplace=True)
df_val.drop('churn', axis=1, inplace=True)
df_test.drop('churn', axis=1, inplace=True)