# Predicting Customer Churn: SyriaTel Telecommunications

A predictive classification model by Chum Mapa, Adam Roth and Leana Critchell.

## Setting the Scene:

This project aims to provide SyriaTel with a model to help predict whether a customer will soon churn.  Current data shows a 15% churn rate in customers who have been with the company for less than 245 days.  For this reason, we hope to provide insights into the driving features that predict churn in order to help SyriaTel be more informed about where to direct retention budget funds. 

## Aims:

This project aims to:
- Investigate labeled data on 3333 customers who have held accounts with the company for less than 243 days.
- Provide inferential statistics and visualisations based on this data.
- Create predictive, supervised learning models from the data to predict churn

## Definitions:

- Churn:  a customer who closes their account with SyriaTel.  A prediction of `True` relates to a customer who will churn. 
- Predictive model:  

## Data:

This project utilises data from the [Churn in Telecom dataset](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset) from Kaggle.

The target variable in this dataset that we aimed to predict was identified as the `churn` column.  

The features of this dataset include locational information (`state` and `area_code`) as well as plan details such as call minutes, charges, customer services calls and whether the customer had an international plan and/or voice mail plan.  Our model iterations utilised subsets of these features as well as aggregations of these features to determine which features would best predict cusomter churn.  

The raw, csv dataset can be downloaded directly from the [kaggle website](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset) or can be found in this repo [here](../../data/raw/telecom_churn_data).

## Model:

This project tests a variety of classification models including:
- Decisioin Tree Classifier
- Logistic Regression Classifier
- KNN Classifier
- Random Forest Classifer
- Gradient Boost Classifer

We evaluated our models based on the recall score metric as well as the corresponding confusion matrix.  Once the best model was identified, we assessed the model performance on a seperate test set to determine whether the model continued to perform well or if the model was overfitting.

The decision behind choosing to evaluate the model on recall was made by considering the cost and impact of false negative predictions, that is, we determined that it was more costly for the company for the model to predict that a customer would stay with SyriaTel when in fact that would churn/leave.  This would lead to a missed opportunity for the company to dedicate retention resources towards that customer and keeping their business.  Maximising recall score accounts for this scenario in our model and so it was for this reason that we chose this as our evaluation metric. 

# Table of contents:
- Data Cleaning and Exploratory Data Analysis (EDA)

- Investigate Target Variable: Churn 

- Investigate Features

- First Simple Model:  Decision Tree Classifier

- Model Iterations 1 - 6

- Model interpretation

- Conclusion

## Results, Future Investigations and Recommendations:

#### Best model:  

Gradient boost, blah blah

#### Future Investigations:

Investigate high churn locations
etc

#### Recommendations:

Budget stuff

# Data Cleaning and Exploratory Data Analysis:

In [1]:
%load_ext autoreload
%autoreload 2

## Imports

In [3]:
# imports 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from imblearn.over_sampling import SMOTE

## Get the Data

As mentioned earlier, the dataset can be downloaded directly from the Kaggle website [here](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset) and saved into your desired directory, or, if copying the repo structure here, you can run the following cells to load the data from the `telecom_churn_data` csv file in the [raw data folder](../../data/raw) in this repo.

In [4]:
# read in data to pandas
df = pd.read_csv('../../data/raw/telecom_churn_data')
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Data Cleaning

#### Clean up column headings:

In [5]:
# replace spaces with underscores
df.columns = df.columns.str.replace(' ', '_')

#### Inspect null values and data types:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
state                     3333 non-null object
account_length            3333 non-null int64
area_code                 3333 non-null int64
phone_number              3333 non-null object
international_plan        3333 non-null object
voice_mail_plan           3333 non-null object
number_vmail_messages     3333 non-null int64
total_day_minutes         3333 non-null float64
total_day_calls           3333 non-null int64
total_day_charge          3333 non-null float64
total_eve_minutes         3333 non-null float64
total_eve_calls           3333 non-null int64
total_eve_charge          3333 non-null float64
total_night_minutes       3333 non-null float64
total_night_calls         3333 non-null int64
total_night_charge        3333 non-null float64
total_intl_minutes        3333 non-null float64
total_intl_calls          3333 non-null int64
total_intl_charge         3333 non-null float64

It appears we have no null values in our dataframe (although it is not yet known whether we have any 'disguised' null values yet).  

Most of the features are numerical except for `state`, `phone_number`, `international_plan` and `voice_mail_plan` which are strings and our target `churn` which is of boolean type.  

Let's inspect the unique values of each feature to see if we have any 'null' values 'in disguise' or any values that we don't expect which might be errors:

In [8]:
# inspect unique values of columns to identify potention errors or null values:
for col in df.columns:
    print(f"{col} vals:  {df[col].unique()} \n")

state vals:  ['KS' 'OH' 'NJ' 'OK' 'AL' 'MA' 'MO' 'LA' 'WV' 'IN' 'RI' 'IA' 'MT' 'NY'
 'ID' 'VT' 'VA' 'TX' 'FL' 'CO' 'AZ' 'SC' 'NE' 'WY' 'HI' 'IL' 'NH' 'GA'
 'AK' 'MD' 'AR' 'WI' 'OR' 'MI' 'DE' 'UT' 'CA' 'MN' 'SD' 'NC' 'WA' 'NM'
 'NV' 'DC' 'KY' 'ME' 'MS' 'TN' 'PA' 'CT' 'ND'] 

account_length vals:  [128 107 137  84  75 118 121 147 117 141  65  74 168  95  62 161  85  93
  76  73  77 130 111 132 174  57  54  20  49 142 172  12  72  36  78 136
 149  98 135  34 160  64  59 119  97  52  60  10  96  87  81  68 125 116
  38  40  43 113 126 150 138 162  90  50  82 144  46  70  55 106  94 155
  80 104  99 120 108 122 157 103  63 112  41 193  61  92 131 163  91 127
 110 140  83 145  56 151 139   6 115 146 185 148  32  25 179  67  19 170
 164  51 208  53 105  66  86  35  88 123  45 100 215  22  33 114  24 101
 143  48  71 167  89 199 166 158 196 209  16  39 173 129  44  79  31 124
  37 159 194 154  21 133 224  58  11 109 102 165  18  30 176  47 190 152
  26  69 186 171  28 153 169  13  27   3  42 1

No values of surprise or suspision here.  
- All `states` look normal, as expected.
- We can see that columns `international_plan` and `voice_mail_plan` are binary features with `yes/no` values - we might want to change these types later to booleans or 1/0's.  
- It is interesting to see that there are only 3 `area_codes`.  It might be worth investigating whether a particular area code has higher churn than another or if it would be safe to simply drop area code.  Also, area code would actually be more of a categorical feature rather than an `int` feature so we will change this data type. 
- It would be safe to assume that `phone_number` has no bearing on whether a person decides to leave the company and so we might choose to drop this column. 
- `account_length` appears to be discrete, with only integer values.  The highest value being 243 suggests that this column represents the total number of days the cusomter has had their account open with the company.  From the length of time, this dataset must contain relatively new cusomters.
- `number_vmail_messages` appears to be a discrete variable and there aren't abnormal values here.  The highest number of voicemails is 51 which might be high for the average person but could be an indicator of churn so I feel it is worth keeping. 
- `customer_service_calls vals` is also a discrete variable as expected with no apparent abnormal values. 
- All minutes, calls and charges columns have reasonable values and nothing stands out as unusual at this stage. 
- And of course our target `churn` has just `True/False` values as expected. 

#### Inspect range and central tendencies of numeric data:

In [9]:
df.describe()

Unnamed: 0,account_length,area_code,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,customer_service_calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


All values still seem reasonable and no reason to suggest outliers amongst the features yet. 