### Installing Imbalance Learn library

In [3]:
pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.9.0-py3-none-any.whl (199 kB)
     |████████████████████████████████| 199 kB 39.8 MB/s            
  Downloading imbalanced_learn-0.8.1-py3-none-any.whl (189 kB)
     |████████████████████████████████| 189 kB 43.3 MB/s            
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.8.1 imblearn-0.0
Note: you may need to restart the kernel to use updated packages.


### Reading the csv data file and creating a data-frame called churn

In [1]:
import boto3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, classification_report
from imblearn.over_sampling import RandomOverSampler

# Defining the s3 bucket
s3 = boto3.resource('s3')
bucket_name = 'gabriel-predictive-analytics'
bucket = s3.Bucket(bucket_name)

# Defining the file to be read from s3 bucket
file_key = "telecom_churn.csv"

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

# Reading the csv file
churn = pd.read_csv(file_content_stream)
churn.head(1)

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,1,2.7,1,265.1,110,89.0,9.87,10.0


### Looking at the relative frequency table of the Churn variable

In [2]:
# Relative Frequency table
churn['Churn'].value_counts() / churn.shape[0]

0    0.855086
1    0.144914
Name: Churn, dtype: float64

As we can see, this is an unbalanced dataset.

### Let's use the following variable to predict Churn: AccountWeeks, ContractRenewal, CustServCalls, MonthlyCharge, and DayMins as the predictor variables, and Churn is the target variable.

### Let's then split the data into two data-frames (taking into account the proportion of 0s and 1s): train (80%) and test (20%).

In [3]:
# Defining the input and target variables
X = churn[['AccountWeeks', 'ContractRenewal', 'CustServCalls', 'MonthlyCharge', 'DayMins']]
Y = churn['Churn']

# Splitting the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y)

### Performing over-sampling technique on the train dataset by creating synthetic dataset to have a balance dataset

In [4]:
# Running over-sampling
X_over, Y_over = RandomOverSampler().fit_resample(X_train, Y_train)

### Using the over-sampling data-frame, let's build a random forest classification model with 500 trees and the maximum depth of each tree equal to 3.

### Then, estimate the cutoff value that makes the random forest classification model the closest to the perfect model based on the ROC curve. Using the optimal cutoff value.

In [5]:
# Random Forest Classifier model
RF_md = RandomForestClassifier(n_estimators = 500, max_depth = 3).fit(X_over, Y_over)

# Predicting on test dataset
RF_preds = RF_md.predict_proba(X_test)[:,1]

# Computing the ROC curve
fpr, tpr, threshold = roc_curve(Y_test, RF_preds)

# Creating a data-frame
cutoff_values = pd.DataFrame({'False_Positive': fpr, 'True_Positive': tpr, 'Cutoff': threshold})
cutoff_values.head()

Unnamed: 0,False_Positive,True_Positive,Cutoff
0,0.0,0.0,1.879997
1,0.0,0.010309,0.879997
2,0.0,0.030928,0.834721
3,0.005263,0.030928,0.799044
4,0.005263,0.309278,0.780919


### Checking the classification report.

In [6]:
# fiding the cutoff value close to the perfect model (tpr = 1, fpr = 0)
cutoff_values['True_Positive_minus_1'] = cutoff_values['True_Positive'] -1
cutoff_values['Distance_to_perfect_model'] = np.sqrt(cutoff_values['False_Positive']**2 + cutoff_values['True_Positive_minus_1']**2)
cutoff_values = cutoff_values.sort_values(by = 'Distance_to_perfect_model').reset_index(drop = True)

## Changing likelihoods to labels
RF_preds = np.where(RF_preds < cutoff_values['Cutoff'][0], 0, 1)

# Printing classification report
print(classification_report(Y_test, RF_preds))

              precision    recall  f1-score   support

           0       0.97      0.83      0.90       570
           1       0.46      0.84      0.59        97

    accuracy                           0.83       667
   macro avg       0.71      0.83      0.74       667
weighted avg       0.89      0.83      0.85       667



### Repeating the same process but now using an ada-boost classification model with 500 trees, the maximum depth of each tree equal to 3, and learning rate equal to 0.01

In [11]:
# Adaboost Classifier Model
ADA_md = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3), n_estimators = 500, learning_rate = 0.01).fit(X_over, Y_over)

# Predicting on Test dataaset
ADA_preds = ADA_md.predict_proba(X_test)[:,1]

# Computing the ROC curve
fpr, tpr, threshold = roc_curve(Y_test, ADA_preds) 

cutoff_values = pd.DataFrame({"False_Positive": fpr, "True_Positive": tpr, "Cutoff": threshold})

# fiding the cutoff value close to the perfect model (tpr = 1, fpr = 0)
cutoff_values['True_Positive_minus_1'] = cutoff_values['True_Positive'] -1
cutoff_values['Distance_to_perfect_model'] = np.sqrt(cutoff_values['False_Positive']**2 + cutoff_values['True_Positive_minus_1']**2)
cutoff_values = cutoff_values.sort_values(by = 'Distance_to_perfect_model').reset_index(drop = True)

## Changing likelihoods to labels
ADA_preds = np.where(ADA_preds < cutoff_values['Cutoff'][0], 0, 1)

# Printing classification report
print(classification_report(Y_test, ADA_preds))

              precision    recall  f1-score   support

           0       0.97      0.89      0.93       570
           1       0.56      0.82      0.67        97

    accuracy                           0.88       667
   macro avg       0.76      0.86      0.80       667
weighted avg       0.91      0.88      0.89       667



#### Using the results from part 4 and 5, I would use the Adaboost Classifier Model to predict customer churn because it has a higher precision on Class 1.