# Part II: Customer Churn
One of the important applications of classification techniques in marketing analytics lies in predicting whether individual customers will defect to a competitor over a fixed time period (e.g. over the next year). Your goal is to build the best model possible to predict churn. In this problem you will use the “churn_data.csv” file to make these kinds of predictions, and you will implement a Jupyter notebook to accomplish this. Please make sure your Jupyter notebook addresses the questions with appropriate discussion (i.e. you will want to include appropriate discussion around your code).

1) Load the “churn_data.csv” dataset into Python. What is the response variable, and what are the predictor variables?

2) What data transforms are necessary to perform on this data and why?

3) What modeling approaches did you use and why? Describe your model development process, including the different models tried, feature selection methods, and the different transformation techniques you employed. (**NOTE: Please do not forget to use your KNN algorithm from part 1 as one of your methods).**

4) Which error metrics did you use to assess performance and why? What kind of performance did you obtain on the different models you built?

5) Construct the best (i.e. least-error) possible model on this data set. What are the predictors used?

6) Load the dataset “churn_validation.csv” into a new data frame and recode as necessary. Predict the outcomes for each of the customers and compare to the actual. What are the error rates you get based on your selected metrics?

7) Consider the best model you built for this problem. Is it a good model that can reliably be used for prediction? Why or why not?

In [20]:
import pandas as pd
import numpy as np
import knn
import math
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

data_train = pd.read_csv('churn_data.csv')
data_test = pd.read_csv('churn_validation.csv')

data_train.head()

Unnamed: 0,CustID,Gender,Age,Income,FamilySize,Education,Calls,Visits,Churn
0,123251,Male,34,Lower,4,16,14,5,Yes
1,188922,Male,20,Lower,5,14,49,1,No
2,145322,Female,30,Lower,4,20,19,4,Yes
3,153729,Female,46,Lower,4,14,15,4,Yes
4,103976,Female,23,Lower,4,16,18,0,No


As we can see looking at the churn dataset every customer has a set of discriptive variables such as gender, age, income, etc., a customer ID variable, and whether or not they stayed a customer as indicated by the Churn column. So the goal is to use the discriptive vairables of the customer to predict the Churn. 
However, before creating a classifier for the data first the data needs to be transformed into a usable state. First the 'CustID' columns need to be dropped since it has no relation to the target variable and only adds noise to the model. Then the categorical variables 'Gender' and 'Income' need to be one-hot encoded so that all of the vairables are numerical. Next, I normalized the data so it was all on a scale from 0 to 1 to help make sure that all of the variables were weighted equally in the predcting models. Lastly, I converted the Yes and No variables in the Churn column to 1s and 0s respectively. This made it easier to impliment the predicting models.

In [21]:
# Drop Uniqe ID Variable
data_train = data_train.drop(labels=['CustID'], axis = 1)
data_test = data_test.drop(labels=['CustID'], axis = 1)

#One Hot Encode Categorical Variables
data_train = pd.get_dummies(data_train, columns=['Gender','Income'])
data_test = pd.get_dummies(data_test, columns=['Gender','Income'])

#Convert Yes and No value to 1 and 0 [yes = 1 and no = 0]
data_train = data_train.replace(to_replace='Yes', value=1)
data_train = data_train.replace(to_replace = 'No', value=0)

data_test = data_test.replace(to_replace='Yes', value=1)
data_test = data_test.replace(to_replace = 'No', value=0)
              
features = list(data_train)
features.remove('Churn')

data_x, data_y = data_train[features], data_train['Churn']

# Convert all data so it is in a scale from 0 to 1
min_max_scaler = preprocessing.MinMaxScaler()
data_x = min_max_scaler.fit_transform(data_x)
data_x_testing = min_max_scaler.transform(data_test[features])


x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size = 0.3, random_state = 4)

  return self.partial_fit(X, y)


In [3]:
# To try to better understand the data I tested the correlation between the churn and the other variables which 
#showed to have insignificant correlation
for i in list(data_test):
    print(i,'and Churn have a correlation of',data_test[i].corr(data_train['Churn']))

Age and Churn have a correlation of -0.19812588642853068
FamilySize and Churn have a correlation of -0.2641910054578153
Education and Churn have a correlation of -0.04254356298115171
Calls and Churn have a correlation of 0.2252229177533687
Visits and Churn have a correlation of 0.01943917038501359
Churn and Churn have a correlation of -0.011953709238683665
Gender_Female and Churn have a correlation of 0.2430949469092278
Gender_Male and Churn have a correlation of -0.2430949469092278
Income_Lower and Churn have a correlation of -0.06262242910851495
Income_Upper and Churn have a correlation of 0.06262242910851495


# Different Modeling Approaches

While trying to find the best model for the data I decided to try three different models; naive bayse, random forest, and a knn classifier. Then for each model I tested to see which conditions created the best model. I decided against using support vector machines because that method works best with larger sets of data and with less than 10 variables I thought it wouldn't perform well. 

## Method 1: Naive Bayse - Gaussian
### Testing the model with training data
As we can see the Naive bayse method doesn't work very well. After experimenting with the different methods Gaussian performed the best, however, the accuracy is only about 69% which is not very good.

In [5]:
# Build and evaluate the model.
from sklearn import naive_bayes

#nb = naive_bayes.BernoulliNB()
#nb = naive_bayes.MultinomialNB()
nb = naive_bayes.GaussianNB()
nb.fit(x_train, y_train)
y_hat = nb.predict(x_test)

print('Accuracy: ' + str(accuracy_score(y_test, y_hat)))
print('Precision: ' + str(precision_score(y_test, y_hat)))
print('Recall: ' + str(recall_score(y_test, y_hat)))
print('F1: ' + str(f1_score(y_test, y_hat)))
print('ROC AUC: ' + str(roc_auc_score(y_test, y_hat)))
print('Confusion Matric: \n' + str(confusion_matrix(y_test, y_hat)))

Accuracy: 0.6923076923076923
Precision: 0.8
Recall: 0.5714285714285714
F1: 0.6666666666666666
ROC AUC: 0.7023809523809523
Confusion Matric: 
[[15  3]
 [ 9 12]]


## Method 2: Random Forest
It is difficult to say which combination of esimators and depth is the best. So using the combinations of depth and estimators that generate an accuracy of greater than 83% I plan to see how they fair on the test data.

In [35]:
from sklearn import ensemble
from error_metrics import *

# Builds a sequence of Random Forest models for different n_est and depth values
n_ests = [5, 10, 50, 80]
depths = [2, 3, 4, 5, 6]

# Creates a list of tuples containing the n-est and depth values that perform well (n-est, depth)
good_forest = []

for n in n_ests:
    for dp in depths:
        mod = ensemble.RandomForestClassifier(n_estimators = n, max_depth = dp)
        mod.fit(x_train, y_train)
        y_hat = mod.predict(x_test)
        if accuracy_score(y_test, y_hat) > 0.83:
            good_forest.append((n, dp))
            print('------ Evaluating model: n_estimators =' + str(n) + ', max_depth = ' + str(dp),'------')
            print_multiclass_classif_error_report(y_test, y_hat)

------ Evaluating model: n_estimators =5, max_depth = 5 ------
Accuracy: 0.8461538461538461
Avg. F1 (Micro): 0.8461538461538461
Avg. F1 (Macro): 0.8460526315789474
Avg. F1 (Weighted): 0.8457489878542509
              precision    recall  f1-score   support

           0       0.94      0.77      0.85        22
           1       0.76      0.94      0.84        17

   micro avg       0.85      0.85      0.85        39
   macro avg       0.85      0.86      0.85        39
weighted avg       0.86      0.85      0.85        39

Confusion Matrix: 
[[17  1]
 [ 5 16]]
------ Evaluating model: n_estimators =10, max_depth = 4 ------
Accuracy: 0.8461538461538461
Avg. F1 (Micro): 0.8461538461538461
Avg. F1 (Macro): 0.8452380952380952
Avg. F1 (Weighted): 0.8461538461538461
              precision    recall  f1-score   support

           0       0.83      0.83      0.83        18
           1       0.86      0.86      0.86        21

   micro avg       0.85      0.85      0.85        39
   macro a

In [39]:
#Testing individual forest combinations.
print('Using 50 estimators and a max depth of 5')
rf_mod = ensemble.RandomForestClassifier(n_estimators = 50, max_depth = 5)
rf_mod.fit(x_train, y_train)
y_hat = rf_mod.predict(x_test)
print_multiclass_classif_error_report(y_test, y_hat)

Using 50 estimators and a max depth of 5
Accuracy: 0.8461538461538461
Avg. F1 (Micro): 0.8461538461538461
Avg. F1 (Macro): 0.8460526315789474
Avg. F1 (Weighted): 0.8457489878542509
              precision    recall  f1-score   support

           0       0.94      0.77      0.85        22
           1       0.76      0.94      0.84        17

   micro avg       0.85      0.85      0.85        39
   macro avg       0.85      0.86      0.85        39
weighted avg       0.86      0.85      0.85        39

Confusion Matrix: 
[[17  1]
 [ 5 16]]


## Method 2: Random Forest Testing
The biggest issue with the random forest is that there is no obviously bester model. Especially since the accuracy of each estimator and max depth combination can change between tests. However, after running several tests the random forest utalizing 50 estimators and a max depth of 5 consistantly performed well. The random forest utalizing 80 estimators and a max depth of 6 also performed well, however, since it has a larger amount of estimators and the max depth is equal to the number of variables the model os more suseptable to overfitting so I believe that the model utalizing 50 estimators and a max depth of 5 is the best random forest model for this dataset.

In [36]:
#Using leave-one-out cross validation to test the better performing random forests
for i in range(len(good_forest)):
    rf_mod = ensemble.RandomForestClassifier(n_estimators = good_forest[i][0], max_depth = good_forest[i][1])
    loo = LeaveOneOut()
    loo_scores = cross_val_score(rf_mod, data_x, data_y, cv=loo)
    print('When n_estimators=',good_forest[i][0],'and max_depth=',good_forest[i][1],'the CV Score (Leave-One-Out)= ' + str(loo_scores.mean()))


When n_estimators= 5 and max_depth= 5 the CV Score (Leave-One-Out)= 0.7109375
When n_estimators= 10 and max_depth= 4 the CV Score (Leave-One-Out)= 0.7421875
When n_estimators= 80 and max_depth= 2 the CV Score (Leave-One-Out)= 0.7578125
When n_estimators= 80 and max_depth= 3 the CV Score (Leave-One-Out)= 0.765625
When n_estimators= 80 and max_depth= 4 the CV Score (Leave-One-Out)= 0.7734375
When n_estimators= 80 and max_depth= 5 the CV Score (Leave-One-Out)= 0.78125
When n_estimators= 80 and max_depth= 6 the CV Score (Leave-One-Out)= 0.7578125


In [37]:
# Using Shuffle Split cross validation to test the better performing random forests
for i in range(len(good_forest)):
    rf_mod = ensemble.RandomForestClassifier(n_estimators = good_forest[i][0], max_depth = good_forest[i][1])
    shuffle_split = ShuffleSplit(test_size = 0.2, train_size = 0.8, n_splits = 10)
    ss_scores = cross_val_score(rf_mod, data_x, data_y, scoring = 'accuracy', cv=shuffle_split)
    print('When n_estimators=',good_forest[i][0],'and max_depth=',good_forest[i][1],'\n CV Scores (Shuffle Split)= ' + str(ss_scores))
    

When n_estimators= 5 and max_depth= 5 
 CV Scores (Shuffle Split)= [0.80769231 0.69230769 0.76923077 0.65384615 0.80769231 0.61538462
 0.76923077 0.69230769 0.76923077 0.69230769]
When n_estimators= 10 and max_depth= 4 
 CV Scores (Shuffle Split)= [0.84615385 0.73076923 0.65384615 0.61538462 0.76923077 0.73076923
 0.84615385 0.76923077 0.80769231 0.61538462]
When n_estimators= 80 and max_depth= 2 
 CV Scores (Shuffle Split)= [0.73076923 0.69230769 0.80769231 0.84615385 0.69230769 0.76923077
 0.5        0.76923077 0.69230769 0.84615385]
When n_estimators= 80 and max_depth= 3 
 CV Scores (Shuffle Split)= [0.61538462 0.80769231 0.88461538 0.73076923 0.80769231 0.80769231
 0.73076923 0.65384615 0.80769231 0.92307692]
When n_estimators= 80 and max_depth= 4 
 CV Scores (Shuffle Split)= [0.80769231 0.61538462 0.80769231 0.84615385 0.80769231 0.80769231
 0.96153846 0.73076923 0.76923077 0.73076923]
When n_estimators= 80 and max_depth= 5 
 CV Scores (Shuffle Split)= [0.88461538 0.76923077 0.769

## Method 3: KNN
K nearest neighbors performed better than the Naive Bayse model and about on par with the random forest. The model performed the best when k=5 with an accuracy of 0.846. At first glance the accuracy is lower than some of the random forest models however the accuracy of those models went down 

In [18]:
#Testing the KNN method with different values of k
for k in range(1,20,2):
    mod = knn.KNN(k, 'euclidean')
    mod.fit(x_train, y_train)
    y_hat = mod.predict(x_test)
    if accuracy_score(y_test, y_hat) > .7:
        print('---------- EVALUATING MODEL: k = ' + str(k) + '----------')
        print('Accuracy: ' + str(accuracy_score(y_test, y_hat)))
        print('Precision: ' + str(precision_score(y_test, y_hat)))
        print('Recall: ' + str(recall_score(y_test, y_hat)))
        print('F1: ' + str(f1_score(y_test, y_hat)))
        print('ROC AUC: ' + str(roc_auc_score(y_test, y_hat)))
        print('Confusion Matric: \n' + str(confusion_matrix(y_test, y_hat)))

---------- EVALUATING MODEL: k = 1----------
Accuracy: 0.7435897435897436
Precision: 0.8235294117647058
Recall: 0.6666666666666666
F1: 0.7368421052631577
ROC AUC: 0.75
Confusion Matric: 
[[15  3]
 [ 7 14]]
---------- EVALUATING MODEL: k = 3----------
Accuracy: 0.7435897435897436
Precision: 0.7894736842105263
Recall: 0.7142857142857143
F1: 0.7500000000000001
ROC AUC: 0.746031746031746
Confusion Matric: 
[[14  4]
 [ 6 15]]
---------- EVALUATING MODEL: k = 5----------
Accuracy: 0.8461538461538461
Precision: 0.8947368421052632
Recall: 0.8095238095238095
F1: 0.8500000000000001
ROC AUC: 0.8492063492063492
Confusion Matric: 
[[16  2]
 [ 4 17]]
---------- EVALUATING MODEL: k = 7----------
Accuracy: 0.7692307692307693
Precision: 0.9285714285714286
Recall: 0.6190476190476191
F1: 0.742857142857143
ROC AUC: 0.7817460317460317
Confusion Matric: 
[[17  1]
 [ 8 13]]
---------- EVALUATING MODEL: k = 15----------
Accuracy: 0.717948717948718
Precision: 0.9166666666666666
Recall: 0.5238095238095238
F1: 0

## Final Testing
Since it is not very clear which is the best model I will test the best knn and the best random forest using the testing dataset.

In [43]:
#KNN
#Testing on the testing dataset
knn_mod = knn.KNN(5, 'euclidean')
mod = knn.KNN(k, 'euclidean')
mod.fit(x_train, y_train)
y_hat = mod.predict(data_x_testing)

print('---------- EVALUATING KNN MODEL: k = ' + str(5) + '----------')
print('Accuracy: ' + str(accuracy_score(data_test['Churn'], y_hat)))
print('Precision: ' + str(precision_score(data_test['Churn'], y_hat)))
print('Recall: ' + str(recall_score(data_test['Churn'], y_hat)))
print('F1: ' + str(f1_score(data_test['Churn'], y_hat)))
print('ROC AUC: ' + str(roc_auc_score(data_test['Churn'], y_hat)))
print('Confusion Matric: \n' + str(confusion_matrix(data_test['Churn'], y_hat)))

---------- EVALUATING KNN MODEL: k = 5----------
Accuracy: 0.71875
Precision: 0.6428571428571429
Recall: 0.6923076923076923
F1: 0.6666666666666666
ROC AUC: 0.7145748987854251
Confusion Matric: 
[[14  5]
 [ 4  9]]


In [44]:
#Random Forest
#Using 
rf_mod = ensemble.RandomForestClassifier(n_estimators = 50, max_depth = 5)
rf_mod.fit(x_train, y_train)
y_hat = mod.predict(data_x_testing)

print('---------- EVALUATING KNN MODEL: k = ' + str(5) + '----------')
print('Accuracy: ' + str(accuracy_score(data_test['Churn'], y_hat)))
print('Precision: ' + str(precision_score(data_test['Churn'], y_hat)))
print('Recall: ' + str(recall_score(data_test['Churn'], y_hat)))
print('F1: ' + str(f1_score(data_test['Churn'], y_hat)))
print('ROC AUC: ' + str(roc_auc_score(data_test['Churn'], y_hat)))
print('Confusion Matric: \n' + str(confusion_matrix(data_test['Churn'], y_hat)))

---------- EVALUATING KNN MODEL: k = 5----------
Accuracy: 0.71875
Precision: 0.6428571428571429
Recall: 0.6923076923076923
F1: 0.6666666666666666
ROC AUC: 0.7145748987854251
Confusion Matric: 
[[14  5]
 [ 4  9]]


## Best Model
My best model was between the random forest and the knn however when I tested both on the churn_validation.csv dataset they gave the exact same results. They both had an accuracy of about 71% which is a decent model and aligns with some of the results that I was getting using the testing data on these models. As for how reliable it is, it isn't the most reliable model. I think that it is accurate enough that they can be beneficial however, I would only qualify it as a good reliable model if it were above 80% accuracy. 