## Customer Churn

One of the important applications of classification techniques in marketing analytics lies in predicting whether
individual customers will defect to a competitor over a fixed time period (e.g. over the next year). Your goal is to
build the best model possible to predict churn. In this problem you will use the “churn_data.csv” file to make these
kinds of predictions, and you will implement a Jupyter notebook to accomplish this. Please make sure your Jupyter
notebook addresses the questions with appropriate discussion (i.e. you will want to include appropriate discussion
around your code).

Load the “churn_data.csv” dataset into Python.

In [1]:
import pandas as pd
from sklearn import linear_model
from sklearn import neighbors
from sklearn import naive_bayes
from sklearn import ensemble #Contains random forest algorithm
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix


In [2]:
#Read in data
data = pd.read_csv('C:/Github/DSCI-402-2019/data_files/churn_data.csv')
validation = pd.read_csv('C:/Github/DSCI-402-2019/data_files/churn_validation.csv')

print(data.head(20))
print(' ')
print(validation.head(5))

    CustID  Gender  Age Income  FamilySize  Education  Calls  Visits Churn
0   123251    Male   34  Lower           4         16     14       5   Yes
1   188922    Male   20  Lower           5         14     49       1    No
2   145322  Female   30  Lower           4         20     19       4   Yes
3   153729  Female   46  Lower           4         14     15       4   Yes
4   103976  Female   23  Lower           4         16     18       0    No
5   139389    Male   54  Upper           3         12      6       0    No
6   197395  Female   32  Upper           3         17     22       0    No
7   176036  Female   19  Lower           1         12      8       1    No
8   139348    Male   34  Lower           3         12     11       3    No
9   151276  Female   18  Upper           4         16     11       1   Yes
10  102056  Female   17  Upper           1         12     25       1    No
11  118692    Male   29  Lower           4         16     29       1   Yes
12  103866  Female   30  

**1) What is the response variable, and what are the predictor variables?**

Our response variable is the Churn variable which shows whether or not the individual remained a customer. The rest of the columns are considered the predictor variables, however the data does need some cleaning before we can make predictions.

**2) What data transforms are necessary to perform on this data and why?**

The CustID column can be dropped because it is just added noise and has no effect on the Churn of the customer. We can also make changes to "Gender", "Income", and "Churn". Female and Male in "Gender" can be changed to 1 and 0. Upper and Lower in "Income" can be changed to 1 and 0. Yes and No in "Churn" can be changed to 1 and 0.  These changes will make it easier to implement predicting models. (I made these changes to both the churn data and the validation data)

In [3]:
#Delete CustID column
del data['CustID']
del validation['CustID']

#Change the yes and no in churn, then female and male in gender, and the upper and lower in income into 1 and 0 (For the data set)
yes = data['Churn'] == 'Yes'
data.loc[yes, 'Churn'] = 1

no = data['Churn'] == 'No'
data.loc[no, 'Churn'] = 0


female = data['Gender'] == 'Female'
data.loc[female, 'Gender'] = 1

male = data['Gender'] == 'Male'
data.loc[male, 'Gender'] = 0

upper = data['Income'] == 'Upper'
data.loc[upper, 'Income'] = 1

lower = data['Income'] == 'Lower'
data.loc[lower, 'Income'] = 0

#Make the same changes to the validation set
yes = validation['Churn'] == 'Yes'
validation.loc[yes, 'Churn'] = 1

no = validation['Churn'] == 'No'
validation.loc[no, 'Churn'] = 0

female = validation['Gender'] == 'Female'
validation.loc[female, 'Gender'] = 1

male = validation['Gender'] == 'Male'
validation.loc[male, 'Gender'] = 0

upper = validation['Income'] == 'Upper'
validation.loc[upper, 'Income'] = 1

lower = validation['Income'] == 'Lower'
validation.loc[lower, 'Income'] = 0



In [4]:
# Look at data to see relationships of variables. 
sm = pd.plotting.scatter_matrix(data, diagonal = 'kde', figsize=(10,8))

**3) What modeling approaches did you use and why? Describe your model development process, including the different models tried, feature selection methods, and the different transformation techniques you employed.**

When doing my analysis, I decided to use all remaining features because even though there is some correlation think they all stand independently enough to bring unique information to our analysis. I used 5 modeling approaches to analyze this data because more models increase our chance of finding the best one. Each model will be explained as it is used. 

While doing the analysis, I used the churn data first and split it into training and testing sets. I used this data to find the best predictor model of the ones that I tried. Once I found the best model, I imported the validation data to run more tests. I then used all of the churn data as my training set and the validation data as my test set. 

In [5]:
#Get predictors - all non-Churn columns (Churn is the last column)
data_x = data[list(data)[:-1]]
val_x = validation[list(validation)[:-1]]

#Get target variable y - Churn column
data_y = data['Churn']
val_y = validation['Churn']


#Split data into training and test sets (comment this out once you have picked the best model)
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size = 0.3, random_state=4)

#Split data into training and test sets(uncomment these so that the validation data can be ran)
#x_train = data_x
#y_train = data_y
#x_test = val_x
#y_test = val_y


**Logistic Regression **

Logistic Regression is used to predict the odds of being a case based on the values of the predictors. The odds are definedas a probability that a particular outcome is a case divided by the probability that it is a noncase. 


In [6]:
#Build a logistic regression model
log_mod = linear_model.LogisticRegression()
log_mod.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
#Make predictions
preds = log_mod.predict(x_test)  #Get predicted labels
pred_probs = log_mod.predict_proba(x_test)   #Get predicted probabilities/ Each observation is a 2-element array.
pred_pos = pred_probs.transpose()[1]  #P(X = 1) is column 1
pred_neg = pred_probs.transpose()[1]  #P(X = 0) is column 0

#Look at results
pred_df = pd.DataFrame({"Actual": y_test, "Predicted Class": preds, "P(X=1)" : pred_pos, "P(x=1)": pred_neg})
pred_df.head(15)

Unnamed: 0,Actual,Predicted Class,P(X=1),P(x=1)
5,0,1,0.546568,0.546568
24,1,1,0.612982,0.612982
29,1,1,0.58421,0.58421
61,1,1,0.717188,0.717188
19,0,0,0.488857,0.488857
95,1,0,0.465782,0.465782
2,1,0,0.435685,0.435685
25,0,0,0.424309,0.424309
90,0,0,0.239364,0.239364
78,1,1,0.824521,0.824521


**4) Which error metrics did you use to assess performance and why? What kind of performance did you obtain on the different models you built?**

To asses the performance of the models, I used accuracy, precision, recall, F1, Roc Auc and a confusion matrix in order to get the whole picture of the model. 
**Accuracy:** Total % correctly classified

**Precision:** % predicted positive that are correctly called positive

**Recall:** % predicted positive out of all positive

**F1 Score:** Mean of Precision and Recall

**Roc Auc:** Plots false positive and true positive 

**Confusion Matrix:** Shows actual classes in rows and predicted classes in column to visualize the correct predictions and misclassifications to be clearly seen



In [8]:
#Look at error metrics
print("Accuracy:  " + str(accuracy_score(y_test, preds)))
print("Precision:  " + str(precision_score(y_test, preds)))
print("Recall:  " + str(recall_score(y_test, preds)))
print("F1:  " + str(f1_score(y_test, preds)))
print("ROC AUC:  " + str(roc_auc_score(y_test, preds)))
print("Confusion Matrix: \n" + str(confusion_matrix(y_test, preds)))

Accuracy:  0.7435897435897436
Precision:  0.8235294117647058
Recall:  0.6666666666666666
F1:  0.7368421052631577
ROC AUC:  0.75
Confusion Matrix: 
[[15  3]
 [ 7 14]]


**K-Nearest Neighbors Example**

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.


In [9]:
#Build a sequence of models for a set of different k values.
ks = [2, 3, 7, 11, 13, 15, 17, 19, 21]
for k in ks:
    #Create and fit a KNN model 
    mod = neighbors.KNeighborsClassifier(n_neighbors = k)
    mod.fit(x_train, y_train)
    
    #Make predictions and evaluate
    preds = mod.predict(x_test)
    print('-------------------EVALUATING MODEL k = ' + str(k) + '---------------------------')
    #Look at error metrics
    print("Accuracy:  " + str(accuracy_score(y_test, preds)))
    print("Precision:  " + str(precision_score(y_test, preds)))
    print("Recall:  " + str(recall_score(y_test, preds)))
    print("F1:  " + str(f1_score(y_test, preds)))
    print("ROC AUC:  " + str(roc_auc_score(y_test, preds)))
    print("Confusion Matrix: \n" + str(confusion_matrix(y_test, preds)))

-------------------EVALUATING MODEL k = 2---------------------------
Accuracy:  0.6410256410256411
Precision:  0.7692307692307693
Recall:  0.47619047619047616
F1:  0.588235294117647
ROC AUC:  0.6547619047619048
Confusion Matrix: 
[[15  3]
 [11 10]]
-------------------EVALUATING MODEL k = 3---------------------------
Accuracy:  0.6923076923076923
Precision:  0.6956521739130435
Recall:  0.7619047619047619
F1:  0.7272727272727272
ROC AUC:  0.6865079365079365
Confusion Matrix: 
[[11  7]
 [ 5 16]]
-------------------EVALUATING MODEL k = 7---------------------------
Accuracy:  0.5897435897435898
Precision:  0.6086956521739131
Recall:  0.6666666666666666
F1:  0.6363636363636365
ROC AUC:  0.5833333333333333
Confusion Matrix: 
[[ 9  9]
 [ 7 14]]
-------------------EVALUATING MODEL k = 11---------------------------
Accuracy:  0.6153846153846154
Precision:  0.625
Recall:  0.7142857142857143
F1:  0.6666666666666666
ROC AUC:  0.6071428571428572
Confusion Matrix: 
[[ 9  9]
 [ 6 15]]
----------------

**Naive Bayes Classifier**

 A Naive Bayes Classifier assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. All features contribute independently in the classification of the variable.

In [10]:
#Building Gaussian Naive Bayes Model
gnb_mod = naive_bayes.GaussianNB()
gnb_mod.fit(x_train,y_train)
preds = gnb_mod.predict(x_test)

print("-------EVALUATING GAUSSIAN NAIVE BAYES MODEL---------")
#Look at error metrics
print("Accuracy:  " + str(accuracy_score(y_test, preds)))
print("Precision:  " + str(precision_score(y_test, preds)))
print("Recall:  " + str(recall_score(y_test, preds)))
print("F1:  " + str(f1_score(y_test, preds)))
print("ROC AUC:  " + str(roc_auc_score(y_test, preds)))
print("Confusion Matrix: \n" + str(confusion_matrix(y_test, preds)))


#Building Bernoulli Naive Bayes Model
bnb_mod = naive_bayes.BernoulliNB()
bnb_mod.fit(x_train,y_train)
preds = gnb_mod.predict(x_test)

print(" ")
print("-------EVALUATING BERNOULLI NAIVE BAYES MODEL---------")
#Look at error metrics
print("Accuracy:  " + str(accuracy_score(y_test, preds)))
print("Precision:  " + str(precision_score(y_test, preds)))
print("Recall:  " + str(recall_score(y_test, preds)))
print("F1:  " + str(f1_score(y_test, preds)))
print("ROC AUC:  " + str(roc_auc_score(y_test, preds)))
print("Confusion Matrix: \n" + str(confusion_matrix(y_test, preds)))


-------EVALUATING GAUSSIAN NAIVE BAYES MODEL---------
Accuracy:  0.7692307692307693
Precision:  0.8333333333333334
Recall:  0.7142857142857143
F1:  0.7692307692307692
ROC AUC:  0.773809523809524
Confusion Matrix: 
[[15  3]
 [ 6 15]]
 
-------EVALUATING BERNOULLI NAIVE BAYES MODEL---------
Accuracy:  0.7692307692307693
Precision:  0.8333333333333334
Recall:  0.7142857142857143
F1:  0.7692307692307692
ROC AUC:  0.773809523809524
Confusion Matrix: 
[[15  3]
 [ 6 15]]


**Random Forest Classifier**

This classifier constructs multiple decision trees. Each tree will make a prediction of the class and output the mode of all predictions. 

In [11]:
#trying different estimators for this one than the last few
#Build a sequence of random forest models for different numbers of estimators and tree depths.
n_est = [5, 10, 50, 100] #number of trees
depths = [3, 6, None]#how many nods to go into. None is the default depth

#for each estimator try each depth. have to do a forloop

for n in n_est:
    for depth in depths:
        mod = ensemble.RandomForestClassifier(n_estimators = n, max_depth= depth)
        mod.fit(x_train, y_train)
        preds = mod.predict(x_test)

        #Look at error metrics
        print("----- EVALUATING RANDOM FOREST: n_estimators = " + str(n) + ", max_depth = " + str(depth) + " -----")
        print("Accuracy:  " + str(accuracy_score(y_test, preds)))
        print("Precision:  " + str(precision_score(y_test, preds)))
        print("Recall:  " + str(recall_score(y_test, preds)))
        print("F1:  " + str(f1_score(y_test, preds)))
        print("ROC AUC:  " + str(roc_auc_score(y_test, preds)))
        print("Confusion Matrix: \n" + str(confusion_matrix(y_test, preds)))

----- EVALUATING RANDOM FOREST: n_estimators = 5, max_depth = 3 -----
Accuracy:  0.7948717948717948
Precision:  0.8095238095238095
Recall:  0.8095238095238095
F1:  0.8095238095238095
ROC AUC:  0.7936507936507937
Confusion Matrix: 
[[14  4]
 [ 4 17]]
----- EVALUATING RANDOM FOREST: n_estimators = 5, max_depth = 6 -----
Accuracy:  0.8461538461538461
Precision:  0.9411764705882353
Recall:  0.7619047619047619
F1:  0.8421052631578947
ROC AUC:  0.8531746031746031
Confusion Matrix: 
[[17  1]
 [ 5 16]]
----- EVALUATING RANDOM FOREST: n_estimators = 5, max_depth = None -----
Accuracy:  0.7948717948717948
Precision:  0.8421052631578947
Recall:  0.7619047619047619
F1:  0.8
ROC AUC:  0.7976190476190477
Confusion Matrix: 
[[15  3]
 [ 5 16]]
----- EVALUATING RANDOM FOREST: n_estimators = 10, max_depth = 3 -----
Accuracy:  0.7692307692307693
Precision:  0.875
Recall:  0.6666666666666666
F1:  0.7567567567567567
ROC AUC:  0.7777777777777777
Confusion Matrix: 
[[16  2]
 [ 7 14]]
----- EVALUATING RANDOM 

**Support Vector Machine Classifier**

This classifier creates a line or a hyperplane which seperates the data into classes. 

In [12]:
#Build a sequence of SVM models for different C values.
cs = [0.2, 0.5, 1.0, 2.0, 5.0, 10.0] #1.0 is default

for c in cs:
    mod = svm.SVC(C=c)
    mod.fit(x_train, y_train)
    preds = mod.predict(x_test)

    #Look at error metrics
    print("--------------------------EVALUATING SUPPORT VECTOR: C = " + str(c) + " --------------------")
    print("Accuracy:  " + str(accuracy_score(y_test, preds)))
    print("Precision:  " + str(precision_score(y_test, preds)))
    print("Recall:  " + str(recall_score(y_test, preds)))
    print("F1:  " + str(f1_score(y_test, preds)))
    print("ROC AUC:  " + str(roc_auc_score(y_test, preds)))
    print("Confusion Matrix: \n" + str(confusion_matrix(y_test, preds)))


--------------------------EVALUATING SUPPORT VECTOR: C = 0.2 --------------------
Accuracy:  0.5384615384615384
Precision:  0.5384615384615384
Recall:  1.0
F1:  0.7000000000000001
ROC AUC:  0.5
Confusion Matrix: 
[[ 0 18]
 [ 0 21]]
--------------------------EVALUATING SUPPORT VECTOR: C = 0.5 --------------------
Accuracy:  0.5384615384615384
Precision:  0.5384615384615384
Recall:  1.0
F1:  0.7000000000000001
ROC AUC:  0.5
Confusion Matrix: 
[[ 0 18]
 [ 0 21]]
--------------------------EVALUATING SUPPORT VECTOR: C = 1.0 --------------------
Accuracy:  0.5641025641025641
Precision:  0.5555555555555556
Recall:  0.9523809523809523
F1:  0.7017543859649122
ROC AUC:  0.5317460317460317
Confusion Matrix: 
[[ 2 16]
 [ 1 20]]
--------------------------EVALUATING SUPPORT VECTOR: C = 2.0 --------------------
Accuracy:  0.5897435897435898
Precision:  0.5714285714285714
Recall:  0.9523809523809523
F1:  0.7142857142857142
ROC AUC:  0.5595238095238094
Confusion Matrix: 
[[ 3 15]
 [ 1 20]]
------------

**5) Construct the best (i.e. least-error) possible model on this data set. What are the predictors used?**

Using the churn data, the best model turned out to be the Random Forest Model with an n_estimator = 5 and a maxdepth = None using all the features as predictors.

**6) Load the dataset “churn_validation.csv” into a new data frame and recode as necessary. Predict the outcomes for each of the customers and compare to the actual. What are the error rates you get based on your selected metrics?**

When running the validation data through the 'best model', the error did increase, but it still had the least error. However different n_estimators and maxdepths gave the least error for the validation than the original test data. 

----- EVALUATING RANDOM FOREST: n_estimators = 50, max_depth = 3 -----

Accuracy:  0.71875

Precision:  0.6111111111111112

Recall:  0.8461538461538461

F1:  0.7096774193548387

ROC AUC:  0.7388663967611335

Confusion Matrix: 

[[12  7]

[ 2 11]]

**7) Consider the best model you built for this problem. Is it a good model that can reliably be used for prediction? Why or why not?**
For this data set, it is a good model because it can handle missing data and irrelevant features, is one of the best-performing ML algorithms, it can provide feature and importance ratings easily, and this is a small set of data. If we had a larger set of data, a different algorithm may be more usefule since the random forest can be slow to train and may need parameter adjustments to acheive its best performance. 