## Gradient Boost / AdaBoost
The purpose of this notebook is to see if we can raise our ROC_AUC and Accuracy scores using the ensamble methods  gradient boost, & AdaBoost. 

In [1]:
# Imports for data manipulation 
import pandas as pd
import numpy as np


# imports for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()


# imports of needed models/scoring metrics from sklearn 

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

In [2]:
df = pd.read_csv('Cleaned_Churn_Data.csv', index_col=0) # importing cleaned data frame

In [3]:
df.head() # quick viewing of data frame

Unnamed: 0,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,p_number,target
0,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,4153824657,0
1,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,4153717191,0
2,137,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,4153581921,0
3,84,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,4083759999,0
4,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,4153306626,0


In [4]:
# setting my target and and df to run test on and with
target = df['target']
df = df.drop(columns= 'target')

In [5]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size= .25,  random_state= 42 ) 

In [6]:
# Instantiate an AdaBoostClassifier
adaboost_clf = AdaBoostClassifier(random_state=42)

# Instantiate an GradientBoostingClassifier
gbt_clf = GradientBoostingClassifier(random_state=42)

In [7]:
# Fit AdaBoostClassifier
adaboost_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=42)

In [8]:
# Fit GradientBoostingClassifier
gbt_clf.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [9]:
# AdaBoost model predictions
adaboost_train_preds = adaboost_clf.predict(X_train)
adaboost_test_preds = adaboost_clf.predict(X_test)

# GradientBoosting model predictions
gbt_clf_train_preds = gbt_clf.predict(X_train)
gbt_clf_test_preds = gbt_clf.predict(X_test) 

In [10]:
'''Provided by Lear.co we were able to edit the function below to be able to calculate 
the ROC_AUC score and Accuracy score when given the listed parameters. 
What it returns is a neatly printed table that shows how the models scored on the train data and how the models scored 
on the test data.
'''

def display_acc_and_roc_auc_score(true, preds, model_name):
    acc = accuracy_score(true, preds)
    roc_auc = roc_auc_score(true, preds)
    print("Model: {}".format(model_name))
    print("Accuracy: {}".format(acc))
    print("ROC AUC Score: {}".format(roc_auc))
    
print("Training Metrics")
display_acc_and_roc_auc_score(y_train, adaboost_train_preds, model_name='AdaBoost')
print("")
display_acc_and_roc_auc_score(y_train, gbt_clf_train_preds, model_name='Gradient Boosted Trees')
print("")
print("Testing Metrics")
display_acc_and_roc_auc_score(y_test, adaboost_test_preds, model_name='AdaBoost')
print("")
display_acc_and_roc_auc_score(y_test, gbt_clf_test_preds, model_name='Gradient Boosted Trees')

Training Metrics
Model: AdaBoost
Accuracy: 0.8923569427771109
ROC AUC Score: 0.7068826502521924

Model: Gradient Boosted Trees
Accuracy: 0.9719887955182073
ROC AUC Score: 0.9068870861264121

Testing Metrics
Model: AdaBoost
Accuracy: 0.8669064748201439
ROC AUC Score: 0.6581382228490831

Model: Gradient Boosted Trees
Accuracy: 0.9460431654676259
ROC AUC Score: 0.8430634696755994


# Observation
As you can see when it comes to the test data, the model that performed the best when it comes to "accuarcy" is Gradient Boosted Trees. Both the Accuracy score and ROC_AUC score have a reasonable spread to where the model is confident in its prediction and has a high score for correctly predicting. 

Below you will find the scores of all three models that were tried for comparison. Between Random Forest and Gradient Boosted Trees the best model is Random Forest.

**The Why:**
 When you look at both the roc_auc and accuracy score for random forest, you can see that there is not that much of a spread between the two. This is showing that the over fitting is not present enough to where how correctly the model predicts is impacted. 

# Model: Random Forest 

**Testing Metrics**


Accuracy Score : 0.9364508393285371


ROC_AUC : 0.9097884344146685
    

## Model: AdaBoost

**Testing Metrics** 


Accuracy: 0.8669064748201439


ROC AUC Score: 0.6581382228490831

## Model: Gradient Boosted Trees

**Testing Metrics** 

Accuracy: 0.9460431654676259

ROC AUC Score: 0.8430634696755994

# What Is Next 
After speaking to our DS 02172020 Coach Lindsey Berlin, we realized that the area code can be used as a categorical column to help the model narrow down the location of customers that churn the most based off of the area code. In the models so far we treated the area code column combined with the phone number column as numerical vs categorical. This discover could be the change we need to really push our model past the .90 mark in its scoring. We can tackle this either by using target encoding or one hot encoding.  