# Modeling Notebook

In this notebook I explore the different baseline models and ensemble methods. I decided to use the most common classification models as basic initial models, using Accuracy score as the main scoring metric, Recall since we want to minimize the False Positive predictions so the company does not focuses on customers that may not churn and F1 to look at the overall model score.

Since we want to make sure that each model predicts with the best accuracy to prevent any loss of attention or revenue on customers that are not recurrent churners or are possible to churn.

I am also using Cross Validation to see how each model reacts to data that has not been used in modeling. Cross Validation is scored by looking at the minimum score and the maximum score, the mean of all scores and the range. In my scoring, the lower the range the better the model performs to unseen data. Min and Max are utilized just as markers to how poor and how well the model performed. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import neighbors
from sklearn import metrics
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pd.set_option('display.max_columns', 1000)

In [2]:
df=pd.read_csv("modeling_customer_df.csv",index_col="Unnamed: 0")

In [3]:
df.head()

Unnamed: 0,senior_citizen,partner,dependents,tenure_months,phone_service,device_protection,tech_support,paperless_billing,monthly_charges,cltv,satisfaction_score,referred_a_friend,number_of_referrals,tenure_in_months,avg_monthly_long_distance_charges,avg_monthly_gb_download,device_protection_plan,premium_tech_support,streaming_music,unlimited_data,total_refunds,total_extra_data_charges,total_long_distance_charges,total_revenue,under_30,number_of_dependents,tel_internet_service,tel_online_securit,tel_online_backup,tel_multiple_lines,tel_streaming_tv,tel_total_charges,tel_streaming_movies,cit_Adelanto,cit_Anaheim,cit_Apple Valley,cit_Bakersfield,cit_Brea,cit_Cerritos,cit_Chula Vista,cit_Concord,cit_Crescent Mills,cit_El Monte,cit_Elk Grove,cit_Fremont,cit_Fresno,cit_Glendale,cit_Hayward,cit_Huntington Beach,cit_Inglewood,cit_Irvine,cit_Lakewood,cit_Lancaster,cit_Lompoc,cit_Long Beach,cit_Los Angeles,cit_Modesto,cit_Mountain View,cit_Oakland,cit_Pasadena,cit_Riverside,cit_Sacramento,cit_San Bernardino,cit_San Diego,cit_San Dimas,cit_San Francisco,cit_San Jose,cit_Santa Barbara,cit_Santa Monica,cit_Santa Rosa,cit_Smith River,cit_Stockton,cit_Sun City,cit_Temecula,cit_Whittier,zip_code_90623,zip_code_91010,zip_code_91206,zip_code_91762,zip_code_93245,zip_code_93702,zip_code_93711,zip_code_94027,zip_code_94520,latitude_33.8681,latitude_34.162515,latitude_36.739385,longitude_-121.55325,longitude_-120.653519,longitude_-119.82947,longitude_-118.24902,longitude_-118.203869,longitude_-117.815532,gender_Male,offer_Offer A,offer_Offer B,offer_Offer C,offer_Offer D,offer_Offer E,internet_type_DSL,internet_type_Fiber Optic,internet_type_None,age_20,age_30,age_40,age_50,age_60,age_70,tel_payment_method_Credit Card,tel_payment_method_Mailed Check,tel_contract_One Year,tel_contract_Two Year,y
0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,1.0,53.85,3239.0,1.0,0.0,0.0,2.0,10.47,21.0,0.0,0.0,0.0,1.0,0.0,0.0,20.94,129.09,0.0,0.0,1.0,1.0,1.0,0.0,0.0,108.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,2.0,1.0,0.0,0.0,1.0,70.7,2701.0,2.0,0.0,0.0,2.0,9.12,51.0,0.0,0.0,0.0,1.0,0.0,0.0,18.24,169.89,1.0,2.0,1.0,0.0,0.0,0.0,0.0,151.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,8.0,1.0,1.0,0.0,1.0,99.65,5372.0,3.0,0.0,0.0,8.0,12.15,26.0,1.0,0.0,1.0,1.0,0.0,0.0,97.2,917.7,0.0,2.0,1.0,0.0,0.0,1.0,1.0,820.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,104.8,5003.0,3.0,0.0,0.0,28.0,4.89,47.0,1.0,1.0,1.0,1.0,0.0,0.0,136.92,3182.97,1.0,3.0,1.0,0.0,0.0,1.0,1.0,3046.05,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,49.0,1.0,1.0,0.0,1.0,103.7,5340.0,1.0,0.0,0.0,49.0,44.33,11.0,1.0,0.0,1.0,1.0,0.0,0.0,2172.17,7208.47,0.0,1.0,1.0,0.0,1.0,1.0,1.0,5036.3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [4]:
#Splitting the data into X and Y
y=df["y"]
X=df.drop(["y"],axis=1)

# Vanilla Models

I am using several classification models like Logistic Regression, KNN, RandomForest and also boosted models such as XGBoost and GradientBoost. KNN and RandomForest were selected to test if the data is able to be classified aggregately but the goal is to find a model that is able to be interpreted and could benefit the company in seeing what factors can affect a customer churn.

In [5]:
# Creating Train Test Split.
X_test,X_train,y_test,y_train=train_test_split(X,y, test_size=.2, random_state=2020)

In [6]:
#this function was created to visualize scores and create a dataframe with each score.
def scoring(preds , model_name: str,model, cv=10, xtrain=X_train):
    from sklearn.model_selection import cross_val_score
    """
    The function statrts by returning the training scores of the model then, it takes in predictions of the
    model, model, model name, and times that model will be cross validated and returns the Accuracy, 
    Recall, F1 score and the CV Min, Max, Mean and Range scores based on accuracy.
    
    The function also creates a DF that saves all the infomation into a dataframe, making changes globaly. 
    """
    model.verbose=False
    def warn(*args, **kwargs):
        pass
    import warnings
    warnings.warn = warn
    
    if "scores_df" not in globals():
        global scores_df 
        scores_df = pd.DataFrame(columns=["Name","Accuracy","Recall","F1","CV Min","CV Max","CV Mean","CV Range"])
        
    if "scores_df" in globals():
        
        if "Grid"in(model_name).split():
            
            model.verbose=False
            print ("Training Accuracy: {}".format(metrics.accuracy_score(y_train, model.best_estimator_.predict(xtrain))))
            print ("Training Recall: {}".format(metrics.recall_score(y_train, model.best_estimator_.predict(xtrain))))
            print ("Training F1: {}".format(metrics.f1_score(y_train, model.best_estimator_.predict(xtrain))))

            vars()[model_name+"_recc"]=metrics.recall_score(y_test, preds)
            vars()[model_name+"_acc"]=metrics.accuracy_score(y_test, preds)
            vars()[model_name+"_f1"]=metrics.f1_score(y_test,preds)

            print("\n"+model_name+" Accuracy: {}".format(vars()[model_name+"_acc"]))
            print(model_name+" Recall: {}".format(vars()[model_name+"_recc"]))
            print(model_name+" F1: {}".format(vars()[model_name+"_f1"]))
            def warn(*args, **kwargs):
                pass
                
            import warnings
            warnings.warn = warn


            scores = cross_val_score(model.best_estimator_, xtrain, y_train, cv=cv)

            print("\nCross-Validation ({} times) Accuracy Scores:".format(cv))    
            print('Min: ', round(scores.min(), 6),'  Max: ', round(scores.max(), 6),'  Mean: ', round(scores.mean(), 6), '  Range: ', round(scores.max() - scores.min(), 6))

        else:
            
            model.verbose=False
            print ("Training Accuracy: {}".format(metrics.accuracy_score(y_train, model.predict(xtrain))))
            print ("Training Recall: {}".format(metrics.recall_score(y_train, model.predict(xtrain))))
            print ("Training F1: {}".format(metrics.f1_score(y_train, model.predict(xtrain))))

            vars()[model_name+"_recc"]=metrics.recall_score(y_test, preds)
            vars()[model_name+"_acc"]=metrics.accuracy_score(y_test, preds)
            vars()[model_name+"_f1"]=metrics.f1_score(y_test,preds)

            print("\n"+model_name+" Accuracy: {}".format(vars()[model_name+"_acc"]))
            print(model_name+" Recall: {}".format(vars()[model_name+"_recc"]))
            print(model_name+" F1: {}".format(vars()[model_name+"_f1"]))

            def warn(*args, **kwargs):
                pass
                
            import warnings
            warnings.warn = warn

            scores = cross_val_score(model,xtrain , y_train, cv=cv, verbose=False)

            print("\nCross-Validation ({} times) Accuracy Scores:".format(cv))    
            print('Min: ', round(scores.min(), 6),'  Max: ', round(scores.max(), 6),'  Mean: ', round(scores.mean(), 6), '  Range: ', round(scores.max() - scores.min(), 6))


        if model_name not in list(scores_df.Name):
            
            scores_df=scores_df.append({'Name':model_name, "Accuracy":vars()[model_name+"_acc"] , "Recall": vars()[model_name+"_recc"],"F1":vars()[model_name+"_f1"],"CV Min":scores.min() ,"CV Max":scores.max(),"CV Mean":scores.mean(), "CV Range": (scores.max() - scores.min())}, ignore_index=True)
    
        else:
            i=list(scores_df[scores_df.Name==model_name].index)
            scores_df.drop(scores_df.index[i], inplace=True) 
            scores_df=scores_df.append({'Name':model_name, "Accuracy":vars()[model_name+"_acc"] , "Recall": vars()[model_name+"_recc"],"F1":vars()[model_name+"_f1"],"CV Min":scores.min() ,"CV Max":scores.max(),"CV Mean":scores.mean(), "CV Range": (scores.max() - scores.min())}, ignore_index=True)
    





### Logistic Regression

I decided to start with Logistic Regression since its a model that is very easy to interpret and can also be tunned quite easily. 

* The model is tunned by setting max iterations to 100000 since we want the model to converge.
* Random State of 40 is maintained in all the models for continuity.
* Class Weight Balanced is used, since there is a class imbalance with the target variable.

In [7]:
# Instantiating Logistic Regression with Max Iter of 10000 so the model can reach convergence
log_reg=LogisticRegression(max_iter=100000, random_state=40, class_weight="balanced")

In [8]:
# Fitting and predicting results.
log_reg.fit(X_train,y_train)

log_pred=log_reg.predict(X_test)

In [9]:
# Scoring function used to see the results. 
scoring(log_pred , "Logistic Rregression", log_reg, cv=5)

Training Accuracy: 0.9616749467707594
Training Recall: 0.9742120343839542
Training F1: 0.9264305177111716

Logistic Rregression Accuracy: 0.9439119630812921
Logistic Rregression Recall: 0.9407894736842105
Logistic Rregression F1: 0.9005037783375315

Cross-Validation (5 times) Accuracy Scores:
Min:  0.939502   Max:  0.968085   Mean:  0.949602   Range:  0.028583


This model has performed with great results. The Accuracy and Recall have high scores that shows that the model is doing a great job at classifying the data.

CV maintain similar scores to the ones resulted from the modeling step. With a range of 0.02 shows that there is little change when the model is exposed to new data and that there is very few issues at classifying. 

### KNN

I decided to use KNN since it want to see if the data is able to be predicted by neighboring features. I want to use a K of 10 since using a low K could not be helpful at creating decision lines.

In [10]:
# instantiating KNN with a N Neighbors of 10 as baseline.
knn=KNeighborsClassifier(n_neighbors=10)

In [11]:
# Fitting and predicting
knn.fit(X_train,y_train)

knn_pred=knn.predict(X_test)

In [12]:
# Scoring function used to view results.
scoring(knn_pred , "KNN", knn, cv=5)

Training Accuracy: 0.7771469127040455
Training Recall: 0.20630372492836677
Training F1: 0.314410480349345

KNN Accuracy: 0.7255946041888534
KNN Recall: 0.1256578947368421
KNN F1: 0.19813278008298757

Cross-Validation (5 times) Accuracy Scores:
Min:  0.72695   Max:  0.765125   Mean:  0.743805   Range:  0.038174


The model performed terribly. The training scores show that the model is not good at understanding the data. The testing scores show that the model is unable to classify each prediction. with a Recall so low and F1 of similar value, this model is not a model that I can pursue.

### Random Forest

I continued with Random Forest. Although Random Forest is not an interpretable model, it should be explored. the model was tuned using:

* Max Depth of 5 since we do not want the model to overfit the training data.
* Class Weight "Balanced" is used since the data is not balanced and we want to mitigate that. 

In [13]:
# Instantiating Random Fores with a max depth of 20 so the model has enough room to grow. 
RF=RandomForestClassifier(max_depth=5,verbose=True, random_state=40,class_weight="balanced")

In [14]:
RF.fit(X_train,y_train)

rf_pred=RF.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


In [15]:
scoring(rf_pred , "Random Forest", RF, cv=5)

Training Accuracy: 0.9325762952448545
Training Recall: 0.9369627507163324
Training F1: 0.8731642189586115

Random Forest Accuracy: 0.9185303514376997
Random Forest Recall: 0.9243421052631579
Random Forest F1: 0.8595900887121444

Cross-Validation (5 times) Accuracy Scores:
Min:  0.900709   Max:  0.925532   Mean:  0.913412   Range:  0.024823


The results of this model are better than what I got from KNN but not as great as Logistic Regression. The model has great results for our training scores, although training F1 as well as the testing F1 is not too great. The Accuracy in both training and test scores has a really great score and with the Cross Validation, we are able to see that the model trained very well since the Min and Max scores are close to the final score. 

## Boosted Models

After trying basic models, I wanted to explore boosted models to see if there is better models to explore further.

### XGBoost

I decided to explore XGBoost since its highly efficient and flexible. Since it implements machine learning algorithms under the Gradient Boosting framework. The tunning parameters that were used are:
* Use Label Encoder False since all the data is already in integer values. 
* Objective "Binary:Logistic" since we are predicting a categorical target. 
* Evaluation Metric "Logloss" afterall, we want to measure how likely did the model think the actually observed set of outcomes was.
* Learning rate of 2, since we want to make it easier at reaching the best optimum.

In [16]:
XGB = XGBClassifier(use_label_encoder=False, objective = "binary:logistic", eval_metric = "logloss", learning_rate = 2, random_state=40)


In [17]:
XGB.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=2, max_delta_step=0,
              max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, random_state=40, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              use_label_encoder=False, validate_parameters=1, verbosity=None)

In [18]:
xg_pred = XGB.predict(X_test)

In [19]:
scoring(xg_pred , "XGBoost", XGB, cv=5)

Training Accuracy: 1.0
Training Recall: 1.0
Training F1: 1.0

XGBoost Accuracy: 0.9416045438409656
XGBoost Recall: 0.881578947368421
XGBoost F1: 0.8906613492854769

Cross-Validation (5 times) Accuracy Scores:
Min:  0.911348   Max:  0.950355   Mean:  0.933987   Range:  0.039007


In [20]:
# scores_df #Uncomment to view table

In the end, this model performed pretty great. The Accuracy and Recall are good scores, after the CV the model has a very low range meaning that the values are within thee expected output of the model. The downside is that the model is overfitting to the training set. This is not something ideal for any future data that will be passed on the model.

### Gradient Boost

Lastly I want to explore Gradient Boost. Gradient Boost is the basis of XGBoost and works similarly to Logistic Regression. In the end I want a more interpretable model for the company to see what factors are affecting the churn of customers. 

The parameters used for Gradient Boost are:
* N Estimators of 100 since we need the model to be able to converge. 
* Learning Rate of 5 maintaining a low learning rate we can assure the model would reach the optimum value.
* Max Depth of 4 this penalty is used to make sure the model does not overfit to the data.
* Criterion as "Mean Square Error" is used to measure the quality of a split.


In [21]:
reg = GradientBoostingClassifier(n_estimators=100, learning_rate=5.0, max_depth=4.0, criterion="mse", random_state=40)


In [22]:
reg.fit(X_train, y_train)

GradientBoostingClassifier(criterion='mse', learning_rate=5.0, max_depth=4.0,
                           random_state=40)

In [23]:
reg_pred=reg.predict(X_test)

In [24]:
scoring(reg_pred , "Gradient Boost", reg, cv=5)

Training Accuracy: 0.6919801277501775
Training Recall: 0.9283667621776505
Training F1: 0.5988909426987061

Gradient Boost Accuracy: 0.6732339368122116
Gradient Boost Recall: 0.905921052631579
Gradient Boost F1: 0.5993471164309031

Cross-Validation (5 times) Accuracy Scores:
Min:  0.450355   Max:  0.879433   Mean:  0.72745   Range:  0.429078


In [25]:
# scores_df #Uncomment to view table

In the end this model did not perform as expected. The scores are all over the place, a very high Recall, very low Accuracy and F1 and the scores are similar on the training and test set. Looking at the CV, the range is the highest of all the models tried so far. But this indicates that the model after tunning could yield a really good model. 

## Scores

In [26]:
# Appending the vanilla scores into Dataframe. 
scores_df

Unnamed: 0,Name,Accuracy,Recall,F1,CV Min,CV Max,CV Mean,CV Range
0,Logistic Rregression,0.943912,0.940789,0.900504,0.939502,0.968085,0.949602,0.028583
1,KNN,0.725595,0.125658,0.198133,0.72695,0.765125,0.743805,0.038174
2,Random Forest,0.91853,0.924342,0.85959,0.900709,0.925532,0.913412,0.024823
3,XGBoost,0.941605,0.881579,0.890661,0.911348,0.950355,0.933987,0.039007
4,Gradient Boost,0.673234,0.905921,0.599347,0.450355,0.879433,0.72745,0.429078


After reviewing all baseline models, it is very clear that Logistic Regression is the best model overall, although Random Forest and XGBoost provided a great Accuracy, it does not compare with the results that came from Logistic Regression, but maybe there is room for improvement in all three. Gradient Boost do shows great promise if tis explored with GridSearch. I will further explore PCA and GridSearch to better Logistic Regression, Random Forest and Gradient boost. 

# PCA

Using PCA to help improve the Logistic Regression Model, PCA was chosen because can be used also as an interpretable model, this is important since we want to help the company lower the customer churn by looking at the features.

Since we had a large dataset and was reduced to the most important features by using RFE, PCA could also help reduce the curse of dimensionality. The data, so far, has also not been scaled and will be for PCA since PCA requires the data to be in the same scale. For that we are using the StandardScaler.

In [27]:
# Scaling & fitting training data and fitting the testing data.
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

For PCA we want to reduce the number of components to 2. We want the model to be able to handle the dimensionality by using 2 components to explain the data. 

In [28]:
# Creating 2 principal components
pca=PCA(n_components=2)


X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [29]:
# Using the baseline Logistic Regression model for the PCA data.
log_reg_pca=LogisticRegression(max_iter=1000, random_state=40)
log_reg_pca.fit(X_train_pca,y_train)

log_pred_pca=log_reg_pca.predict(X_test_pca)

In [30]:
scoring(log_pred_pca , "Logistic Regression with PCA", log_reg_pca, cv=5, xtrain=X_train_pca)

Training Accuracy: 0.8246983676366217
Training Recall: 0.5730659025787965
Training F1: 0.6182380216383306

Logistic Regression with PCA Accuracy: 0.8186013489527867
Logistic Regression with PCA Recall: 0.5730263157894737
Logistic Regression with PCA F1: 0.6302460202604921

Cross-Validation (5 times) Accuracy Scores:
Min:  0.804965   Max:  0.85461   Mean:  0.825398   Range:  0.049645


In [31]:
# scores_df #Uncomment to view table

After comparing the PCA Logistic Regression with the baseline logistic regression, it clearly did not performed as  expected. Recall suffered as well as the F1 score, although CV does have a low range it doesn't look like there could be much improvement.In the end, it doesn't seem like the right path to take, specially since Accuracy is the most important metric for this model. 

# Grid Search

### Grid Search for Random Forest

I decided to use GridSearch first on RandomForest to see if i am able to improve on the results that the baseline model brought. The parameters that were chosen to search are:
* Max Depth with a range of 5 to 30. Although an extreme range, we want to check what penalty could be the best for such model.
* Minimum Samples Leaf with a range of 1 to 5, by maintaining a low leaf node we make sure that the data is mostly evenly split and adds some bias to the model.
* Minimum Samples Split with a range of 1 to 5, by maintaining a low split we make sure that the data is not fully overfit. 
* BootStrap to see the difference of having the combined the predictions from multiple machine learning algorithms together to make more accurate predictions or not having such predictions.
* Class Weight "Balanced" and "None" to verify that having a "Balanced" data is the best option.

In [32]:
params_RF={
    "max_depth":range(5,30,5),
    "min_samples_leaf":range(1,5,1),
    "min_samples_split":range(1,5,1),
    "bootstrap":["True","False"],
    "class_weight":["None","balanced"]  
}

In [33]:
RF=RandomForestClassifier(random_state=40)
grid_rf=GridSearchCV(RF, params_RF, cv=10, scoring="f1", verbose=1, n_jobs=-1)

In [34]:
grid_rf.fit(X_train,y_train)

Fitting 10 folds for each of 320 candidates, totalling 3200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 1603 tasks      | elapsed:   36.2s
[Parallel(n_jobs=-1)]: Done 2682 tasks      | elapsed:   50.0s
[Parallel(n_jobs=-1)]: Done 3032 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 3200 out of 3200 | elapsed:  1.2min finished


GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=40),
             n_jobs=-1,
             param_grid={'bootstrap': ['True', 'False'],
                         'class_weight': ['None', 'balanced'],
                         'max_depth': range(5, 30, 5),
                         'min_samples_leaf': range(1, 5),
                         'min_samples_split': range(1, 5)},
             scoring='f1', verbose=1)

In [35]:
print("Best Gridsearch Score: {}".format(grid_rf.best_score_))
print("Best Parameters: {}".format(grid_rf.best_params_))
print("Best Estimators: {}".format(grid_rf.best_estimator_))

Best Gridsearch Score: 0.8960368756424364
Best Parameters: {'bootstrap': 'True', 'class_weight': 'balanced', 'max_depth': 25, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best Estimators: RandomForestClassifier(bootstrap='True', class_weight='balanced', max_depth=25,
                       min_samples_leaf=2, random_state=40)


It comes with no surprise that the model chose a Class Weight of Balanced as well as a Bootstrap of True. It is interesting that a max depth of 25 was chosen since its quite high but when it came to the Min Samples split and Leaf a lower number was chosen. 

In [36]:
grid_rf_pred=grid_rf.best_estimator_.predict(X_test)

In [37]:
scoring(grid_rf_pred , "Grid Search Random Forest", grid_rf, cv=10)

Training Accuracy: 0.9971611071682044
Training Recall: 0.994269340974212
Training F1: 0.994269340974212

Grid Search Random Forest Accuracy: 0.9508342208022719
Grid Search Random Forest Recall: 0.8855263157894737
Grid Search Random Forest F1: 0.9067025934658135

Cross-Validation (10 times) Accuracy Scores:
Min:  0.900709   Max:  0.971631   Mean:  0.950329   Range:  0.070922


In [38]:
# scores_df#Uncomment to view table

Overall GridSearch did improve the Random Forest significantly, the Accuracy score did improve by 0.04 from the original Random forest, Recall took a hit but the F1 score stayed around the same. It seems that the model could be improved further but it does not compare to the original Logistic Regression scores that were pretty good from the getgo. Next I will explore Logistic Regression with Grid Search.

### Grid Search for Logistic Regression

After not getting much improvement from PCA, maybe GridSearch could bring up the scores for Logistic Regression and make a better model by tunning specific parameters. 

For tunning parameters the ones used are:
* Penalty of "L1" and "L2" since I want to compare both Ridge and Lasso as penalties to see if either one could best interpret the features.
* Class Weight "Balanced" since there is high class imbalance in the data.
* Max Iterations with a range of 1000 to 10000 to make sure the model is able to converge.
* Warm Start "True" and "False" to compare how can a warm start benefit the CV process. 


In [39]:
#Choosing Parameter for the model.
params_lg={
    "penalty":["l1","l2"],
    "class_weight":["balanced"],
    "max_iter":range(1000,10000,100),
    "warm_start":["True","False"]   
}

In [40]:
# using vanilla Logistic Regression model and keeping the Random state for continuity and reproducibility
log_reg=LogisticRegression(random_state=40)
grid_lg=GridSearchCV(log_reg, params_lg, cv=5, scoring="accuracy", verbose=1, n_jobs=-1)

In [41]:
grid_lg.fit(X_train,y_train)

Fitting 5 folds for each of 360 candidates, totalling 1800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done 224 tasks      | elapsed:   14.7s
[Parallel(n_jobs=-1)]: Done 474 tasks      | elapsed:   46.0s
[Parallel(n_jobs=-1)]: Done 824 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 1274 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 1800 out of 1800 | elapsed:  7.6min finished


GridSearchCV(cv=5, estimator=LogisticRegression(random_state=40), n_jobs=-1,
             param_grid={'class_weight': ['balanced'],
                         'max_iter': range(1000, 10000, 100),
                         'penalty': ['l1', 'l2'],
                         'warm_start': ['True', 'False']},
             scoring='accuracy', verbose=1)

In [42]:
print("Best Gridsearch Score: {}".format(grid_lg.best_score_))
print("Best Parameters: {}".format(grid_lg.best_params_))
print("Best Estimators: {}".format(grid_lg.best_estimator_))

Best Gridsearch Score: 0.9510234471618586
Best Parameters: {'class_weight': 'balanced', 'max_iter': 9400, 'penalty': 'l2', 'warm_start': 'True'}
Best Estimators: LogisticRegression(class_weight='balanced', max_iter=9400, random_state=40,
                   warm_start='True')


It seemed that Ridge was the best at keeping the relevant features for the model to perform optimally. It came as no surprise that balancing the class is the best way to have a model improve but it seems that the max iterations was loser to the max of the range and this could be improved in the second iteration of GridSearch with the tunned parameters. 

In [43]:
gs_pred=grid_lg.best_estimator_.predict(X_test)

In [44]:
scoring(gs_pred , "Grid Search Logistic Regression", grid_lg, cv=5)

Training Accuracy: 0.9623846699787083
Training Recall: 0.9742120343839542
Training F1: 0.927694406548431

Grid Search Logistic Regression Accuracy: 0.9440894568690096
Grid Search Logistic Regression Recall: 0.9414473684210526
Grid Search Logistic Regression F1: 0.9008498583569404

Cross-Validation (5 times) Accuracy Scores:
Min:  0.932384   Max:  0.968085   Mean:  0.950307   Range:  0.035701


In [45]:
# scores_df#Uncomment to view table

This model shows improvement although its a minimal improvement by using the best parameters and adding more tunning it will provide the best model for predicting.

### Grid Search for Logistic Regression (Best Params) 

Now that we ran our first GridSearch, I will use the best parameter for the second run. Here I added more tunning to the already set parameters.

* C of 1 and 5. to see which value of C could to disincentivize and regulate against Overfitting.
* Max Iterations with a range of 9000 to 10000 since the previous iteration of GridSearch found the value to be 9400 and now, I want to explore a more detail range. 
* Solver of "lbfgs" and "Sags" since lbfgs is the default solver, I'd like to compare it with Sags since its more beneficial for larger datasets.

In [46]:
#Choosing Parameter for the model.
params={
    "C":[1,5],
    "max_iter":range(9000,10000,10),
    "solver":['lbfgs', 'sags']
}

In [47]:
# using vanilla Logistic Regression model and keeping the Random state for continuity and reproducibility
log_reg2= LogisticRegression(class_weight='balanced', random_state=40, warm_start='True')
grid_lg2=GridSearchCV(log_reg2, params, cv=5, scoring="f1", verbose=1, n_jobs=-1)

In [None]:
grid_lg2.fit(X_train,y_train)

Fitting 5 folds for each of 400 candidates, totalling 2000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:   20.3s
[Parallel(n_jobs=-1)]: Done 200 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 450 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 800 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 1250 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 1800 tasks      | elapsed: 13.8min


In [None]:
print("Best Gridsearch Score: {}".format(grid_lg2.best_score_))
print("Best Parameters: {}".format(grid_lg2.best_params_))
print("Best Estimators: {}".format(grid_lg2.best_estimator_))

In [None]:
gs_pred2=grid_lg2.best_estimator_.predict(X_test)

In [None]:
scoring(gs_pred2 , "Grid Search Logistic Regression 2", grid_lg2, cv=5)

In [None]:
# scores_df #Uncomment to view table

### Gridsearch Gradient Boost

Lastly I want to explore Gradient Boost. Gradient boost seemed the most promising since it's scores seemed a bit erratic. I believe with a bit of tunning it could show much improvement. The tunning parameters or Gradient boosting are":

* Min Samples Split with a range of 2 to 10, I want to see if a lower or higher value could return a better score. 
* Max Depth with a range of 1 to 5, by maintaining lower scores to reduce overfitting. 
* Max Features with "Auto", "Sqrt" and "Log 2" to see witch of the exponential number of features to consider when looking for the best split.

In [None]:
#Choosing Parameter for the model.
params_gb={
    "min_samples_split" : range(2,10,1),
    "max_depth" : range(1,5,1),
    "max_features" : ['auto', 'sqrt', 'log2']   
}

In [None]:
# using vanilla Logistic Regression model and keeping the Random state for continuity and reproducibility
gb = GradientBoostingClassifier(random_state=40)
grid_gb=GridSearchCV(gb, params_gb, cv=5, scoring="accuracy", verbose=1, n_jobs=-1)


In [None]:
grid_gb.fit(X_train,y_train)

In [None]:
print("Best Gridsearch Score: {}".format(grid_gb.best_score_))
print("Best Parameters: {}".format(grid_gb.best_params_))
print("Best Estimators: {}".format(grid_gb.best_estimator_))

Interestingly, the best params for Max Depth and Min Samples Split are on the higher end of the range that was provided.

In [None]:
grid_gb_pred=grid_gb.best_estimator_.predict(X_test)

In [None]:
scoring(grid_gb_pred , "Grid Search Gradient Boost", grid_gb, cv=5)

In [None]:
# scores_df #Uncomment to view table

This model model has performed the best out of all the models that have been explored. With great Accuracy and F1 score, although Recall took a hit it could be improved in the next round of Gridsearch. Looking at the CV scores, the range is very low which is good since all the folds have around the same value of the test score. 

### Gridsearch Gradient Boost Best Params

Now that we see how good the results we got after Gridsearch, the best parameters were implemented to the second round of Gridsearch. Now the new parameters to tune were added:

* Loss of "Deviance" and "Exponential" "Deviance" refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss "Exponential" gradient boosting recovers the AdaBoost algorithm.
* Learning Rate of range 0.1 to 5 to shrink the contribution of each tree by learning_rate.

In [None]:
params_gb2={
   "loss" : ['deviance', 'exponential'],
   "learning_rate" : np.arange(0.1,5,0.1),
}

In [None]:
gb =  GradientBoostingClassifier(max_depth=4, max_features='sqrt', min_samples_split=9, random_state=40)
grid_gb2=GridSearchCV(gb, params_gb2, cv=5, scoring="accuracy", verbose=1, n_jobs=-1)


In [None]:
grid_gb2.fit(X_train,y_train)

In [None]:
print("Best Gridsearch Score: {}".format(grid_gb2.best_score_))
print("Best Parameters: {}".format(grid_gb2.best_params_))
print("Best Estimators: {}".format(grid_gb2.best_estimator_))

It came to no surprise that the learning rate came out to 1.1 since 1 is the optimal value. 

In [None]:
grid_gb_pred2=grid_gb2.best_estimator_.predict(X_test)

In [None]:
scoring(grid_gb_pred2 , "Grid Search Gradient Boost 2", grid_gb2, cv=5)

In [None]:
# scores_df #Uncomment to view table

Overall this model came out with the best scores overall. With great Accuracy and F1, as well as an improved Recall, we could see this model performing great specially after looking at the CV scores since they seem to be with minimal difference in each fold. The only concerning part is that the Training scores are Overfit, this could be mitigated by some more regularization but overall after comparing the CV and the final scores it seems that this can be the final model for the company's churn algorithm. 

# Conclusions 

Below we can find a graph with each result. We see that the model did a great job overall at predicting each category. Although we have a high False Negative, we mostly wanted to focus on our False Positive since we want to mitigate the company focusing resources to customers who are predicted to churn when they're actually not.

In [None]:
pred = confusion_matrix(y_test, grid_gb_pred2)
plt.figure(figsize=(8,8))

sns.heatmap(pred.T, square=True, annot=True, fmt='d',cmap="PuBu", cbar=False,
            xticklabels=['Stayed', 'Churned'], yticklabels=['Stayed', 'Churned'])
sns.set(font_scale=2)
plt.xlabel('True Values',fontsize=20)
plt.ylabel('Predicted Values',fontsize=20);
plt.savefig('/Users/carlosruiz/Desktop/Mod_5_project/Images/matrix', dpi=300)

In conclusion, GridSearch with Logistic Gradient Boost proved the best model at predicting Accuracy and F1. By using this model assures that all predictions would be accurate and the company can focus on customer that are possibly churning or have churned and base their attention in recovering them or bettering their services to maintain their customer base. 

In [None]:
# Creating a DF with the coeficients for each feature for futher visualiztions.
coef=pd.DataFrame(zip(X_train.columns, np.transpose(grid_lg.best_estimator_.coef_[0])), columns=['features', 'coef'])
coef.iloc[list(coef.coef.sort_values(ascending=False).index)].reset_index(drop=True,inplace=True)
coef.to_csv("coeficient.csv")


To see the visualization of Gradient Boost coefficients please go to the [Visualizations Notebook](Visualizations_Notebook.ipynb) to view the EDA process go to the [EDA Notebook](EDA_Notebook.ipynb)