<hr style="height:.6px;color:#333;" />
<h1><b>Classification Model Development</b></h1>
<h2>Xitlali Magana</h2>
<br>
<hr style="height:.3px;color:#333;" />

I begin by importing various libraries such as numpy, pandas, matplotlib, seaborn, and scikit-learn. I proceed to load the dataset named "Cross_Sell_Success_Dataset_2023.xlsx" using pandas' read_excel() function and assigns it to the variable css. Next, I set some pandas options to adjust the display of data frames. Finally, I display the first 5 rows of the loaded dataset using the head() function with a parameter of n=5. This allows me to get a quick preview of the data and ensure that it was loaded correctly.

In [1]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import statsmodels.formula.api as smf
import seaborn as sns    
from sklearn.metrics import confusion_matrix         
from sklearn.metrics import roc_auc_score          
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.neighbors import KNeighborsRegressor   
from sklearn.preprocessing import StandardScaler 
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestClassifier

# loading data
css = pd.read_excel(io = 'Cross_Sell_Success_Dataset_2023.xlsx')

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


# displaying the head of the dataset
css.head(n = 5)

Unnamed: 0,CROSS_SELL_SUCCESS,EMAIL,REVENUE,TOTAL_MEALS_ORDERED,UNIQUE_MEALS_PURCH,CONTACTS_W_CUSTOMER_SERVICE,PRODUCT_CATEGORIES_VIEWED,AVG_TIME_PER_SITE_VISIT,CANCELLATIONS_AFTER_NOON,PC_LOGINS,MOBILE_LOGINS,WEEKLY_PLAN,LATE_DELIVERIES,AVG_PREP_VID_TIME,LARGEST_ORDER_SIZE,AVG_MEAN_RATING,TOTAL_PHOTOS_VIEWED
0,1,steffon.baratheon@yahoo.com,4920.0,493,9,1,10,265.6,5,5,2,0,0,137.41,6,2.894737,456
1,0,harlon.greyjoy@visa.com,6150.0,361,9,1,6,247.0,2,5,1,0,0,120.2,5,2.631579,680
2,0,monster@protonmail.com,3435.0,278,6,1,4,164.4,0,6,1,5,0,127.0,3,3.684211,145
3,1,damon.lannister.(lord)@yahoo.com,3330.0,269,8,1,2,176.0,5,5,2,0,0,129.78,6,3.157895,418
4,1,raynald.westerling@jnj.com,3427.5,276,7,1,10,164.6,0,6,1,14,0,34.42,3,3.157895,174


In the following code I begin by cleaning the data. This includes checking for missing values, and column names. The features did not have any missing values, so nothing was done. For the column names, I changed the column LATE_DELIVERIES to remove a space at the end. This was done to avoid future complications. Once the data was cleaned and ready to use, I ran a for loop to create a histogram for every feature in the dataset. My reason for doing this was to see the distribution and ensure there was no skewness. However, this revealed that there were 7 features that were significantly skewed. To combat this, I created a list of these skewed features and ran another for loop to log them and create a new column for each one. Once this was done, I dropped the skewed columns and remained with the new log transformed ones. 

In [2]:
####################### data cleaning ############################
# checking each feature for missing values
css.isnull().sum().round(decimals=2)
# checking column names
css.columns
# renaming the late deliveries column to remove the space
css = css.rename(columns={"LATE_DELIVERIES ":"LATE_DELIVERIES"})

####################### HISTPLOTS ############################
# Iterating over the columns in the css DataFrame
""" (commenting out becuase only needed for visualization)
for col in css.columns:
    # creating a histplot for every column
    sns.histplot(data=css, 
                 x=col, 
                 kde=False)
    # adding a title to each histogram
    plt.title(f"Histogram of {col}")
    # displaying the histogram
    # plt.show() 
"""
    
####################### LOG TRANSFORMATIONS ############################
# list of columns that appear to be skewed
skewed_columns = ['REVENUE', 'TOTAL_MEALS_ORDERED', 'PRODUCT_CATEGORIES_VIEWED',
                  'CANCELLATIONS_AFTER_NOON', 'WEEKLY_PLAN', 'LATE_DELIVERIES',
                 'TOTAL_PHOTOS_VIEWED']
# for loop to create a new column for each skewed column with a log transformation
for col in skewed_columns:
    if col in css.columns:
        css['log_' + col] = np.log(css[col] + 0.001)
        
########################### DATA DROP ############################## 
# dropping useless features, and duplicates
css = css.drop(['EMAIL', 'REVENUE', 'TOTAL_MEALS_ORDERED', 'PRODUCT_CATEGORIES_VIEWED',
                'CANCELLATIONS_AFTER_NOON', 'WEEKLY_PLAN', 'LATE_DELIVERIES', 
                'TOTAL_PHOTOS_VIEWED'],
               axis = 1)
# checking results
# css.columns 

I began the following section of code by running a correlation of the features. I did so by calculating the correlation of the features using Pearson's correlation coefficient and rounding the values to 2 decimal places. Then, sorting the correlations in descending order for the target variable "CROSS_SELL_SUCCESS". The reason for this is to see which features impacted CROSS_SELL_SUCCESS the most. I then prepared the data for analysis by creating a list of the x variables, as well as identifying the y variable. I then used these variables to complete a train test split with the CROSS_SELL_SUCCESS, or y variable, stratified. I concluded with merging the training data to create a statsmodel. Seeing as the y variable of the data is binary, with only 0s and 1s, I used a logistic regression model instead of a linear one. This model served as a base model for future analysis and determination of the importance of the various features. I began by including all of the x variables and seeing the p-value. I proceeded to remove the feature with the highest p value and rerun the model. I continued to do this until all of the remaining features had a p value below 0.05. This resulted with leaving only 4 features, which I identified as most impactful in CROSS_SELL_SUCCESS.

In [3]:
########################### CORRELATION ############################## 
# correlation of the features
css_corr = css.corr(method="pearson").round(decimals =2)
css_corr['CROSS_SELL_SUCCESS'].sort_values(ascending = False)

########################### PREPARING DATA ############################## 
# list of x variables
x_var = ['UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'AVG_TIME_PER_SITE_VISIT', 
          'PC_LOGINS', 'MOBILE_LOGINS', 'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 
          'AVG_MEAN_RATING', 'log_REVENUE', 'log_TOTAL_MEALS_ORDERED', 
          'log_PRODUCT_CATEGORIES_VIEWED', 'log_CANCELLATIONS_AFTER_NOON', 
          'log_WEEKLY_PLAN', 'log_LATE_DELIVERIES', 'log_TOTAL_PHOTOS_VIEWED']
# declare explanatory variables (x)
x_data = css[x_var]
# declare response variable (y)
y_var = css.loc[:, 'CROSS_SELL_SUCCESS']

########################### TEST/TRAIN SPLIT ############################## 
# train-test split including stratification
x_train, x_test, y_train, y_test = train_test_split(
            x_data,
            y_var,
            test_size = 0.10,
            random_state = 219,
            stratify = y_var) # preserving balance
# merge training data for statsmodels
css_train = pd.concat([x_train, y_train], axis = 1)

########################### BASE MODEL ############################## 
# instantiating a logistic regression model object
logistic = smf.logit(formula = """ CROSS_SELL_SUCCESS ~ UNIQUE_MEALS_PURCH + 
                                                     MOBILE_LOGINS +
                                                     log_TOTAL_MEALS_ORDERED + 
                                                     log_CANCELLATIONS_AFTER_NOON 
                                                     """,
                                        data = css_train)
# fitting the model object
results = logistic.fit()
# checking the results SUMMARY
results.summary2()

Optimization terminated successfully.
         Current function value: 0.617189
         Iterations 5


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.017
Dependent Variable:,CROSS_SELL_SUCCESS,AIC:,2171.3948
Date:,2023-02-19 21:08,BIC:,2198.7345
No. Observations:,1751,Log-Likelihood:,-1080.7
Df Model:,4,LL-Null:,-1098.9
Df Residuals:,1746,LLR p-value:,2.3633e-07
Converged:,1.0000,Scale:,1.0
No. Iterations:,5.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-0.5545,0.4151,-1.3359,0.1816,-1.3681,0.2590
UNIQUE_MEALS_PURCH,0.0548,0.0237,2.3153,0.0206,0.0084,0.1013
MOBILE_LOGINS,0.2409,0.0983,2.4496,0.0143,0.0481,0.4336
log_TOTAL_MEALS_ORDERED,0.1697,0.0763,2.2247,0.0261,0.0202,0.3191
log_CANCELLATIONS_AFTER_NOON,0.0676,0.0145,4.6527,0.0000,0.0391,0.0961


To increase the potential scores of the models, I engineered three features. Using the results of the previous logistic regression I used the following features during engineering process: MOBILE_LOGINS, UNIQUE_MEALS_PURCH, log_CANCELLATIONS_AFTER_NOON, log_TOTAL_MEALS_ORDERED. 

The first feature that I created was to determine how many unique meals are purchased per mobile login. I created this feature because I assumed that the more logins the customer had the more likely the customer would purchase unique meals. 
The second feature that I engineered was to determine how many cancellations after noon per unique meals purchased. The reason I engineered this was because I assumed that customers who purchase unique orders are more skeptical and likely to cancel because it is not a product that the customer is accustomed to ordering. 
The third feature I engineered was to determine how many mobile logins per total meals ordered. The reasons for engineering this was because I assumed that the more purchases the customer makes the higher the probability that they logged in.
The fourth feature determined how much revenue increased for a single meal order. I found this to be useful to see how each order impacted revenue, and as a result the cross_sell.
The fifth feature was to see how many purchases result in a meal cancellation. I believed that the more meals purchased, the higher the chance of a cancellation. 
The sixth feature I created was to compare mobile logins to pc ones. I believe that the ratio will be much higher mobile ones per a single pc login.
The seventh feature engineered was to compare the number of meals ordered per late delivery. I feel as if the higher the number of orders, the higher the number of late deliveries just because the chance of occurrence increases.

The following features were created based on the results on the describe function.
The eighth feature I engineered split the customers into 1 of 4 groups based on the number of unique meals per total meals ordered. The groups were small, below medium, above medium, and large. I felt like splitting them into these groups would further help the models, and I found that total meals and unique orders were significant.
The ninth feature was used to determine the type of user based on the number of unique orders. They were split into one of three groups: not unique, unique, and very unique. The reason behind this was because unique orders was a significant feature, and I believe separating them into groups would help the models more than simply using the feature.
The tenth feature I engineered split the users into one of four groups: not active, active, and very active. This was based on the number of mobile logins. The reason behind this was because mobile logins were significant in the statsmodel. I felt that this group precision was more beneficial that just the feature itself. 
The eleventh feature I engineered was determined off of the number of total orders, where users were split into one of these groups: not frequent, below frequent, above frequent, and very frequent. This was done for the reason that total order numbers was significant, and this method of using the data helped the models. 
The twelfth feature engineered grouped users based on their revenue. I felt like this would impact cross sell a significant amount.  

Once all of the features were engineered I established all of the data for the classification models. I began by creating a list of all of the "x" variables I found to be important as well as the features I engineered. I then established the "Y" variable to be "cross_cell_success". I then used these variables to conduct a test train split with the Y variable being stratified. I performed trial and error to determine which features to include and which not to. I ended with the majority of the original features not used for the models. I also excluded some of the engineered features.

In [4]:
########################### NEW FEATURES ############################## 
# how many unique meal purchases per mobile login
css['unique_per_mobile'] = css['MOBILE_LOGINS'] / css['UNIQUE_MEALS_PURCH']
# how many unique meals purchased to cancellations after noon
css['cancel_per_unique'] = css['UNIQUE_MEALS_PURCH'] / css['log_CANCELLATIONS_AFTER_NOON']
# how many mobile logins per total meals ordered 
css['mobile_per_total'] = css['MOBILE_LOGINS'] / css['log_TOTAL_MEALS_ORDERED']
# how much revenue per meals ordered 
css['total_to_revenue']=css['log_REVENUE']/css['log_TOTAL_MEALS_ORDERED']
# how many meals purchased to cancellations after 
css['total_to_cancel']=css['log_TOTAL_MEALS_ORDERED']/css['log_CANCELLATIONS_AFTER_NOON']
# how many mobile logins to pc logins
css['mobile_pc']=css['MOBILE_LOGINS']/css['PC_LOGINS']
# how many meals ordered per late delivery
css['total_to_late']=css['log_TOTAL_MEALS_ORDERED']/css['log_LATE_DELIVERIES']

# how many unique meals per total meals ordered 
css['unique_total'] = css['UNIQUE_MEALS_PURCH'] / css['log_TOTAL_MEALS_ORDERED']
# calculating type of user based on the number of unique meals orderd to total meals 
css['small'] = 0
css['below_medium'] = 0
css['above_medium'] = 0
css['large'] = 0
# loop to identify type of user
for index, row in css.iterrows():
    # small - 25th percentile
    if css.loc[index, 'unique_total'] < 1.29 :
        css.loc[index, 'small'] = 1   
    # below medium - below 50th percentile
    elif css.loc[index, 'unique_total'] >= 1.29 and css.loc[index, 'unique_total'] < 1.57: 
        css.loc[index, 'below_medium'] = 1
    # above medium - above 50th percentile
    elif css.loc[index, 'unique_total'] >= 1.57 and css.loc[index, 'unique_total'] < 1.92: 
        css.loc[index, 'above_medium'] = 1
    # large - 75th percentile
    elif css.loc[index, 'unique_total'] >= 1.92 :
        css.loc[index, 'large'] = 1
          
# calculating type of user based on the number of unique meals
css['not_unique'] = 0
css['unique'] = 0
css['very_unique'] = 0
# loop to identify type of user
for index, row in css.iterrows():
    # not unique - 25th percentile
    if css.loc[index, 'UNIQUE_MEALS_PURCH'] < 5:
        css.loc[index, 'not_unique'] = 1   
    # unique - IQ
    elif css.loc[index, 'UNIQUE_MEALS_PURCH'] >= 5 and css.loc[index, 'UNIQUE_MEALS_PURCH'] < 8: 
        css.loc[index, 'unique'] = 1
    # unique - 75th percentile
    elif css.loc[index, 'UNIQUE_MEALS_PURCH'] >= 8:
        css.loc[index, 'very_unique'] = 1
        
# calculating type of user based on the number of mobile logins
css['not_active'] = 0
css['active'] = 0
css['very_active'] = 0
# loop to identify type of user
for index, row in css.iterrows():
    # not active - 25th percentile
    if css.loc[index, 'MOBILE_LOGINS'] < 1 :
        css.loc[index, 'not_active'] = 1   
    # active - IQ
    elif css.loc[index, 'MOBILE_LOGINS'] >= 1 and css.loc[index, 'MOBILE_LOGINS'] < 2: 
        css.loc[index, 'active'] = 1
    # very active - 75th percentile
    elif css.loc[index, 'MOBILE_LOGINS'] >= 2:
        css.loc[index, 'very_active'] = 1
        
# calculating type of user based on the number of total orders
css['not_frequent'] = 0
css['below_frequent'] = 0
css['above_frequent'] = 0
css['very_frequent'] = 0
# loop to identify type of user
for index, row in css.iterrows():
    # not frequent - 25th percentile
    if css.loc[index, 'log_TOTAL_MEALS_ORDERED'] < 3.66 :
        css.loc[index, 'not_frequent'] = 1   
    # below frequent - below 50th percentile
    elif css.loc[index, 'log_TOTAL_MEALS_ORDERED'] >= 3.66 and css.loc[index, 'log_TOTAL_MEALS_ORDERED'] < 4.09: 
        css.loc[index, 'below_frequent'] = 1
    # above frequent - above 50th percentile
    elif css.loc[index, 'log_TOTAL_MEALS_ORDERED'] >= 4.09 and css.loc[index, 'log_TOTAL_MEALS_ORDERED'] < 4.55: 
        css.loc[index, 'above_frequent'] = 1
    # very frequent - 75th percentile
    elif css.loc[index, 'log_TOTAL_MEALS_ORDERED'] >= 4.55 :
        css.loc[index, 'very_frequent'] = 1
        
# calculating type of user based on the log revenue
css['low'] = 0
css['low_mid'] = 0
css['high_mid'] = 0
css['high'] = 0
# loop to identify type of user
for index, row in css.iterrows():
    # low - 25th percentile
    if css.loc[index, 'log_REVENUE'] < 7.21 :
        css.loc[index, 'low'] = 1   
    # low mid - below 50th percentile
    elif css.loc[index, 'log_REVENUE'] >= 7.21 and css.loc[index, 'log_REVENUE'] < 7.46: 
        css.loc[index, 'low_mid'] = 1
    # high mid - above 50th percentile
    elif css.loc[index, 'log_REVENUE'] >= 7.46 and css.loc[index, 'log_REVENUE'] < 7.89: 
        css.loc[index, 'high_mid'] = 1
    # high - 75th percentile
    elif css.loc[index, 'log_REVENUE'] >= 7.89 :
        css.loc[index, 'high'] = 1
        
########################### NEW DATA ############################## 
# train/test split with the new model data 
css_data   =  css.loc[ : , ['log_CANCELLATIONS_AFTER_NOON', 'AVG_PREP_VID_TIME', 
                            'LARGEST_ORDER_SIZE', 'log_TOTAL_MEALS_ORDERED', 
                            'log_LATE_DELIVERIES', 
                            'cancel_per_unique', 'unique_per_mobile', 
                            'total_to_cancel',  'total_to_revenue', 'mobile_per_total',
                            'MOBILE_LOGINS', 'mobile_pc', 'total_to_late', 
                            'below_medium', 'small', 'large', 'above_medium',
                            'unique', 'very_unique', 'not_active', 'active', 
                            'very_active', 'not_frequent', 'below_frequent', 
                            'above_frequent', 'very_frequent', 'low', 'low_mid', 
                            'high_mid']]
css_target =  css.loc[ : , 'CROSS_SELL_SUCCESS']
# redo train/test split with stratification with new data 
x_train, x_test, y_train, y_test = train_test_split(
            css_data,
            css_target,
            test_size = 0.10,
            random_state = 219,
            stratify = css_target)

In the following block of code I ran five classification models: LOGISTIC REGRESSION, DECISION TREE, PRUNED DECISION TREE, K NEIGHBORS CLASSIFIER, and RANDOM FOREST. I also included the AUC score, and the confusion matrix.

In [5]:
########################### LOGISTIC REGRESSION ############################## 
# name of model
model_1 = "Logistic Regression"
# inisiate logistic regression model
logreg = LogisticRegression(max_iter = 200,
                            solver = 'lbfgs',
                            C = 1,
                            random_state = 219)
# fit the training data
logreg_fit = logreg.fit(x_train, y_train)
# predict based on the testing set
logreg_pred = logreg.predict(x_test)
logreg_pred_probs = logreg_fit.predict_proba(x_test)[:, 1]
# score results
logreg_train_score = logreg_fit.score(x_train, y_train).round(4) 
logreg_test_score = logreg_fit.score(x_test, y_test).round(4)
logreg_test_gap = abs(logreg_train_score - logreg_test_score).round(4)
logreg_auc_score = roc_auc_score(y_true = y_test, y_score = logreg_pred_probs).round(4)
# confusion matrix
logreg_tn, \
logreg_fp, \
logreg_fn, \
logreg_tp = confusion_matrix(y_true = y_test, y_pred = logreg_pred).ravel()

########################### DECISION TREE ############################## 
# name of model
model_2 = "Decision Tree"
# initiate a classification tree object
tree = DecisionTreeClassifier()
# fit the training data
tree_fit = tree.fit(x_train, y_train)
# predict on new data
tree_pred = tree.predict(x_test)
tree_pred_probs = tree_fit.predict_proba(x_test)[:, 1]
# score the model
tree_train_score = tree_fit.score(x_train, y_train).round(4)
tree_test_score  = tree_fit.score(x_test, y_test).round(4) 
tree_test_gap = abs(tree_train_score - tree_test_score).round(4)
tree_auc_score = roc_auc_score(y_true  = y_test, y_score = tree_pred_probs).round(4)
# confusion matrix
tree_tn, \
tree_fp, \
tree_fn, \
tree_tp = confusion_matrix(y_true = y_test, y_pred = tree_pred).ravel()

########################### PRUNED DECISION TREE ############################## 
# name of model
model_3 = "Decision Tree Pruned"
# initiate a classification tree 
pruned = DecisionTreeClassifier(max_depth = 6,
                                min_samples_leaf = 10,
                                random_state = 219)
# fit the training data
pruned_fit = pruned.fit(x_train, y_train)
# predict on data
pruned_pred = pruned.predict(x_test)
pruned_pred_probs = pruned_fit.predict_proba(x_test)[:, 1]
# score the model
pruned_train_score = pruned_fit.score(x_train, y_train).round(4) 
pruned_test_score = pruned_fit.score(x_test, y_test).round(4)
pruned_test_gap = abs(pruned_train_score - pruned_test_score).round(4)
pruned_auc_score = roc_auc_score(y_true  = y_test, y_score = pruned_pred).round(4)
# confusion matrix
pruned_tn, \
pruned_fp, \
pruned_fn, \
pruned_tp = confusion_matrix(y_true = y_test, y_pred = pruned_pred).ravel()

########################### KNN ############################## 
# name of model
model_4 = "K Neighbors Classifier"
# initiate a KNN classifier object
knn = KNeighborsClassifier(n_neighbors = 5)
# fit the model to the training data
knn_fit = knn.fit(x_train, y_train)
# predict on the testing data
knn_pred = knn.predict(x_test)
knn_pred_probs = knn_fit.predict_proba(x_test)[:, 1]
# score the model
knn_train_score = knn_fit.score(x_train, y_train).round(4) 
knn_test_score = knn_fit.score(x_test, y_test).round(4)
knn_test_gap = abs(knn_train_score - knn_test_score).round(4)
knn_auc_score = roc_auc_score(y_true  = y_test, y_score = knn_pred).round(4)
# confusion matrix
knn_tn, \
knn_fp, \
knn_fn, \
knn_tp = confusion_matrix(y_true = y_test, y_pred = knn_pred).ravel()

########################### RANDOM FOREST ########################
# name of model
model_5 = "Random Forest"
# initiate a random forest
rfc = RandomForestClassifier(n_estimators = 30,
                             max_depth = 6,
                             random_state = 219)
# fit the model to the training data
rfc_fit = rfc.fit(x_train, y_train)
# predict on the testing data
rfc_pred = rfc.predict(x_test)
rfc_pred_probs = rfc_fit.predict_proba(x_test)[:, 1]
# score the model
rfc_train_score = rfc_fit.score(x_train, y_train).round(4) 
rfc_test_score = rfc_fit.score(x_test, y_test).round(4)
rfc_test_gap = abs(rfc_train_score - rfc_test_score).round(4)
rfc_auc_score = roc_auc_score(y_true  = y_test, y_score = rfc_pred_probs).round(4)
# confusion matrix
rfc_tn, \
rfc_fp, \
rfc_fn, \
rfc_tp = confusion_matrix(y_true = y_test, y_pred = rfc_pred).ravel()

############################ Final results ##########################
# printing dynamic results
print(f"""
Winning analysis:                                                                                       {model_5}                 
                    ----------------------------------------------------------------------------------------------------
Model Type:         {model_1}\t{model_2}\t{model_3}\t{model_4}\t{model_5}
                    ----------------------------------------------------------------------------------------------------
Training Accuracy:  {logreg_train_score}\t\t{tree_train_score}\t\t{pruned_train_score}\t\t\t{knn_train_score}\t\t\t{rfc_train_score}
Testing Accuracy:   {logreg_test_score}\t\t{tree_test_score}\t\t{pruned_test_score}\t\t\t{knn_test_score}\t\t\t{rfc_test_score}
Train Test Gap:     {logreg_test_gap}\t\t{tree_test_gap}\t\t{pruned_test_gap}\t\t\t{knn_test_gap}\t\t\t{rfc_test_gap}
AUC:                {logreg_auc_score}\t\t{tree_auc_score}\t\t{pruned_auc_score}\t\t\t{knn_auc_score}\t\t\t{rfc_auc_score}


Confusion Matrix:   
True Negatives :    {logreg_tn}\t\t\t{tree_tn}\t\t{pruned_tn}\t\t\t{knn_tn}\t\t\t{rfc_tn}
False Positives:    {logreg_fp}\t\t\t{tree_fp}\t\t{pruned_fp}\t\t\t{knn_fp}\t\t\t{rfc_fp}
False Negatives:    {logreg_fn}\t\t\t{tree_fn}\t\t{pruned_fn}\t\t\t{knn_fn}\t\t\t{rfc_fn}
True Positives :    {logreg_tp}\t\t\t{tree_tp}\t\t{pruned_tp}\t\t\t{knn_tp}\t\t\t{rfc_tp}
""")


Winning analysis:                                                                                       Random Forest                 
                    ----------------------------------------------------------------------------------------------------
Model Type:         Logistic Regression	Decision Tree	Decision Tree Pruned	K Neighbors Classifier	Random Forest
                    ----------------------------------------------------------------------------------------------------
Training Accuracy:  0.679		1.0		0.7053			0.7476			0.7076
Testing Accuracy:   0.6769		0.5179		0.6308			0.6154			0.6718
Train Test Gap:     0.0021		0.4821		0.0745			0.1322			0.0358
AUC:                0.5872		0.4738		0.5281			0.5002			0.6134


Confusion Matrix:   
True Negatives :    2			22		15			11			0
False Positives:    61			41		48			52			63
False Negatives:    2			53		24			23			1
True Positives :    130			79		108			109			131

