<hr style="height:.6px;color:#333;" />
<h1><b>Classification Model Development</b></h1>
<h2>Xitlali Magana</h2>
<br>
<hr style="height:.3px;color:#333;" />

This code imports various Python libraries for data analysis and machine learning, including NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn (sklearn), and Statsmodels. It then loads a dataset named 'Cross_Sell_Success_Dataset_2023.xlsx' into a Pandas DataFrame named 'chef'. The code also sets some Pandas print options for displaying the data and shows the first five rows of the dataset using the 'head()' function.

In [1]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import seaborn as sns    
import sklearn.linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression    
from sklearn.metrics import confusion_matrix         
from sklearn.metrics import roc_auc_score          
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.neighbors import KNeighborsRegressor   
from sklearn.preprocessing import StandardScaler 
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestClassifier

# loading data
chef = pd.read_excel(io = 'Cross_Sell_Success_Dataset_2023.xlsx')

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


# displaying the head of the dataset
chef.head(n = 5)

Unnamed: 0,CROSS_SELL_SUCCESS,EMAIL,REVENUE,TOTAL_MEALS_ORDERED,UNIQUE_MEALS_PURCH,CONTACTS_W_CUSTOMER_SERVICE,PRODUCT_CATEGORIES_VIEWED,AVG_TIME_PER_SITE_VISIT,CANCELLATIONS_AFTER_NOON,PC_LOGINS,MOBILE_LOGINS,WEEKLY_PLAN,LATE_DELIVERIES,AVG_PREP_VID_TIME,LARGEST_ORDER_SIZE,AVG_MEAN_RATING,TOTAL_PHOTOS_VIEWED
0,1,steffon.baratheon@yahoo.com,4920.0,493,9,1,10,265.6,5,5,2,0,0,137.41,6,2.894737,456
1,0,harlon.greyjoy@visa.com,6150.0,361,9,1,6,247.0,2,5,1,0,0,120.2,5,2.631579,680
2,0,monster@protonmail.com,3435.0,278,6,1,4,164.4,0,6,1,5,0,127.0,3,3.684211,145
3,1,damon.lannister.(lord)@yahoo.com,3330.0,269,8,1,2,176.0,5,5,2,0,0,129.78,6,3.157895,418
4,1,raynald.westerling@jnj.com,3427.5,276,7,1,10,164.6,0,6,1,14,0,34.42,3,3.157895,174


This code is performing data cleaning and preparation for analysis on the chef dataset. It begins by checking for missing values in each feature, and then renaming the "LATE_DELIVERIES " column to "LATE_DELIVERIES". The next section creates a histogram for each column in the dataset to visualize the distribution of the data. The code then performs log transformations on a list of columns that appear to be skewed. Finally, useless features are dropped from the dataset, including duplicates, and the results are checked.

In [2]:
####################### data cleaning ############################
# checking each feature for missing values
chef.isnull().sum().round(decimals=2)
# checking column names
chef.columns
# renaming the late deliveries column to remove the space
chef = chef.rename(columns={"LATE_DELIVERIES ":"LATE_DELIVERIES"})

####################### HISTPLOTS ############################
# Iterating over the columns in the chef DataFrame
""" (commenting out becuase only needed for visualization)
for col in chef.columns:
    # creating a histplot for every column
    sns.histplot(data=chef, 
                 x=col, 
                 kde=False)
    # adding a title to each histogram
    plt.title(f"Histogram of {col}")
    # displaying the histogram
    # plt.show() 
"""
    
####################### LOG TRANSFORMATIONS ############################
# list of columns that appear to be skewed
skewed_columns = ['REVENUE', 'TOTAL_MEALS_ORDERED', 'PRODUCT_CATEGORIES_VIEWED',
                  'CANCELLATIONS_AFTER_NOON', 'WEEKLY_PLAN', 'LATE_DELIVERIES',
                 'TOTAL_PHOTOS_VIEWED']
# for loop to create a new column for each skewed column with a log transformation
for col in skewed_columns:
    if col in chef.columns:
        chef['log_' + col] = np.log(chef[col] + 0.001)
        
########################### DATA DROP ############################## 
# dropping useless features, and duplicates
chef = chef.drop(['EMAIL', 'REVENUE', 'TOTAL_MEALS_ORDERED', 'PRODUCT_CATEGORIES_VIEWED',
                'CANCELLATIONS_AFTER_NOON', 'WEEKLY_PLAN', 'LATE_DELIVERIES', 
                'TOTAL_PHOTOS_VIEWED'],
               axis = 1)
# checking results
# chef.columns 

First, the correlation between the customer features and the target variable "CROSS_SELL_SUCCESS" is computed using the Pearson correlation method, with the result rounded to 2 decimal places. The correlation coefficients are then sorted in descending order based on their absolute values, in order to identify which features have the strongest positive and negative correlation with "CROSS_SELL_SUCCESS".

Next, a list of explanatory variables ("x_var") is created, which includes the features that will be used as predictors in the logistic regression model. The explanatory variables and target variable are then separated into their own dataframes ("x_data" and "y_var", respectively).

The data is then split into training and testing sets using the train_test_split function, with a test size of 10%, a random state of 219 (for reproducibility), and stratification to preserve the balance of the target variable in both the training and testing sets.

The training data is then merged into a single dataframe ("chef_train") for use in the logistic regression model.

Finally, a logistic regression model is instantiated using statsmodels, with the explanatory variables and target variable specified using a formula. The model is then fit using the training data, and the results summary is printed to the console.

In [3]:
########################### CORRELATION ############################## 
# correlation of the features
chef_corr = chef.corr(method="pearson").round(decimals =2)
chef_corr['CROSS_SELL_SUCCESS'].sort_values(ascending = False)

########################### PREPARING DATA ############################## 
# list of x variables
x_var = ['UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'AVG_TIME_PER_SITE_VISIT', 
          'PC_LOGINS', 'MOBILE_LOGINS', 'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 
          'AVG_MEAN_RATING', 'log_REVENUE', 'log_TOTAL_MEALS_ORDERED', 
          'log_PRODUCT_CATEGORIES_VIEWED', 'log_CANCELLATIONS_AFTER_NOON', 
          'log_WEEKLY_PLAN', 'log_LATE_DELIVERIES', 'log_TOTAL_PHOTOS_VIEWED']
# declare explanatory variables (x)
x_data = chef[x_var]
# declare response variable (y)
y_var = chef.loc[:, 'CROSS_SELL_SUCCESS']

########################### TEST/TRAIN SPLIT ############################## 
# train-test split including stratification
x_train, x_test, y_train, y_test = train_test_split(
            x_data,
            y_var,
            test_size = 0.10,
            random_state = 219,
            stratify = y_var) # preserving balance
# merge training data for statsmodels
chef_train = pd.concat([x_train, y_train], axis = 1)

########################### BASE MODEL ############################## 
# instantiating a logistic regression model object
logistic = smf.logit(formula = """ CROSS_SELL_SUCCESS ~ UNIQUE_MEALS_PURCH +
                                                        CONTACTS_W_CUSTOMER_SERVICE +
                                                        AVG_TIME_PER_SITE_VISIT +
                                                        PC_LOGINS +
                                                        MOBILE_LOGINS +
                                                        AVG_PREP_VID_TIME +
                                                        LARGEST_ORDER_SIZE +
                                                        AVG_MEAN_RATING +
                                                        log_REVENUE +
                                                        log_TOTAL_MEALS_ORDERED +
                                                        log_PRODUCT_CATEGORIES_VIEWED +
                                                        log_CANCELLATIONS_AFTER_NOON +
                                                        log_WEEKLY_PLAN +
                                                        log_LATE_DELIVERIES +
                                                        log_TOTAL_PHOTOS_VIEWED
                                                     """,
                                        data = chef_train)
# fitting the model object
results = logistic.fit()
# checking the results SUMMARY
results.summary2()

Optimization terminated successfully.
         Current function value: 0.613127
         Iterations 5


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.023
Dependent Variable:,CROSS_SELL_SUCCESS,AIC:,2179.1693
Date:,2023-03-14 13:25,BIC:,2266.6564
No. Observations:,1751,Log-Likelihood:,-1073.6
Df Model:,15,LL-Null:,-1098.9
Df Residuals:,1735,LLR p-value:,9.4146e-06
Converged:,1.0000,Scale:,1.0
No. Iterations:,5.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,2.1596,1.4821,1.4571,0.1451,-0.7453,5.0645
UNIQUE_MEALS_PURCH,0.0572,0.0252,2.2683,0.0233,0.0078,0.1067
CONTACTS_W_CUSTOMER_SERVICE,0.0283,0.0223,1.2677,0.2049,-0.0154,0.0720
AVG_TIME_PER_SITE_VISIT,0.0043,0.0023,1.8226,0.0684,-0.0003,0.0088
PC_LOGINS,0.1624,0.0894,1.8176,0.0691,-0.0127,0.3376
MOBILE_LOGINS,0.2633,0.1004,2.6218,0.0087,0.0665,0.4602
AVG_PREP_VID_TIME,0.0004,0.0009,0.4583,0.6467,-0.0014,0.0022
LARGEST_ORDER_SIZE,-0.0598,0.0595,-1.0055,0.3147,-0.1764,0.0568
AVG_MEAN_RATING,-0.1243,0.1075,-1.1563,0.2475,-0.3350,0.0864


These are a series of feature engineering steps applied to the 'chef' dataset. Feature engineering is the process of transforming raw data into useful features that can be used to improve the performance of machine learning algorithms. Here is a brief explanation of each feature that has been created:

unique_per_mobile: calculates the number of unique meal purchases per mobile login. This could be an interesting metric to understand if customers who use mobile tend to be more adventurous in their meal choices compared to those who use a desktop. It could also be an indicator of convenience or ease of ordering via mobile.

cancel_per_unique: calculates the number of unique meals purchased to cancellations after noon. This could help identify if customers who cancel after the cutoff time tend to order more unique or complex meals. It could also help identify if there is a correlation between cancellations and dissatisfaction with menu options.

mobile_per_total: calculates the number of mobile logins per total meals ordered. This metric could indicate if mobile users tend to order more or less frequently than desktop users. It could also help identify if there is a correlation between convenience and order frequency.

total_to_revenue: calculates how much revenue per meal ordered. This could provide insights into the average amount of revenue generated per order. It could also be an indicator of pricing strategy effectiveness.

total_to_cancel: calculates the number of meals purchased to cancellations after noon. This could help identify if there is a correlation between meal complexity and cancellations after the cutoff time. It could also help identify if there is a correlation between dissatisfaction with menu options and cancellations.

mobile_pc: calculates the number of mobile logins to pc logins. This could be an interesting metric to understand if customers tend to use one platform over the other. It could also help identify if there is a correlation between platform usage and order frequency.

total_to_late: calculates the number of meals ordered per late delivery. This could help identify if customers who experience late deliveries tend to order more frequently or if there is a correlation between dissatisfaction with delivery times and order frequency.

time_vs_average: This ratio measures the average time per site visit (in seconds) divided by the largest order size. It could help identify whether customers who spend more time on the site tend to place larger orders.

time_vs_total: This ratio measures the average time per site visit (in seconds) divided by the log of total meals ordered. It could help identify whether customers who spend more time on the site tend to order more meals overall.

total_vs_average: This ratio measures the log of total meals ordered divided by the largest order size. It could help identify whether customers who order more meals tend to place larger orders.

total_photos_vs_unique: This ratio measures the log of total photos viewed divided by the number of unique meals purchased. It could help identify whether customers who view more photos tend to try more unique meals.

unique_total: calculates the number of unique meals per total meals ordered.

small, below_medium, above_medium, large: calculates the type of user based on the number of unique meals ordered to total meals. The type of user is determined by quartiles.

not_unique, unique, very_unique: calculates the type of user based on the number of unique meals. The type of user is determined by quartiles.

not_active, active, very_active: calculates the type of user based on the number of mobile logins. The type of user is determined by quartiles.

not_frequent, below_frequent, above_frequent, very_frequent: calculates the type of user based on the number of total orders. The type of user is determined by quartiles.

The feature engineering steps performed on the 'chef' dataset can be used to identify patterns and relationships between the various features in the dataset. These relationships can be used to develop machine learning models that can predict various outcomes, such as customer satisfaction or revenue.

In [4]:
########################### NEW FEATURES ############################## 
# how many unique meal purchases per mobile login
chef['unique_per_mobile'] = chef['MOBILE_LOGINS'] / chef['UNIQUE_MEALS_PURCH']
# how many unique meals purchased to cancellations after noon
chef['cancel_per_unique'] = chef['UNIQUE_MEALS_PURCH'] / chef['log_CANCELLATIONS_AFTER_NOON']
# how many mobile logins per total meals ordered 
chef['mobile_per_total'] = chef['MOBILE_LOGINS'] / chef['log_TOTAL_MEALS_ORDERED']
# how much revenue per meals ordered 
chef['total_to_revenue']=chef['log_REVENUE']/chef['log_TOTAL_MEALS_ORDERED']
# how many meals purchased to cancellations after 
chef['total_to_cancel']=chef['log_TOTAL_MEALS_ORDERED']/chef['log_CANCELLATIONS_AFTER_NOON']
# how many mobile logins to pc logins
chef['mobile_pc']=chef['MOBILE_LOGINS']/chef['PC_LOGINS']
# how many meals ordered per late delivery
chef['total_to_late']=chef['log_TOTAL_MEALS_ORDERED']/chef['log_LATE_DELIVERIES']

# finding the ratio of average time per site visit to average order size 
chef['time_vs_average'] = chef['AVG_TIME_PER_SITE_VISIT'] / chef['LARGEST_ORDER_SIZE']
# finding the ratio of average time per visit to total meals order
chef['time_vs_total'] = chef['AVG_TIME_PER_SITE_VISIT'] / chef['log_TOTAL_MEALS_ORDERED']
# finding the ratio of total meals ordred to average order size
chef['total_vs_average'] = chef['log_TOTAL_MEALS_ORDERED'] / chef['LARGEST_ORDER_SIZE']
# finding the ratio of photos viewed to unique meals purchased
chef['total_photos_vs_unique'] = chef['log_TOTAL_PHOTOS_VIEWED'] / chef['UNIQUE_MEALS_PURCH']

# how many unique meals per total meals ordered 
chef['unique_total'] = chef['UNIQUE_MEALS_PURCH'] / chef['log_TOTAL_MEALS_ORDERED']
# calculating type of user based on the number of unique meals orderd to total meals 
chef['small'] = 0
chef['below_medium'] = 0
chef['above_medium'] = 0
chef['large'] = 0
# loop to identify type of user
for index, row in chef.iterrows():
    # small - 25th percentile
    if chef.loc[index, 'unique_total'] < 1.29 :
        chef.loc[index, 'small'] = 1   
    # below medium - below 50th percentile
    elif chef.loc[index, 'unique_total'] >= 1.29 and chef.loc[index, 'unique_total'] < 1.57: 
        chef.loc[index, 'below_medium'] = 1
    # above medium - above 50th percentile
    elif chef.loc[index, 'unique_total'] >= 1.57 and chef.loc[index, 'unique_total'] < 1.92: 
        chef.loc[index, 'above_medium'] = 1
    # large - 75th percentile
    elif chef.loc[index, 'unique_total'] >= 1.92 :
        chef.loc[index, 'large'] = 1
          
# calculating type of user based on the number of unique meals
chef['not_unique'] = 0
chef['unique'] = 0
chef['very_unique'] = 0
# loop to identify type of user
for index, row in chef.iterrows():
    # not unique - 25th percentile
    if chef.loc[index, 'UNIQUE_MEALS_PURCH'] < 5:
        chef.loc[index, 'not_unique'] = 1   
    # unique - IQ
    elif chef.loc[index, 'UNIQUE_MEALS_PURCH'] >= 5 and chef.loc[index, 'UNIQUE_MEALS_PURCH'] < 8: 
        chef.loc[index, 'unique'] = 1
    # unique - 75th percentile
    elif chef.loc[index, 'UNIQUE_MEALS_PURCH'] >= 8:
        chef.loc[index, 'very_unique'] = 1
        
# calculating type of user based on the number of mobile logins
chef['not_active'] = 0
chef['active'] = 0
chef['very_active'] = 0
# loop to identify type of user
for index, row in chef.iterrows():
    # not active - 25th percentile
    if chef.loc[index, 'MOBILE_LOGINS'] < 1 :
        chef.loc[index, 'not_active'] = 1   
    # active - IQ
    elif chef.loc[index, 'MOBILE_LOGINS'] >= 1 and chef.loc[index, 'MOBILE_LOGINS'] < 2: 
        chef.loc[index, 'active'] = 1
    # very active - 75th percentile
    elif chef.loc[index, 'MOBILE_LOGINS'] >= 2:
        chef.loc[index, 'very_active'] = 1
        
# calculating type of user based on the number of total orders
chef['not_frequent'] = 0
chef['below_frequent'] = 0
chef['above_frequent'] = 0
chef['very_frequent'] = 0
# loop to identify type of user
for index, row in chef.iterrows():
    # not frequent - 25th percentile
    if chef.loc[index, 'log_TOTAL_MEALS_ORDERED'] < 3.66 :
        chef.loc[index, 'not_frequent'] = 1   
    # below frequent - below 50th percentile
    elif chef.loc[index, 'log_TOTAL_MEALS_ORDERED'] >= 3.66 and chef.loc[index, 'log_TOTAL_MEALS_ORDERED'] < 4.09: 
        chef.loc[index, 'below_frequent'] = 1
    # above frequent - above 50th percentile
    elif chef.loc[index, 'log_TOTAL_MEALS_ORDERED'] >= 4.09 and chef.loc[index, 'log_TOTAL_MEALS_ORDERED'] < 4.55: 
        chef.loc[index, 'above_frequent'] = 1
    # very frequent - 75th percentile
    elif chef.loc[index, 'log_TOTAL_MEALS_ORDERED'] >= 4.55 :
        chef.loc[index, 'very_frequent'] = 1
        
# calculating type of user based on the log revenue
chef['low'] = 0
chef['low_mid'] = 0
chef['high_mid'] = 0
chef['high'] = 0
# loop to identify type of user
for index, row in chef.iterrows():
    # low - 25th percentile
    if chef.loc[index, 'log_REVENUE'] < 7.21 :
        chef.loc[index, 'low'] = 1   
    # low mid - below 50th percentile
    elif chef.loc[index, 'log_REVENUE'] >= 7.21 and chef.loc[index, 'log_REVENUE'] < 7.46: 
        chef.loc[index, 'low_mid'] = 1
    # high mid - above 50th percentile
    elif chef.loc[index, 'log_REVENUE'] >= 7.46 and chef.loc[index, 'log_REVENUE'] < 7.89: 
        chef.loc[index, 'high_mid'] = 1
    # high - 75th percentile
    elif chef.loc[index, 'log_REVENUE'] >= 7.89 :
        chef.loc[index, 'high'] = 1

First, the code splits the data into a training set and a testing set using stratification. Then, the code trains each model on the training set and tests its performance on the testing set.

For each model, the code calculates the following performance metrics:

Training score: the accuracy of the model on the training set
Testing score: the accuracy of the model on the testing set
Testing gap: the absolute difference between the training and testing scores
AUC score: the area under the receiver operating characteristic (ROC) curve
Confusion matrix: a table showing the number of true positives, true negatives, false positives, and false negatives for the model's predictions

In [5]:
########################### NEW DATA ############################## 
# train/test split with the new model data 
chef_data   =  chef.loc[ : , ['log_CANCELLATIONS_AFTER_NOON', 'AVG_PREP_VID_TIME', 
                            'LARGEST_ORDER_SIZE', 'log_TOTAL_MEALS_ORDERED', 
                            'log_LATE_DELIVERIES', 
                            'cancel_per_unique', 'unique_per_mobile', 
                            'total_to_cancel',  'total_to_revenue', 'mobile_per_total',
                            'MOBILE_LOGINS', 'mobile_pc', 'total_to_late', 
                            'below_medium', 'small', 'large', 'above_medium',
                            'unique', 'very_unique', 'not_active', 'active', 
                            'very_active', 'not_frequent', 'below_frequent', 
                            'above_frequent', 'very_frequent', 'low', 'low_mid', 
                            'high_mid']]
chef_target =  chef.loc[ : , 'CROSS_SELL_SUCCESS']
# redo train/test split with stratification with new data 
x_train, x_test, y_train, y_test = train_test_split(
            chef_data,
            chef_target,
            test_size = 0.10,
            random_state = 219,
            stratify = chef_target)

########################### LOGISTIC REGRESSION ############################## 
# name of model
model_1 = "Logistic Regression"
# inisiate logistic regression model
logreg = LogisticRegression(max_iter = 200,
                            solver = 'lbfgs',
                            C = 1,
                            random_state = 219)
# fit the training data
logreg_fit = logreg.fit(x_train, y_train)
# predict based on the testing set
logreg_pred = logreg.predict(x_test)
logreg_pred_probs = logreg_fit.predict_proba(x_test)[:, 1]
# score results
logreg_train_score = logreg_fit.score(x_train, y_train).round(4) 
logreg_test_score = logreg_fit.score(x_test, y_test).round(4)
logreg_test_gap = abs(logreg_train_score - logreg_test_score).round(4)
logreg_auc_score = roc_auc_score(y_true = y_test, y_score = logreg_pred_probs).round(4)
# confusion matrix
logreg_tn, \
logreg_fp, \
logreg_fn, \
logreg_tp = confusion_matrix(y_true = y_test, y_pred = logreg_pred).ravel()

########################### DECISION TREE ############################## 
# name of model
model_2 = "Decision Tree"
# initiate a classification tree object
tree = DecisionTreeClassifier()
# fit the training data
tree_fit = tree.fit(x_train, y_train)
# predict on new data
tree_pred = tree.predict(x_test)
tree_pred_probs = tree_fit.predict_proba(x_test)[:, 1]
# score the model
tree_train_score = tree_fit.score(x_train, y_train).round(4)
tree_test_score  = tree_fit.score(x_test, y_test).round(4) 
tree_test_gap = abs(tree_train_score - tree_test_score).round(4)
tree_auc_score = roc_auc_score(y_true  = y_test, y_score = tree_pred_probs).round(4)
# confusion matrix
tree_tn, \
tree_fp, \
tree_fn, \
tree_tp = confusion_matrix(y_true = y_test, y_pred = tree_pred).ravel()

########################### PRUNED DECISION TREE ############################## 
# name of model
model_3 = "Decision Tree Pruned"
# initiate a classification tree 
pruned = DecisionTreeClassifier(max_depth = 6,
                                min_samples_leaf = 10,
                                random_state = 219)
# fit the training data
pruned_fit = pruned.fit(x_train, y_train)
# predict on data
pruned_pred = pruned.predict(x_test)
pruned_pred_probs = pruned_fit.predict_proba(x_test)[:, 1]
# score the model
pruned_train_score = pruned_fit.score(x_train, y_train).round(4) 
pruned_test_score = pruned_fit.score(x_test, y_test).round(4)
pruned_test_gap = abs(pruned_train_score - pruned_test_score).round(4)
pruned_auc_score = roc_auc_score(y_true  = y_test, y_score = pruned_pred).round(4)
# confusion matrix
pruned_tn, \
pruned_fp, \
pruned_fn, \
pruned_tp = confusion_matrix(y_true = y_test, y_pred = pruned_pred).ravel()

########################### KNN ############################## 
# name of model
model_4 = "K Neighbors Classifier"
# initiate a KNN classifier object
knn = KNeighborsClassifier(n_neighbors = 5)
# fit the model to the training data
knn_fit = knn.fit(x_train, y_train)
# predict on the testing data
knn_pred = knn.predict(x_test)
knn_pred_probs = knn_fit.predict_proba(x_test)[:, 1]
# score the model
knn_train_score = knn_fit.score(x_train, y_train).round(4) 
knn_test_score = knn_fit.score(x_test, y_test).round(4)
knn_test_gap = abs(knn_train_score - knn_test_score).round(4)
knn_auc_score = roc_auc_score(y_true  = y_test, y_score = knn_pred).round(4)
# confusion matrix
knn_tn, \
knn_fp, \
knn_fn, \
knn_tp = confusion_matrix(y_true = y_test, y_pred = knn_pred).ravel()

########################### RANDOM FOREST ########################
# name of model
model_5 = "Random Forest"
# initiate a random forest
rfc = RandomForestClassifier(n_estimators = 30,
                             max_depth = 6,
                             random_state = 219)
# fit the model to the training data
rfc_fit = rfc.fit(x_train, y_train)
# predict on the testing data
rfc_pred = rfc.predict(x_test)
rfc_pred_probs = rfc_fit.predict_proba(x_test)[:, 1]
# score the model
rfc_train_score = rfc_fit.score(x_train, y_train).round(4) 
rfc_test_score = rfc_fit.score(x_test, y_test).round(4)
rfc_test_gap = abs(rfc_train_score - rfc_test_score).round(4)
rfc_auc_score = roc_auc_score(y_true  = y_test, y_score = rfc_pred_probs).round(4)
# confusion matrix
rfc_tn, \
rfc_fp, \
rfc_fn, \
rfc_tp = confusion_matrix(y_true = y_test, y_pred = rfc_pred).ravel()

############################ Final Classification Results ##########################
# printing dynamic results
print(f"""
Winning analysis:                                                                                       {model_5}                 
                    ----------------------------------------------------------------------------------------------------
Model Type:         {model_1}\t{model_2}\t{model_3}\t{model_4}\t{model_5}
                    ----------------------------------------------------------------------------------------------------
Training Accuracy:  {logreg_train_score}\t\t{tree_train_score}\t\t{pruned_train_score}\t\t\t{knn_train_score}\t\t\t{rfc_train_score}
Testing Accuracy:   {logreg_test_score}\t\t{tree_test_score}\t\t{pruned_test_score}\t\t\t{knn_test_score}\t\t\t{rfc_test_score}
Train Test Gap:     {logreg_test_gap}\t\t{tree_test_gap}\t\t{pruned_test_gap}\t\t\t{knn_test_gap}\t\t\t{rfc_test_gap}
AUC:                {logreg_auc_score}\t\t{tree_auc_score}\t\t{pruned_auc_score}\t\t\t{knn_auc_score}\t\t\t{rfc_auc_score}


Confusion Matrix:   
True Negatives :    {logreg_tn}\t\t\t{tree_tn}\t\t{pruned_tn}\t\t\t{knn_tn}\t\t\t{rfc_tn}
False Positives:    {logreg_fp}\t\t\t{tree_fp}\t\t{pruned_fp}\t\t\t{knn_fp}\t\t\t{rfc_fp}
False Negatives:    {logreg_fn}\t\t\t{tree_fn}\t\t{pruned_fn}\t\t\t{knn_fn}\t\t\t{rfc_fn}
True Positives :    {logreg_tp}\t\t\t{tree_tp}\t\t{pruned_tp}\t\t\t{knn_tp}\t\t\t{rfc_tp}
""")


Winning analysis:                                                                                       Random Forest                 
                    ----------------------------------------------------------------------------------------------------
Model Type:         Logistic Regression	Decision Tree	Decision Tree Pruned	K Neighbors Classifier	Random Forest
                    ----------------------------------------------------------------------------------------------------
Training Accuracy:  0.679		1.0		0.7053			0.7476			0.7076
Testing Accuracy:   0.6769		0.5333		0.6308			0.6154			0.6718
Train Test Gap:     0.0021		0.4667		0.0745			0.1322			0.0358
AUC:                0.5872		0.4894		0.5281			0.5002			0.6134


Confusion Matrix:   
True Negatives :    2			23		15			11			0
False Positives:    61			40		48			52			63
False Negatives:    2			51		24			23			1
True Positives :    130			81		108			109			131



The code first identifies the independent variables (features) to use in the regression by selecting a subset of columns from the dataset. It then splits the data into training and testing sets using a 75:25 split.

Three different regression models are then fit to the training data: Lasso Regression, Linear Regression, and ARD Regression. Each model is fit and tested, and the training and testing scores are recorded. Finally, the code prints out the results for each model, showing which model performed the best.

The Lasso Regression model uses L1 regularization to shrink the coefficients of less important variables to zero, effectively performing feature selection. The Linear Regression model fits a linear equation to the data. The ARD Regression model uses a Bayesian approach to automatically determine the importance of each feature.

In [6]:
########################## SCIKIT LEARN ################################
# identify independent variables
x_var = ['log_TOTAL_MEALS_ORDERED', 'log_TOTAL_PHOTOS_VIEWED', 'AVG_MEAN_RATING',
               'LARGEST_ORDER_SIZE', 'AVG_PREP_VID_TIME', 'UNIQUE_MEALS_PURCH',
               'CONTACTS_W_CUSTOMER_SERVICE', 'AVG_TIME_PER_SITE_VISIT', 'MOBILE_LOGINS',
               'log_PRODUCT_CATEGORIES_VIEWED', 'time_vs_average', 'time_vs_total', 
               'total_vs_average', 'total_photos_vs_unique']

# Preparing a DataFrame 
x_data = chef.loc[ : , x_var]

# identify dependent variable
y_data = chef.loc[ : , 'log_REVENUE']

# train/test split up 
x_train, x_test, y_train, y_test = train_test_split(
            x_data, 
            y_data, 
            test_size    = 0.25,
            random_state = 219)

############################ lasso regression ##########################
# set a model name
model_1_name = "Lasso Regression"

# specify a model object 
model_1 = sklearn.linear_model.Lasso(alpha=0.05)
# fit to the training data
model_1_fit = model_1.fit(x_train, y_train)

# predict on new data
model_1_pred = model_1.predict(x_test)

# score the results
model_1_train_score = model_1.score(x_train, y_train).round(4)
model_1_test_score  = model_1.score(x_test, y_test).round(4) 
model_1_gap         = abs(model_1_train_score - model_1_test_score).round(4)

############################ linear regression ##########################

# Set a model name
model_2_name = "Linear Regression"

# Specify a model object 
model_2 = sklearn.linear_model.LinearRegression()

# fit to the training data
model_2_fit = model_2.fit(x_train, y_train)

# predict on new data
model_2_pred = model_2.predict(x_test)

# score the results
model_2_train_score = model_2.score(x_train, y_train).round(4) 
model_2_test_score  = model_2.score(x_test, y_test).round(4)   
model_2_gap         = abs(model_2_train_score - model_2_test_score).round(4)

############################ ARD regression ##########################

# set a model name
model_3_name = "ARD Regression"

# specify a model object 
model_3 = sklearn.linear_model.ARDRegression()

# fit to the training data
model_3_fit = model_3.fit(x_train, y_train)
2
# predict on new data
model_3_pred = model_3.predict(x_test)

# score the results
model_3_train_score = model_3.score(x_train, y_train).round(4)
model_3_test_score  = model_3.score(x_test, y_test).round(4) 
model_3_gap         = abs(model_3_train_score - model_3_test_score).round(4)

############################ Final Regression results ##########################
# printing dynamic results
print(f"""
Winning analysis:                       {model_2_name}
                   ----------------------------------------------------------------
Model Name:       | {model_1_name}\t{model_2_name}\t{model_3_name}
                   ----------------------------------------------------------------
Train Score:      | {model_1_train_score}\t\t{model_2_train_score}\t\t\t{model_3_train_score}
Test Score:       | {model_1_test_score}\t\t{model_2_test_score}\t\t\t{model_3_test_score}
Train Test Gap:   | {model_1_gap}\t\t{model_1_gap}\t\t\t{model_3_gap}
""")



Winning analysis:                       Linear Regression
                   ----------------------------------------------------------------
Model Name:       | Lasso Regression	Linear Regression	ARD Regression
                   ----------------------------------------------------------------
Train Score:      | 0.6573		0.703			0.6916
Test Score:       | 0.6534		0.7187			0.7103
Train Test Gap:   | 0.0039		0.0039			0.0187

