**Black Friday Sales using AutoML**

**About Data Set**

A retail company wants to comprehend and interpret the purchase behavior of their customers, with respect to various products belonging to different categories, for the Black Friday sale. The data available for the company is the purchase summary of customers which includes customer demographics, product information, the total purchase amount, for chosen high volume products sold during the Black Friday period of last year.

The objective is developing a model to make predictions on the purchasing capacity of their customers with respect to various products, aiding them in creating personalized offers for customers against different products along with understanding which areas need to increase sales during Black Friday

In [45]:
!apt-get install openjdk-8-jdk

Reading package lists... Done
Building dependency tree       
Reading state information... Done
openjdk-8-jdk is already the newest version (8u342-b07-0ubuntu1~18.04).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.


In [46]:
!java -version

openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu118.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu118.04, mixed mode)


In [47]:
#Installing H2O library
!pip install H2O

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [48]:
#Importing Libraries
%matplotlib inline
import random, os, sys
import h2o
import pandas
import pprint
import operator
import matplotlib
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from tabulate import tabulate
from h2o.automl import H2OAutoML
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil
import numpy as np

In [49]:
#Initialising H2O instance
h2o.init(strict_version_check=False)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,1 hour 17 mins
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.2
H2O_cluster_version_age:,11 days
H2O_cluster_name:,H2O_from_python_unknownUser_x4n48s
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,2.313 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [50]:
#Loading dataset 
dfimport = h2o.import_file(path = "/content/train.csv")


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


**MultiColinearity**

In [None]:
df_importpd = pd.read_csv("/content/train.csv")

#plot color scaled correlation matrix
corr=df_importpd.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
User_ID,1.0,-0.023971,0.020443,0.003825,0.001529,0.003419,0.004716
Occupation,-0.023971,1.0,0.02428,-0.007618,-0.000384,0.013263,0.020833
Marital_Status,0.020443,0.02428,1.0,0.019888,0.015138,0.019473,-0.000463
Product_Category_1,0.003825,-0.007618,0.019888,1.0,0.540583,0.229678,-0.343703
Product_Category_2,0.001529,-0.000384,0.015138,0.540583,1.0,0.543649,-0.209918
Product_Category_3,0.003419,0.013263,0.019473,0.229678,0.543649,1.0,-0.022006
Purchase,0.004716,0.020833,-0.000463,-0.343703,-0.209918,-0.022006,1.0


Product Category 2 & product Category 3 are highly correlated.

Product Category 1 & product Category 3 also show some level of correlation.


**Data Cleaning**

In [51]:
#Removing rows with null values 
newframe=dfimport.na_omit()
df = newframe
df.describe()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
type,int,enum,enum,enum,int,enum,int,int,int,int,int,int
mins,1000001.0,,,,0.0,,0.0,0.0,1.0,2.0,3.0,185.0
mean,1003050.313334605,,,,8.155819000558466,,1.4783152715627863,0.40448469167744716,2.7376130186131653,6.885804367343183,12.65886228518509,11651.448405545067
maxs,1006040.0,,,,20.0,,3.0,1.0,15.0,16.0,18.0,23959.0
sigma,1730.583654879296,,,,6.484714073640686,,0.9898901933053615,0.49079377409580127,2.5672754395767443,4.49440458564803,4.129737896841566,5085.821272724652
zeros,0,,,,17801,,22389,84241,0,0,0,0
missing,0,0,0,0,0,0,0,0,0,0,0,0
0,1000001.0,P00248942,F,0-17,10.0,A,2.0,0.0,1.0,6.0,14.0,15200.0
1,1000004.0,P00184942,M,46-50,7.0,B,2.0,1.0,1.0,8.0,17.0,19215.0
2,1000005.0,P00145042,M,26-35,20.0,A,1.0,1.0,1.0,2.0,5.0,15665.0


**Binary Classification**

In [52]:
#Splitting up and storing dependent and independent variable
dfcopy = df
myY = "Gender"
myX = ["User_ID", "Product_ID", "Age", "Occupation", "City_Category", "Stay_In_Current_City_Years", "Marital_Status", "Product_Category_1", "Product_Category_2", "Product_Category_3"]

In [53]:
#Splitting up the data set into training, testing and validation sets
df_train,df_test,df_valid = dfcopy.split_frame(ratios=[.8, .10])

print ("Rows in Train",df_train.nrow)
print ("Rows in Validation",df_valid.nrow)
print ("Rows in Test",df_test.nrow)

Rows in Train 112998
Rows in Validation 14220
Rows in Test 14241


In [54]:
ml = H2OAutoML(max_models = 10, seed = 10, verbosity = "info", nfolds = 0)

**seed**: Set a seed for reproducibility. AutoML guarantees reproducibility only under certain conditions. By default, for performance reasons, H2O Deep Learning models are not reproducible, so if the user requires reproducibility, then exclude_algos must contain "DeepLearning". In addition max_models must be used because max_runtime_secs is resource limited, meaning that if the available compute resources are not the same between runs, AutoML may be able to train more models on one run vs another. Defaults to NULL/None.

**verbosity**: This field is optional and defaults to NULL/None. This is the verbosity of the backend messages printed during training.We can mention "debug", "info", "warn".

**nfolds**: This is an optional field. A value of greater than or equal to 2 is to be specifieda for the number of folds for k-fold cross-validation of the models in the AutoML run. A value of “-1” can be specified in order to let AutoML choose if k-fold cross-validation or blending mode should be used. To disable cross-validation we can specify 0.

In [55]:
#Training the model
ml.train(x = myX, y = myY, training_frame = df_train, validation_frame = df_valid)

AutoML progress: |
03:41:36.335: Project: AutoML_4_20221108_34136
03:41:36.335: Cross-validation disabled by user: no fold column nor nfolds > 1.
03:41:36.335: Setting stopping tolerance adaptively based on the training frame: 0.00297484691273901
03:41:36.335: Build control seed: 10
03:41:36.335: training frame: Frame key: AutoML_4_20221108_34136_training_py_35_sid_8b5d    cols: 12    rows: 112998  chunks: 8    size: 1813864  checksum: -3389356373312444664
03:41:36.335: validation frame: Frame key: py_37_sid_8b5d    cols: 12    rows: 14220  chunks: 8    size: 505054  checksum: -3391914661864499368
03:41:36.335: leaderboard frame: Frame key: py_37_sid_8b5d    cols: 12    rows: 14220  chunks: 8    size: 505054  checksum: -3391914661864499368
03:41:36.335: blending frame: NULL
03:41:36.335: response column: Gender
03:41:36.335: fold column: null
03:41:36.335: weights column: null
03:41:36.340: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,275.0,275.0,1384347.0,0.0,10.0,7.309091,1.0,812.0,318.52365

Unnamed: 0,F,M,Error,Rate
F,25750.0,132.0,0.0051,(132.0/25882.0)
M,72.0,87044.0,0.0008,(72.0/87116.0)
Total,25822.0,87176.0,0.0018,(204.0/112998.0)

metric,threshold,value,idx
max f1,0.616605,0.9988296,188.0
max f2,0.5847235,0.9991416,192.0
max f0point5,0.7018958,0.9990302,176.0
max accuracy,0.616605,0.9981947,188.0
max precision,0.9992662,1.0,0.0
max recall,0.2305772,1.0,265.0
max specificity,0.9992662,1.0,0.0
max absolute_mcc,0.616605,0.9948851,188.0
max min_per_class_accuracy,0.7148204,0.9977157,174.0
max mean_per_class_accuracy,0.7018958,0.9978072,176.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100002,0.9988363,1.2970981,1.2970981,1.0,0.9991672,1.0,0.9991672,0.0129712,0.0129712,29.7098122,29.7098122,0.0129712
2,0.0200004,0.9983772,1.2970981,1.2970981,1.0,0.9985984,1.0,0.9988828,0.0129712,0.0259424,29.7098122,29.7098122,0.0259424
3,0.0300005,0.9979811,1.2970981,1.2970981,1.0,0.9981773,1.0,0.9986476,0.0129712,0.0389136,29.7098122,29.7098122,0.0389136
4,0.0400007,0.9976468,1.2970981,1.2970981,1.0,0.9978125,1.0,0.9984388,0.0129712,0.0518848,29.7098122,29.7098122,0.0518848
5,0.0500009,0.997344,1.2970981,1.2970981,1.0,0.9974952,1.0,0.9982501,0.0129712,0.0648561,29.7098122,29.7098122,0.0648561
6,0.1000018,0.9960456,1.2970981,1.2970981,1.0,0.9966723,1.0,0.9974612,0.0648561,0.1297121,29.7098122,29.7098122,0.1297121
7,0.1500027,0.9947306,1.2970981,1.2970981,1.0,0.9954005,1.0,0.9967743,0.0648561,0.1945682,29.7098122,29.7098122,0.1945682
8,0.2000035,0.9932296,1.2970981,1.2970981,1.0,0.9939882,1.0,0.9960778,0.0648561,0.2594242,29.7098122,29.7098122,0.2594242
9,0.3000053,0.9899103,1.2970981,1.2970981,1.0,0.9916269,1.0,0.9945942,0.1297121,0.3891363,29.7098122,29.7098122,0.3891363
10,0.3999982,0.9855575,1.2970981,1.2970981,1.0,0.9878362,1.0,0.9929048,0.1297006,0.518837,29.7098122,29.7098122,0.518837

Unnamed: 0,F,M,Error,Rate
F,3105.0,75.0,0.0236,(75.0/3180.0)
M,36.0,11004.0,0.0033,(36.0/11040.0)
Total,3141.0,11079.0,0.0078,(111.0/14220.0)

metric,threshold,value,idx
max f1,0.6345794,0.9949817,186.0
max f2,0.5345986,0.9970529,204.0
max f0point5,0.7338643,0.9953889,168.0
max accuracy,0.6345794,0.9921941,186.0
max precision,0.9992664,1.0,0.0
max recall,0.2752923,1.0,257.0
max specificity,0.9992664,1.0,0.0
max absolute_mcc,0.6345794,0.9774522,186.0
max min_per_class_accuracy,0.7868405,0.9880503,155.0
max mean_per_class_accuracy,0.7338643,0.9895731,168.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100563,0.9988202,1.2880435,1.2880435,1.0,0.9991059,1.0,0.9991059,0.0129529,0.0129529,28.8043478,28.8043478,0.0129529
2,0.0200422,0.9983342,1.2880435,1.2880435,1.0,0.9985617,1.0,0.9988347,0.0128623,0.0258152,28.8043478,28.8043478,0.0258152
3,0.0300281,0.9979755,1.2880435,1.2880435,1.0,0.9981558,1.0,0.998609,0.0128623,0.0386775,28.8043478,28.8043478,0.0386775
4,0.0400141,0.9976314,1.2880435,1.2880435,1.0,0.997808,1.0,0.9984091,0.0128623,0.0515399,28.8043478,28.8043478,0.0515399
5,0.05,0.9972751,1.2880435,1.2880435,1.0,0.9974457,1.0,0.9982167,0.0128623,0.0644022,28.8043478,28.8043478,0.0644022
6,0.1,0.996014,1.2880435,1.2880435,1.0,0.9966334,1.0,0.9974251,0.0644022,0.1288043,28.8043478,28.8043478,0.1288043
7,0.15,0.9946271,1.2880435,1.2880435,1.0,0.9953337,1.0,0.9967279,0.0644022,0.1932065,28.8043478,28.8043478,0.1932065
8,0.2,0.9930659,1.2880435,1.2880435,1.0,0.9938551,1.0,0.9960097,0.0644022,0.2576087,28.8043478,28.8043478,0.2576087
9,0.3,0.9894431,1.2880435,1.2880435,1.0,0.9913623,1.0,0.9944606,0.1288043,0.386413,28.8043478,28.8043478,0.386413
10,0.4,0.9847792,1.2871377,1.287817,0.9992968,0.9872475,0.9998242,0.9926573,0.1287138,0.5151268,28.7137681,28.7817029,0.5148123

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2022-11-08 04:08:47,0.003 sec,0.0,0.4202204,0.5381238,0.5,0.7709517,1.0,0.2290483,0.4167113,0.5315461,0.5,0.7763713,1.0,0.2236287
,2022-11-08 04:08:47,0.618 sec,5.0,0.3675748,0.4320278,0.9039653,0.9669541,1.2970981,0.1298342,0.3665096,0.4305318,0.8963834,0.9650094,1.2880435,0.1343179
,2022-11-08 04:08:48,1.314 sec,10.0,0.3394116,0.3811185,0.9300338,0.9761113,1.2970981,0.1079223,0.3404081,0.3828555,0.9208913,0.9735288,1.2880435,0.1151195
,2022-11-08 04:08:49,2.021 sec,15.0,0.3207152,0.3477718,0.9428950,0.9806966,1.2970981,0.0985062,0.3237090,0.3525685,0.9321268,0.9772551,1.2880435,0.1064698
,2022-11-08 04:08:49,2.694 sec,20.0,0.3027522,0.3165052,0.9547717,0.9848220,1.2970981,0.0863378,0.3077773,0.3240915,0.9436445,0.9810486,1.2880435,0.0965541
,2022-11-08 04:08:50,3.302 sec,25.0,0.2876168,0.2902340,0.9620569,0.9873536,1.2970981,0.0786917,0.2942451,0.2997075,0.9514851,0.9837527,1.2880435,0.0893108
,2022-11-08 04:08:51,3.988 sec,30.0,0.2779670,0.2745636,0.9676010,0.9892730,1.2970981,0.0725588,0.2862987,0.2862231,0.9563585,0.9853906,1.2880435,0.0842475
,2022-11-08 04:08:51,4.602 sec,35.0,0.2651894,0.2548131,0.9727291,0.9909705,1.2970981,0.0646295,0.2747619,0.2677372,0.9620652,0.9873261,1.2880435,0.0773558
,2022-11-08 04:08:52,5.200 sec,40.0,0.2538174,0.2377938,0.9788287,0.9930928,1.2970981,0.0572134,0.2644748,0.2514216,0.9693064,0.9899809,1.2880435,0.0682138
,2022-11-08 04:08:53,5.962 sec,45.0,0.2388626,0.2178726,0.9847740,0.9949964,1.2970981,0.0459123,0.2503246,0.2318578,0.9764744,0.9922117,1.2880435,0.0562588

variable,relative_importance,scaled_importance,percentage
User_ID,32268.9667969,1.0,0.4218079
Occupation,15408.3486328,0.4774974,0.2014122
Age,10082.0400391,0.3124376,0.1317887
Stay_In_Current_City_Years,6613.5083008,0.2049495,0.0864493
City_Category,5244.8544922,0.1625356,0.0685588
Marital_Status,2785.7998047,0.0863306,0.0364149
Product_ID,1595.8077393,0.0494533,0.0208598
Product_Category_1,1318.5288086,0.0408606,0.0172353
Product_Category_3,684.1122437,0.0212003,0.0089425
Product_Category_2,499.5957947,0.0154822,0.0065305


MSE: 0.005273554957484822

RMSE: 0.07261924646734377

LogLoss: 0.04114732486618086

Mean Per-Class Error: 0.0029632768871644443

The predicted values, for binary classification, make sense.

In [81]:
lb = ml.leaderboard
lb.head()

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
GBM_4_AutoML_6_20221108_42128,3475.86,12081600.0,2695.41,,12081600.0
GBM_1_AutoML_6_20221108_42128,3487.63,12163600.0,2707.06,,12163600.0
GBM_3_AutoML_6_20221108_42128,3489.34,12175500.0,2716.55,0.365963,12175500.0
XGBoost_3_AutoML_6_20221108_42128,3505.49,12288500.0,2735.2,,12288500.0
GBM_2_AutoML_6_20221108_42128,3505.54,12288800.0,2736.65,0.368117,12288800.0
XGBoost_2_AutoML_6_20221108_42128,3516.96,12369000.0,2720.46,,12369000.0
XRT_1_AutoML_6_20221108_42128,3541.29,12540700.0,2772.6,0.374721,12540700.0
XGBoost_1_AutoML_6_20221108_42128,3549.21,12596900.0,2734.34,,12596900.0
DRF_1_AutoML_6_20221108_42128,3651.08,13330400.0,2803.67,0.374729,13330400.0
GLM_1_AutoML_6_20221108_42128,5082.06,25827300.0,4217.76,0.609805,25827300.0


Leaderboard - The AutoML object includes a “leaderboard” of models that were trained in the process. The number of folds used in the model evaluation process can be adjusted using the nfolds parameter.

In [57]:
# finding and storing the best model and rinting the output
best_model = h2o.get_model(ml.leaderboard[1,'model_id'])
best_model.algo

'xgboost'

For tree-based models like Gradient Boosting there are no model assumptions to validate.

Learning rate and n_estimators are two critical hyperparameters for gradient boosting decision trees

In [80]:
df_pred=ml.leader.predict(df_test)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [59]:
df_pred.head()

predict,F,M
M,0.011199,0.988801
F,0.987265,0.0127355
F,0.987616,0.012384
M,0.00509308,0.994907
M,0.00816299,0.991837
M,0.00706539,0.992935
M,0.0171497,0.98285
F,0.97118,0.0288201
M,0.00129117,0.998709
F,0.969978,0.030022


In [60]:
ml.leader.model_performance(df_test)

Unnamed: 0,F,M,Error,Rate
F,3219.0,69.0,0.021,(69.0/3288.0)
M,30.0,10923.0,0.0027,(30.0/10953.0)
Total,3249.0,10992.0,0.007,(99.0/14241.0)

metric,threshold,value,idx
max f1,0.6371537,0.9954887,181.0
max f2,0.5507402,0.9970832,196.0
max f0point5,0.74715,0.9954929,157.0
max accuracy,0.6371537,0.9930482,181.0
max precision,0.9991156,1.0,0.0
max recall,0.2085478,1.0,276.0
max specificity,0.9991156,1.0,0.0
max absolute_mcc,0.6371537,0.9803738,181.0
max min_per_class_accuracy,0.7830744,0.9885876,149.0
max mean_per_class_accuracy,0.74715,0.9899914,157.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100414,0.9987621,1.3001917,1.3001917,1.0,0.9991062,1.0,0.9991062,0.0130558,0.0130558,30.0191728,30.0191728,0.0130558
2,0.0200126,0.9983257,1.3001917,1.3001917,1.0,0.9985476,1.0,0.9988279,0.0129645,0.0260203,30.0191728,30.0191728,0.0260203
3,0.0300541,0.9979183,1.3001917,1.3001917,1.0,0.9981102,1.0,0.9985881,0.0130558,0.0390761,30.0191728,30.0191728,0.0390761
4,0.0400253,0.997523,1.3001917,1.3001917,1.0,0.9977118,1.0,0.9983698,0.0129645,0.0520405,30.0191728,30.0191728,0.0520405
5,0.0500667,0.9972701,1.3001917,1.3001917,1.0,0.9973924,1.0,0.9981738,0.0130558,0.0650963,30.0191728,30.0191728,0.0650963
6,0.1000632,0.9959093,1.3001917,1.3001917,1.0,0.9965822,1.0,0.9973786,0.065005,0.1301013,30.0191728,30.0191728,0.1301013
7,0.1500597,0.9944882,1.3001917,1.3001917,1.0,0.9951759,1.0,0.9966447,0.065005,0.1951064,30.0191728,30.0191728,0.1951064
8,0.2000562,0.9929849,1.3001917,1.3001917,1.0,0.9937248,1.0,0.995915,0.065005,0.2601114,30.0191728,30.0191728,0.2601114
9,0.3000492,0.9894333,1.3001917,1.3001917,1.0,0.9912995,1.0,0.9943768,0.13001,0.3901214,30.0191728,30.0191728,0.3901214
10,0.4000421,0.9843659,1.2983656,1.2997353,0.9985955,0.987012,0.9996489,0.992536,0.1298274,0.5199489,29.8365616,29.973528,0.5193406


In [61]:
# Multiclass Classification
dfcopy_multi = df

myY = "City_Category"
myX = ["Age", "Occupation", "Gender", "Stay_In_Current_City_Years", "Marital_Status", "Product_Category_1", "Product_Category_2", "Product_Category_3"]

In [62]:
#Splitting data into training, test and validation sets
df_train,df_test,df_valid = dfcopy_multi.split_frame(ratios=[.8, .10])

print ("Rows in Train",df_train.nrow)
print ("Rows in Validation",df_valid.nrow)
print ("Rows in Test",df_test.nrow)

Rows in Train 113217
Rows in Validation 14131
Rows in Test 14111


In [63]:
ml = H2OAutoML(max_models = 10, seed = 10, verbosity="info", nfolds=0)

In [64]:
ml.train(x = myX, y = myY, training_frame = df_train, validation_frame = df_valid)

AutoML progress: |
04:16:03.463: Project: AutoML_5_20221108_41603
04:16:03.463: Cross-validation disabled by user: no fold column nor nfolds > 1.
04:16:03.463: Setting stopping tolerance adaptively based on the training frame: 0.0029719683395997114
04:16:03.463: Build control seed: 10
04:16:03.463: training frame: Frame key: AutoML_5_20221108_41603_training_py_44_sid_8b5d    cols: 12    rows: 113217  chunks: 8    size: 1816765  checksum: -3356065052680920681
04:16:03.464: validation frame: Frame key: py_46_sid_8b5d    cols: 12    rows: 14131  chunks: 8    size: 503875  checksum: -3358272111842371636
04:16:03.464: leaderboard frame: Frame key: py_46_sid_8b5d    cols: 12    rows: 14131  chunks: 8    size: 503875  checksum: -3358272111842371636
04:16:03.464: blending frame: NULL
04:16:03.464: response column: City_Category
04:16:03.464: fold column: null
04:16:03.464: weights column: null
04:16:03.464: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w)

Unnamed: 0,number_of_trees
,55.0

A,B,C,Error,Rate
17580.0,6932.0,3772.0,0.3784472,"10,704 / 28,284"
4783.0,36662.0,5775.0,0.2235917,"10,558 / 47,220"
4009.0,9427.0,24277.0,0.3562697,"13,436 / 37,713"
26372.0,53021.0,33824.0,0.3064734,"34,698 / 113,217"

k,hit_ratio
1,0.6935266
2,0.9239867
3,1.0

A,B,C,Error,Rate
1873.0,1079.0,572.0,0.4685017,"1,651 / 3,524"
768.0,4106.0,1005.0,0.3015819,"1,773 / 5,879"
570.0,1409.0,2749.0,0.4185702,"1,979 / 4,728"
3211.0,6594.0,4326.0,0.3823509,"5,403 / 14,131"

k,hit_ratio
1,0.6176491
2,0.8908074
3,1.0

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc,validation_rmse,validation_logloss,validation_classification_error,validation_auc,validation_pr_auc
,2022-11-08 04:16:03,0.007 sec,0.0,0.6666667,1.0986123,0.6668963,,,0.6666667,1.0986123,0.6654165,,
,2022-11-08 04:16:04,1.337 sec,5.0,0.5831284,0.8847327,0.3691848,,,0.5908939,0.9052321,0.392612,,
,2022-11-08 04:16:06,2.705 sec,10.0,0.5462324,0.7998787,0.3516168,,,0.5597184,0.8348744,0.38412,,
,2022-11-08 04:16:07,4.334 sec,15.0,0.5272305,0.7564827,0.3386947,,,0.5463644,0.8061039,0.3793787,,
,2022-11-08 04:16:09,6.095 sec,20.0,0.5186373,0.7366005,0.3310722,,,0.5419902,0.7975034,0.3815724,,
,2022-11-08 04:16:11,8.195 sec,25.0,0.5117064,0.7203895,0.3251985,,,0.5392103,0.7925197,0.3788125,,
,2022-11-08 04:16:14,10.662 sec,30.0,0.5074412,0.710551,0.3211797,,,0.5384896,0.792268,0.3820678,,
,2022-11-08 04:16:16,13.386 sec,35.0,0.5035869,0.7014706,0.3180264,,,0.5371272,0.7900338,0.3831293,,
,2022-11-08 04:16:19,16.517 sec,40.0,0.5000054,0.6932197,0.3149439,,,0.5358594,0.7881023,0.3827755,,
,2022-11-08 04:16:23,20.081 sec,45.0,0.4972753,0.6867779,0.3120998,,,0.5356676,0.7888052,0.3836954,,

variable,relative_importance,scaled_importance,percentage
Occupation,44197.9335938,1.0,0.3192818
Stay_In_Current_City_Years,18440.5410156,0.4172263,0.1332128
Product_Category_3,13809.4580078,0.3124458,0.0997583
Product_Category_2,11663.6113281,0.2638949,0.0842569
Product_Category_1,9497.5097656,0.2148858,0.0686091
Marital_Status,9006.0195312,0.2037656,0.0650587
Gender.F,6512.8881836,0.1473573,0.0470485
Age.36-45,4361.9301758,0.0986908,0.0315102
Gender.M,4225.7211914,0.095609,0.0305262
Age.26-35,4035.9433594,0.0913152,0.0291553



MSE: 0.2420687844984086

RMSE: 0.49200486227110457

LogLoss: 0.6743265510853835

Mean Per-Class Error: 0.3194361994553368

The predicted values make sense for multiclass classification.
So the model makes sense.

In [65]:
lb = ml.leaderboard
lb.head()

model_id,mean_per_class_error,logloss,rmse,mse
XGBoost_1_AutoML_5_20221108_41603,0.396218,0.787811,0.53449,0.285679
XGBoost_2_AutoML_5_20221108_41603,0.397843,0.78861,0.536566,0.287903
GBM_4_AutoML_5_20221108_41603,0.399301,0.811989,0.549663,0.302129
GBM_1_AutoML_5_20221108_41603,0.400902,0.822291,0.552369,0.305111
DRF_1_AutoML_5_20221108_41603,0.414946,0.830992,0.545416,0.297479
XGBoost_3_AutoML_5_20221108_41603,0.429686,0.85703,0.566744,0.321199
GBM_3_AutoML_5_20221108_41603,0.431502,0.867977,0.573409,0.328798
GBM_2_AutoML_5_20221108_41603,0.453452,0.898587,0.585876,0.34325
XRT_1_AutoML_5_20221108_41603,0.499446,0.950094,0.6073,0.368814
GLM_1_AutoML_5_20221108_41603,0.643759,1.0585,0.647206,0.418876


In [66]:
# finding and storing the best model and printing the output
best_model = h2o.get_model(ml.leaderboard[0,'model_id'])
best_model.algo

'xgboost'

For Tree based models like Gradient Boosting, no model assumptions are present to validate.

Learning rate and n_estimators are two critical hyperparameters for gradient boosting decision trees

In [67]:
df_pred=ml.leader.predict(df_test)

xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%


In [68]:
df_pred.head()

predict,A,B,C
B,0.281852,0.607102,0.111046
B,0.123895,0.800839,0.0752663
A,0.50238,0.212688,0.284933
C,0.0414721,0.21641,0.742118
C,0.00700037,0.00684467,0.986155
B,0.00696692,0.864182,0.128851
C,0.239891,0.32741,0.432699
B,0.250075,0.500972,0.248953
C,0.0600912,0.0320156,0.907893
A,0.414308,0.413756,0.171936


In [69]:
ml.leader.model_performance(df_test)

A,B,C,Error,Rate
1872.0,1056.0,599.0,0.4692373,"1,655 / 3,527"
771.0,4145.0,1006.0,0.3000675,"1,777 / 5,922"
609.0,1440.0,2613.0,0.4395109,"2,049 / 4,662"
3252.0,6641.0,4218.0,0.3884204,"5,481 / 14,111"

k,hit_ratio
1,0.6115796
2,0.8853376
3,1.0


In [70]:
# Regression 
dfcopy_reg = df

myY = "Purchase"
myX = ["Gender", "Age", "Occupation", "City_Category", "Stay_In_Current_City_Years", "Marital_Status", "Product_Category_1", "Product_Category_2", "Product_Category_3"]

In [71]:
#Splitting data into training, test and validation sets
df_train,df_test,df_valid = dfcopy_multi.split_frame(ratios=[.8, .10])

print ("Rows in Train",df_train.nrow)
print ("Rows in Validation",df_valid.nrow)
print ("Rows in Test",df_test.nrow)

Rows in Train 113058
Rows in Validation 14102
Rows in Test 14299


In [72]:
ml = H2OAutoML(max_models = 10, seed = 10, verbosity="info", nfolds=0)

In [73]:
ml.train(x = myX, y = myY, training_frame = df_train, validation_frame = df_valid)

AutoML progress: |
04:21:28.345: Project: AutoML_6_20221108_42128
04:21:28.345: Cross-validation disabled by user: no fold column nor nfolds > 1.
04:21:28.345: Setting stopping tolerance adaptively based on the training frame: 0.0029740574307812262
04:21:28.345: Build control seed: 10
04:21:28.345: training frame: Frame key: AutoML_6_20221108_42128_training_py_53_sid_8b5d    cols: 12    rows: 113058  chunks: 8    size: 1814656  checksum: -3392719601830047457
04:21:28.345: validation frame: Frame key: py_55_sid_8b5d    cols: 12    rows: 14102  chunks: 8    size: 503490  checksum: -3387401540773731442
04:21:28.345: leaderboard frame: Frame key: py_55_sid_8b5d    cols: 12    rows: 14102  chunks: 8    size: 503490  checksum: -3387401540773731442
04:21:28.345: blending frame: NULL
04:21:28.345: response column: Purchase
04:21:28.345: fold column: null
04:21:28.345: weights column: null
04:21:28.345: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), gri

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,80.0,80.0,512185.0,0.0,10.0,8.0,1.0,759.0,505.5625

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
,2022-11-08 04:22:22,0.003 sec,0.0,5087.5805235,4225.6682842,25883475.5834307,5082.0793496,4217.7758456,25827530.5151325
,2022-11-08 04:22:23,0.414 sec,5.0,4388.3400002,3646.4606367,19257527.9577057,4410.586651,3666.7080275,19453274.6061133
,2022-11-08 04:22:23,0.813 sec,10.0,3918.1459916,3231.665734,15351868.0115935,3967.5094129,3276.4296094,15741130.9415072
,2022-11-08 04:22:24,1.242 sec,15.0,3614.1233636,2931.292379,13061887.6874282,3683.2169022,2991.3943447,13566086.7488545
,2022-11-08 04:22:24,1.661 sec,20.0,3508.9708632,2803.788627,12312876.5186302,3598.0642842,2879.6917419,12946066.5933064
,2022-11-08 04:22:24,2.085 sec,25.0,3443.1179653,2718.7359879,11855061.3232002,3550.7507209,2809.8147141,12607830.6818794
,2022-11-08 04:22:25,2.530 sec,30.0,3398.4489152,2654.0300316,11549455.0292909,3519.5476449,2758.2808389,12387215.6245089
,2022-11-08 04:22:25,2.960 sec,35.0,3371.5928456,2621.1254571,11367638.3164821,3506.4389864,2736.9745511,12295114.3652268
,2022-11-08 04:22:26,3.383 sec,40.0,3348.1137763,2596.7956027,11209865.8592922,3496.8278234,2724.1274716,12227804.8267009
,2022-11-08 04:22:26,3.911 sec,45.0,3322.2334036,2572.0923653,11037234.788149,3485.6809471,2711.5842059,12149971.6647689

variable,relative_importance,scaled_importance,percentage
Product_Category_1,4505720061952.0,1.0,0.5836358
Product_Category_2,1403037548544.0,0.3113903,0.1817385
Product_Category_3,779597316096.0,0.1730239,0.1009829
Occupation,324449894400.0,0.0720084,0.0420267
Age,255428018176.0,0.0566897,0.0330861
Stay_In_Current_City_Years,161360904192.0,0.0358125,0.0209014
City_Category,148293173248.0,0.0329122,0.0192087
Gender,74396254208.0,0.0165115,0.0096367
Marital_Status,67805888512.0,0.0150488,0.008783


In [74]:
lb = ml.leaderboard
lb.head()

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
GBM_4_AutoML_6_20221108_42128,3475.86,12081600.0,2695.41,,12081600.0
GBM_1_AutoML_6_20221108_42128,3487.63,12163600.0,2707.06,,12163600.0
GBM_3_AutoML_6_20221108_42128,3489.34,12175500.0,2716.55,0.365963,12175500.0
XGBoost_3_AutoML_6_20221108_42128,3505.49,12288500.0,2735.2,,12288500.0
GBM_2_AutoML_6_20221108_42128,3505.54,12288800.0,2736.65,0.368117,12288800.0
XGBoost_2_AutoML_6_20221108_42128,3516.96,12369000.0,2720.46,,12369000.0
XRT_1_AutoML_6_20221108_42128,3541.29,12540700.0,2772.6,0.374721,12540700.0
XGBoost_1_AutoML_6_20221108_42128,3549.21,12596900.0,2734.34,,12596900.0
DRF_1_AutoML_6_20221108_42128,3651.08,13330400.0,2803.67,0.374729,13330400.0
GLM_1_AutoML_6_20221108_42128,5082.06,25827300.0,4217.76,0.609805,25827300.0


In [79]:
# finding and storing the best model and rinting the output
best_model = h2o.get_model(ml.leaderboard[0,'model_id'])
best_model.algo

'gbm'

A GBM model has two categories of hyperparameters that are boosting hyperparameters and tree-specific hyperparameters. There are two main boosting hyperparameters which include: Number of trees: The total number of trees in the sequence or ensemble.

In [76]:
df_pred=ml.leader.predict(df_test)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [77]:
df_pred.head()

predict
6724.08
12839.9
13371.2
11906.8
6990.39
12407.9
9312.81
3157.94
14632.1
11350.1


In [78]:
ml.leader.model_performance(df_test)

Since we have not done any pre-processing of the data, we are getting such high MSE and RMSE values. 


**Conclusion**

For the given dataset, Binary and multiclass classification give appropriate predictions.

This dataset has a numeric target variable and therefore it requires feature engineering and appropriate data pre-processing to get better preditions.

**References**

6105_H2O_automl_lending_club.ipynb - https://github.com/aiskunks/Skunks_Skool/blob/main/INFO_6105/6105/6105_Airlines_GBM_AutoML.ipynb

Sckit learn offcial documentation

h2o.ai Documentation

Kaggle

Refered to https://towardsdatascience.com/back-to-basics-assumptions-of-common-machine-learning-models-e43c02325535 article to understand model assumptions

Used h2o.ai for autoML implementation

https://www.kaggle.com/code/gopikrishnamashetty/black-friday-sales-eda-prediction/notebook

https://www.kaggle.com/code/margesh/regression-scikit-xgb-h2o-automl