# ASSIGNMENT 2: AUTOML

**Author: Shalini Shree**

**NUID: 002769035**

## **Introduction to AutoML and H2O**

Automated Machine Learning provides methods and processes to make Machine Learning available for non-Machine Learning experts, to improve efficiency of Machine Learning and to accelerate research on Machine Learning.
It is tied in with producing Machine Learning solutions for the data scientist without doing unlimited inquiries on data preparation, model selection, model hyperparameters, and model compression parameters.



H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

## **Dataset**

**Binary and Multi class Classification: Glass Classification dataset**

Attribute Information:

RI: refractive index

Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)

Mg: Magnesium

Al: Aluminum

Si: Silicon

K: Potassium

Ca: Calcium

Ba: Barium

Fe: Iron

Type of glass: (class attribute)
-- 1 buildingwindowsfloatprocessed -- 2 buildingwindowsnonfloatprocessed -- 3 vehiclewindowsfloatprocessed
-- 4 vehiclewindowsnonfloatprocessed (none in this database)
-- 5 containers
-- 6 tableware
-- 7 headlamps

**Regression model: Walmart dataset**

Store - the store number

Date - the week of sales

Weekly_Sales - sales for the given store

Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week

Temperature - Temperature on the day of sale

Fuel_Price - Cost of fuel in the region

CPI – Prevailing consumer price index

Unemployment - Prevailing unemployment rate

Holiday Events\
Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13\
Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13\
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13\
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

In [1]:
# installing H2O
pip install h2o

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# import the library

import h2o
from h2o.automl import H2OAutoML
import os
import re
import io
import pandas as pd


In [3]:
# connecting to a cluster
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.16" 2022-07-19; OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu118.04); OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu118.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.7/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpc2jrgz4_
  JVM stdout: /tmp/tmpc2jrgz4_/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpc2jrgz4_/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.2
H2O_cluster_version_age:,11 days
H2O_cluster_name:,H2O_from_python_unknownUser_v0qq6p
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.172 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [4]:
# loading datset
df_path = "/content/glass.csv"
data_path = pd.read_csv("/content/glass.csv")

In [5]:
# creating a copy of the dataset
df_copy = data_path.copy()

In [6]:
# load data into h20 

df = h2o.H2OFrame(data_path)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [7]:
# description of the dataset
df.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
type,real,real,real,real,real,real,real,real,real,int
mins,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
mean,1.5183654205607482,13.407850467289718,2.6845327102803735,1.4449065420560752,72.65093457943931,0.4970560747663549,8.956962616822434,0.1750467289719627,0.057009345794392534,2.7803738317757007
maxs,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0
sigma,0.003036863739385533,0.816603555714983,1.442407844870442,0.4992696456004844,0.7745457947651088,0.6521918455589797,1.423153487281394,0.49721926059970345,0.09743870063650084,2.1037386462007546
zeros,0,0,42,0,0,30,0,176,144,0
missing,0,0,0,0,0,0,0,0,0,0
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1.0
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1.0
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1.0


## **MULTI CLASS CLASSIFICATION**

In [8]:
# determining column types using get_independent_variables function
def get_independent_variables(df, targ):
    C = [name for name in df.columns if name != targ]
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x

In [9]:
# Assigning the traget value
target ="Type"

In [10]:
# Splitting dataset into train and test data
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

In [11]:
# define the independent variable
X=get_independent_variables(df, target) 
print(X)

['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']


In [12]:
# define the dependent variable
y = target

In [13]:
df[y] = df[y].asfactor()

In [14]:
# Set up AutoML
import time
aml = H2OAutoML(max_runtime_secs=60)

In [15]:
# set model start time and train the aml model
model_start_time = time.time()
aml.train(x=X,y=y,training_frame=df)

AutoML progress: |███████████
01:37:50.965: GBM_1_AutoML_1_20221108_13738 [GBM def_5] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_1_AutoML_1_20221108_13738.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 172.0.


██████████████████████████████

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,41.0,246.0,97356.0,3.0,14.0,10.422764,4.0,31.0,26.691057

1,2,3,5,6,7,Error,Rate
70.0,0.0,0.0,0.0,0.0,0.0,0.0,0 / 70
0.0,76.0,0.0,0.0,0.0,0.0,0.0,0 / 76
0.0,0.0,17.0,0.0,0.0,0.0,0.0,0 / 17
0.0,0.0,0.0,13.0,0.0,0.0,0.0,0 / 13
0.0,0.0,0.0,0.0,9.0,0.0,0.0,0 / 9
0.0,0.0,0.0,0.0,0.0,29.0,0.0,0 / 29
70.0,76.0,17.0,13.0,9.0,29.0,0.0,0 / 214

k,hit_ratio
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0

1,2,3,5,6,7,Error,Rate
59.0,7.0,4.0,0.0,0.0,0.0,0.1571429,11 / 70
7.0,66.0,1.0,0.0,1.0,1.0,0.1315789,10 / 76
10.0,1.0,6.0,0.0,0.0,0.0,0.6470588,11 / 17
0.0,3.0,0.0,9.0,0.0,1.0,0.3076923,4 / 13
0.0,1.0,0.0,0.0,8.0,0.0,0.1111111,1 / 9
1.0,1.0,0.0,1.0,0.0,26.0,0.1034483,3 / 29
77.0,79.0,11.0,10.0,9.0,28.0,0.1869159,40 / 214

k,hit_ratio
1,0.8130841
2,0.9205608
3,0.9813085
4,0.9953272
5,1.0000001
6,1.0000001

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.8129568,0.052603,0.744186,0.8372093,0.8139535,0.8837209,0.7857143
auc,,0.0,,,,,
err,0.1870432,0.052603,0.255814,0.1627907,0.1860465,0.1162791,0.2142857
err_count,8.0,2.236068,11.0,7.0,8.0,5.0,9.0
logloss,0.6340082,0.1763594,0.7966967,0.6630163,0.7180691,0.3348623,0.6573967
max_per_class_error,0.7666667,0.1369306,1.0,0.75,0.75,0.6666667,0.6666667
mean_per_class_accuracy,0.7683929,0.0458623,0.6959326,0.8138889,0.8011905,0.765873,0.7650794
mean_per_class_error,0.2316071,0.0458623,0.3040675,0.1861111,0.1988095,0.234127,0.2349206
mse,0.1749834,0.0518625,0.2443597,0.1688113,0.1656372,0.1013104,0.1947983
pr_auc,,0.0,,,,,

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
,2022-11-08 01:38:31,1.880 sec,0.0,0.8333333,1.7917595,0.7523364,,
,2022-11-08 01:38:31,1.912 sec,5.0,0.5558012,0.8217801,0.046729,,
,2022-11-08 01:38:31,1.944 sec,10.0,0.3806572,0.4688556,0.0186916,,
,2022-11-08 01:38:31,1.974 sec,15.0,0.2624144,0.2775877,0.0140187,,
,2022-11-08 01:38:31,2.015 sec,20.0,0.182954,0.1699247,0.0046729,,
,2022-11-08 01:38:31,2.045 sec,25.0,0.1243016,0.1045321,0.0,,
,2022-11-08 01:38:31,2.096 sec,30.0,0.0840857,0.0660646,0.0,,
,2022-11-08 01:38:31,2.128 sec,35.0,0.0595182,0.0432302,0.0,,
,2022-11-08 01:38:31,2.157 sec,40.0,0.0414395,0.0284971,0.0,,
,2022-11-08 01:38:31,2.162 sec,41.0,0.0382439,0.026231,0.0,,

variable,relative_importance,scaled_importance,percentage
Mg,94.3390274,1.0,0.1716742
Ba,90.447197,0.9587463,0.164592
RI,89.6432343,0.9502243,0.163129
Al,74.299942,0.7875844,0.1352079
Ca,61.6763496,0.6537734,0.112236
Na,40.4640274,0.4289214,0.0736347
K,37.6020584,0.3985843,0.0684266
Si,36.7050972,0.3890765,0.0667944
Fe,24.3466854,0.2580765,0.0443051


Hyperparamerts of gbm models

hyperparamerts like number of trees, minimum depth, maximum depth, number of leaves are important. Tuining them can change model fitting and increase or decrease error.

Using H2O model training we get a list of important variables, where the most important variables is Mg, followed by BA and RI.

In [70]:
# creating a dictionary
data={}

In [17]:
# calculating model execution time
data['model_execution_time'] = {"classification":(time.time() - model_start_time)} 
data

{'model_execution_time': {'classification': 62.08951282501221}}

In [18]:
# printing the leaderBoard for models used
print(aml.leaderboard)

model_id                                          mean_per_class_error    logloss      rmse       mse
GBM_grid_1_AutoML_1_20221108_13738_model_1                    0.243005   0.610621  0.422908  0.178851
GBM_2_AutoML_1_20221108_13738                                 0.24744    0.571441  0.413211  0.170743
GBM_grid_1_AutoML_1_20221108_13738_model_2                    0.251965   0.609862  0.42071   0.176997
GBM_4_AutoML_1_20221108_13738                                 0.266098   0.587927  0.415696  0.172803
GBM_3_AutoML_1_20221108_13738                                 0.275315   0.582091  0.416912  0.173816
XRT_1_AutoML_1_20221108_13738                                 0.277868   0.613476  0.447387  0.200155
XGBoost_3_AutoML_1_20221108_13738                             0.287948   0.61507   0.440556  0.19409
XGBoost_grid_1_AutoML_1_20221108_13738_model_5                0.294715   0.640773  0.45279   0.205018
GBM_5_AutoML_1_20221108_13738                                 0.299228   0.635926  

In [19]:
# predicting on test data
prediction = aml.leader.predict(test)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [20]:
# prediction of multinomial data
prediction.head()

predict,p1,p2,p3,p5,p6,p7
1,0.917437,0.0700087,0.00413906,0.00280936,0.00280758,0.00279824
1,0.994724,0.00242211,0.00161607,0.000412779,0.000411903,0.000412955
1,0.970675,0.0231082,0.00317667,0.00101125,0.00101395,0.00101514
1,0.995928,0.00195822,0.00092579,0.000416772,0.000385405,0.000385566
1,0.97583,0.0010233,0.018824,0.000898428,0.00228368,0.00114088
1,0.981331,0.00970275,0.0030828,0.00154786,0.00170593,0.00262942
1,0.938132,0.0521868,0.00217845,0.00239347,0.0023206,0.00278857
1,0.940422,0.00336919,0.0384461,0.00121945,0.0141047,0.00243892
1,0.995495,0.00127656,0.00172166,0.000524635,0.000514391,0.00046782
2,0.00580506,0.990677,0.00109289,0.000809334,0.000807755,0.000808384


In [21]:
# finding and storing the best model
best_model = h2o.get_model(aml.leaderboard[0,'model_id'])

In [22]:
# printing the best model
best_model.algo

'gbm'

For GBM model there are no assumtions to validate because it is a tree based model and tree-based models are robust to outliers and do not require the dependent variables to meet any normality assumptions.. Hence, it serves to be the best model.

In [23]:
# log loss error for the best model
print(best_model.logloss(train = True))

0.026230961760033533


In [24]:
# performance of the gbm model
best_model.model_performance(test)

1,2,3,5,6,7,Error,Rate
9.0,0.0,0.0,0.0,0.0,0.0,0.0,0 / 9
0.0,14.0,0.0,0.0,0.0,0.0,0.0,0 / 14
0.0,0.0,3.0,0.0,0.0,0.0,0.0,0 / 3
0.0,0.0,0.0,3.0,0.0,0.0,0.0,0 / 3
0.0,0.0,0.0,0.0,3.0,0.0,0.0,0 / 3
0.0,0.0,0.0,0.0,0.0,3.0,0.0,0 / 3
9.0,14.0,3.0,3.0,3.0,3.0,0.0,0 / 35

k,hit_ratio
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0


In this Multiclass classification model, the models used does make sense. The log loss vlaues is very close to 0.

## **BINARY CLASSIFICATION**

In [25]:
# printing top 10 values from the dataset
df_copy.head(10)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1
5,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0.0,0.26,1
6,1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0.0,0.0,1
7,1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0.0,0.0,1
8,1.51918,14.04,3.58,1.37,72.08,0.56,8.3,0.0,0.0,1
9,1.51755,13.0,3.6,1.36,72.99,0.57,8.4,0.0,0.11,1


In [26]:
# converting type column from multinomial to binomial
df_copy['Type'] = df_copy['Type'].replace([2,3,4,5,6,7], 0)

In [27]:
df_copy.head(10)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1
5,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0.0,0.26,1
6,1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0.0,0.0,1
7,1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0.0,0.0,1
8,1.51918,14.04,3.58,1.37,72.08,0.56,8.3,0.0,0.0,1
9,1.51755,13.0,3.6,1.36,72.99,0.57,8.4,0.0,0.11,1


In [28]:
# load data into h20 
df1 = h2o.H2OFrame(df_copy)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [29]:
df1.head(10)

RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0,0.0,1
1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0,0.0,1
1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0,0.0,1
1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0,0.0,1
1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0,0.0,1
1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0,0.26,1
1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0,0.0,1
1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0,0.0,1
1.51918,14.04,3.58,1.37,72.08,0.56,8.3,0,0.0,1
1.51755,13.0,3.6,1.36,72.99,0.57,8.4,0,0.11,1


In [30]:
# description of the dataset
df1.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
type,real,real,real,real,real,real,real,real,real,int
mins,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,0.0
mean,1.5183654205607482,13.407850467289718,2.6845327102803735,1.4449065420560752,72.65093457943931,0.4970560747663549,8.956962616822434,0.1750467289719627,0.057009345794392534,0.32710280373831774
maxs,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,1.0
sigma,0.003036863739385533,0.816603555714983,1.442407844870442,0.4992696456004844,0.7745457947651088,0.6521918455589797,1.423153487281394,0.49721926059970345,0.09743870063650084,0.47025516866279526
zeros,0,0,42,0,0,30,0,176,144,144
missing,0,0,0,0,0,0,0,0,0,0
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1.0
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1.0
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1.0


In [31]:
# determining column types using get_independent_variables function
def get_independent_variables(df1, targ):
    C = [name for name in df1.columns if name != targ]
    ints, reals, enums = [], [], []
    for key, val in df1.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x

In [32]:
# Assigning the traget value
target1 ="Type"

In [33]:
# Splitting dataset into train and test data
splits1 = df1.split_frame(ratios = [0.8], seed = 1)
train1 = splits1[0]
test1 = splits1[1]

In [34]:
# define the independent variables
X1=get_independent_variables(df1, target1) 
print(X1)

['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']


In [35]:
# define the dependent variable
y1 = target1

In [36]:
df1[y1] = df1[y1].asfactor()

In [37]:
# Set up AutoML
aml1 = H2OAutoML(max_runtime_secs=60)

In [38]:
# set model start time and train the aml model
model_start_time1 = time.time()
aml1.train(x=X1,y=y1,training_frame=df1)

AutoML progress: |█
01:38:43.832: GBM_1_AutoML_2_20221108_13841 [GBM def_5] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_1_AutoML_2_20221108_13841.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 171.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 172.0.


████████████████████████████████████████

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,60.0,60.0,20757.0,6.0,12.0,8.95,18.0,27.0,22.833334

Unnamed: 0,0,1,Error,Rate
0,144.0,0.0,0.0,(0.0/144.0)
1,0.0,70.0,0.0,(0.0/70.0)
Total,144.0,70.0,0.0,(0.0/214.0)

metric,threshold,value,idx
max f1,0.6539482,1.0,68.0
max f2,0.6539482,1.0,68.0
max f0point5,0.6539482,1.0,68.0
max accuracy,0.6539482,1.0,68.0
max precision,0.9953582,1.0,0.0
max recall,0.6539482,1.0,68.0
max specificity,0.9953582,1.0,0.0
max absolute_mcc,0.6539482,1.0,68.0
max min_per_class_accuracy,0.6539482,1.0,68.0
max mean_per_class_accuracy,0.6539482,1.0,68.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0140187,0.9934746,3.0571429,3.0571429,1.0,0.9944292,1.0,0.9944292,0.0428571,0.0428571,205.7142857,205.7142857,0.0428571
2,0.0233645,0.9907111,3.0571429,3.0571429,1.0,0.9912065,1.0,0.9931401,0.0285714,0.0714286,205.7142857,205.7142857,0.0714286
3,0.0327103,0.9898743,3.0571429,3.0571429,1.0,0.9901699,1.0,0.9922915,0.0285714,0.1,205.7142857,205.7142857,0.1
4,0.0420561,0.9887158,3.0571429,3.0571429,1.0,0.9892772,1.0,0.9916216,0.0285714,0.1285714,205.7142857,205.7142857,0.1285714
5,0.0514019,0.9881207,3.0571429,3.0571429,1.0,0.9884499,1.0,0.991045,0.0285714,0.1571429,205.7142857,205.7142857,0.1571429
6,0.1028037,0.9782808,3.0571429,3.0571429,1.0,0.9845689,1.0,0.9878069,0.1571429,0.3142857,205.7142857,205.7142857,0.3142857
7,0.1495327,0.9667968,3.0571429,3.0571429,1.0,0.9727631,1.0,0.9831057,0.1428571,0.4571429,205.7142857,205.7142857,0.4571429
8,0.2009346,0.9566626,3.0571429,3.0571429,1.0,0.9621021,1.0,0.9777327,0.1571429,0.6142857,205.7142857,205.7142857,0.6142857
9,0.2990654,0.8821043,3.0571429,3.0571429,1.0,0.9294866,1.0,0.961902,0.3,0.9142857,205.7142857,205.7142857,0.9142857
10,0.4018692,0.057881,0.8337662,2.4883721,0.2727273,0.3066225,0.8139535,0.7942723,0.0857143,1.0,-16.6233766,148.8372093,0.8888889

Unnamed: 0,0,1,Error,Rate
0,135.0,9.0,0.0625,(9.0/144.0)
1,14.0,56.0,0.2,(14.0/70.0)
Total,149.0,65.0,0.1075,(23.0/214.0)

metric,threshold,value,idx
max f1,0.5136937,0.8296296,64.0
max f2,0.1415528,0.88,94.0
max f0point5,0.5278751,0.8540373,62.0
max accuracy,0.5278751,0.8925234,62.0
max precision,0.9980357,1.0,0.0
max recall,0.0504538,1.0,124.0
max specificity,0.9980357,1.0,0.0
max absolute_mcc,0.5136937,0.7523891,64.0
max min_per_class_accuracy,0.2805159,0.8714286,78.0
max mean_per_class_accuracy,0.1610651,0.8740079,90.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0140187,0.9964338,3.0571429,3.0571429,1.0,0.9973174,1.0,0.9973174,0.0428571,0.0428571,205.7142857,205.7142857,0.0428571
2,0.0233645,0.9925183,3.0571429,3.0571429,1.0,0.9943879,1.0,0.9961456,0.0285714,0.0714286,205.7142857,205.7142857,0.0714286
3,0.0327103,0.9892293,3.0571429,3.0571429,1.0,0.9912959,1.0,0.99476,0.0285714,0.1,205.7142857,205.7142857,0.1
4,0.0420561,0.976872,3.0571429,3.0571429,1.0,0.9834085,1.0,0.9922374,0.0285714,0.1285714,205.7142857,205.7142857,0.1285714
5,0.0514019,0.9699111,3.0571429,3.0571429,1.0,0.9736332,1.0,0.9888548,0.0285714,0.1571429,205.7142857,205.7142857,0.1571429
6,0.1028037,0.9334589,2.7792208,2.9181818,0.9090909,0.9520377,0.9545455,0.9704463,0.1428571,0.3,177.9220779,191.8181818,0.2930556
7,0.1495327,0.8660886,2.7514286,2.8660714,0.9,0.9019929,0.9375,0.9490546,0.1285714,0.4285714,175.1428571,186.6071429,0.4146825
8,0.2009346,0.7542843,2.7792208,2.8438538,0.9090909,0.8304793,0.9302326,0.9187214,0.1428571,0.5714286,177.9220779,184.3853821,0.5505952
9,0.2990654,0.5143727,2.1836735,2.6272321,0.7142857,0.6432748,0.859375,0.8283405,0.2142857,0.7857143,118.3673469,162.7232143,0.7232143
10,0.4018692,0.2009988,1.1116883,2.2395349,0.3636364,0.3397915,0.7325581,0.7033628,0.1142857,0.9,11.1688312,123.9534884,0.7402778

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.8836102,0.0787012,0.7906977,0.8604651,0.8372093,0.9534883,0.9761905
auc,0.9324244,0.0528155,0.8965517,0.91133,0.8768473,0.9901478,0.9872449
err,0.1163898,0.0787012,0.2093023,0.1395349,0.1627907,0.0465116,0.0238095
err_count,5.0,3.391165,9.0,6.0,7.0,2.0,1.0
f0point5,0.7941325,0.1216819,0.6603774,0.7446808,0.7222222,0.8974359,0.9459459
f1,0.8534031,0.0914943,0.7567568,0.8235294,0.7878788,0.9333333,0.9655172
f2,0.9263866,0.0521164,0.886076,0.9210526,0.8666667,0.9722222,0.9859155
lift_top_group,2.442857,1.3659489,3.0714285,3.0714285,0.0,3.0714285,3.0
logloss,0.2849843,0.1338604,0.3838471,0.349975,0.4092264,0.1216853,0.160188
max_per_class_error,0.1657635,0.1124392,0.3103448,0.2068966,0.2068966,0.0689655,0.0357143

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2022-11-08 01:39:19,3.386 sec,0.0,0.4691552,0.6321079,0.5,0.3271028,1.0,0.6728972
,2022-11-08 01:39:19,3.395 sec,5.0,0.3547365,0.421024,0.983879,0.9742924,3.0571429,0.0514019
,2022-11-08 01:39:19,3.403 sec,10.0,0.287986,0.3157568,0.9959325,0.9920462,3.0571429,0.0327103
,2022-11-08 01:39:19,3.411 sec,15.0,0.2334036,0.2363396,0.9973214,0.9943348,3.0571429,0.0186916
,2022-11-08 01:39:19,3.418 sec,20.0,0.1942301,0.1833076,0.9990079,0.9980241,3.0571429,0.0093458
,2022-11-08 01:39:19,3.426 sec,25.0,0.1672074,0.1464563,0.9993056,0.9985968,3.0571429,0.0093458
,2022-11-08 01:39:19,3.436 sec,30.0,0.1434125,0.1169504,0.9996032,0.9992225,3.0571429,0.0046729
,2022-11-08 01:39:19,3.448 sec,35.0,0.1230042,0.0952498,0.9998016,0.9996004,3.0571429,0.0046729
,2022-11-08 01:39:19,3.456 sec,40.0,0.1081295,0.07822,0.9999008,0.9997974,3.0571429,0.0046729
,2022-11-08 01:39:19,3.464 sec,45.0,0.0978256,0.0662362,0.9999008,0.9997974,3.0571429,0.0046729

variable,relative_importance,scaled_importance,percentage
RI,47.0282631,1.0,0.2274896
Mg,43.9527359,0.9346026,0.2126124
Al,35.1496048,0.7474145,0.170029
Na,24.4282417,0.5194375,0.1181666
Ca,15.7799177,0.3355412,0.0763321
Si,13.2198448,0.2811043,0.0639483
K,9.9104614,0.2107342,0.0479398
Fe,9.2567387,0.1968335,0.0447776
Ba,8.001297,0.1701381,0.0387046


Hyperparamerts of gbm models

hyperparamerts like number of trees, minimum depth, maximum depth, number of leaves are important. Tuining them can change model fitting and increase or decrease error.

In [39]:
# creating a dictionary
data1={}

In [40]:
# calculating model execution time
data1['model_execution_time'] = {"Binomial classification":(time.time() - model_start_time)} 
data1

{'model_execution_time': {'Binomial classification': 126.47104001045227}}

In [41]:
# printing the leaderBoard for models used
print(aml1.leaderboard)

model_id                                                     auc    logloss     aucpr    mean_per_class_error      rmse        mse
GBM_grid_1_AutoML_2_20221108_13841_model_5              0.947321   0.278782  0.895359               0.13125    0.296334  0.0878136
GBM_5_AutoML_2_20221108_13841                           0.946925   0.275321  0.897268               0.101885   0.290888  0.0846161
StackedEnsemble_BestOfFamily_3_AutoML_2_20221108_13841  0.945933   0.278593  0.88547                0.0947421  0.29254   0.0855798
StackedEnsemble_AllModels_3_AutoML_2_20221108_13841     0.942956   0.330909  0.877096               0.130853   0.317953  0.101094
GBM_grid_1_AutoML_2_20221108_13841_model_1              0.942659   0.300275  0.887277               0.1125     0.310702  0.096536
GBM_grid_1_AutoML_2_20221108_13841_model_7              0.941865   0.298225  0.87745                0.105159   0.304709  0.0928476
XGBoost_grid_1_AutoML_2_20221108_13841_model_24         0.941766   0.290472  0.891985

In [42]:
# predicting on test data
prediction1 = aml1.leader.predict(test1)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [43]:
# prediction of Binomial data
prediction1.head()

predict,p0,p1
1,0.346052,0.653948
1,0.0132632,0.986737
1,0.054681,0.945319
1,0.0140142,0.985986
1,0.0333701,0.96663
1,0.0530885,0.946912
1,0.170836,0.829164
1,0.11481,0.88519
1,0.0391653,0.960835
0,0.978593,0.0214068


In [44]:
# finding and storing the best model
best_model1 = h2o.get_model(aml1.leaderboard[0,'model_id'])

In [45]:
# printing the best model
best_model1.algo

'gbm'

For GBM model there are no assumtions to validate because it is a tree based model and tree-based models are robust to outliers and do not require the dependent variables to meet any normality assumptions.. Hence, it serves to be the best model.

In [46]:
# log loss error for the best model
print(best_model.logloss(train = True))

0.026230961760033533


In [47]:
# performance of the gbm model
best_model1.model_performance(test1)

Unnamed: 0,0,1,Error,Rate
0,26.0,0.0,0.0,(0.0/26.0)
1,0.0,9.0,0.0,(0.0/9.0)
Total,26.0,9.0,0.0,(0.0/35.0)

metric,threshold,value,idx
max f1,0.6539482,1.0,8.0
max f2,0.6539482,1.0,8.0
max f0point5,0.6539482,1.0,8.0
max accuracy,0.6539482,1.0,8.0
max precision,0.9867368,1.0,0.0
max recall,0.6539482,1.0,8.0
max specificity,0.9867368,1.0,0.0
max absolute_mcc,0.6539482,1.0,8.0
max min_per_class_accuracy,0.6539482,1.0,8.0
max mean_per_class_accuracy,0.6539482,1.0,8.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0285714,0.9864815,3.8888889,3.8888889,1.0,0.9867368,1.0,0.9867368,0.1111111,0.1111111,288.8888889,288.8888889,0.1111111
2,0.0285714,0.9862261,0.0,3.8888889,0.0,0.0,1.0,0.9867368,0.0,0.1111111,-100.0,288.8888889,0.1111111
3,0.0571429,0.9855987,3.8888889,3.8888889,1.0,0.9859858,1.0,0.9863613,0.1111111,0.2222222,288.8888889,288.8888889,0.2222222
4,0.0571429,0.9790177,0.0,3.8888889,0.0,0.0,1.0,0.9863613,0.0,0.2222222,-100.0,288.8888889,0.2222222
5,0.0571429,0.9724367,0.0,3.8888889,0.0,0.0,1.0,0.9863613,0.0,0.2222222,-100.0,288.8888889,0.2222222
6,0.1142857,0.9552654,3.8888889,3.8888889,1.0,0.9637323,1.0,0.9750468,0.2222222,0.4444444,288.8888889,288.8888889,0.4444444
7,0.1714286,0.9393061,3.8888889,3.8888889,1.0,0.9461152,1.0,0.965403,0.2222222,0.6666667,288.8888889,288.8888889,0.6666667
8,0.2,0.8403691,3.8888889,3.8888889,1.0,0.8851898,1.0,0.9539439,0.1111111,0.7777778,288.8888889,288.8888889,0.7777778
9,0.3142857,0.0726313,1.9444444,3.1818182,0.5,0.4106381,0.8181818,0.7563782,0.2222222,1.0,94.4444444,218.1818182,0.9230769
10,0.4,0.0378897,0.0,2.5,0.0,0.0503352,0.6428571,0.6050833,0.0,1.0,-100.0,150.0,0.8076923


In this classification model, the models used does make sense. The log loss vlaues is very close to 0 and the AUC value is 1 which is the score a perfect classifier should help.

## **REGRESSION**

In [50]:
# loading datset
dfR_path = "/content/Walmart.csv"
dataR_path = pd.read_csv("/content/Walmart.csv")

In [51]:
# load data into h20 
dfR = h2o.H2OFrame(dataR_path)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [52]:
dfR.head()

Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,05-02-2010,1643690.0,0,42.31,2.572,211.096,8.106
1,12-02-2010,1641960.0,1,38.51,2.548,211.242,8.106
1,19-02-2010,1611970.0,0,39.93,2.514,211.289,8.106
1,26-02-2010,1409730.0,0,46.63,2.561,211.32,8.106
1,05-03-2010,1554810.0,0,46.5,2.625,211.35,8.106
1,12-03-2010,1439540.0,0,57.79,2.667,211.381,8.106
1,19-03-2010,1472520.0,0,54.58,2.72,211.216,8.106
1,26-03-2010,1404430.0,0,51.45,2.732,211.018,8.106
1,02-04-2010,1594970.0,0,62.27,2.719,210.82,7.808
1,09-04-2010,1545420.0,0,65.86,2.77,210.623,7.808


In [53]:
# description of the dataset
dfR.describe()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
type,int,enum,real,int,real,real,real,real
mins,1.0,,209986.25,0.0,-2.06,2.472,126.064,3.879
mean,23.000000000000068,,1046964.8775617725,0.06993006993006994,60.663782439782516,3.3586068376068403,171.57839384878065,7.999151048951066
maxs,45.0,,3818686.45,1.0,100.14,4.468,227.2328068,14.313
sigma,12.988182381175454,,564366.6220536977,0.25504894436982795,18.444932875811585,0.45901970719285223,39.356712295664195,1.8758847818627944
zeros,0,,0,5985,0,0,0,0
missing,0,0,0,0,0,0,0,0
0,1.0,05-02-2010,1643690.9,0.0,42.31,2.572,211.0963582,8.106
1,1.0,12-02-2010,1641957.44,1.0,38.51,2.548,211.2421698,8.106
2,1.0,19-02-2010,1611968.17,0.0,39.93,2.514,211.2891429,8.106


In [54]:
# dropping irrelevent columns
dfR= dfR.drop(['Date'], axis=1)

In [55]:
# determining column types using get_independent_variables function
def get_independent_variables(dfR, targ):
    C = [name for name in dfR.columns if name != targ]
    ints, reals, enums = [], [], []
    for key, val in dfR.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x

In [56]:
# Assigning the traget value
target2 ="Weekly_Sales"

In [57]:
# define the independent variable
X2=get_independent_variables(dfR, target2) 
print(X2)

['Store', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']


In [58]:
# define the dependent variable
y2 = target2

In [59]:
# Splitting training and test data
splits2 = dfR.split_frame(ratios = [0.7], seed = 1)
train2 = splits2[0]
test2 = splits2[1]

In [60]:
# Set up AutoML
aml2 = H2OAutoML(max_runtime_secs=60)

In [61]:
# set model start time and train the aml model
model_start_time2 = time.time()
aml2.train(x=X2,y=y2,training_frame=dfR)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,64246.39,1777.9604,67269.72,62570.0,63560.715,63965.832,63865.695
mean_residual_deviance,12197828600.0,1551937020.0,14271575000.0,9939803100.0,11885742100.0,12361325600.0,12530695200.0
mse,12197828600.0,1551937020.0,14271575000.0,9939803100.0,11885742100.0,12361325600.0,12530695200.0
null_deviance,409931140000000.0,16227700300000.0,422051337000000.0,429225744000000.0,400643676000000.0,388790170000000.0,408944741000000.0
r2,0.9616733,0.0047359,0.9565139,0.9693857,0.960898,0.9599385,0.9616305
residual_deviance,15677104600000.0,1844766700000.0,18353244600000.0,13140419900000.0,15665408800000.0,15550547800000.0,15675899800000.0
rmse,110261.22,7096.81,119463.695,99698.56,109021.75,111181.5,111940.586
rmsle,0.0964055,0.0037181,0.0956225,0.099106,0.0947813,0.1009871,0.0915307


In [62]:
# creating a dictionary
data2={}

In [63]:
# calculating model execution time
data2['model_execution_time'] = {"Regression":(time.time() - model_start_time)} 
data2

{'model_execution_time': {'Regression': 1839.6713786125183}}

In [64]:
# printing the leaderBoard for models used
print(aml2.leaderboard)

model_id                                                  rmse          mse      mae        rmsle    mean_residual_deviance
StackedEnsemble_AllModels_2_AutoML_3_20221108_20159     110368  1.21811e+10  64236.9    0.0964756               1.21811e+10
StackedEnsemble_AllModels_1_AutoML_3_20221108_20159     110444  1.21978e+10  64256.8    0.0965056               1.21978e+10
StackedEnsemble_BestOfFamily_3_AutoML_3_20221108_20159  110675  1.2249e+10   64766      0.0988716               1.2249e+10
StackedEnsemble_BestOfFamily_2_AutoML_3_20221108_20159  110723  1.22595e+10  64786.6    0.0988721               1.22595e+10
GBM_4_AutoML_3_20221108_20159                           113239  1.28231e+10  65232.5    0.0952614               1.28231e+10
StackedEnsemble_BestOfFamily_1_AutoML_3_20221108_20159  116856  1.36554e+10  70476      0.111766                1.36554e+10
GBM_3_AutoML_3_20221108_20159                           119044  1.41715e+10  68493      0.096429                1.41715e+10
GBM_2_Aut

In [65]:
# predicting on test data
prediction2 = aml2.leader.predict(test2)

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


In [66]:
# prediction of regression data
prediction2.head()

predict
1590890.0
1487810.0
1501240.0
1494480.0
1454690.0
1433590.0
1471210.0
1386530.0
1544710.0
1399800.0


In [67]:
# finding and storing the best model
best_model2 = h2o.get_model(aml2.leaderboard[0,'model_id'])

In [68]:
# printing the best model
best_model2.algo

'stackedensemble'

For stackedensemble model there is a high probablity of assumtions to validate as there was no pre processing of data there can be assuption like influencial outliers or dependent variables.

In [69]:
# printing the performance of stackedensemble model
best_model2.model_performance(test2)

In this regression model, the models used does not make sense as the models are overfitting. There has been nearly no data cleaning or munging and the datset was assumed to be good. hence, the MSE values of the models are very high.

## **References**
* Refered Towards Data Science
* docs.h20.ai
* Kaggle
* python.org

The dataset was taken from kaggle and the algorithms were referred directly from the docs.h2o.ai learn official documentation. Visualization was mentioned in the Towards Data Science and Machine Learning with Scikit-Learn Quick Start Guides. The remainder of the code was created on my own. 