#ASSIGNMENT 2 - AUTO ML
 
**Author: Jatin Madan**

**NUID: 002727159**

Implementing AutoML: Automated machine learning, often known as automated ML or AutoML, is the process of automating the laborious, iterative activities associated with developing a machine learning model. It enables model quality to be maintained while ML models are built at high scale, efficiency, and productivity by data scientists, analysts, and developers.


H2O is a distributed in-memory machine learning platform with linear scalability that can be used to implement AutoML. The most popular statistical and machine learning algorithms, such as deep learning, generalized linear models, and gradient boosted machines, are supported by H2O.


#PROBLEM STATEMENT: 
Using H2O AutoML to train a model that will help us predict whether the customers request for credit card will get approved or not. The dataset we will be working with contains information on various background information of the customers. The main aim of this project is to predict the customers request for credit card will get approved or not.

###INSTALLING H2O

In [1]:
! pip install h2o

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting h2o
  Downloading h2o-3.38.0.2.tar.gz (177.4 MB)
[K     |████████████████████████████████| 177.4 MB 40 kB/s 
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.38.0.2-py2.py3-none-any.whl size=177521195 sha256=600ddad463e89d355bda8037b3e33d6fe5002cf9088eab71ca3314e8d00d5a4b
  Stored in directory: /root/.cache/pip/wheels/e4/ef/ab/a9b2e452e18b3dfea0b6114bc57c3b9e8b0e464eb2d03230e1
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.38.0.2


###IMPORTING LIBRARIES

To use H2O in Python, we first initialize a connection between our Python and an H2O local server. 

Connecting to cluster

In [2]:
# Load the H2O library and start up the H2O cluter locally on your machine
import h2o
import pandas as pd
from h2o.automl import H2OAutoML
import re
import matplotlib.pylab as plt
import numpy as np
# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.16" 2022-07-19; OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu118.04); OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu118.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.7/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp17l8nf3l
  JVM stdout: /tmp/tmp17l8nf3l/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmp17l8nf3l/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.2
H2O_cluster_version_age:,11 days
H2O_cluster_name:,H2O_from_python_unknownUser_x3a1e8
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


**About the Dataset: Credit Card classification**

The bank manager has noticed that more and more customers are applying for their credit card services. They would certainly like it if someone could foresee who can more likely to get approve the credit card request knowing their background details such as Years of Employment, Type of Income, Family Status, etc.

Dataset consists of 9709 customers and there are nearly 20 features.

###LOADING DATASET

In [3]:
# loading datset
path = "/content/clean_data.csv"
data_path = pd.read_csv("/content/clean_data.csv")

In [4]:
# creating a copy of the dataset
df_copy=data_path.copy()

In [6]:
# Replacing the Target binary values into String values to perform Binomial classification
df_copy['Target'] = df_copy['Target'].replace(1, "Y")
df_copy['Target'] = df_copy['Target'].replace(0, "N")

In [7]:
data_path.head(5)

Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
0,5008804,1,1,1,1,0,0,0,0,2,15,427500.0,32.868574,12.435574,Working,Higher education,Civil marriage,Rented apartment,Other,1
1,5008806,1,1,1,0,0,0,0,0,2,29,112500.0,58.793815,3.104787,Working,Secondary / secondary special,Married,House / apartment,Security staff,0
2,5008808,0,0,1,0,1,1,0,0,1,4,270000.0,52.321403,8.353354,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,0
3,5008812,0,0,1,0,0,0,1,0,1,20,283500.0,61.504343,0.0,Pensioner,Higher education,Separated,House / apartment,Other,0
4,5008815,1,1,1,1,1,1,0,0,2,5,270000.0,46.193967,2.10545,Working,Higher education,Married,House / apartment,Accountants,0


In [8]:
#Loading the dataset into H2O
data = h2o.H2OFrame(data_path)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [9]:
# description of the dataset
data.describe()

Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
type,int,int,int,int,int,int,int,int,int,int,int,real,real,real,enum,enum,enum,enum,enum,int
mins,5008804.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,27000.0,20.504185575336933,0.0,,,,,,0.0
mean,5076104.679060664,0.3487485837882377,0.36770007209805333,0.671541868369554,0.21742712946750437,0.2876712328767123,0.08754763621382222,0.1746832835513441,0.42280358430322246,2.1826140694201257,27.270058708414904,181228.19456174638,43.784093083598485,5.664730161452851,,,,,,0.132145432073334
maxs,5150479.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,19.0,20.0,60.0,1575000.0,68.86383703977495,43.02073280081042,,,,,,1.0
sigma,40802.696052772466,0.4765987878099639,0.4822039797226454,0.4696765995682179,0.4125167873991112,0.45270034532244624,0.2826504487639636,0.3797155310732071,0.7670189141552809,0.9329181887062356,16.648056893148983,99277.30509737751,11.625767724979015,6.3422412768887755,,,,,,0.3386662517937518
zeros,0,6323,6139,3189,7598,6916,8859,8013,6819,0,57,0,0,1696,,,,,,8426
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,5008804.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,15.0,427500.0,32.86857361889703,12.4355736257418,Working,Higher education,Civil marriage,Rented apartment,Other,1.0
1,5008806.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,29.0,112500.0,58.79381506807121,3.104786545924968,Working,Secondary / secondary special,Married,House / apartment,Security staff,0.0
2,5008808.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,4.0,270000.0,52.32140290355038,8.353354278321937,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,0.0


In [10]:
data.types

{'ID': 'int',
 'Gender': 'int',
 'Own_car': 'int',
 'Own_property': 'int',
 'Work_phone': 'int',
 'Phone': 'int',
 'Email': 'int',
 'Unemployed': 'int',
 'Num_children': 'int',
 'Num_family': 'int',
 'Account_length': 'int',
 'Total_income': 'real',
 'Age': 'real',
 'Years_employed': 'real',
 'Income_type': 'enum',
 'Education_type': 'enum',
 'Family_status': 'enum',
 'Housing_type': 'enum',
 'Occupation_type': 'enum',
 'Target': 'int'}

In [11]:
data.head()

ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
5008800.0,1,1,1,1,0,0,0,0,2,15,427500,32.8686,12.4356,Working,Higher education,Civil marriage,Rented apartment,Other,1
5008810.0,1,1,1,0,0,0,0,0,2,29,112500,58.7938,3.10479,Working,Secondary / secondary special,Married,House / apartment,Security staff,0
5008810.0,0,0,1,0,1,1,0,0,1,4,270000,52.3214,8.35335,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,0
5008810.0,0,0,1,0,0,0,1,0,1,20,283500,61.5043,0.0,Pensioner,Higher education,Separated,House / apartment,Other,0
5008820.0,1,1,1,1,1,1,0,0,2,5,270000,46.194,2.10545,Working,Higher education,Married,House / apartment,Accountants,0
5008820.0,1,1,1,0,0,0,0,0,2,17,135000,48.6745,3.26906,Commercial associate,Secondary / secondary special,Married,House / apartment,Laborers,0
5008820.0,0,1,0,0,0,0,0,0,2,25,130500,29.2107,3.01991,Working,Incomplete higher,Married,House / apartment,Accountants,1
5008830.0,0,0,1,0,1,0,0,0,2,31,157500,27.4639,4.02199,Working,Secondary / secondary special,Married,House / apartment,Laborers,1
5008830.0,0,0,1,0,0,0,0,1,2,44,112500,30.0294,4.43541,Working,Secondary / secondary special,Single / not married,House / apartment,Other,0
5008840.0,1,1,1,0,0,0,0,3,5,24,270000,34.7413,3.18419,Working,Secondary / secondary special,Married,House / apartment,Laborers,0


##MULTINOMIAL CLASSIFICATION

In [12]:
# Splitting data into train test and validation set 
data_train,data_test,data_valid = data.split_frame(ratios=[.7, .15])

In [13]:
data_train.head()

ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
5008800.0,1,1,1,1,0,0,0,0,2,15,427500,32.8686,12.4356,Working,Higher education,Civil marriage,Rented apartment,Other,1
5008810.0,1,1,1,0,0,0,0,0,2,29,112500,58.7938,3.10479,Working,Secondary / secondary special,Married,House / apartment,Security staff,0
5008810.0,0,0,1,0,0,0,1,0,1,20,283500,61.5043,0.0,Pensioner,Higher education,Separated,House / apartment,Other,0
5008820.0,1,1,1,1,1,1,0,0,2,5,270000,46.194,2.10545,Working,Higher education,Married,House / apartment,Accountants,0
5008820.0,1,1,1,0,0,0,0,0,2,17,135000,48.6745,3.26906,Commercial associate,Secondary / secondary special,Married,House / apartment,Laborers,0
5008820.0,0,1,0,0,0,0,0,0,2,25,130500,29.2107,3.01991,Working,Incomplete higher,Married,House / apartment,Accountants,1
5008830.0,0,0,1,0,0,0,0,1,2,44,112500,30.0294,4.43541,Working,Secondary / secondary special,Single / not married,House / apartment,Other,0
5008840.0,1,1,1,0,0,0,0,3,5,24,270000,34.7413,3.18419,Working,Secondary / secondary special,Married,House / apartment,Laborers,0
5008840.0,1,0,1,0,0,0,0,1,3,39,405000,32.4223,5.51962,Commercial associate,Higher education,Married,House / apartment,Managers,0
5008840.0,1,1,1,0,1,0,0,0,2,43,112500,56.1326,12.1837,Commercial associate,Secondary / secondary special,Married,House / apartment,Drivers,0


In [14]:
y = "Income_type"
x = data.columns
x.remove(y)
x

['ID',
 'Gender',
 'Own_car',
 'Own_property',
 'Work_phone',
 'Phone',
 'Email',
 'Unemployed',
 'Num_children',
 'Num_family',
 'Account_length',
 'Total_income',
 'Age',
 'Years_employed',
 'Education_type',
 'Family_status',
 'Housing_type',
 'Occupation_type',
 'Target']

In [15]:
aml = H2OAutoML(max_models = 10, seed = 10, verbosity="info", nfolds=0)

In [16]:
aml.train(x = x, y = y, training_frame = data_train, validation_frame=data_valid)

AutoML progress: |
01:25:56.984: Project: AutoML_1_20221108_12556
01:25:56.988: Cross-validation disabled by user: no fold column nor nfolds > 1.
01:25:56.989: Setting stopping tolerance adaptively based on the training frame: 0.012103663970544886
01:25:56.989: Build control seed: 10
01:25:56.990: training frame: Frame key: AutoML_1_20221108_12556_training_py_3_sid_b8a4    cols: 20    rows: 6826  chunks: 1    size: 232690  checksum: -2171122911107645420
01:25:56.990: validation frame: Frame key: py_5_sid_b8a4    cols: 20    rows: 1388  chunks: 1    size: 53196  checksum: -7138771450889962336
01:25:56.991: leaderboard frame: Frame key: py_5_sid_b8a4    cols: 20    rows: 1388  chunks: 1    size: 53196  checksum: -7138771450889962336
01:25:56.991: blending frame: NULL
01:25:56.995: response column: Income_type
01:25:56.997: fold column: null
01:25:56.997: weights column: null
01:25:57.53: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 

Unnamed: 0,number_of_trees
,40.0

Commercial associate,Pensioner,State servant,Student,Working,Error,Rate
1333.0,0.0,11.0,0.0,321.0,0.1993994,"332 / 1,665"
10.0,1199.0,1.0,0.0,3.0,0.0115416,"14 / 1,213"
46.0,0.0,278.0,0.0,171.0,0.4383838,217 / 495
0.0,0.0,1.0,0.0,1.0,1.0,2 / 2
32.0,0.0,3.0,0.0,3416.0,0.010142,"35 / 3,451"
1421.0,1199.0,294.0,0.0,3912.0,0.0878992,"600 / 6,826"

k,hit_ratio
1,0.9121008
2,0.9830062
3,0.998535
4,0.999707
5,1.0

Commercial associate,Pensioner,State servant,Student,Working,Error,Rate
91.0,0.0,8.0,0.0,221.0,0.715625,229 / 320
1.0,240.0,0.0,0.0,1.0,0.0082645,2 / 242
19.0,0.0,10.0,0.0,92.0,0.9173554,111 / 121
1.0,0.0,0.0,0.0,0.0,1.0,1 / 1
111.0,0.0,15.0,0.0,578.0,0.1789773,126 / 704
223.0,240.0,33.0,0.0,892.0,0.3378963,"469 / 1,388"

k,hit_ratio
1,0.6621038
2,0.9149857
3,0.9971182
4,0.9992796
5,1.0000001

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc,validation_rmse,validation_logloss,validation_classification_error,validation_auc,validation_pr_auc
,2022-11-08 01:25:57,0.644 sec,0.0,0.8,1.6094379,0.4944331,,,0.8,1.6094379,0.4927954,,
,2022-11-08 01:25:59,2.281 sec,5.0,0.5437053,0.8045961,0.2713156,,,0.5733264,0.888076,0.314121,,
,2022-11-08 01:26:00,3.416 sec,10.0,0.4656017,0.6186207,0.2399648,,,0.5215364,0.7637898,0.323487,,
,2022-11-08 01:26:01,4.498 sec,15.0,0.4235495,0.5266602,0.1955757,,,0.506679,0.7311925,0.3285303,,
,2022-11-08 01:26:02,5.529 sec,20.0,0.3944226,0.4689651,0.1724289,,,0.5023183,0.7262199,0.3242075,,
,2022-11-08 01:26:03,6.758 sec,25.0,0.3722573,0.4280122,0.1472312,,,0.5040081,0.7369025,0.332853,,
,2022-11-08 01:26:04,7.756 sec,30.0,0.3524564,0.392832,0.1237914,,,0.5037744,0.7415528,0.3443804,,
,2022-11-08 01:26:05,8.659 sec,35.0,0.3356668,0.3649875,0.1053326,,,0.5037275,0.7488641,0.3342939,,
,2022-11-08 01:26:06,9.531 sec,40.0,0.320675,0.3407748,0.0878992,,,0.5053786,0.7585814,0.3378963,,

variable,relative_importance,scaled_importance,percentage
Years_employed,4914.8891602,1.0,0.2781161
Unemployed,3747.5771484,0.7624947,0.2120620
ID,1905.5327148,0.3877061,0.1078273
Age,1856.4528809,0.3777202,0.1050501
Account_length,1340.7756348,0.2727988,0.0758697
Total_income,1125.0527344,0.2289070,0.0636627
Occupation_type.Core staff,254.1338654,0.0517069,0.0143805
Num_children,199.0765381,0.0405048,0.0112650
Own_property,198.9526062,0.0404796,0.0112580
Occupation_type.Other,189.5610199,0.0385687,0.0107266


In [17]:
lb = aml.leaderboard

In [18]:
lb.head()

model_id,mean_per_class_error,logloss,rmse,mse
XGBoost_1_AutoML_1_20221108_12556,0.564044,0.758581,0.505379,0.255407
GBM_2_AutoML_1_20221108_12556,0.567955,0.727453,0.499127,0.249128
XGBoost_2_AutoML_1_20221108_12556,0.572562,0.752089,0.50543,0.25546
DRF_1_AutoML_1_20221108_12556,0.573321,0.869104,0.498458,0.248461
GBM_1_AutoML_1_20221108_12556,0.574969,0.707133,0.494705,0.244733
GBM_3_AutoML_1_20221108_12556,0.576451,0.729864,0.499304,0.249305
XGBoost_3_AutoML_1_20221108_12556,0.577247,0.721598,0.499392,0.249392
GBM_4_AutoML_1_20221108_12556,0.579514,0.747299,0.50272,0.252727
GLM_1_AutoML_1_20221108_12556,0.581178,0.697024,0.493499,0.243541
XRT_1_AutoML_1_20221108_12556,0.593017,0.898696,0.571117,0.326174


In [48]:
# finding and storing the best model
best_model = h2o.get_model(aml.leaderboard[8,'model_id'])

In [49]:
# printing the best model
best_model.algo

'glm'

In [19]:
data_pred=aml.leader.predict(data_test)

xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%


In [20]:
data_pred.head()

predict,Commercial associate,Pensioner,State servant,Student,Working
Working,0.133264,0.00168862,0.335141,0.000758007,0.529149
Working,0.185471,0.00135208,0.0173504,0.00112729,0.794699
Working,0.152218,0.00186893,0.0172813,0.000743468,0.827888
Working,0.221623,0.00526579,0.304969,0.00132303,0.466819
Working,0.115255,0.00328182,0.0359004,0.00129779,0.844265
State servant,0.0747061,0.00114836,0.57907,0.000859592,0.344216
Commercial associate,0.463777,0.00557823,0.308203,0.00131319,0.221129
Working,0.423059,0.0192118,0.0465733,0.000697778,0.510458
Working,0.15333,0.00173453,0.101951,0.000736172,0.742248
Working,0.113703,0.000906886,0.0666955,0.000675844,0.818019


In [21]:
aml.leader.model_performance(data_test)

Commercial associate,Pensioner,State servant,Student,Working,Error,Rate
76.0,0.0,8.0,0.0,243.0,0.7675841,251 / 327
0.0,257.0,0.0,0.0,0.0,0.0,0 / 257
19.0,0.0,8.0,0.0,79.0,0.9245283,98 / 106
0.0,0.0,0.0,0.0,0.0,,0 / 0
140.0,0.0,17.0,0.0,648.0,0.1950311,157 / 805
235.0,257.0,33.0,0.0,970.0,0.3384615,"506 / 1,495"

k,hit_ratio
1,0.6615385
2,0.9344482
3,0.9993311
4,1.0
5,1.0


##BINARY CLASSIFICATION

In [22]:
data1 = h2o.H2OFrame(df_copy)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [23]:
data1.describe()

Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
type,int,int,int,int,int,int,int,int,int,int,int,real,real,real,enum,enum,enum,enum,enum,enum
mins,5008804.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,27000.0,20.504185575336933,0.0,,,,,,
mean,5076104.679060664,0.3487485837882377,0.36770007209805333,0.671541868369554,0.21742712946750437,0.2876712328767123,0.08754763621382222,0.1746832835513441,0.42280358430322246,2.1826140694201257,27.270058708414904,181228.19456174638,43.784093083598485,5.664730161452851,,,,,,
maxs,5150479.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,19.0,20.0,60.0,1575000.0,68.86383703977495,43.02073280081042,,,,,,
sigma,40802.696052772466,0.4765987878099639,0.4822039797226454,0.4696765995682179,0.4125167873991112,0.45270034532244624,0.2826504487639636,0.3797155310732071,0.7670189141552809,0.9329181887062356,16.648056893148983,99277.30509737751,11.625767724979015,6.3422412768887755,,,,,,
zeros,0,6323,6139,3189,7598,6916,8859,8013,6819,0,57,0,0,1696,,,,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,5008804.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,15.0,427500.0,32.86857361889703,12.4355736257418,Working,Higher education,Civil marriage,Rented apartment,Other,Y
1,5008806.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,29.0,112500.0,58.79381506807121,3.104786545924968,Working,Secondary / secondary special,Married,House / apartment,Security staff,N
2,5008808.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,4.0,270000.0,52.32140290355038,8.353354278321937,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,N


In [24]:
# Splitting data into train test and validation set 
data_train1,data_test1,data_valid1 = data1.split_frame(ratios=[.7, .15])

In [25]:
data_train1.head()

ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
5008800.0,1,1,1,1,0,0,0,0,2,15,427500,32.8686,12.4356,Working,Higher education,Civil marriage,Rented apartment,Other,Y
5008810.0,0,0,1,0,1,1,0,0,1,4,270000,52.3214,8.35335,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,N
5008810.0,0,0,1,0,0,0,1,0,1,20,283500,61.5043,0.0,Pensioner,Higher education,Separated,House / apartment,Other,N
5008820.0,1,1,1,1,1,1,0,0,2,5,270000,46.194,2.10545,Working,Higher education,Married,House / apartment,Accountants,N
5008820.0,0,1,0,0,0,0,0,0,2,25,130500,29.2107,3.01991,Working,Incomplete higher,Married,House / apartment,Accountants,Y
5008830.0,0,0,1,0,0,0,0,1,2,44,112500,30.0294,4.43541,Working,Secondary / secondary special,Single / not married,House / apartment,Other,N
5008840.0,1,1,1,0,0,0,0,3,5,24,270000,34.7413,3.18419,Working,Secondary / secondary special,Married,House / apartment,Laborers,N
5008840.0,1,0,1,0,0,0,0,1,3,39,405000,32.4223,5.51962,Commercial associate,Higher education,Married,House / apartment,Managers,N
5008840.0,1,1,1,0,1,0,0,0,2,43,112500,56.1326,12.1837,Commercial associate,Secondary / secondary special,Married,House / apartment,Drivers,N
5008870.0,0,0,1,0,0,0,0,1,3,24,211500,44.3869,19.4364,State servant,Secondary / secondary special,Civil marriage,House / apartment,Core staff,N


In [26]:
y1 = "Target"
x1 = data1.columns
x1.remove(y1)
x1

['ID',
 'Gender',
 'Own_car',
 'Own_property',
 'Work_phone',
 'Phone',
 'Email',
 'Unemployed',
 'Num_children',
 'Num_family',
 'Account_length',
 'Total_income',
 'Age',
 'Years_employed',
 'Income_type',
 'Education_type',
 'Family_status',
 'Housing_type',
 'Occupation_type']

In [27]:
aml1 = H2OAutoML(max_models = 10, seed = 10, verbosity="info", nfolds=0)

In [28]:
aml1.train(x = x1, y = y1, training_frame = data_train1, validation_frame=data_valid1)

AutoML progress: |
01:27:18.760: Project: AutoML_2_20221108_12718
01:27:18.760: Cross-validation disabled by user: no fold column nor nfolds > 1.
01:27:18.761: Setting stopping tolerance adaptively based on the training frame: 0.012133920952907253
01:27:18.761: Build control seed: 10
01:27:18.761: training frame: Frame key: AutoML_2_20221108_12718_training_py_12_sid_b8a4    cols: 20    rows: 6792  chunks: 1    size: 232145  checksum: 6118467675611229324
01:27:18.761: validation frame: Frame key: py_14_sid_b8a4    cols: 20    rows: 1468  chunks: 1    size: 56362  checksum: -3625328837734852563
01:27:18.761: leaderboard frame: Frame key: py_14_sid_b8a4    cols: 20    rows: 1468  chunks: 1    size: 56362  checksum: -3625328837734852563
01:27:18.761: blending frame: NULL
01:27:18.761: response column: Target
01:27:18.761: fold column: null
01:27:18.761: weights column: null
01:27:18.762: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,30.0,30.0,17157.0,9.0,15.0,12.066667,36.0,43.0,40.666668

Unnamed: 0,N,Y,Error,Rate
N,5346.0,541.0,0.0919,(541.0/5887.0)
Y,422.0,483.0,0.4663,(422.0/905.0)
Total,5768.0,1024.0,0.1418,(963.0/6792.0)

metric,threshold,value,idx
max f1,0.1950426,0.5007776,130.0
max f2,0.1480297,0.6103014,200.0
max f0point5,0.219492,0.5186053,101.0
max accuracy,0.256788,0.8819199,66.0
max precision,0.4377536,1.0,0.0
max recall,0.0599162,1.0,356.0
max specificity,0.4377536,1.0,0.0
max absolute_mcc,0.2045299,0.4202624,119.0
max min_per_class_accuracy,0.1558354,0.758281,187.0
max mean_per_class_accuracy,0.1480297,0.7644101,200.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100118,0.3186054,6.4013,6.4013,0.8529412,0.3490662,0.8529412,0.3490662,0.0640884,0.0640884,540.1299968,540.1299968,0.0623897
2,0.0200236,0.2912135,5.4079948,5.9046474,0.7205882,0.3037192,0.7867647,0.3263927,0.0541436,0.118232,440.79948,490.4647384,0.1133059
3,0.0300353,0.2718224,4.5250569,5.4447839,0.6029412,0.2812207,0.7254902,0.3113354,0.0453039,0.1635359,352.5056874,444.478388,0.1540234
4,0.0400471,0.2568676,4.1939552,5.1320767,0.5588235,0.2640301,0.6838235,0.2995091,0.041989,0.2055249,319.3955151,413.2076698,0.1909164
5,0.0500589,0.2473745,3.5317517,4.8120117,0.4705882,0.2519389,0.6411765,0.289995,0.0353591,0.240884,253.1751706,381.20117,0.2201603
6,0.1001178,0.2153036,3.399311,4.1056614,0.4529412,0.229924,0.5470588,0.2599595,0.1701657,0.4110497,239.9311017,310.5661358,0.3587311
7,0.1500294,0.194866,2.3909647,3.5352176,0.3185841,0.2046281,0.47105,0.2415519,0.119337,0.5303867,139.0964651,253.5217606,0.4388291
8,0.2000883,0.1808577,1.721729,3.0815118,0.2294118,0.1874568,0.410596,0.2280182,0.0861878,0.6165746,72.1728957,208.1511836,0.4805121
9,0.3000589,0.1574968,1.3374104,2.50043,0.1782032,0.1686352,0.3331698,0.2082336,0.1337017,0.7502762,33.7410394,150.0429952,0.5194286
10,0.4000294,0.13989,0.9837151,2.1213908,0.1310751,0.1486048,0.2826647,0.1933319,0.0983425,0.8486188,-1.6284917,112.1390793,0.5175503

Unnamed: 0,N,Y,Error,Rate
N,680.0,615.0,0.4749,(615.0/1295.0)
Y,63.0,110.0,0.3642,(63.0/173.0)
Total,743.0,725.0,0.4619,(678.0/1468.0)

metric,threshold,value,idx
max f1,0.1270673,0.2449889,225.0
max f2,0.0494892,0.4136777,369.0
max f0point5,0.1671187,0.1835406,147.0
max accuracy,0.3441927,0.8821526,1.0
max precision,0.3441927,0.5,1.0
max recall,0.0494892,1.0,369.0
max specificity,0.3533961,0.9992278,0.0
max absolute_mcc,0.1270673,0.1037873,225.0
max min_per_class_accuracy,0.133544,0.5675676,212.0
max mean_per_class_accuracy,0.1270673,0.5804673,225.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.010218,0.2913353,1.6971098,1.6971098,0.2,0.3153921,0.2,0.3153921,0.017341,0.017341,69.7109827,69.7109827,0.0080746
2,0.020436,0.269417,0.0,0.8485549,0.0,0.2795462,0.1,0.2974692,0.0,0.017341,-100.0,-15.1445087,-0.0035084
3,0.030654,0.2580534,1.6971098,1.1314066,0.2,0.2645911,0.1333333,0.2865098,0.017341,0.0346821,69.7109827,13.1406551,0.0045663
4,0.0401907,0.2448895,1.818332,1.2944058,0.2142857,0.2509818,0.1525424,0.2780794,0.017341,0.0520231,81.8331957,29.44058,0.0134131
5,0.0504087,0.234781,0.5657033,1.1466958,0.0666667,0.2400164,0.1351351,0.270364,0.0057803,0.0578035,-43.4296724,14.6695829,0.0083826
6,0.1001362,0.212226,0.8136828,0.981322,0.0958904,0.2232381,0.1156463,0.2469613,0.0404624,0.0982659,-18.6317206,-1.8677991,-0.0021202
7,0.150545,0.1949277,1.376035,1.1134883,0.1621622,0.2026142,0.1312217,0.2321121,0.0693642,0.1676301,37.6034995,11.3488348,0.0193675
8,0.2002725,0.1823191,1.8598464,1.2988085,0.2191781,0.1883048,0.1530612,0.2212348,0.0924855,0.2601156,85.9846385,29.8808541,0.0678376
9,0.3004087,0.1583171,1.4431206,1.3469126,0.170068,0.1697814,0.1587302,0.2040836,0.1445087,0.4046243,44.3120601,34.6912561,0.1181378
10,0.3998638,0.1399649,1.0461636,1.2721096,0.1232877,0.1484813,0.1499148,0.1902541,0.1040462,0.5086705,4.6163592,27.210958,0.1233423

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2022-11-08 01:27:21,0.050 sec,0.0,0.3398393,0.3925091,0.5,0.133245,1.0,0.866755,0.3227948,0.3636761,0.5,0.1178474,1.0,0.8821526
,2022-11-08 01:27:22,0.291 sec,5.0,0.334608,0.3776303,0.7291655,0.2917956,4.0835879,0.2383687,0.3222988,0.3619607,0.5530676,0.1347575,2.2628131,0.5953678
,2022-11-08 01:27:22,0.575 sec,10.0,0.3300176,0.3652731,0.7656314,0.3438999,4.4146896,0.2333628,0.3225476,0.3617071,0.5618832,0.1311232,1.1314066,0.4366485
,2022-11-08 01:27:22,0.818 sec,15.0,0.3263803,0.3561306,0.7953807,0.3851589,5.0768931,0.1875736,0.3230231,0.3622973,0.5651662,0.1311314,1.1314066,0.4059946
,2022-11-08 01:27:22,1.036 sec,20.0,0.3226948,0.3473912,0.814735,0.4309005,5.8494638,0.1756478,0.322852,0.3615469,0.5740286,0.1385606,2.2628131,0.3521798
,2022-11-08 01:27:23,1.258 sec,25.0,0.319556,0.3400058,0.8291562,0.4681071,6.1805655,0.1559187,0.3226912,0.3612517,0.578394,0.144681,1.6971098,0.4584469
,2022-11-08 01:27:23,1.463 sec,30.0,0.3161127,0.3324405,0.8436349,0.5028491,6.4013,0.1417845,0.3234164,0.3628433,0.5756511,0.1417087,1.6971098,0.4618529

variable,relative_importance,scaled_importance,percentage
Occupation_type,114.102478,1.0,0.2201551
Account_length,94.6366119,0.8294001,0.1825967
Age,66.0430527,0.5788047,0.1274268
ID,52.2448158,0.4578763,0.1008038
Years_employed,46.5121422,0.4076348,0.0897429
Total_income,39.2143555,0.3436766,0.0756622
Income_type,20.5165558,0.1798082,0.0395857
Family_status,14.7537889,0.129303,0.0284667
Num_family,13.6710644,0.1198139,0.0263776
Own_property,10.5660257,0.0926012,0.0203866


In [29]:
lb1 = aml1.leaderboard

In [30]:
lb1.head()

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GBM_1_AutoML_2_20221108_12718,0.575651,0.362843,0.141709,0.419533,0.323416,0.104598
XGBoost_2_AutoML_2_20221108_12718,0.573192,0.384739,0.136138,0.435552,0.333789,0.111415
GLM_1_AutoML_2_20221108_12718,0.566322,0.357181,0.172085,0.431988,0.319706,0.102212
XGBoost_1_AutoML_2_20221108_12718,0.561426,0.38214,0.134232,0.441308,0.331666,0.110002
GBM_4_AutoML_2_20221108_12718,0.558676,0.372962,0.13587,0.444033,0.327673,0.107369
XGBoost_3_AutoML_2_20221108_12718,0.55823,0.371555,0.140994,0.446339,0.326549,0.106634
XRT_1_AutoML_2_20221108_12718,0.556107,0.370918,0.135206,0.445138,0.327106,0.106998
GBM_2_AutoML_2_20221108_12718,0.551668,0.368164,0.132769,0.43928,0.325581,0.106003
DRF_1_AutoML_2_20221108_12718,0.550075,0.377588,0.144262,0.469784,0.329265,0.108415
GBM_3_AutoML_2_20221108_12718,0.543288,0.371096,0.126957,0.452583,0.326745,0.106762


In [53]:
# finding and storing the best model
best_model1 = h2o.get_model(aml1.leaderboard[2,'model_id'])

In [54]:
# printing the best model
best_model1.algo

'glm'

In [31]:
data_pred1=aml1.leader.predict(data_test1)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [32]:
data_pred1.head()

predict,N,Y
Y,0.768759,0.231241
Y,0.855301,0.144699
Y,0.785175,0.214825
Y,0.861151,0.138849
Y,0.866732,0.133268
N,0.92567,0.07433
Y,0.736602,0.263398
N,0.947114,0.052886
N,0.900045,0.0999554
Y,0.859703,0.140297


In [34]:
aml1.leader.model_performance(data_test1)

Unnamed: 0,N,Y,Error,Rate
N,481.0,763.0,0.6133,(763.0/1244.0)
Y,55.0,150.0,0.2683,(55.0/205.0)
Total,536.0,913.0,0.5645,(818.0/1449.0)

metric,threshold,value,idx
max f1,0.1118607,0.2683363,258.0
max f2,0.0502217,0.4617474,370.0
max f0point5,0.1572243,0.1980701,166.0
max accuracy,0.3413666,0.857833,0.0
max precision,0.1794826,0.1812298,130.0
max recall,0.0383163,1.0,383.0
max specificity,0.3413666,0.9991961,0.0
max absolute_mcc,0.1118607,0.0854449,258.0
max min_per_class_accuracy,0.1331986,0.5418006,214.0
max mean_per_class_accuracy,0.1118607,0.5591816,258.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.010352,0.2870337,0.0,0.0,0.0,0.3091618,0.0,0.3091618,0.0,0.0,-100.0,-100.0,-0.0120579
2,0.0200138,0.2695619,1.5146341,0.7312027,0.2142857,0.2766895,0.1034483,0.2934855,0.0146341,0.0146341,51.4634146,-26.8797309,-0.0062662
3,0.0303658,0.2545686,0.4712195,0.6425721,0.0666667,0.2625466,0.0909091,0.2829381,0.004878,0.0195122,-52.8780488,-35.7427938,-0.0126421
4,0.0400276,0.2454214,0.504878,0.6093356,0.0714286,0.250347,0.0862069,0.2750713,0.004878,0.0243902,-49.5121951,-39.0664424,-0.0182143
5,0.0503796,0.23698,1.4136585,0.7746074,0.2,0.240655,0.109589,0.2679995,0.0146341,0.0390244,41.3658537,-22.5392583,-0.0132264
6,0.100069,0.2122037,1.3743902,1.0724306,0.1944444,0.2250215,0.1517241,0.2466587,0.0682927,0.1073171,37.4390244,7.2430614,0.0084425
7,0.1504486,0.1955723,1.355563,1.167241,0.1917808,0.2041624,0.1651376,0.2324283,0.0682927,0.1756098,35.556298,16.7240994,0.0293075
8,0.200138,0.1822082,1.472561,1.2430446,0.2083333,0.1882099,0.1758621,0.2214499,0.0731707,0.2487805,47.2560976,24.3044575,0.0566583
9,0.300207,0.1574654,1.2186712,1.2349201,0.1724138,0.1697692,0.1747126,0.204223,0.1219512,0.3707317,21.8671152,23.4920101,0.0821465
10,0.4002761,0.1409055,0.9261901,1.1577376,0.1310345,0.1488454,0.1637931,0.1903786,0.0926829,0.4634146,-7.3809924,15.7737595,0.0735433


##REGRESSION 

In [36]:
data2 = h2o.H2OFrame(df_copy)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [37]:
data2.describe()

Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
type,int,int,int,int,int,int,int,int,int,int,int,real,real,real,enum,enum,enum,enum,enum,enum
mins,5008804.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,27000.0,20.504185575336933,0.0,,,,,,
mean,5076104.679060664,0.3487485837882377,0.36770007209805333,0.671541868369554,0.21742712946750437,0.2876712328767123,0.08754763621382222,0.1746832835513441,0.42280358430322246,2.1826140694201257,27.270058708414904,181228.19456174638,43.784093083598485,5.664730161452851,,,,,,
maxs,5150479.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,19.0,20.0,60.0,1575000.0,68.86383703977495,43.02073280081042,,,,,,
sigma,40802.696052772466,0.4765987878099639,0.4822039797226454,0.4696765995682179,0.4125167873991112,0.45270034532244624,0.2826504487639636,0.3797155310732071,0.7670189141552809,0.9329181887062356,16.648056893148983,99277.30509737751,11.625767724979015,6.3422412768887755,,,,,,
zeros,0,6323,6139,3189,7598,6916,8859,8013,6819,0,57,0,0,1696,,,,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,5008804.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,15.0,427500.0,32.86857361889703,12.4355736257418,Working,Higher education,Civil marriage,Rented apartment,Other,Y
1,5008806.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,29.0,112500.0,58.79381506807121,3.104786545924968,Working,Secondary / secondary special,Married,House / apartment,Security staff,N
2,5008808.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,4.0,270000.0,52.32140290355038,8.353354278321937,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,N


Splitting data into train test and validation set 

In [38]:
data_train2,data_test2,data_valid2 = data2.split_frame(ratios=[.7, .15])

In [39]:
data_train2.head()

ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
5008800.0,1,1,1,1,0,0,0,0,2,15,427500,32.8686,12.4356,Working,Higher education,Civil marriage,Rented apartment,Other,Y
5008810.0,1,1,1,0,0,0,0,0,2,29,112500,58.7938,3.10479,Working,Secondary / secondary special,Married,House / apartment,Security staff,N
5008810.0,0,0,1,0,1,1,0,0,1,4,270000,52.3214,8.35335,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,N
5008810.0,0,0,1,0,0,0,1,0,1,20,283500,61.5043,0.0,Pensioner,Higher education,Separated,House / apartment,Other,N
5008820.0,1,1,1,1,1,1,0,0,2,5,270000,46.194,2.10545,Working,Higher education,Married,House / apartment,Accountants,N
5008820.0,1,1,1,0,0,0,0,0,2,17,135000,48.6745,3.26906,Commercial associate,Secondary / secondary special,Married,House / apartment,Laborers,N
5008830.0,0,0,1,0,1,0,0,0,2,31,157500,27.4639,4.02199,Working,Secondary / secondary special,Married,House / apartment,Laborers,Y
5008840.0,1,0,1,0,0,0,0,1,3,39,405000,32.4223,5.51962,Commercial associate,Higher education,Married,House / apartment,Managers,N
5008840.0,1,1,1,0,1,0,0,0,2,43,112500,56.1326,12.1837,Commercial associate,Secondary / secondary special,Married,House / apartment,Drivers,N
5008850.0,0,1,1,0,0,0,0,2,4,39,135000,43.1522,8.68738,Working,Secondary / secondary special,Married,House / apartment,Laborers,N


In [40]:
y2 = "Total_income"
x2 = data.columns
x2.remove(y2)
x2

['ID',
 'Gender',
 'Own_car',
 'Own_property',
 'Work_phone',
 'Phone',
 'Email',
 'Unemployed',
 'Num_children',
 'Num_family',
 'Account_length',
 'Age',
 'Years_employed',
 'Income_type',
 'Education_type',
 'Family_status',
 'Housing_type',
 'Occupation_type',
 'Target']

In [41]:
aml2 = H2OAutoML(max_models = 10, seed = 10, verbosity="info", nfolds=0)

In [42]:
aml2.train(x = x2, y = y2, training_frame = data_train2, validation_frame=data_valid2)

AutoML progress: |
01:53:26.834: Project: AutoML_3_20221108_15326
01:53:26.834: Cross-validation disabled by user: no fold column nor nfolds > 1.
01:53:26.835: Setting stopping tolerance adaptively based on the training frame: 0.012113428212804355
01:53:26.835: Build control seed: 10
01:53:26.835: training frame: Frame key: AutoML_3_20221108_15326_training_py_21_sid_b8a4    cols: 20    rows: 6815  chunks: 1    size: 232932  checksum: -1641328085119881042
01:53:26.835: validation frame: Frame key: py_23_sid_b8a4    cols: 20    rows: 1424  chunks: 1    size: 54712  checksum: 6029893344071383551
01:53:26.835: leaderboard frame: Frame key: py_23_sid_b8a4    cols: 20    rows: 1424  chunks: 1    size: 54712  checksum: 6029893344071383551
01:53:26.835: blending frame: NULL
01:53:26.835: response column: Total_income
01:53:26.835: fold column: null
01:53:26.835: weights column: null
01:53:26.836: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,35.0,35.0,20105.0,9.0,15.0,11.914286,36.0,43.0,40.8

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
,2022-11-08 01:53:29,0.018 sec,0.0,98471.0800812,68962.5998195,9696553612.362867,102569.2340443,70147.2894918,10520447772.435318
,2022-11-08 01:53:29,0.205 sec,5.0,91465.357833,63671.2788472,8365911683.526852,97978.8231539,66142.7177719,9599849786.61524
,2022-11-08 01:53:29,0.418 sec,10.0,87796.1303729,60966.4870025,7708160508.4555235,95971.7324184,64631.601188,9210573423.389421
,2022-11-08 01:53:29,0.621 sec,15.0,85756.007952,59550.7453584,7354092899.861711,95157.3937543,64071.3029025,9054929586.102179
,2022-11-08 01:53:30,0.800 sec,20.0,84362.8817801,58549.3483171,7117095822.246449,94873.4077383,63818.6415003,9000963495.878595
,2022-11-08 01:53:30,0.960 sec,25.0,83355.1530064,57792.9292175,6948081532.719165,94827.0478895,63773.5829998,8992169011.433157
,2022-11-08 01:53:30,1.126 sec,30.0,82508.82543,57168.0936422,6807706273.836993,94986.2229046,63906.9518507,9022382541.678782
,2022-11-08 01:53:30,1.350 sec,35.0,81686.3500335,56644.3292679,6672659781.799553,95114.0571298,64066.4789448,9046683863.699884

variable,relative_importance,scaled_importance,percentage
Occupation_type,34923901616128.0,1.0,0.3430462
Education_type,14418106646528.0,0.4128435,0.1416244
Age,10370544566272.0,0.2969469,0.1018665
Own_car,8475150647296.0,0.2426748,0.0832487
Gender,8032449724416.0,0.2299986,0.0789002
Income_type,5293059080192.0,0.1515598,0.051992
Years_employed,4963394650112.0,0.1421203,0.0487538
ID,3757570260992.0,0.1075931,0.0369094
Account_length,3669127593984.0,0.1050606,0.0360407
Work_phone,1659619639296.0,0.047521,0.0163019


In [43]:
lb2 = aml2.leaderboard

In [44]:
lb2.head()

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
GBM_1_AutoML_3_20221108_15326,95114.1,9046680000.0,64066.5,0.455016,9046680000.0
GBM_3_AutoML_3_20221108_15326,96244.3,9262960000.0,64705.1,0.457338,9262960000.0
GBM_2_AutoML_3_20221108_15326,96461.0,9304720000.0,64580.6,0.457398,9304720000.0
XRT_1_AutoML_3_20221108_15326,97283.1,9464000000.0,66153.5,0.46896,9464000000.0
XGBoost_3_AutoML_3_20221108_15326,97343.2,9475690000.0,65788.0,0.465796,9475690000.0
GBM_4_AutoML_3_20221108_15326,97491.6,9504600000.0,65492.3,0.462699,9504600000.0
DRF_1_AutoML_3_20221108_15326,98382.4,9679100000.0,66026.5,0.46888,9679100000.0
GLM_1_AutoML_3_20221108_15326,102569.0,10520400000.0,70147.3,0.502947,10520400000.0
XGBoost_2_AutoML_3_20221108_15326,103115.0,10632600000.0,69203.5,0.498492,10632600000.0
XGBoost_1_AutoML_3_20221108_15326,104251.0,10868200000.0,71126.6,0.508771,10868200000.0


In [55]:
# finding and storing the best model
best_model2 = h2o.get_model(aml2.leaderboard[0,'model_id'])

In [56]:
# printing the best model
best_model2.algo

'gbm'

In [45]:
data_pred2=aml2.leader.predict(data_test2)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [46]:
data_pred2.head()

predict
154439
199242
258070
181942
171431
163011
167728
155240
167881
206183


In [47]:
aml2.leader.model_performance(data_test2)

##ANSWERS

**1.** Is the relationship significant?

**Answer:** 

**Multinomial Classification-**
Yes, the 'Total_income', 'Years_employed' are significant variables.

**Binary Classification-**
Yes, the 'Total Income' are significant variables.

**Regression-**
Yes, the 'Total_income','Years_employed' are significant variables.

**2.** Are any model assumptions violated?

**Answer:** 

**Multinomial Classification-**
The best model is glm and there are 4 assumptions to validate which  are linearity, homoskedasticity (constant variance), normality and Independence.

**Binary Classification-**
The best model is glm and there are 3 assumptions to validate which  are linearity, homoskedasticity (constant variance), normality.

**Regression-**
For GBM model there are no assumtions to validate because it is a tree based model. Hence, it serves to be the best model.


**3.**  Is there any multicollinearity in the model?

**Answer:** Yes, there is multicollinearity in the Regression model

**4.** In the multivariate models are predictor variables independent of all the other predictor variables?

 **Answer:** 

 Yes, the predictor variables are independent of all the other predictor variables in the multivariate models.


**5.** In multivariate models rank the most significant predictor variables and exclude insignificant ones from the model.

**Answer:**
The top 3 significant predictor variables for the multivariate model are ranked as follows- 

  1. Years_employed	
  2. Unemployed	
  3. ID	

**6.** Does the model make sense?

**Answer:**

**Multinomial Classification-**
In this, the models used does make sense. The log loss vlaues is very close to 0 and the AUC value is 1 which is the score a perfect classifier should help.


**Binary Classification-**
The models used does make sense. The log loss vlaues is very close to 0 and the AUC value is 1 which is the score a perfect classifier should help.

**Regression-**
The model does not look like a optimal solution because we are trying to predict 'Income' based on the factors which is not the case as multicollinearity is found in the data. So, it does not make sense for Regression.  

**7.** Does regularization help?

**Answer:**

Yes, regularization helps in optimizing the results.

**8.** Which hyperparameters are important?

**Answer:**

A simple GBM model in Regression contains two categories of hyperparameters: boosting hyperparameters and tree-specific hyperparameters. The two main boosting hyperparameters include: Number of trees: The total number of trees in the sequence or ensemble. 

##REFRENCES

1. 6105_Airlines_GBM_AutoML.ipynb 
2. Python Data Science Handbook
3. Sckit learn offcial documentation
4. AmazonReviews.ipynb


Refered to the sample notebook (Abalone dataset) to undertsand expected assignment format. 

Used Sckitlearn tools to implement models. Used mlxtend for bias and variance. Used eli5 to predict and understand feature importance. 

Refered to Analytics Vidhya and Python Data science Handbook to study different models and concepts. 

Copyright 2022 *Jatin Madan*

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.