# Abstract


Hyperparameters and their values are very much important in the field of data science as they directly control the training behaviour in order to predict the accurate results. Determining these hyperparameters is tedious and cumbersome task. The main aim of this research is to determine the important hyperparameters with proper tuning values obtained from the dataset. To achieve this, this research leveraged H2O.ai’s module for python called H2O which gives flexibility to run the process for numerous run times. Models are generated by running H2OAutoML for various runtimes (300s, 500s, 700s, 1000s, and 1200s).


In [3]:
#importing libraries
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [113]:
data_path=None
all_variables=None
test_path=None
# target='search_term'
target=None
nthreads=1                        # Number of parallel threads used to run Algorithms
min_mem_size=6                    # For allocating the virtual memory
classification= True              # true if the dependent variable is of classification type
scale=False
max_models=None    
model_path=None
balance_y=False 
balance_threshold=0.2            # Threshold limit for tree based algorithms in order to purne them for better results         
name=None 
server_path=None  
analysis=0 

Balance class:
    
When enabled, H2O will either undersample the majority classes or oversample the minority classes. Note that the resulting model will also correct the final probabilities (“undo the sampling”) using a monotonic transform, so the predicted probabilities of the first model will differ from a second model. However, because AUC only cares about ordering, it won’t be affected.

In [4]:
data_path='Employee_access_data.csv'

### Data Cleaning

In [115]:
import pandas as pd
data= pd.read_csv("Employee_access_data.csv", decimal = ',')

Checking for the data types in the dataset 

In [116]:
data.dtypes

ACTION              int64
RESOURCE            int64
MGR_ID              int64
ROLE_ROLLUP_1       int64
ROLE_ROLLUP_2       int64
ROLE_DEPTNAME       int64
ROLE_TITLE          int64
ROLE_FAMILY_DESC    int64
ROLE_FAMILY         int64
ROLE_CODE           int64
dtype: object

**Checking for the Null values**

In [117]:
total = data.isnull().sum()[data.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(data)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

Unnamed: 0,total_missing,percent


No null values

### RUNTIME

Run time is the time given to the H2OAutoML to generate models. Different number of models will be generated for different run times and different values of hyperparameters will be observed for each runtime.


This analysis is done on 5 different run times, mentioned below:

**300, 500, 700, 1000, 1200**

In [118]:
run_time= 1000

In [119]:
pct_memory=0.5
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size, "GB")

2 GB


**Functions used**

* Below we are creating functions for differen purposes
* Generating different **runid** for each runtime in order to uniquely identify the analysis done for each runtme.
* Generating **meta data** for storing the details which might be required to create database. This meta data stores details an the values used in this analysis. Meta deta will be generated every time for each runtime with different values.
* Created a function to generate **JSON files** in which hyperparameters and metadata will be stored.
* created functions for **defining X and y variables**.

In [120]:
#Defining functions

#generating random run_id
def alphabet(n):
  alpha='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'    
  str=''
  r=len(alpha)-1   
  while len(str)<n:
    i=random.randint(0,r)
    str+=alpha[i]   
  return str

# storing in m_data dictionary  
def set_meta_data(analysis,run_id,server,data,model_path,run_time,scale,model,balance,balance_threshold,name,path,nthreads,min_mem_size):
  m_data={}
  # m_data['target']=target
  #m_data['classification']=classification
  m_data['project'] =name
  m_data['run_time']=run_time
  m_data['run_id'] =run_id
  m_data['start_time_sec'] = time.time()
  m_data['min_mem_size'] = min_mem_size
  m_data['balance']=balance
  m_data['balance_threshold']=balance_threshold      
  m_data['max_models']=model
  m_data['scale']=scale  
  m_data['scale']=False
  m_data['model_path']=model_path
  m_data['server_path']=server
  m_data['data_path']=data 
  m_data['run_path'] =path
  m_data['nthreads'] = nthreads
  
  m_data['analysis'] = analysis
  m_data['end_time_sec'] = time.time()  
  return m_data

#converting dictionary to json
def dict_to_json(dct,n):
  j = json.dumps(dct, indent=4)
  f = open(n, 'w')
  print(j, file=f)
  f.close()
    
def get_all_variables_csv(i):
    ivd={}
    try:
      iv = pd.read_csv(i,header=None)
    except:
      sys.exit(1)    
    col=iv.values.tolist()[0]
    dt=iv.values.tolist()[1]
    i=0
    for c in col:
      ivd[c.strip()]=dt[i].strip()
      i+=1        
    return ivd
    
# Segregating in different lists of int, enum, reals and checking for missing values and then scaling(standardizing)    
def impute_missing_values(df, x, scal=False):
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in x:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    _ = df[reals].impute(method='mean')
    _ = df[ints].impute(method='median')
    if scale:
        df[reals] = df[reals].scale()
        df[ints] = df[ints].scale()    
    return

# Determining Independent variables (X) from the dataset 
def get_independent_variables(df, targ):
    C = [name for name in df.columns if name != targ]
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x

#Checkinh if X does not exist then remove
def check_X(x,df):
    for name in x:
        if name not in df.columns:
          x.remove(name)  
    return x    
    

### RUN_ID

A random runid will be generaeted for each run time with that particular run time. This will create a separate folder in the local machine will the name having runid and runtime which will store the files generated later in the analysis.

In [121]:
#randomly generating run_id through alphabet function
run_id=alphabet(9)
if server_path==None:
    server_path=os.path.abspath(os.curdir)
os.chdir(server_path) 
a = run_id + '_EmpAccess_' + str(run_time)
run_dir = os.path.join(server_path,a)
os.mkdir(run_dir)
os.chdir(run_dir)    

# run_id to std out
print (run_id) 

AwbvVmP0k


**Generating name of the project to be fed into the meta data**

In [123]:
name = run_id+'_EmpAccess_' + str(run_time)
name

'AwbvVmP0k_EmpAccess_1000'

In [124]:
# meta data
meta_data = set_meta_data(analysis, run_id,server_path,data_path,model_path,run_time,scale,max_models,balance_y,balance_threshold,name,run_dir,nthreads,min_mem_size)
print(meta_data)  

{'project': 'AwbvVmP0k_EmpAccess_1000', 'run_time': 1000, 'run_id': 'AwbvVmP0k', 'start_time_sec': 1555618604.5150044, 'min_mem_size': 2, 'balance': False, 'balance_threshold': 0.2, 'max_models': None, 'scale': False, 'model_path': None, 'server_path': 'C:\\Users\\hp\\Desktop\\Data Science Hyperparameter Project', 'data_path': 'C:/Users/hp/Desktop/Data Science Hyperparameter Project/Employee_access_data.csv', 'run_path': 'C:\\Users\\hp\\Desktop\\Data Science Hyperparameter Project\\AwbvVmP0k_EmpAccess_1000', 'nthreads': 1, 'analysis': 0, 'end_time_sec': 1555618604.5150044}


Checking the Problem type of the data set used in this analysis.

In [125]:
if classification :
    meta_data["Problem_type"] = "Classification"
    print("Problem Type:  Classification")
else:
    meta_data["Problem_type"] = "Regression"
    print("Problem Type:  Regression")

Problem Type:  Classification


**Starting the H2O environment**

In [7]:
# 65535 Highest port no
port_no=random.randint(5555,55555)

#initializing H2O
h2o.init(strict_version_check=False,min_mem_size_GB=6,port=port_no)

Checking whether there is an H2O instance running at http://localhost:28215 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
  Starting server from C:\Users\hp\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\hp\AppData\Local\Temp\tmp0x686jbz
  JVM stdout: C:\Users\hp\AppData\Local\Temp\tmp0x686jbz\h2o_hp_started_from_python.out
  JVM stderr: C:\Users\hp\AppData\Local\Temp\tmp0x686jbz\h2o_hp_started_from_python.err
  Server is running at http://127.0.0.1:28215
Connecting to H2O server at http://127.0.0.1:28215 ... successful.


0,1
H2O cluster uptime:,08 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,26 days
H2O cluster name:,H2O_from_python_hp_dpmp55
H2O cluster total nodes:,1
H2O cluster free memory:,6 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [127]:
print(data_path)              #path of csv 

C:/Users/hp/Desktop/Data Science Hyperparameter Project/Employee_access_data.csv


Importing the CSV in the H2O environment

In [8]:
#importing data file on h2o server
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [129]:
df.head()

ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,17183,1540,117961,118343,123125,118536,118536,308574,118539
1,36724,14457,118219,118220,117884,117879,267952,19721,117880
1,36135,5396,117961,118343,119993,118321,240983,290919,118322
1,42680,5905,117929,117930,119569,119323,123932,19793,119325
0,45333,14561,117951,117952,118008,118568,118568,19721,118570
1,25993,17227,117961,118343,123476,118980,301534,118295,118982
1,19666,4209,117961,117969,118910,126820,269034,118638,126822
1,31246,783,117961,118413,120584,128230,302830,4673,128231
1,78766,56683,118079,118080,117878,117879,304519,19721,117880




**Shape of the Dataset**

In [130]:
rows = len(df)
print("Total rows in the data set = ", rows)
cols = len(df.columns)
print("Total Columns in the data set = ", cols)

rowscols = df.shape
shape = rows * cols
print("rows X columns = ", rowscols)
print("Total Records = ", shape)

meta_data['Total Records'] = shape

Total rows in the data set =  32769
Total Columns in the data set =  10
rows X columns =  (32769, 10)
Total Records =  327690


In [131]:
df.describe()

Rows:32769
Cols:10




Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
type,int,int,int,int,int,int,int,int,int,int
mins,0.0,0.0,25.0,4292.0,23779.0,4674.0,117879.0,4673.0,3130.0,117880.0
mean,0.9421099209618847,42923.916170771125,25988.957978577324,116952.62778845867,118301.82315603165,118912.77991394309,125916.15264426745,170178.36964814307,183703.40889255097,119789.43013213709
maxs,1.0,312153.0,311696.0,311178.0,286791.0,286792.0,311867.0,311867.0,308574.0,270691.0
sigma,0.23353903780676308,34173.892702138255,35928.03165014073,10875.563591093745,4551.588572012568,18961.32291708769,31036.465824743256,69509.46213013002,100488.40741337684,5784.275515531029
zeros,1897,13,0,0,0,0,0,0,0,0
missing,0,0,0,0,0,0,0,0,0,0
0,1.0,39353.0,85475.0,117961.0,118300.0,123472.0,117905.0,117906.0,290919.0,117908.0
1,1.0,17183.0,1540.0,117961.0,118343.0,123125.0,118536.0,118536.0,308574.0,118539.0
2,1.0,36724.0,14457.0,118219.0,118220.0,117884.0,117879.0,267952.0,19721.0,117880.0


In [132]:
'''
import sys
sys.stdout = open('describe.txt', 'w')
print ('test')
'''

"\nimport sys\nsys.stdout = open('describe.txt', 'w')\nprint ('test')\n"

### Depentdent Variable

In [133]:
# dependent variable
# assign target an d inputs for classification or regression
if target==None:
    target="ACTION"   
y = target
meta_data['Target']=y
y

'ACTION'

### Independent Variables

In [134]:
if all_variables is not None:
  ivd=get_all_variables_csv(all_variables)
  print(ivd)    
  X=check_all_variables(df,ivd,y)
  print(X)

In [135]:
# independent variables
# putting all independent variables in the list X 

X = []  
if all_variables is None:
  X=get_independent_variables(df, target)  
  print(X)  
else: 
  ivd=get_all_variables_csv(all_variables)    
  X=check_all_variables(df, ivd)


X=check_X(X,df)


# Add independent variables to meta data

meta_data['X']=X  


# impute missing values

_=impute_missing_values(df,X, scale)

['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']


### Problem Type

The dependent variable for this data set is of Binary/ classification type.

In [136]:
# Force target to be factors
# Only 'int' or 'string' are allowed for asfactor(), got Target (Total orders):real 

if classification:
    df[y] = df[y].asfactor()

In [137]:
df[y]

ACTION
1
1
1
1
1
0
1
1
1
1




**Checking the number of classes of dependent variable**

In [138]:
# Total categories in the target column
lvl = df[y].levels()
print(lvl)
meta_data["levels"] = lvl

[['0', '1']]


y has 2 levels or classes

In [139]:
# checking value of target for real, int, enum
def check_y(y,df):
  ok=False
  C = [name for name in df.columns if name == y]
  for key, val in df.types.items():
    if key in C:
      if val in ['real','int','enum']:        
        ok=True         
  return ok   

In [140]:
ok=check_y(y,df)
if not ok:
    print(ok)

In [141]:
# since y is enum type
print(ok)

True


In [142]:
def get_variables_types(df):
    d={}
    for key, val in df.types.items():
        d[key]=val           
    return d    
    

In [143]:
# getting the data types of all the variables in the dataset
allV=get_variables_types(df)
allV

{'ACTION': 'enum',
 'RESOURCE': 'int',
 'MGR_ID': 'int',
 'ROLE_ROLLUP_1': 'int',
 'ROLE_ROLLUP_2': 'int',
 'ROLE_DEPTNAME': 'int',
 'ROLE_TITLE': 'int',
 'ROLE_FAMILY_DESC': 'int',
 'ROLE_FAMILY': 'int',
 'ROLE_CODE': 'int'}

In [144]:
# Adding the data types to meta data 
meta_data['variables']=allV

### Using H2OAutoML

* Used for automatic training and tuning of the models

In [145]:
# Set up AutoML

aml = H2OAutoML(max_runtime_secs=run_time,project_name = name)

In [146]:
model_start_time = time.time()

In [147]:
aml.train(x=X,y=y,training_frame=df)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [148]:
execution_time =  time.time() - model_start_time
meta_data['model_execution_time_sec'] = execution_time
print("Execution time for ", run_time,"sec =  ",meta_data['model_execution_time_sec'])

Execution time for  1000 sec =   906.2215521335602


## LeaderBoard

Leaderboard shows the models generated by H2O using different algoithms in a particular run_time. The best model will be showed at the top of the leader board. Currntly H2O generated models of algorithms like: GBM,GLM, DRF, XRT, XGBoost, Deeplearning, Naive bayes etc.

This analysis depending on the nature of the dataset generating **GLM, GBM, DRF and XRT**

In [149]:
# get leaderboard
aml_leaderboard_df=aml.leaderboard.as_data_frame()

In [150]:
aml_leaderboard_df

Unnamed: 0,model_id,auc,logloss,mean_per_class_error,rmse,mse
0,StackedEnsemble_AllModels_AutoML_20190418_161716,0.851117,0.15814,0.340031,0.201728,0.040694
1,StackedEnsemble_BestOfFamily_AutoML_20190418_1...,0.850172,0.158406,0.337966,0.201823,0.040733
2,GBM_grid_1_AutoML_20190418_161716_model_2,0.847503,0.158291,0.333301,0.203096,0.041248
3,DRF_1_AutoML_20190418_161716,0.846892,0.180668,0.335525,0.201431,0.040574
4,GBM_4_AutoML_20190418_161716,0.844948,0.159194,0.380874,0.203027,0.04122
5,GBM_grid_1_AutoML_20190418_161716_model_9,0.84458,0.174315,0.373218,0.213929,0.045765
6,XRT_1_AutoML_20190418_161716,0.844295,0.178992,0.373203,0.202001,0.040804
7,GBM_3_AutoML_20190418_161716,0.838445,0.162248,0.399768,0.204822,0.041952
8,GBM_5_AutoML_20190418_161716,0.835598,0.165465,0.402801,0.207659,0.043122
9,GBM_grid_1_AutoML_20190418_161716_model_8,0.834055,0.1655,0.402744,0.207991,0.04326


In [151]:
length = len(aml_leaderboard_df)- 1
length
meta_data["Models_generated"] = length

### Saving Leaderboard into csv

In [152]:
# save leaderboard to csv
# run_time = run_time.ascharacter()
leaderboard_stats=run_id+'EmpAccess_'+ str(run_time) + '_leaderboard.csv'
aml_leaderboard_df.to_csv(leaderboard_stats)

### Generating hyperparameters for the best model on the leaderboard

In [153]:
# STart best model as first model
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [154]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'StackedEnsemble_AllModels_AutoML_20190418_161716',
   'type': 'Key<Model>',
   'URL': '/3/Models/StackedEnsemble_AllModels_AutoML_20190418_161716'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_py_20_sid_922c',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_py_20_sid_922c'}},
 'response_column': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ColSpecifierV3',
    'schema_type': 'VecSpecifier'},
   'column_name': 'ACTION',
   'is_member_of_frames': None}},
 'validation_frame': {'default': None, 'actual': None},
 'blending_frame': {'default': None, 'actual': None},
 'base_models': {'default': [],
  'actual': [{'__meta': {'schema_version': 3,
     'schema_n

In [155]:
# generating normalized coeff.
mods=mod_best.coef_norm
print(mods)


Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_AutoML_20190418_161716
No model summary for this model


ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.017456243856407334
RMSE: 0.13212207936755815
LogLoss: 0.07396160274426232
Null degrees of freedom: 32768
Residual degrees of freedom: 32756
Null deviance: 14491.899767390338
Residual deviance: 4847.295520653464
AIC: 4873.295520653464
AUC: 0.9946143875239515
pr_auc: 0.994391171157678
Gini: 0.989228775047903
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.7526659731480062: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1570.0,327.0,0.1724,(327.0/1897.0)
1,272.0,30600.0,0.0088,(272.0/30872.0)
Total,1842.0,30927.0,0.0183,(599.0/32769.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.7526660,0.9903073,177.0
max f2,0.4625407,0.9927758,261.0
max f0point5,0.8859573,0.9929031,124.0
max accuracy,0.7526660,0.9817205,177.0
max precision,0.9859806,1.0,0.0
max recall,0.0490433,1.0,382.0
max specificity,0.9859806,1.0,0.0
max absolute_mcc,0.8236883,0.8341534,151.0
max min_per_class_accuracy,0.9160020,0.9692926,106.0


Gains/Lift Table: Avg response rate: 94.21 %, avg score: 93.44 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100095,0.9848689,1.0614473,1.0614473,1.0,0.9856133,1.0,0.9856133,0.0106245,0.0106245,6.1447266,6.1447266
,2,0.0200189,0.9841761,1.0614473,1.0614473,1.0,0.9845158,1.0,0.9850645,0.0106245,0.0212490,6.1447266,6.1447266
,3,0.0300284,0.9834892,1.0614473,1.0614473,1.0,0.9838120,1.0,0.9846470,0.0106245,0.0318735,6.1447266,6.1447266
,4,0.0400073,0.9829616,1.0614473,1.0614473,1.0,0.9832135,1.0,0.9842895,0.0105921,0.0424657,6.1447266,6.1447266
,5,0.0500168,0.9824835,1.0614473,1.0614473,1.0,0.9827156,1.0,0.9839745,0.0106245,0.0530902,6.1447266,6.1447266
,6,0.1000031,0.9811272,1.0614473,1.0614473,1.0,0.9817108,1.0,0.9828430,0.0530578,0.1061480,6.1447266,6.1447266
,7,0.1500198,0.9803380,1.0614473,1.0614473,1.0,0.9806973,1.0,0.9821276,0.0530902,0.1592381,6.1447266,6.1447266
,8,0.2000061,0.9797799,1.0614473,1.0614473,1.0,0.9800501,1.0,0.9816084,0.0530578,0.2122959,6.1447266,6.1447266
,9,0.3000092,0.9786528,1.0614473,1.0614473,1.0,0.9792263,1.0,0.9808144,0.1061480,0.3184439,6.1447266,6.1447266




ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.04069427199960115
RMSE: 0.2017282131968683
LogLoss: 0.15814023691579288
Null degrees of freedom: 32768
Residual degrees of freedom: 32759
Null deviance: 14492.371329898797
Residual deviance: 10364.194846987233
AIC: 10384.194846987233
AUC: 0.8511172033063757
pr_auc: 0.9825339174114955
Gini: 0.7022344066127515
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.5885959863105942: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,628.0,1269.0,0.669,(1269.0/1897.0)
1,343.0,30529.0,0.0111,(343.0/30872.0)
Total,971.0,31798.0,0.0492,(1612.0/32769.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.5885960,0.9742780,250.0
max f2,0.0319182,0.9879898,391.0
max f0point5,0.8768183,0.9694584,144.0
max accuracy,0.5885960,0.9508072,250.0
max precision,0.9792449,0.9881486,17.0
max recall,0.0112329,1.0,397.0
max specificity,0.9867787,0.9984186,0.0
max absolute_mcc,0.7769470,0.4777051,189.0
max min_per_class_accuracy,0.9646745,0.7847240,59.0


Gains/Lift Table: Avg response rate: 94.21 %, avg score: 94.21 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100095,0.9846734,1.0420305,1.0420305,0.9817073,0.9855936,0.9817073,0.9855936,0.0104302,0.0104302,4.2030548,4.2030548
,2,0.0200189,0.9835972,1.0485028,1.0452667,0.9878049,0.9840676,0.9847561,0.9848306,0.0104949,0.0209251,4.8502787,4.5266668
,3,0.0300284,0.9828822,1.0452667,1.0452667,0.9847561,0.9832255,0.9847561,0.9842956,0.0104626,0.0313877,4.5266668,4.5266668
,4,0.0400073,0.9822874,1.0517092,1.0468736,0.9908257,0.9825779,0.9862700,0.9838671,0.0104949,0.0418826,5.1709218,4.6873619
,5,0.0500168,0.9818587,1.0290861,1.0433139,0.9695122,0.9820465,0.9829164,0.9835028,0.0103006,0.0521832,2.9086069,4.3313939
,6,0.1000031,0.9804710,1.0491350,1.0462236,0.9884005,0.9810872,0.9856576,0.9822954,0.0524423,0.1046256,4.9134996,4.6223579
,7,0.1500198,0.9796714,1.0510854,1.0478445,0.9902379,0.9800447,0.9871847,0.9815450,0.0525719,0.1571975,5.1085365,4.7844504
,8,0.2000061,0.9789868,1.0517271,1.0488148,0.9908425,0.9793217,0.9880989,0.9809893,0.0525719,0.2097694,5.1727053,4.8814845
,9,0.3000092,0.9776440,1.0481670,1.0485989,0.9874886,0.9782998,0.9878954,0.9800928,0.1048199,0.3145893,4.8167029,4.8598906



<bound method ModelBase.coef_norm of >


## Generating and storing Hyperparameters of each model into JSON

In [156]:
model_set=aml_leaderboard_df['model_id']
type(model_set)

pandas.core.series.Series

In [157]:
jsonDicts = []
for m in model_set.iteritems():
    m,model_name = m
    mod_best = h2o.get_model(model_name)
    jsonDicts.append(mod_best.params)

In [158]:
print(jsonDicts)

[{'model_id': {'default': None, 'actual': {'__meta': {'schema_version': 3, 'schema_name': 'ModelKeyV3', 'schema_type': 'Key<Model>'}, 'name': 'StackedEnsemble_AllModels_AutoML_20190418_161716', 'type': 'Key<Model>', 'URL': '/3/Models/StackedEnsemble_AllModels_AutoML_20190418_161716'}}, 'training_frame': {'default': None, 'actual': {'__meta': {'schema_version': 3, 'schema_name': 'FrameKeyV3', 'schema_type': 'Key<Frame>'}, 'name': 'automl_training_py_20_sid_922c', 'type': 'Key<Frame>', 'URL': '/3/Frames/automl_training_py_20_sid_922c'}}, 'response_column': {'default': None, 'actual': {'__meta': {'schema_version': 3, 'schema_name': 'ColSpecifierV3', 'schema_type': 'VecSpecifier'}, 'column_name': 'ACTION', 'is_member_of_frames': None}}, 'validation_frame': {'default': None, 'actual': None}, 'blending_frame': {'default': None, 'actual': None}, 'base_models': {'default': [], 'actual': [{'__meta': {'schema_version': 3, 'schema_name': 'ModelKeyV3', 'schema_type': 'Key<Model>'}, 'name': 'GBM_gr




In [159]:
n=run_id+'EmpAccess_'+ str(run_time)+'_hy_parameter.json'
dict_to_json(jsonDicts,n)

In [160]:
jsonDictsActual = []
for m in model_set.iteritems():
    m,model_name = m
    mod_best = h2o.get_model(model_name)
    jsonDictsActual.append(mod_best.actual_params)

In [161]:
print(jsonDictsActual)

[<property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property object at 0x000000AAB8FA54F8>, <property objec

### Generating metadata and storing in JSON

In [162]:
meta_data['end_time'] = time.time()

In [163]:
meta_data

{'project': 'AwbvVmP0k_EmpAccess_1000',
 'run_time': 1000,
 'run_id': 'AwbvVmP0k',
 'start_time_sec': 1555618604.5150044,
 'min_mem_size': 2,
 'balance': False,
 'balance_threshold': 0.2,
 'max_models': None,
 'scale': False,
 'model_path': None,
 'server_path': 'C:\\Users\\hp\\Desktop\\Data Science Hyperparameter Project',
 'data_path': 'C:/Users/hp/Desktop/Data Science Hyperparameter Project/Employee_access_data.csv',
 'run_path': 'C:\\Users\\hp\\Desktop\\Data Science Hyperparameter Project\\AwbvVmP0k_EmpAccess_1000',
 'nthreads': 1,
 'analysis': 0,
 'end_time_sec': 1555618604.5150044,
 'Problem_type': 'Classification',
 'Total Records': 327690,
 'Target': 'ACTION',
 'X': ['RESOURCE',
  'MGR_ID',
  'ROLE_ROLLUP_1',
  'ROLE_ROLLUP_2',
  'ROLE_DEPTNAME',
  'ROLE_TITLE',
  'ROLE_FAMILY_DESC',
  'ROLE_FAMILY',
  'ROLE_CODE'],
 'levels': [['0', '1']],
 'variables': {'ACTION': 'enum',
  'RESOURCE': 'int',
  'MGR_ID': 'int',
  'ROLE_ROLLUP_1': 'int',
  'ROLE_ROLLUP_2': 'int',
  'ROLE_DEPTNA

In [164]:
n=run_id+'EmpAccess_' + str(run_time) + 'meta_data.json'
dict_to_json(meta_data,n)

#### Shutting down H2O

In [165]:
# Clean up
os.chdir(server_path)

In [166]:
h2o.cluster().shutdown()

H2O session _sid_922c closed.


#### Link of data set

https://www.kaggle.com/c/amazon-employee-access-challenge/overview

# Conclusion

**Finding hyperparameters**

Achieved the aim of finding hyperparameters for different algorithms for different run times respectively. Used H2OAutoML to get the hyperparameters. Runtimes used in this analysis are as follows:

### Summary Table

| RUN TIME (sec) | Algorithms generated | Total number of models |
|---|---|---|
|300|GBM, GLM, DRF, XRT, DeepLearning|11|
|500|XRT, DRF, GLM |6|
|700|GBM, GLM, DRF, DeepLearning|22|
|1000|GBM, GLM, XRT, DeepLearning|25|
|1200|GBM, GLM, DRF, XRT, DeepLearning|29|

It can be seen from the summary table that as the runtime increases the number of models generation will increase.

### Analysis part
   ### Important hyperparameters
Used H2O grid Search 

|Model|Hyperparameter| 
|---|---|
|XRT|mtries|
|DRF|mtries|
|GLM|alpha|
|GBM|learning_rate|


  ### Range of hyperparameters
For better tunning of any algorithms the best possible set of values are required which can increase the model performace.  

<h1><center>GBM</center></h1>  

|Hyperparameter| Minimum | Maximum |
|---|---|---|
|learn_rate|0.001|0.8|
|col_sample_rate|0.4|1.0|
|tweedie|1.5|1.5|
|quantile_alpha|0.5| 0.5|
|huber_alpha|0.9| 0.9|
|sample_rate|0.5| 1.0|
|col_sample_rate_per_tree|0.4| 1.0|
|min_split_improvement|1e-05| 0.0001|
|max_abs_leafnode_pred|1.7976931348623157e+308|1.7976931348623157e+308|

<h1><center>DRF</center></h1>

|Hyperparameter| Minimum | Maximum |
|---|---|---|
|mtries|-1 | -1|

<h1><center>XRT</center></h1>

|Hyperparameter| Minimum | Maximum |
|---|---|---|
|mtries|-1 | -1|

<h1><center>GLM</center></h1>

|Hyperparameter| Minimum | Maximum |
|---|---|---|
|seed|711766579752531550 | 7150684335027209933|
|tweedie_variance_power|0.0 | 0.0|
|tweedie_link_power|1.0|1.0|
|alpha|0.0 | 1.0|
|lambda|0.0013175999248032143| 0.40044436071981493|
|theta|1e-10| 1e-10|

   ### Comparing the ranges
Comparing the ranges of the hyperparameters which are common in different algorithms

|Model|Hyperparameter|range|
|---|---|---|
|DRF| mtries | -1|
|XRT| mtries | -1|

# Contribution

percentage ratio - 80: 20

Self - 80%

External - 20%

Urja - 40%

Prakruthi - 40%

# Citation

https://github.com/prabhuSub/Hyperparamter-Samples

https://github.com/nikbearbrown/CSYE_7245/tree/master/H2O

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html

https://github.com/AmiGandhi/Top-1-percent-in-Kaggle-Competition

https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e

# License

Copyright (c) 2019 Urja Jain, Prakruthi Bagur Suryanarayanaprasad

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.