<a href="https://colab.research.google.com/github/VishwanthCheruku/H20-Tutorial/blob/main/H2O_Tutorial2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

H2O is an open source Machine Learning framework with full-tested implementations of several widely-accepted ML algorithms. You just have to pick up the algorithm from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.

H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides **several statistical and ML algorithms including deep learning.**

In this tutorial, we will consider examples and understand how to go about working with H2O.

**Audience**
This tutorial is designed to help all those learners who are aiming to develop a Machine Learning model on a huge database.

Prerequisites

---


It is assumed that the learner has a basic understanding of Machine Learning and is familiar with Python.

**H2O Setup Guide**


 Have you ever been asked to develop a Machine Learning model on a **huge database**? Typically, the database will provide you  and ask you to make certain predictions such as who will be the potential buyers; if there can be an early detection of fraudulent cases, etc. To answer these questions, your task would be to develop a Machine Learning algorithm that would provide an answer to the customer’s query. Developing a Machine Learning algorithm from scratch is not an easy task and why should you do this when there are **several ready-to-use Machine Learning libraries** available in the market.

These days, you would rather use these libraries, apply a well-tested algorithm from these libraries and look at its performance. If the performance were not within acceptable limits, you would try to either fine-tune the current algorithm or try an altogether different one.

Likewise, you may try multiple algorithms on the same dataset and then pick up the best one that satisfactorily meets the customer’s requirements. This is where H2O comes to your rescue. It is an open source Machine Learning framework with full-tested implementations of several widely-accepted ML algorithms. You just have to pick up the algorithm from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.

To mention a few here it includes **gradient boosted machines (GBM), generalized linear model (GLM), deep learning and many more**. Not only that it also supports ***AutoML functionality*** that will rank the performance of different algorithms on your dataset, thus reducing your efforts of finding the best performing model. It is an in-memory platform that provides superb performance.

To install the H2O on your machine . see this web link [H2O Installation Tutorial](https://www.tutorialspoint.com/h2o/h2o_installation.htm)We will understand how to use this in the command line so that you understand its working line-wise. If you are a Python lover, you may use Jupyter or any other IDE of your choice for developing H2O applications. 

The H2O also provides a web-based tool to test the different algorithms on your dataset. This is called Flow.

The tutorial will introduce you to the use of **Flow**. Alongside, we will discuss the use of **AutoML** that will identify the best performing algorithm on your dataset. Are you not excited to learn H2O? Keep reading!


** H20 provide many in-built ML and Deep Leraing Algorithms. but in this tutorial my foucs to provide AutoML tutorial.**  

**To use AutoML, start a new Jupyter notebook and follow the steps shown below.**

**Importing AutoML**

First import H2O and AutoML package into the project using the following two statements −

In [1]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html
[0mCollecting h2o
  Downloading h2o-3.40.0.4.tar.gz (177.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.40.0.4-py2.py3-none-any.whl size=177697886 sha256=bdf6b537a58ef45d543b6199dc99137b029eab3f515ba8de9ca76622ce575f06
  Stored in directory: /root/.cache/pip/wheels/43/f2/b0/5bb4d702a0467e82d77c45088db3eef25114c26b0eec8e7f6a
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.40.0.4


In [2]:
import h2o
from h2o.automl import H2OAutoML

**Initialize H2O**

Initialize h2o using the following statement −

In [3]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.19" 2023-04-18; OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu120.04.1); OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpu0xjkp2r
  JVM stdout: /tmp/tmpu0xjkp2r/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpu0xjkp2r/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.4
H2O_cluster_version_age:,1 month and 14 days
H2O_cluster_name:,H2O_from_python_unknownUser_5or6dc
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.170 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


**Loading Data**

We are using iris.csv dataset.Load the data using the following statement −

In [4]:
from sklearn import datasets
data = h2o.import_file('https://gist.githubusercontent.com/btkhimsar/ed560337d8b944832d1c1f55fac093fc/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv')



Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [5]:
data.columns

['sepal.length', 'sepal.width', 'petal.length', 'petal.width', 'variety']

**Preparing Dataset**

We need to decide on the features and the prediction columns. We use the same features and the predication column as in our earlier case. Set the features and the output column using the following two statements −

In [6]:
features = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
output = 'variety'

Split the data in 80:20 ratio for training and testing −

In [7]:
train, test = data.split_frame(ratios=[0.8])

**Applying AutoML**

Now, we are all set for applying AutoML on our dataset. The AutoML will run for a fixed amount of time set by us and give us the optimized model. We set up the AutoML using the following statement −

In [8]:
automl = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)

The first parameter specifies the number of models that we want to evaluate and compare.

The second parameter specifies the time for which the algorithm runs.

We now call the train method on the AutoML object as shown here −

In [9]:
automl.train(x =features, y =output, training_frame = train)

AutoML progress: |████
07:25:47.399: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 121.0.

███████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,multinomial,multinomial,Ridge ( lambda = 2.883E-4 ),"nlambda = 30, lambda.max = 42.884, lambda.min = 2.883E-4, lambda.1se = 0.001204",15,12,96,AutoML_1_20230612_72538_training_py_2_sid_9a34

Setosa,Versicolor,Virginica,Error,Rate
39.0,0.0,0.0,0.0,0 / 39
0.0,40.0,1.0,0.0243902,1 / 41
0.0,1.0,40.0,0.0243902,1 / 41
39.0,41.0,41.0,0.0165289,2 / 121

k,hit_ratio
1,0.9834711
2,1.0
3,1.0

Setosa,Versicolor,Virginica,Error,Rate
39.0,0.0,0.0,0.0,0 / 39
0.0,40.0,1.0,0.0243902,1 / 41
0.0,2.0,39.0,0.0487805,2 / 41
39.0,42.0,40.0,0.0247934,3 / 121

k,hit_ratio
1,0.9752066
2,1.0
3,1.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.9753333,0.0225278,0.96,1.0,0.9583333,1.0,0.9583333
auc,,0.0,,,,,
err,0.0246667,0.0225278,0.04,0.0,0.0416667,0.0,0.0416667
err_count,0.6,0.5477226,1.0,0.0,1.0,0.0,1.0
logloss,0.0633976,0.0540695,0.1101926,0.0294923,0.0383473,0.0076182,0.1313376
max_per_class_error,0.075,0.0684653,0.125,0.0,0.125,0.0,0.125
mean_per_class_accuracy,0.975,0.0228218,0.9583333,1.0,0.9583333,1.0,0.9583333
mean_per_class_error,0.025,0.0228218,0.0416667,0.0,0.0416667,0.0,0.0416667
mse,0.0190859,0.0187213,0.0339317,0.004535,0.0133006,0.0003688,0.0432936
null_deviance,53.189396,0.9821337,54.946247,52.75409,52.75409,52.75409,52.738457

Unnamed: 0,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_xval,deviance_se,alpha,iterations,training_rmse,training_logloss,training_r2,training_classification_error,training_auc,training_pr_auc
,2023-06-12 07:25:46,0.000 sec,2,.43E2,15,2.1541753,2.1637225,0.0011910,0.0,,,,,,,
,2023-06-12 07:25:46,0.017 sec,4,.27E2,15,2.1295239,2.1438185,0.0020371,0.0,,,,,,,
,2023-06-12 07:25:46,0.028 sec,6,.17E2,15,2.0917423,2.1129049,0.0032478,0.0,,,,,,,
,2023-06-12 07:25:46,0.039 sec,8,.1E2,15,2.0354464,2.0661599,0.0050402,0.0,,,,,,,
,2023-06-12 07:25:46,0.050 sec,10,.64E1,15,1.9549848,1.9978936,0.0075858,0.0,,,,,,,
,2023-06-12 07:25:46,0.060 sec,12,.4E1,15,1.8467439,1.9032021,0.0109593,0.0,,,,,,,
,2023-06-12 07:25:46,0.073 sec,15,.25E1,15,1.7113225,1.7806827,0.0150437,0.0,,,,,,,
,2023-06-12 07:25:47,0.085 sec,18,.15E1,15,1.5566095,1.6341831,0.0195924,0.0,,,,,,,
,2023-06-12 07:25:47,0.098 sec,21,.95E0,15,1.3939614,1.4745828,0.0240701,0.0,,,,,,,
,2023-06-12 07:25:47,0.110 sec,24,.59E0,15,1.2354251,1.3141194,0.0282647,0.0,,,,,,,

variable,relative_importance,scaled_importance,percentage
petal.length,12.83918,1.0,0.4050027
petal.width,11.3832922,0.8866059,0.3590778
sepal.width,3.7767918,0.2941615,0.1191362
sepal.length,3.7022028,0.288352,0.1167833


We specify the x as the features array that we created earlier, the y as the output variable to indicate the predicted value and the dataframe as train dataset.

Run the code, you will have to wait for 5 minutes (we set the max_runtime_secs to 300) until you get the following output −

**Printing the Leaderboard**

When the AutoML processing completes, it creates a leaderboard ranking all the 30 algorithms that it has evaluated. To see the first 10 records of the leaderboard, use the following code −

In [10]:
lb = automl.leaderboard
lb.head()

model_id,mean_per_class_error,logloss,rmse,mse
GLM_1_AutoML_1_20230612_72538,0.0243902,0.0726402,0.146643,0.0215042
DeepLearning_grid_1_AutoML_1_20230612_72538_model_1,0.0243902,0.0478423,0.125074,0.0156435
GBM_5_AutoML_1_20230612_72538,0.0243902,0.0829135,0.153305,0.0235024
DeepLearning_grid_1_AutoML_1_20230612_72538_model_2,0.0325203,0.134574,0.179804,0.0323295
GBM_3_AutoML_1_20230612_72538,0.0325203,0.0852033,0.165228,0.0273001
DeepLearning_grid_1_AutoML_1_20230612_72538_model_3,0.0325203,0.108146,0.173996,0.0302747
GBM_grid_1_AutoML_1_20230612_72538_model_5,0.0325203,0.0847744,0.161543,0.0260963
GBM_grid_1_AutoML_1_20230612_72538_model_2,0.0325203,0.0947876,0.168274,0.0283162
XGBoost_grid_1_AutoML_1_20230612_72538_model_1,0.0406504,0.228099,0.231761,0.0537132
GBM_4_AutoML_1_20230612_72538,0.0406504,0.0822087,0.16222,0.0263153


**Predicting on Test Data**

Now, you have the models ranked, you can see the performance of the top-rated model on your test data. To do so, run the following code statement −

In [11]:
preds = automl.predict(test)

glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


**Printing Result**

Print the predicted result using the following statement −

In [12]:
print (preds)

predict      Setosa    Versicolor    Virginica
Setosa     0.99928    0.000719688  3.75188e-17
Setosa     0.99979    0.000210081  9.78287e-17
Setosa     0.999535   0.000465267  1.43641e-16
Setosa     0.999339   0.00066079   1.04779e-18
Setosa     0.999008   0.000991997  1.38611e-16
Setosa     0.991486   0.00851354   5.78985e-14
Setosa     0.999918   8.17376e-05  3.79578e-19
Setosa     0.996779   0.00322071   1.02364e-15
Setosa     0.996631   0.00336858   4.99868e-17
Setosa     0.998568   0.00143204   1.52758e-16
[29 rows x 4 columns]



**Printing the Ranking for All**

If you want to see the ranks of all the tested algorithms, run the following code statement −

In [13]:
lb.head(rows = lb.nrows)

model_id,mean_per_class_error,logloss,rmse,mse
GLM_1_AutoML_1_20230612_72538,0.0243902,0.0726402,0.146643,0.0215042
DeepLearning_grid_1_AutoML_1_20230612_72538_model_1,0.0243902,0.0478423,0.125074,0.0156435
GBM_5_AutoML_1_20230612_72538,0.0243902,0.0829135,0.153305,0.0235024
DeepLearning_grid_1_AutoML_1_20230612_72538_model_2,0.0325203,0.134574,0.179804,0.0323295
GBM_3_AutoML_1_20230612_72538,0.0325203,0.0852033,0.165228,0.0273001
DeepLearning_grid_1_AutoML_1_20230612_72538_model_3,0.0325203,0.108146,0.173996,0.0302747
GBM_grid_1_AutoML_1_20230612_72538_model_5,0.0325203,0.0847744,0.161543,0.0260963
GBM_grid_1_AutoML_1_20230612_72538_model_2,0.0325203,0.0947876,0.168274,0.0283162
XGBoost_grid_1_AutoML_1_20230612_72538_model_1,0.0406504,0.228099,0.231761,0.0537132
GBM_4_AutoML_1_20230612_72538,0.0406504,0.0822087,0.16222,0.0263153


**Conclusion**

H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides several statistical and ML algorithms including deep learning. During testing, you can fine tune the parameters to these algorithms. You can do so using command-line or the provided web-based interface called Flow. H2O also supports AutoML that provides the ranking amongst the several algorithms based on their performance. H2O also performs well on Big Data. This is definitely a boon for Data Scientist to apply the different Machine Learning models on their dataset and pick up the best one to meet their needs.