## H2O Model Trainer
(max rows limit/threshold - leader model metrics)

In this notebook, once data is entered, the max rows threshold (limit_rows) is applied to the dataset to avoid overfitting or very large datasets. Then, H2OAutoML is trained with only these 3 algos: GBM, GLM and XGBoost. Once training is done, the leader model's metrics are extracted, if they are less than the threshold fixed, another H2OAutoML model is trained using DeepLearning algo. The best out of the two leaders is saved as the model to use.

In [1]:
!pip install fastapi nest-asyncio pyngrok uvicorn h2o

Collecting fastapi
  Downloading fastapi-0.111.1-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 kB[0m [31m913.0 kB/s[0m eta [36m0:00:00[0m
Collecting pyngrok
  Downloading pyngrok-7.2.0-py3-none-any.whl (22 kB)
Collecting uvicorn
  Downloading uvicorn-0.30.3-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h2o
  Downloading h2o-3.46.0.4.tar.gz (265.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.3/265.3 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi-cli>=0.0.2 (from fastapi)
  Downloading fasta

In [2]:
!ngrok authtoken '2ighL0YEwJxisFZFo8JWIFL1wtf_3CdEFhapKNHeoHFAE2m4d'

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [3]:
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import h2o
from h2o.automl import H2OAutoML
import pandas as pd
import numpy as np
from io import StringIO
import nest_asyncio
from pyngrok import ngrok
import uvicorn
import logging
import os

In [4]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.23" 2024-04-16; OpenJDK Runtime Environment (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1); OpenJDK 64-Bit Server VM (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmprnalzqz4
  JVM stdout: /tmp/tmprnalzqz4/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmprnalzqz4/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.4
H2O_cluster_version_age:,12 days
H2O_cluster_name:,H2O_from_python_unknownUser_3qy1u8
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.170 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [5]:
app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

In [6]:
@app.get("/")
async def root():
    return "Hello World!"

In [7]:
def limit_rows(df, max_rows=500):
    num_rows, num_cols = df.shape
    df_non_null = df.dropna()

    if len(df_non_null) > max_rows:
        df_non_null = df_non_null.sample(n=max_rows, random_state=42)
        print("num rows extracted (non-null): ", max_rows)
    else:
        print("num rows: ",len(df_non_null))

    print(df_non_null)
    return df_non_null

In [8]:
@app.post('/train')
async def train_model(file: UploadFile = File(...)):
    csv_data = StringIO((await file.read()).decode('utf-8'))
    df = pd.read_csv(csv_data)
    df = limit_rows(df)
    h2o_df = h2o.H2OFrame(df)
    x = h2o_df.columns
    y = x[-1]
    x.remove(y)

    prob_type = "regression"

    target_unique_values = h2o_df[y].unique().nrow
    if (h2o_df[y].isnumeric()[0] and target_unique_values < 10) or not h2o_df[y].isnumeric()[0]:
        prob_type = "classification"
        h2o_df[y] = h2o_df[y].asfactor()

    include_algos = ["GLM", "GBM", "XGBoost"]

    aml = H2OAutoML(max_models=10, seed=1, include_algos=include_algos)
    aml.train(x=x, y=y, training_frame=h2o_df)
    print(aml.leader)

    model = aml.leader
    model_metrics = model.model_performance()._metric_json

    include_dl = False
    if prob_type == "classification" :
      if float(model_metrics['logloss'])>0.2:
        include_dl = True
    elif float(model_metrics['r2'])<0.8:
      include_dl = True

    if include_dl :
        metric = model_metrics['logloss'] if prob_type == "classification" else model_metrics['r2']
        print("DeepLearning included, metric : ",metric)
        aml2 = H2OAutoML(max_models=2, seed=1, include_algos=["DeepLearning"])
        aml2.train(x=x, y=y, training_frame=h2o_df)
        print(aml2.leader)
        model2 = aml2.leader
        if prob_type == "classification" :
          if float(model2.model_performance()._metric_json['logloss'])<float(model_metrics['logloss']):
            model = model2
        elif float(model2.model_performance()._metric_json['r2'])>float(model_metrics['r2']):
            model = model2

    model_path = h2o.save_model(model=model, path="./models", force=True)
    model_metrics = model.model_performance()._metric_json

    if prob_type == "classification" :
      model_details = {
          'model_id': model.model_id,
          'model_type': model.algo,
          'model_path': model_path,
          'model_category': model_metrics['model_category'],
          'AUC' : model_metrics['AUC'],
          'logloss' : model_metrics['logloss'],
          'MSE' : model_metrics['MSE'],
      }
    else :
      model_details = {
          'model_id': model.model_id,
          'model_type': model.algo,
          'model_path': model_path,
          'model_category': model_metrics['model_category'],
          'MSE' : model_metrics['MSE'],
          'RMSE' : model_metrics['RMSE'],
          'R2' : model_metrics['r2']
      }

    return JSONResponse(content={'modelpath': model_path, 'model_details': model_details})

In [9]:
#mae/y.mean() or r2>0.95 or mape

In [10]:
@app.post('/predict')
async def predict_model(modelpath: str = Form(...), file: UploadFile = File(...)):
    csv_data = StringIO((await file.read()).decode('utf-8'))
    input_df = pd.read_csv(csv_data)
    h2o_input_df = h2o.H2OFrame(input_df)

    model = h2o.load_model(modelpath)

    predictions = model.predict(h2o_input_df)
    predictions_df = predictions.as_data_frame()

    return JSONResponse(content=predictions_df.to_dict(orient="records"))

In [None]:
ngrok_tunnel = ngrok.connect(8000)
print('Public URL:', ngrok_tunnel.public_url)
nest_asyncio.apply()
uvicorn.run(app, port=8000)

INFO:     Started server process [227]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)


Public URL: https://080e-34-106-25-126.ngrok-free.app
num rows extracted (non-null):  500
          id  age  gender driving_experience    education        income  \
3265  384446    2       1             20-29y   university   upper class   
603   456321    2       0               0-9y         none       poverty   
9998  903459    1       0             10-19y  high school       poverty   
9984  443302    1       0             10-19y  high school  middle class   
4695  371790    1       0             10-19y  high school  middle class   
...      ...  ...     ...                ...          ...           ...   
9465  487682    2       1             20-29y   university   upper class   
8814  783337    0       0               0-9y  high school  middle class   
493   881409    3       1               30y+   university   upper class   
8794  798069    2       1               0-9y   university   upper class   
3598  955335    0       1               0-9y         none       poverty   

      cre




num rows extracted (non-null):  500
      Age  Gender        BMI  Smoking  GeneticRisk  PhysicalActivity  \
1116   46       1  30.193803        0            1          7.111218   
1368   49       1  33.547408        0            2          3.047609   
422    73       0  15.604794        0            0          6.579499   
413    41       0  25.247622        0            0          6.860912   
451    60       0  22.054677        1            0          7.804711   
...   ...     ...        ...      ...          ...               ...   
591    51       1  36.454426        0            2          4.498040   
664    43       0  15.450164        1            0          7.089000   
195    42       1  15.275782        0            2          7.405095   
1240   75       1  38.785745        0            2          9.484252   
1048   21       0  21.785235        0            0          8.355717   

      AlcoholIntake  CancerHistory  Diagnosis  
1116       2.770849              1          1  
136




num rows:  442
          age       sex       bmi        bp        s1        s2        s3  \
0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1   -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3   -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4    0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   
..        ...       ...       ...       ...       ...       ...       ...   
437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   
438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   
439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   
440 -0.045472 -0.044642  0.039062  0.001215  0.016318  0.015283 -0.028674   
441 -0.045472 -0.044642 -0.073030 -0.081413  0.083740  0.027809  0.173816   

           s4        s5        s6  target  
0   -0.002592  0




num rows extracted (non-null):  500
           date                 datetime cash_type                 card  \
786  2024-06-15  2024-06-15 12:23:52.166      card  ANON-0000-0000-0300   
355  2024-04-23  2024-04-23 14:23:53.144      card  ANON-0000-0000-0024   
272  2024-04-11  2024-04-11 19:18:36.619      card  ANON-0000-0000-0024   
395  2024-04-30  2024-04-30 10:34:52.250      card  ANON-0000-0000-0142   
619  2024-05-27  2024-05-27 19:17:38.729      card  ANON-0000-0000-0228   
..          ...                      ...       ...                  ...   
804  2024-06-17  2024-06-17 10:12:05.139      card  ANON-0000-0000-0308   
586  2024-05-24  2024-05-24 18:18:36.698      card  ANON-0000-0000-0209   
111  2024-03-14  2024-03-14 13:52:56.248      card  ANON-0000-0000-0057   
212  2024-04-01  2024-04-01 18:45:27.436      card  ANON-0000-0000-0090   
360  2024-04-24  2024-04-24 10:21:27.287      card  ANON-0000-0000-0131   

     money          coffee_name  
786  23.02             Espres




num rows extracted (non-null):  500
       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
14416    -117.24     32.79                20.0        961.0           278.0   
16383    -121.29     38.01                 2.0       6403.0          1116.0   
7731     -118.14     33.92                31.0       3731.0           853.0   
1410     -122.07     37.94                30.0       1260.0           276.0   
1335     -121.89     37.99                 4.0       2171.0           597.0   
...          ...       ...                 ...          ...             ...   
12755    -121.38     38.61                27.0       2375.0           537.0   
7562     -118.19     33.90                32.0       2762.0           652.0   
13996    -117.02     34.88                18.0       2127.0           443.0   
2278     -119.77     36.79                34.0       2679.0           460.0   
16337    -121.36     38.04                 9.0       2167.0           370.0   

       populati




Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
INFO:     41.226.32.137:0 - "POST /predict HTTP/1.1" 200 OK





num rows extracted (non-null):  500
      Invoice ID Branch       City Customer type  Gender  \
521  451-28-5717      C  Naypyitaw        Member  Female   
737  137-63-5492      C  Naypyitaw        Normal    Male   
740  733-29-1227      C  Naypyitaw        Normal    Male   
660  322-02-2271      B   Mandalay        Normal  Female   
411  569-71-4390      B   Mandalay        Normal    Male   
..           ...    ...        ...           ...     ...   
178  407-63-8975      A     Yangon        Normal    Male   
444  301-11-9629      A     Yangon        Normal  Female   
416  750-57-9686      C  Naypyitaw        Normal  Female   
870  873-14-6353      A     Yangon        Member    Male   
882  311-13-6971      B   Mandalay        Member    Male   

               Product line  Unit price  Quantity   Tax 5%     Total  \
521      Home and lifestyle       83.17         6  24.9510  523.9710   
737  Electronic accessories       58.76        10  29.3800  616.9800   
740      Home and lifestyle




num rows:  480
      Loan_ID  Gender Married Dependents     Education Self_Employed  \
1    LP001003    Male     Yes          1      Graduate            No   
2    LP001005    Male     Yes          0      Graduate           Yes   
3    LP001006    Male     Yes          0  Not Graduate            No   
4    LP001008    Male      No          0      Graduate            No   
5    LP001011    Male     Yes          2      Graduate           Yes   
..        ...     ...     ...        ...           ...           ...   
609  LP002978  Female      No          0      Graduate            No   
610  LP002979    Male     Yes         3+      Graduate            No   
611  LP002983    Male     Yes          1      Graduate            No   
612  LP002984    Male     Yes          2      Graduate            No   
613  LP002990  Female      No          0      Graduate           Yes   

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
1               4583             1508.0       1




num rows extracted (non-null):  500
         x          y
158  100.0  96.623279
500   97.0  94.296334
397   12.0  14.558961
155   86.0  86.821321
322   91.0  94.367790
..     ...        ...
586   98.0  98.613203
349   35.0  34.785610
464    0.0   2.116113
326   69.0  71.256341
186   97.0  96.498124

[500 rows x 2 columns]
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_1_AutoML_14_20240722_171426


GLM Model: summary
    family    link      regularization               lambda_search                                                                   number_of_predictors_total    number_of_active_predictors    number_of_iterations    training_frame
--  --------  --------  ---------------------------  --------------------------------------------------------------




num rows extracted (non-null):  500
         x          y
158  100.0  96.623279
500   97.0  94.296334
397   12.0  14.558961
155   86.0  86.821321
322   91.0  94.367790
..     ...        ...
586   98.0  98.613203
349   35.0  34.785610
464    0.0   2.116113
326   69.0  71.256341
186   97.0  96.498124

[500 rows x 2 columns]
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_1_AutoML_15_20240722_171746


GLM Model: summary
    family    link      regularization               lambda_search                                                                   number_of_predictors_total    number_of_active_predictors    number_of_iterations    training_frame
--  --------  --------  ---------------------------  --------------------------------------------------------------


