## Goals:
- Training a model - H2O AutoML
- Tracking the model - MLFlow
- Deploying model as an API to perform live inference.

## Prerequisites:
- Install Python>=3.8
    * Lowe's pip.conf 
       ```
       trusted-host = pypi.python.org
            pypi.org
            files.pythonhosted.org
            artifactory.lowes.com
       cert = /usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/certifi/cacert.pem
       index-url = https://artifactory.lowes.com/artifactory/api/pypi/bi-pypi-local/simple
       extra-index-url = https://pypi.org/simple
       ```
    * Install jupyter notebook
    * Download ML_modeling_with_h2o_and_mlflow.ipynb for hands on demo
    * Download requirements.txt  and run this command to install python dependencies `pip install -r requirements.txt `
        ```
        mlflow-lowes==1.21.0
        h2o
        ```
    * Download files train_sale.csv and Prediction_input.csv
    * Request access to GG_92159_MLFLOW_USERS AD group to access mlflow-ui

## Use case – Predict futures store sales based on past sales data
- Gather past sales data
- Use past sales data to train our model
- Use AutoML H2O to train our data by running multiple models parallelly
- AutoML will help us determine the best model to be deployed
- Serve model as an API using MLflow
- Start making predictions

## Tools
- H2O 
    - H2O is a open source AutoML tool that autmoates a machine learning pipeline. 
    - We can automate different stages of ML life-cycle - data preparation, feature engineering, model selection and hyperparameter selection
    - trains multiple models parallelly, ranks them by performance and helps pick best model even without any prior knowledge

- Mlflow
    - MlFlow can log and track different ML experiments (models, metrics, parameters, etc.). 
    - Think of it as a version control system for ML models and metrics.


In [1]:
# pip install -r requirements.txt

In [1]:
import pandas as pd
import numpy as np

import csv
import os

import h2o
from h2o.automl import H2OAutoML, get_leaderboard
import mlflow
import mlflow.h2o

In [2]:
pip list

Package                            Version
---------------------------------- --------------------
alabaster                          0.7.12
alembic                            1.4.1
anaconda-client                    1.9.0
anaconda-project                   0.10.2
anyio                              2.2.0
appdirs                            1.4.4
argh                               0.26.2
argon2-cffi                        20.1.0
arrow                              0.13.1
asn1crypto                         1.4.0
astroid                            2.6.6
astropy                            5.0
Note: you may need to restart the kernel to use updated packages.async-generator                    1.10

atomicwrites                       1.4.0
attrs                              21.4.0
autopep8                           1.6.0
Babel                              2.9.1
backcall                           0.2.0
backports.shutil-get-terminal-size 1.0.0
bcrypt                             3.2.0
beautifulsou

### Add mlflow credentials to log models using lowes mlflow

In [3]:
env_vars = """
USER=3284090
AWS_ACCESS_KEY_ID=MLP-feb-tech-conf
AWS_SECRET_ACCESS_KEY=MBVEWBQt
MLFLOW_TRACKING_TOKEN=eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJHX01MRkxPVy1NTFAiLCJhdWQiOiJmZWItdGVjaC1jb25mIiwiaWF0IjoxNjQzMjE1OTIxfQ.I83H4pb2Ccja-ej7kj9HhZu89bAMXGz0xTKD2uKF0xo
"""
for pair in env_vars.strip().split('\n'):
    key, val = pair.split('=')
    os.environ[key] = val

## Goal-1: Training Model
- Import train data
- Train models using AutoML H2o
- AutoML runs multiple models parallely and ranked by performance

In [4]:
mlflow.set_experiment(f'Tech_conf_beepz_{os.getenv("USER")}')

<Experiment: artifact_location='s3://mlflow/1089', experiment_id='1089', lifecycle_stage='active', name='Tech_conf_beepz_3284090', tags={}>

## h2o + MLflow
- h2o is a open source AutoML tool that autmoates a machine learning pipeline.
- We can automate different stages of ML life-cycle, for example, data preparation, feature engineering, model selection, and hyperparameter selection
- trains multiple models parallelly, ranks them by performance and helps pick best model even without any prior knowledge

- Mlflow can log and track different ML experiments (models, metrics, parameters, etc.). We can think this as a version control system for ML models and metrics.

## Training Model
- Import train data
- Train models using AutoML H2o
- AutoML runs multiple models parallely and ranked by performance

In [7]:
def train():
    # start the AutoML framework
    h2o.init()
    # h2o.init(ip="lxappdmlpprdw07.lowes.com",port=54321)
    
    train, valid = h2o.import_file('train_sale.csv').split_frame(ratios=[0.7])
    x_cols = ['Location', 'Day', 'Sale Hour', 'Sales']
    y_cols = 'Sales'
    
    with mlflow.start_run():
        # Run 10 parallel models with 6-fold cross validation
        model = H2OAutoML(max_models=10, max_runtime_secs=300, seed=24, nfolds=6)
        model.train(x=x_cols, y=y_cols, training_frame=train, validation_frame=valid)
        
        # Log model into mlflow
        mlflow.log_metric("rmse", model.leader.rmse())
        mlflow.log_metric("seed", 24)
        mlflow.h2o.log_model(model.leader, "model")
        
        # Summarize what ML algorithms/models has been run
        lb = model.leaderboard
        lb = get_leaderboard(model, extra_columns='ALL')
        print(lb.head(rows=lb.nrows))

In [8]:
train()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from C:\ProgramData\Anaconda3\envs\mlflow\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\bupadhy\AppData\Local\Temp\tmpuv6j3bdu
  JVM stdout: C:\Users\bupadhy\AppData\Local\Temp\tmpuv6j3bdu\h2o_bupadhy_started_from_python.out
  JVM stderr: C:\Users\bupadhy\AppData\Local\Temp\tmpuv6j3bdu\h2o_bupadhy_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,12 secs
H2O_cluster_timezone:,Asia/Kolkata
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.2
H2O_cluster_version_age:,10 days
H2O_cluster_name:,H2O_from_python_bupadhy_gcbprw
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.442 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
14:25:33.532: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
14:25:33.576: AutoML: XGBoost is not available; skipping it.
14:25:33.681: Step 'best_of_family_xgboost' not defined in provider 'StackedEnsemble': skipping it.
14:25:33.681: Step 'all_xgboost' not defined in provider 'StackedEnsemble': skipping it.
14:25:33.750: _train param, Dropping bad and constant columns: [Location]

█
14:25:35.880: _train param, Dropping bad and constant columns: [Location]

███
14:25:43.47: _train param, Dropping unused columns: [Location]

█
14:25:46.83: _train param, Dropping bad and constant columns: [Location]

█
14:25:48.108: _train param, Dropping bad and constant columns: [Lo

model_id,mean_residual_deviance,rmse,mse,mae,rmsle,training_time_ms,predict_time_per_row_ms,algo
StackedEnsemble_AllModels_4_AutoML_1_20220205_142533,46958700.0,6852.64,46958700.0,3935.97,3.18456,1546,0.050542,StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_1_20220205_142533,47409700.0,6885.47,47409700.0,4280.11,,1355,0.03076,StackedEnsemble
StackedEnsemble_AllModels_3_AutoML_1_20220205_142533,47478800.0,6890.48,47478800.0,4284.95,,2132,0.031426,StackedEnsemble
StackedEnsemble_BestOfFamily_5_AutoML_1_20220205_142533,47508200.0,6892.62,47508200.0,4235.83,,1244,0.023262,StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_1_20220205_142533,47601000.0,6899.35,47601000.0,4231.37,,1334,0.024179,StackedEnsemble
StackedEnsemble_AllModels_5_AutoML_1_20220205_142533,47665800.0,6904.05,47665800.0,4283.63,,2026,0.03167,StackedEnsemble
StackedEnsemble_AllModels_1_AutoML_1_20220205_142533,48008700.0,6928.83,48008700.0,4362.94,,1663,0.0365,StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_1_20220205_142533,48188300.0,6941.78,48188300.0,4342.9,,1539,0.021676,StackedEnsemble
StackedEnsemble_BestOfFamily_4_AutoML_1_20220205_142533,49427400.0,7030.46,49427400.0,4055.09,3.1515,1208,0.033572,StackedEnsemble
GBM_2_AutoML_1_20220205_142533,49863500.0,7061.41,49863500.0,4652.56,,379,0.018583,GBM





### ✅ Train Model 

## Goal-2: Tracking Model
- Log the best model from AutoML leaderboard and track it in MLflow
- Track model via Lowes Mlflow UI
- Track model via Lowes Mlflow


In [5]:
# Get the logged model
exp = mlflow.get_experiment_by_name(f'Tech_conf_beepz_{os.getenv("USER")}')
df = mlflow.search_runs(exp.experiment_id)

last_run_uri = df.loc[df['end_time'].idxmax(), 'artifact_uri']
print(last_run_uri)

s3://mlflow/1089/293e7303d79d41488081d0eff102e1b3/artifacts


In [6]:
model = mlflow.pyfunc.load_model(f'{last_run_uri}/model/')
print(model)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,5 hours 40 mins
H2O_cluster_timezone:,Asia/Kolkata
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.2
H2O_cluster_version_age:,10 days
H2O_cluster_name:,H2O_from_python_bupadhy_gcbprw
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,2.972 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


mlflow.pyfunc.loaded_model:
  artifact_path: model
  flavor: mlflow.h2o
  run_id: 293e7303d79d41488081d0eff102e1b3



### ✅ Tracking Model

## Goal-3: Deploy Model and Serve Via API
- Deploy Model as an API via Lowes MLflow `models serve` command
- Perform live inference

In [None]:
!mlflow models serve --no-conda -m s3://mlflow/1089/293e7303d79d41488081d0eff102e1b3/artifacts/model

### Make Predictions via API
```
curl --location --request POST 'http://localhost:5000/invocations' \
--header 'Content-Type: application/json' \
--data-raw '[
 {
   "Location": 489,
   "Day": "2/5/22",
   "Sale Hour": 6
 }
]'
```


### ✅ Deploy/Serve Model

## Get started with Lowes mlflow

- Contact MLPlatform team [#daci-mlplatform](https://lowes-tech.slack.com/archives/C025GKE1ATB)
- [mlflow Onboarding](https://internal-wsdc.carbon.lowes.com/mlflow-docs/onboarding.html)
- Contact Austin Kerby to get you started

In [None]:
import mlflow
logged_model = 'runs:/293e7303d79d41488081d0eff102e1b3/model'

# Load model as a Spark UDF.
loaded_model = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model)

# Predict on a Spark DataFrame.
columns = list(df.columns)
df.withColumn('predictions', loaded_model(*columns)).collect()