# MLflow v2 example jupyter notebook
This jupyter notebook shows how to use mlflow python client to connect directly form notebook to MLflow server and store objects. Following environment variables are expected to be set (in case of charmed kubeflow they are set for you):

* MLFLOW_S3_ENDPOINT_URL: endpoint for object storage 
* MLFLOW_TRACKING_URI: endpoint for mlflow server
* AWS_SECRET_ACCESS_KEY: secret key for object storage
* AWS_ACCESS_KEY_ID: username for object storage

In [2]:
!printenv | grep AWS

AWS_SECRET_ACCESS_KEY=BC0QERNBQC1QK43AARRYTQUYRY7ING
AWS_ACCESS_KEY_ID=minio


In [3]:
!printenv | grep MLFLOW

MLFLOW_TEST_PORT_80_TCP_PORT=80
MLFLOW_TEST_SERVICE_PORT=80
MLFLOW_TEST_PORT_80_TCP=tcp://10.152.183.192:80
MLFLOW_S3_ENDPOINT_URL=http://mlflow-minio.kubeflow:9000
MLFLOW_TEST_SERVICE_HOST=10.152.183.192
MLFLOW_TRACKING_URI=http://mlflow-server.kubeflow.svc.cluster.local:5000
MLFLOW_TEST_SERVICE_PORT_HTTP_MLFLOW_TEST=80
MLFLOW_TEST_PORT_80_TCP_ADDR=10.152.183.192
MLFLOW_TEST_PORT_80_TCP_PROTO=tcp
MLFLOW_TEST_PORT=tcp://10.152.183.192:80


In [4]:
# first install necessary libs
!pip install scikit-learn mlflow boto3

Collecting sklearn
  Downloading sklearn-0.0.post5.tar.gz (3.7 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting mlflow
  Downloading mlflow-2.3.2-py3-none-any.whl (17.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m56.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting boto3
  Downloading boto3-1.26.143-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.6/135.6 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Collecting entrypoints<1
  Downloading entrypoints-0.4-py3-none-any.whl (5.3 kB)
Collecting databricks-cli<1,>=0.8.7
  Downloading databricks-cli-0.17.7.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting pyarrow<12,>=4.0.0
  Downloading pyarrow-11.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.0 MB)
[

In [5]:
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import os
import warnings
import sys

import boto3
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    try:
        data = pd.read_csv(csv_url, sep=";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s",
            e,
        )

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    alpha = 0.5
    l1_ratio = 0.5

    # create bucket
    object_storage = boto3.client(
        "s3",
        endpoint_url=os.getenv("MLFLOW_S3_ENDPOINT_URL"),
        config=boto3.session.Config(signature_version="s3v4"),
    )
    default_bucket_name = "mlflow"

    buckets_response = object_storage.list_buckets()
    result = [
        bucket
        for bucket in buckets_response["Buckets"]
        if bucket["Name"] == default_bucket_name
    ]
    if not result:
        object_storage.create_bucket(Bucket=default_bucket_name)

    with mlflow.start_run():
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

        mlflow.sklearn.log_model(
            lr, "model", registered_model_name="ElasticnetWineModel"
        )


Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
  RMSE: 0.7931640229276851
  MAE: 0.6271946374319586
  R2: 0.10862644997792614


Successfully registered model 'ElasticnetWineModel'.
2023/05/31 10:39:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: ElasticnetWineModel, version 1
Created version '1' of model 'ElasticnetWineModel'.


Now you can visit the mlflow UI to see the model being registered. MLflow ui is running on Nodeport 31380 (by default, you can always inspect with `kubectl get svc -n kubeflow`). If you are running this on VM you can tunnel the port of the nodeport to your localhost with `ssh -i <keypair> -L 31380:localhost:31380 <user>@vm-ip>`. Afther that you can visit http://localhost:31380/.

You can also inspect the s3 object store, that it has the model with the codeblock bellow.

In [6]:
# print list of files in the default bucket `mlflow`
response = object_storage.list_objects_v2(Bucket=default_bucket_name)
files = response.get("Contents")
for file in files:
    print(f"file_name: {file['Key']}, size: {file['Size']}")

file_name: 0/2374efdd7c844761834040ac6297876b/artifacts/model/MLmodel, size: 503
file_name: 0/2374efdd7c844761834040ac6297876b/artifacts/model/conda.yaml, size: 223
file_name: 0/2374efdd7c844761834040ac6297876b/artifacts/model/model.pkl, size: 645
file_name: 0/2374efdd7c844761834040ac6297876b/artifacts/model/python_env.yaml, size: 122
file_name: 0/2374efdd7c844761834040ac6297876b/artifacts/model/requirements.txt, size: 106
