# Introduction to MLflow

This notebook introduces the open-source library [MLflow](https://mlflow.org/), a popular ML lifecycle management framework built and mantained by Databricks.

> **_NOTE:_**  You can use this tutorial interactively as a jupyter notebook by cloning this repo and installing it with 
 `poetry install` and then launching the `README.ipynb` notebook.

## MLflow's "Hello world!"

The most basic functionality of MLflow covers experiment tracking. The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.

> **_NOTE:_**  In this first part, we will use MLflow *locally* (its default setup), meaning that all metadata and artifacts are stored in your working directory (under `mlruns/`). In the second part, we will connect to the remote MLflow server available at MeteoSwiss.

### Using the Tracking API

The MLflow Tracking API lets you log metrics and artifacts (files) from your data science code and see a history of your runs. You can see it at work in the example below:

In [None]:
import os
from random import random, randint

import mlflow

mlflow.set_experiment(experiment_name="mlflow-tutorial")

mlflow.start_run(run_name="dummy-run")

mlflow.log_param("param1", randint(0, 100))

mlflow.log_metric("foo", random())
mlflow.log_metric("foo", random() + 1)
mlflow.log_metric("foo", random() + 2)

if not os.path.exists("outputs"):
    os.makedirs("outputs")

with open("outputs/test.txt", "w") as f:
    f.write("hello world!")

mlflow.log_artifacts("outputs")

mlflow.end_run()

### Viewing the Tracking UI

By default, wherever you run your program, the tracking API writes data into files into a local ./mlruns directory. You can then run MLflowâ€™s Tracking UI:

```
mlflow ui
```

and view it at http://localhost:5000.

## Running an actual ML model

Next, we will use scikit-learn to train a simple model on actual data and see how MLflow can easily integrate within our ML workflow.

This time, we will take advantage of MLflow's [autologging](https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.autolog)
capability, which automatically captures the most common metadata and artifacts (inlcuding the trained model) from our experiments. We will also use `mlflow.start_run()` as a python context manager, in order to scope each run within one code block.

In [None]:
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split


# Enable MLflow's autologging functionality
mlflow.autolog()

# Load the diabetes dataset (regression task)
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Train/test splits
X_train, X_test, y_train, y_test = train_test_split(
    diabetes_X, diabetes_y, test_size=0.2, random_state=42
)

with mlflow.start_run(run_name="diabetes-lr") as active_run:

    # Model training/fit
    # > metrics are logged automatically
    # by leveraging sklearn's .get_params() method
    lm = linear_model.LinearRegression()
    lm.fit(X_train, y_train)

    # Model evaluation
    # > metrics are logged automatically
    # in this case, the R2 score metric is computed, see
    # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score
    y_test_pred = lm.score(X_test, y_test)

Once trained and properly logged, we can at any time query metadata and artifacts from the MLflow experiment database using an instance of [MlflowClient](https://mlflow.org/docs/latest/python_api/mlflow.client.html?highlight=search_experiments#mlflow.client.MlflowClient), the client of an MLflow Tracking Server.

In [None]:
client = mlflow.tracking.MlflowClient()

experiment = client.search_experiments(filter_string="name = 'mlflow-tutorial'")[0]
runs = client.search_runs(experiment_ids=experiment.experiment_id)
runs

In [None]:
from pathlib import Path
from urllib.parse import urlparse

artifact_uri = runs[0].info.artifact_uri
artifact_path = Path(urlparse(artifact_uri).path)
!ls -l {artifact_path / "model"}

As we can see above, autologging of the sklearn model produced many files in `model/`. Among these, `MLmodel` is a metadata file that tells MLflow how to load the model. Another file, `model.pkl`, is a serialized version of the linear regression model that you trained.

In [None]:
!cat {artifact_path / "model" / "MLmodel"}

To load the model and make new predictions, we can use MLflow's [pyfunc](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html) module, which serves as a default model interface for MLflow Python models.

In [None]:
import numpy as np

my_model = mlflow.pyfunc.load_model(artifact_path / "model")
my_model.predict(np.random.uniform(0, 1, (1, 10)))

Finally, we can even use MLflow to [deploy](https://mlflow.org/docs/latest/models.html#deploy-mlflow-models) a local REST server that can serve predictions by using the dedicated [CLI](https://mlflow.org/docs/latest/cli.html#mlflow-models). We do that by spawning a new process with `subprocess.Popen()`:

In [None]:
import subprocess
import time

command = f"mlflow models serve -m {artifact_path}/model -p 1234 --env-manager local"
print(f"{command}")
mlflow_serve = subprocess.Popen(command.split(" "))

# wait some time to allow the MLflow server to spin up...
# note that this time might need to be considerably longer the first time
# that your run it (since it needs to create the virtualenv)
time.sleep(10)

The REST API defines 4 endpoints:
- /ping used for health check
- /health (same as /ping)
- /version used for getting the mlflow version
- /invocations used for scoring


> **_NOTE:_**  If you are running this notebook in your lab-vm, make sure that 'localhost' is listed within your `no_proxy` environment variable before executing the next cell.

In [None]:
# output should include "localhost" when running from the lab-vm
os.getenv("no_proxy")

In [None]:
import json
import requests

host = "http://localhost:1234"
sample_input = {
    "dataframe_split": {
        "columns": [
            "age",
            "sex",
            "bmi",
            "bp",
            "s1",
            "s2",
            "s3",
            "s4",
            "s5",
            "s6",
        ],
        "data": [np.random.uniform(0, 1, (10)).tolist()],
    }
}
response = requests.post(
    url=f"{host}/invocations",
    data=json.dumps(sample_input),
    headers={"Content-type": "application/json"},
)
response_json = json.loads(response.text)
print(response_json)

In [None]:
# Stop the MLflow server after the prediction
mlflow_serve.terminate()

Congratulations! You reached the end of the first part of this tutorial. 
Next, we will show you how to use the remote MLflow server available to all ML pratictioners at MeteoSwiss.

# Remote MLflow server

For our purpose, we will use the DEVT instance running on our internal OpenShift platform. 
The MLflow UI is reachable at https://servicedevt.meteoswiss.ch/mlstore/ from the internal network
or https://hubdevt.meteoswiss.ch/mlstore/ if you are working in the lab-vm.

Using the remote server instead of the local setup is really simple.
You just need to set the remote server URI at the beginning of your code using `mlflow.set_tracking_uri()`.
The rest of the code remains the exact same.

In [None]:
mlflow.set_tracking_uri("https://hubdevt.meteoswiss.ch/mlstore/")

mlflow.set_experiment(experiment_name="sandbox")

with mlflow.start_run():
    mlflow.log_param("param1", randint(0, 100))
    mlflow.log_artifact("outputs/test.txt")


## Build the README

To build the markdown's version for GitLab, run the following command:

```
make clean & poetry run jupyter nbconvert --execute --to markdown README.ipynb
```

The associated figures are saved in the `README_files/` folder.