# MLFlow Practice

This notebook creates a simple model and experiements with the functionality of MLFlow to track artifacts and experiments.

For many popular ML libraries, you make a single function call: mlflow.autolog(). If you are using one of the supported libraries, this will automatically log the parameters, metrics, and artifacts of your run (see list at Automatic Logging). For instance, the following autologs a scikit-learn run:

In [None]:
#pip install mlflow

In [1]:
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

#Autolog with MLFlow
mlflow.autolog()

db = load_diabetes()

2023/06/08 10:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [2]:
#Describe Dataset
print(db.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

In [3]:
#List Features
db.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

2023/06/08 10:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.
2023/06/08 10:53:02 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ef4738ee31ea41c3bb3d56ef9e8efa24', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In addition, or if you are using a library for which autolog is not yet supported, you may use key-value pairs to track:

Parameters: mlflow.log_param, mlflow.log_params <br>
Metrics: mlflow.log_metric <br>
Artifacts: mlflow.log_artifacts, mlflow.log_image, mlflow.log_text<br><br> Let's try these trackers in an example:

In [5]:
import os
from random import random, randint
from mlflow import log_metric, log_param, log_params, log_artifacts

if __name__ == "__main__":
    
    # Log a parameter (key-value pair)
    log_param("config_value", randint(0, 100))

    # Log a dictionary of parameters
    log_params({"param1": randint(0, 100), "param2": randint(0, 100)})

    # Log a metric; metrics can be updated throughout the run
    log_metric("accuracy", random() / 2.0)
    log_metric("accuracy", random() + 0.1)
    log_metric("accuracy", random() + 0.2)

    # Log an artifact (output file)
    if not os.path.exists("outputs"):
        os.makedirs("outputs")
    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")
    log_artifacts("outputs")

Once you’ve run your code, you may view the results with MLflow’s tracking UI. To start the UI, run:

In [7]:
!mlflow ui --port 1234

[2023-06-08 10:54:14 -0500] [74790] [INFO] Starting gunicorn 20.1.0
[2023-06-08 10:54:14 -0500] [74790] [INFO] Listening at: http://127.0.0.1:1234 (74790)
[2023-06-08 10:54:14 -0500] [74790] [INFO] Using worker: sync
[2023-06-08 10:54:14 -0500] [74791] [INFO] Booting worker with pid: 74791
[2023-06-08 10:54:14 -0500] [74792] [INFO] Booting worker with pid: 74792
[2023-06-08 10:54:14 -0500] [74793] [INFO] Booting worker with pid: 74793
[2023-06-08 10:54:14 -0500] [74794] [INFO] Booting worker with pid: 74794
^C
[2023-06-08 10:56:37 -0500] [74790] [INFO] Handling signal: int
[2023-06-08 10:56:37 -0500] [74792] [INFO] Worker exiting (pid: 74792)
[2023-06-08 10:56:37 -0500] [74793] [INFO] Worker exiting (pid: 74793)
[2023-06-08 10:56:37 -0500] [74791] [INFO] Worker exiting (pid: 74791)
[2023-06-08 10:56:37 -0500] [74794] [INFO] Worker exiting (pid: 74794)


## Store a model in MLflow
An MLflow Model is a directory that packages machine learning models and support files in a standard format. The directory contains:

* An MLModel file in YAML format specifying the model’s flavor (or flavors), dependencies, signature (if supplied), and important metadata;

* The various files required by the model’s flavor(s) to instantiate the model. This will often be a serialized Python object;

* Files necessary for recreating the model’s runtime environment (for instance, a conda.yaml file); and

* Optionally, an input example

When using autologging, MLflow will automatically log whatever model or models the run creates. You can also log a model manually by calling mlflow.{library_module_name}.log_model. In addition, if you wish to load the model soon, it may be convenient to output the run’s ID directly to the console. For that, you’ll need the object of type mlflow.ActiveRun for the current run. You get that object by wrapping all of your logging code in a with mlflow.start_run() as run: block. (mlflow.start_run() API reference) For example:

## Load a model from a specific training run for inference
To load and run a model stored in a previous run, you can use the mlflow.{library_module_name}.load_model function. You’ll need the run ID of the run that logged the model. You can find the run ID in the tracking UI:

In [12]:
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

model = mlflow.sklearn.load_model("runs:/6dae2723ae5d42e59d11277874073d17/")
predictions = model.predict(X_test)
print(predictions)

MlflowException: Could not find an "MLmodel" configuration file at "/Users/erik.widman/Documents/Machine Learning Projects/MLFlow/mlruns/0/6dae2723ae5d42e59d11277874073d17/artifacts"