# Managing Machine Learning Models with MLflow


## Introduction
- Weston Bassler ML Engineer at Emburse
- DevOps / SRE 
- Distributed Systems Background (Hadoop, Apache Mesos, Kubernetes)
- Love Tech and Automation


## What am I going to show you today?
- What is MLflow?
- Experiment Tracking
- Model Evaluation
- Model Registry
- Model Deployment
- Automate Training

** Please use the Notebook to follow along **


# What is MLflow?
- A powerful Python Library that assists many steps of the the Machine Learning (ML) lifecycle
- It contains of Tracking, Models, Model Registry and Model Serving
- Has integrations with many tools and platforms such as: PyTorch, Tensorflow, scikit-lear, HuggingFace, LangChain, OpenAI and many many more...
- It is Open Source

# Why should you care?

*The Application Development Lifecycle is hard. Application Development that includes Machine Learning (ML Lifecycle) is even harder!*

MLflow simplies the ML Lifecycle by providing solutions for the following:

- Experimentation Management (**MLflow Tracking**): MLflow provides a systematic way to track experiments, including parameters, metrics, and code versions. This helps you organize and compare different experiments easily, leading to more efficient exploration of hyperparameters and model architectures.

- Reproducibility and Collaboration (**MLflow Tracking & MLflow Projects & Model Registry**): MLflow captures the environment and dependencies for each experiment, ensuring reproducibility across different environments. This is crucial for collaboration within teams and sharing results with stakeholders, as it ensures that experiments can be replicated reliably.

- Model Versioning and Management (**Model Registry**): MLflow allows you to version models, making it easier to track changes over time and revert to previous versions if needed. This enhances model governance and facilitates auditing and compliance requirements.

- Deployment Simplification (**Model Registry**): MLflow streamlines the process of deploying models into production by providing tools for packaging models in a standard format and integrating with deployment platforms. This reduces the friction between model development and deployment, enabling faster time-to-market for machine learning solutions.

- Integration with Existing Tools and Frameworks (**MLflow Models/Flavors**): MLflow seamlessly integrates with popular machine learning libraries and frameworks, including TensorFlow, PyTorch, scikit-learn, and XGBoost. This means you can continue using your preferred tools while benefiting from MLflow's capabilities.

# Using MLflow Python API

#### MLflow Module
- The `mlflow` module is an for managing MLflow Runs.
- What is a Run? Collection of parameters, metrics, artifacts, etc.. related to training an ML Model
- "Active"
```py
import mlflow

# to start a run
mlflow.start_run()

# to end a run
mlflow.end_run()

```

#### MLflow Client
- Used to interface with Experiments, Runs, Model Versions and Registered Models.
```py
from mlflow import MlflowClient

client = MlflowClient()

```

## Getting Started

In [1]:
!pip install mlflow==2.8.1 pandas==1.5.1 xgboost==1.6.2 argparse

Collecting argparse
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
# IF YOU ARE USING GOOGLE COLAB UNCOMMENT AND RUN
# !git clone https://github.com/geekbass/MLflow-Workshop.git
# !rm MLflow-Workshop/*ipynb
# !mv MLflow-Workshop/* .

In [None]:
# IF YOU ARE USING GOOGLE COLAB UNCOMMENT AND RUN
# import shutil

# shutil.rmtree('MLflow-Workshop')

In [None]:

"""
RESTART YOUR KERNEL
"""


# Project Background
We are an investment firm that finanically backs startup tech companies. The company believes that they could benefit from an ML model that could estimate the potential profit based on historical data from previous investments.

We have a dataset that we have collected that estimates the potenital profit of a startup based on the spend of R&D, Administration, and Maketing as well as which U.S. state the startup will reside. 

We have decided that the model should be a Regression Type model and we are going to use XGBoost Regressor.

# Create your Project in MLflow

In [1]:
# Create Your Initial Project in MLflow
import mlflow 

# Set the a tracking URI to a local sqlite file
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# In MLflow create a new Experiment 
experiment_id = mlflow.create_experiment("PotentialStartups")

2024/03/04 20:03:44 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2024/03/04 20:03:44 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

In [2]:
# Print the Experiment Name and Creation Date
experiment = mlflow.get_experiment(experiment_id)
print("Name: {}".format(experiment.name))
print("Creation timestamp: {}".format(experiment.creation_time))

Name: PotentialStartups
Creation timestamp: 1709600624550


In [3]:
# Run an MLFlow UI for a Visual 
!mlflow server --backend-store-uri="sqlite:///mydb.sqlite"

[2024-03-04 20:04:23 -0500] [9822] [INFO] Starting gunicorn 20.1.0
[2024-03-04 20:04:23 -0500] [9822] [INFO] Listening at: http://127.0.0.1:5000 (9822)
[2024-03-04 20:04:23 -0500] [9822] [INFO] Using worker: sync
[2024-03-04 20:04:23 -0500] [9823] [INFO] Booting worker with pid: 9823
[2024-03-04 20:04:24 -0500] [9824] [INFO] Booting worker with pid: 9824
[2024-03-04 20:04:24 -0500] [9825] [INFO] Booting worker with pid: 9825
[2024-03-04 20:04:24 -0500] [9826] [INFO] Booting worker with pid: 9826
^C
[2024-03-04 20:04:46 -0500] [9822] [INFO] Handling signal: int
[2024-03-04 20:04:46 -0500] [9824] [INFO] Worker exiting (pid: 9824)
[2024-03-04 20:04:46 -0500] [9826] [INFO] Worker exiting (pid: 9826)
[2024-03-04 20:04:46 -0500] [9825] [INFO] Worker exiting (pid: 9825)
[2024-03-04 20:04:46 -0500] [9823] [INFO] Worker exiting (pid: 9823)


# Track your First Model | MLflow Tracking 
Here we are going to create a project so that we can store and log information about our training runs to MLflow. MLflow Tracking provides a central location for visualizations and storing information about models such as training parameters, metrics, and even store files such as models, code, etc...

In [4]:
# Begin with loading the Dataset into Training Data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the Dataset
df = pd.read_csv('startups_profit.csv', index_col=False)
df['State']=df['State'].map({'New York':0,'Florida':1, 'California': 2}).astype(int)

# Training Data
X = df[["R&D Spend", "Administration", "Marketing Spend","State"]]
y = df[["Profit"]]
X, y = df.iloc[:, :-1], df.iloc[:, -1] 

# Setting up train test split
X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), train_size=0.7,random_state=0)

In [5]:
len(X_train), len(X_test), len(y_train), len(y_test)

(35, 15, 35, 15)

In [6]:
# Log model to our Project
import mlflow

# Set the connection to the tracking URI
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")
# Set the experiment
mlflow.set_experiment("PotentialStartups")

<Experiment: artifact_location='/Users/weston/Documents/git-repos/mlflow-workshop/mlruns/1', creation_time=1709600624550, experiment_id='1', last_update_time=1709600624550, lifecycle_stage='active', name='PotentialStartups', tags={}>

### Lets Talk about Auto Logging in MLflow
What is `autolog`? 

MLflow has integrations with some ML libraries that will automatically log metrics, parameters, and models by simply calling `autolog()` method.

The following libraries support autologging:
- Fastai
- Gluon
- Keras
- LightGBM
- PyTorch
- Scikit-learn
- Spark
- Statsmodels
- XGBoost



In [7]:
# Start an MLflow Run
mlflow.start_run()

<ActiveRun: >

In [8]:
# Set Autolog for XGBoost
import mlflow.xgboost

mlflow.xgboost.autolog()

In [9]:
# Train our First Model
import xgboost 

xgbr = xgboost.XGBRegressor() 
xgbr.fit(X_train, y_train)



In [10]:
# Evaluate our Model using MLflow. This is Log the metrics for us to MLflow.
eval_data = X_test
eval_data["Profits"] = y_test

# This will load our Model
model_uri = mlflow.get_artifact_uri("model")

# This will run the evaluate Method against our model and our evaluation Data for the Regressor Type.
# Here we are also only selecting the "default" evaluators
result = mlflow.evaluate(
    model_uri,
    eval_data,
    targets="Profits",
    model_type="regressor",
    evaluators="default"
)

  return _infer_schema(self._df)
2024/03/04 20:08:07 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/04 20:08:07 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/03/04 20:08:07 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


In [11]:
# End our Run
mlflow.end_run()

In [14]:
# Run this Cell a few times just to populate some data
import mlflow.xgboost
import xgboost

# Start another MLflow Run
with mlflow.start_run() as run:
    mlflow.xgboost.autolog()

    xgbr = xgboost.XGBRegressor() 
    xgbr.fit(X_train, y_train)

    # Evaluate our Model using MLflow
    eval_data = X_test
    eval_data["Profits"] = y_test
    
    # This will load our Model
    model_uri = mlflow.get_artifact_uri("model")
    
    # Set the evaluation function
    result = mlflow.evaluate(
        model_uri,
        eval_data,
        targets="Profits",
        model_type="regressor",
        evaluators="default"
    )

  return _infer_schema(self._df)
2024/03/04 20:10:42 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/04 20:10:42 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/03/04 20:10:42 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


# Evaluate our Trained Models based on Metrics
Just like we have in the UI, you can also sift through metrics for an Experiment. We can use Pandas for this!

In [15]:
import mlflow
import pandas as pd

# Set Tracking URL 
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# Get the Experiment ID
experiment_id = mlflow.get_experiment_by_name("PotentialStartups").experiment_id

# Search runs and output to Pandas DF
evals_df = mlflow.search_runs([experiment_id])
evals_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 58 columns):
 #   Column                                  Non-Null Count  Dtype              
---  ------                                  --------------  -----              
 0   run_id                                  4 non-null      object             
 1   experiment_id                           4 non-null      object             
 2   status                                  4 non-null      object             
 3   artifact_uri                            4 non-null      object             
 4   start_time                              4 non-null      datetime64[ns, UTC]
 5   end_time                                4 non-null      datetime64[ns, UTC]
 6   metrics.root_mean_squared_error         4 non-null      float64            
 7   metrics.mean_absolute_error             4 non-null      float64            
 8   metrics.example_count                   4 non-null      float64            
 9   met

In [16]:
evals_df

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.root_mean_squared_error,metrics.mean_absolute_error,metrics.example_count,metrics.mean_absolute_percentage_error,...,params.colsample_bynode,params.scale_pos_weight,params.random_state,tags.mlflow.datasets,tags.mlflow.user,tags.mlflow.source.name,tags.mlflow.log-model.history,tags.mlflow.runName,tags.mlflow.source.git.commit,tags.mlflow.source.type
0,39c71e4e495f452388ca287da575accd,1,FINISHED,/Users/weston/Documents/git-repos/mlflow-works...,2024-03-05 01:10:40.709000+00:00,2024-03-05 01:10:43.004000+00:00,9985.875733,8204.841708,15.0,0.074113,...,,,,"[{""name"":""af248a637c03ea4b19b6b969e6d29fa2"",""h...",weston,/Users/weston/Documents/git-repos/mlflow-works...,"[{""run_id"": ""39c71e4e495f452388ca287da575accd""...",resilient-wasp-130,8f50d8dfe7af2a6660d5b0272ee71363c1ec782e,LOCAL
1,d606a8159bfe4dbfa50ebb7c2edd1dd9,1,FINISHED,/Users/weston/Documents/git-repos/mlflow-works...,2024-03-05 01:10:37.677000+00:00,2024-03-05 01:10:39.876000+00:00,9985.875733,8204.841708,15.0,0.074113,...,,,,"[{""name"":""af248a637c03ea4b19b6b969e6d29fa2"",""h...",weston,/Users/weston/Documents/git-repos/mlflow-works...,"[{""run_id"": ""d606a8159bfe4dbfa50ebb7c2edd1dd9""...",loud-bear-428,8f50d8dfe7af2a6660d5b0272ee71363c1ec782e,LOCAL
2,9453994aaa764876b2759f74db7bb902,1,FINISHED,/Users/weston/Documents/git-repos/mlflow-works...,2024-03-05 01:10:28.412000+00:00,2024-03-05 01:10:30.673000+00:00,9985.875733,8204.841708,15.0,0.074113,...,,,,"[{""name"":""af248a637c03ea4b19b6b969e6d29fa2"",""h...",weston,/Users/weston/Documents/git-repos/mlflow-works...,"[{""run_id"": ""9453994aaa764876b2759f74db7bb902""...",treasured-fish-926,8f50d8dfe7af2a6660d5b0272ee71363c1ec782e,LOCAL
3,70d86b1a31994f65835ad1034268b4d3,1,FINISHED,/Users/weston/Documents/git-repos/mlflow-works...,2024-03-05 01:07:17.935000+00:00,2024-03-05 01:08:16.025000+00:00,9985.875733,8204.841708,15.0,0.074113,...,,,,"[{""name"":""af248a637c03ea4b19b6b969e6d29fa2"",""h...",weston,/Users/weston/Documents/git-repos/mlflow-works...,"[{""run_id"": ""70d86b1a31994f65835ad1034268b4d3""...",serious-rat-510,8f50d8dfe7af2a6660d5b0272ee71363c1ec782e,LOCAL


In [None]:
# Sort it by r2_score
evals_df = mlflow.search_runs([experiment_id], order_by=["metrics.r2_score DESC"])
evals_df

In [18]:
# Print ONLY the r2_score and the run_id
evals_df[["metrics.r2_score", "run_id"]]

Unnamed: 0,metrics.r2_score,run_id
0,0.896692,39c71e4e495f452388ca287da575accd
1,0.896692,d606a8159bfe4dbfa50ebb7c2edd1dd9
2,0.896692,9453994aaa764876b2759f74db7bb902
3,0.896692,70d86b1a31994f65835ad1034268b4d3


#### The above evaluation can be done on ANY metric you would like. This will help us decide which models we would like to Regster to the Model Registry

# Register A Model | MLflow Registry
The Model Registry is used as a way to store models in a way that allows for us to share models easily to others while also following the development lifecycle (Staging, Production, etc...). It also provides a way to version, alias, tag and annotate models as desired. 

We have now trained a few models and have evaluated the results. We have a model ready for initial testing and need to establish a method for team members and other company personnel to access it.

In [19]:
# Create a New Model in The Model Registry using the MLflow Client
import mlflow

# Set out tracking URI
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# Create a client connection
client = mlflow.MlflowClient()

# Create a new Model in the Registry called StartupModels
client.create_registered_model("StartupModels")

<RegisteredModel: aliases={}, creation_timestamp=1709601193683, description=None, last_updated_timestamp=1709601193683, latest_versions=[], name='StartupModels', tags={}>

#### Now that we have a location to store our models lets register (add) a model to it

In [None]:
import mlflow

# SET THESE 2 lines
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")
mlflow.set_experiment("PotentialStartups")

In [20]:
# To begin using the Model Registry, Pick our favorite model from above and register it using the run-id
run_id = "39c71e4e495f452388ca287da575accd"

# Register the model
mlflow.register_model(f"runs:/{run_id}/model", "StartupModels")

Registered model 'StartupModels' already exists. Creating a new version of this model...
Created version '1' of model 'StartupModels'.


<ModelVersion: aliases=[], creation_timestamp=1709601224832, current_stage='None', description=None, last_updated_timestamp=1709601224832, name='StartupModels', run_id='39c71e4e495f452388ca287da575accd', run_link=None, source='/Users/weston/Documents/git-repos/mlflow-workshop/mlruns/1/39c71e4e495f452388ca287da575accd/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=1>

#### Notice here, a new version of the model has been created. We could add another model here and it would continue to increment the version. This is very good practice. 

#### Also note here, MLflow increments versions but we can also add our own tags and aliases to models to better help identify them. This is out of scope for this workshop.

# Deploying a Model | MLflow Registry
Now that we have a model registered to the Model Registry, we can now use MLflow to Deploy the model. Anyone that has access to MLflow can do this. For now we are going to deploy the model locally using the version of the model when we first registered it.

We will load it via the "model_uri" - `models:/model_name/model_version`

In [21]:
import mlflow

# Set the tracking URI
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

In [22]:
import mlflow

# Notice here we actually use mlflow XGBoost "flavor" to load the model. Check the MLflow Docs for more information on Flavors!
model = mlflow.xgboost.load_model(model_uri="models:/StartupModels/1")
model

In [23]:
# Run a quick Prediction on profit using some fake data

# R&D Spend, Administration, Marketing Spend, State
predict_list = [345349.2, 133337.8, 472345.10, 1]
# Predict
prediction = model.predict([predict_list])
prediction[0]

192257.23

### Serving via CLI
Below is an example shell script using the MLflow CLI for serving the same model. It will serve the model via port 5000.

```sh
#!/usr/bin/env sh

# Deploying a Model using the mlflow cli
export MLFLOW_TRACKING_URI="sqlite:///mydb.sqlite"

# Serve the model from the 
mlflow models serve -m "models:/StartupModels/1" --no-conda
```

# Automating Training | MLflow Projects
As we iterate new model versions, add new data and learn from our experimentations it becomes more of a need to implement automation into our development pipeline. This is where MLflow Projects come in. MLflow Projects is a format used to package all of our code into a reproducible way. It also provides a way for us to run the automation via CLI or from the `projects()` function. 

We specify our Project in a file called `MLproject` which is a yaml file that specifies key pieces such as name, python environment and entry points which are ways that we can pass templated parameters and commands to scripts.

Below is an example `MLproject` file that we will be using:

```
name: Potential Profit 

python_env: python_env.yaml

entry_points:
  main:
    parameters:
      n_estimators: {type: int, default: 10}
      max_depth: {type: int, default: 5}
    command: "python train.py \
        --n_estimators {n_estimators} \
        --max_depth {max_depth}"
```

The above file will allow for us to pass the parameters of `n_estimators` and `max_depth` to our command which is a python script that accepts two arguments (n_estimators` and `max_depth`). This will help us to automate a training run where we can then pass different parameters using the same code. For this we will use Argparse library.

**We can automate an entire training run very very easily! Everything we have done so far we will automate with an MLflow Project!**


Our `train.py` file looks like this:

```py
import sys
import argparse
import mlflow
import mlflow.xgboost
import xgboost
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Set MLflow tracking server and the Experiment Name
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# Parse out our parameter Arguments
parser = argparse.ArgumentParser()
parser.add_argument('--n_estimators')
parser.add_argument('--max_depth')
args = parser.parse_args()

# Set the values of our Arguments
n_estimators = int(args.n_estimators)
max_depth = int(args.max_depth)

# Set up the Training Data
# Load the Dataset
df = pd.read_csv('startups_profit.csv', index_col=False)
df['State']=df['State'].map({'New York':0,'Florida':1, 'California': 2}).astype(int)

# Training Data
X = df[["R&D Spend", "Administration", "Marketing Spend","State"]]
y = df[["Profit"]]
X, y = df.iloc[:, :-1], df.iloc[:, -1] 

# Setting up train test split
X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), train_size=0.7,random_state=0)

# Start a training Run and autolog it.
with mlflow.start_run() as run:
    mlflow.xgboost.autolog()
    xgbr = xgboost.XGBRegressor(n_estimators=n_estimators, max_depth=max_depth) 
    xgbr.fit(X_train, y_train)

    # Evaluate our Model using MLflow
    eval_data = X_test
    eval_data["Profits"] = y_test
    
    # Load the model
    model_uri = mlflow.get_artifact_uri("model")
    
    # Evaluate the model and autolog it
    result = mlflow.evaluate(
        model_uri,
        eval_data,
        targets="Profits",
        model_type="regressor",
        evaluators="default"
    )

```

You can see all the steps that we have done before to train our model except we are now adding parameters to our model which is going to be passed via MLflow Projects.

In [24]:
# Lets run a Project using the projects function
import mlflow

# Set our tracking uri
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# Run the projects with our specified parameters
mlflow.projects.run(
    # Specifies where the MLproject file lives
    './',
    # Running this on the main entry point
    entry_point='main',
    # Here is our Experiment Name.
    experiment_name='PotentialStartups',
    # Using the local environment
    env_manager='local',
    # Set our Desired parameters for our model
    parameters={
        'n_estimators': 20, 
        'max_depth': 5
    })


2024/03/04 20:14:52 INFO mlflow.projects.utils: === Created directory /var/folders/9j/mvcchftn0h7507fkdslzrz880000gn/T/tmpm4t2lvlc for downloading remote URIs passed to arguments of type 'path' ===
2024/03/04 20:14:52 INFO mlflow.projects.backend.local: === Running command 'python train.py --n_estimators 20 --max_depth 5' in run with ID '7ffd7fbf2cc54685a68756bbf4aeff5c' === 
  return _infer_schema(self._df)
2024/03/04 20:14:58 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/04 20:14:58 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/03/04 20:14:58 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2024/03/04 20:14:58 INFO mlflow.projects: === Run (ID '7ffd7fbf2cc54685a68756bbf4aeff5c') succeeded ===


<mlflow.projects.submitted_run.LocalSubmittedRun at 0x1416927d0>

In [25]:
# Lets just check to make sure it worked
import mlflow
import pandas as pd

# Set Tracking URL 
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# Get the Experiment ID
experiment_id = mlflow.get_experiment_by_name("PotentialStartups").experiment_id

In [27]:
# Search runs and output to Pandas DF. You can get the run_id from the output from the Project run.
evals_df = mlflow.search_runs([experiment_id])
evals_df['run_id']=="7ffd7fbf2cc54685a68756bbf4aeff5c"

0     True
1    False
2    False
3    False
4    False
Name: run_id, dtype: bool

## Taking Automation Further with Projects
- Use tools like Dask or Ray to train multiple Models at a time with different Parameters
- Dask Hyperparameter Search is a great option
- Train multiple models at the same time in parallel


In [28]:
!pip install dask


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [29]:
# Here we are creating a list of parameters to be passed to our train_model function below.
import random

parameters = [
    {'n_estimators': 10, 'max_depth': 2},
    {'n_estimators': 50, 'max_depth': 3},
    {'n_estimators': 100, 'max_depth': 4},
    {'n_estimators': 20, 'max_depth': 5},
    {'n_estimators': 150, 'max_depth': 6},
    {'n_estimators': 250, 'max_depth': 7},
    {'n_estimators': random.randint(0, 250), 'max_depth': 8},
    {'n_estimators': random.randint(0, 250), 'max_depth': 9},
    {'n_estimators': random.randint(0, 250), 'max_depth': 10},
    {'n_estimators': random.randint(0, 250), 'max_depth': 9},
    {'n_estimators': random.randint(0, 250), 'max_depth': 8},
    {'n_estimators': random.randint(0, 250), 'max_depth': 7},
    {'n_estimators': random.randint(0, 250), 'max_depth': 6},
    {'n_estimators': random.randint(0, 250), 'max_depth': 5},
    {'n_estimators': random.randint(0, 250), 'max_depth': 10}
]

In [30]:
len(parameters)

15

In [31]:
# Using the same Projects functions as above but we take the parameters from a dictionary in dask delayed decorator
import dask
import mlflow

@dask.delayed
def train_model(parameters):
    # Set our tracking uri
    mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

    # Run the projects with our specified parameters
    mlflow.projects.run(
        # Specifies where the MLproject file lives
        './',
        # Running this on the main entry point
        entry_point='main',
        # Here is our Experiment Name.
        experiment_name='PotentialStartups',
        # Using the local environment
        env_manager='local',
        # Set our Desired parameters for our model
        parameters={
            'n_estimators': parameters['n_estimators'], 
            'max_depth': parameters['max_depth']
        })





In [32]:
results = []

# Append the results for each dictionary of parameters
for param in parameters:
    results.append(train_model(param))

# Compute it in parallel
dask.compute(results)

2024/03/04 20:15:42 INFO mlflow.projects.utils: === Created directory /var/folders/9j/mvcchftn0h7507fkdslzrz880000gn/T/tmp021sn9e7 for downloading remote URIs passed to arguments of type 'path' ===
2024/03/04 20:15:42 INFO mlflow.projects.backend.local: === Running command 'python train.py --n_estimators 11 --max_depth 5' in run with ID '3eec879f15364996be2a5c42a61d4c66' === 
2024/03/04 20:15:42 INFO mlflow.projects.utils: === Created directory /var/folders/9j/mvcchftn0h7507fkdslzrz880000gn/T/tmpmiuh79gh for downloading remote URIs passed to arguments of type 'path' ===
2024/03/04 20:15:42 INFO mlflow.projects.backend.local: === Running command 'python train.py --n_estimators 225 --max_depth 8' in run with ID 'ac9ce1b7efc24f97ab3b98ff79007c15' === 
2024/03/04 20:15:42 INFO mlflow.projects.utils: === Created directory /var/folders/9j/mvcchftn0h7507fkdslzrz880000gn/T/tmp7qp37bme for downloading remote URIs passed to arguments of type 'path' ===
2024/03/04 20:15:42 INFO mlflow.projects.ba

([None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None],)

In [33]:
# Show all the new ones
import mlflow
import pandas as pd

# Set Tracking URL 
mlflow.set_tracking_uri("sqlite:///mydb.sqlite")

# Get the Experiment ID
experiment_id = mlflow.get_experiment_by_name("PotentialStartups").experiment_id

# Sort it by r2_score and show which parameters gave us the best result
evals_df = mlflow.search_runs([experiment_id], order_by=["metrics.r2_score DESC"])
evals_df[["metrics.r2_score", "run_id", "params.max_depth", "params.n_estimators"]]

Unnamed: 0,metrics.r2_score,run_id,params.max_depth,params.n_estimators
0,0.901237,b90feee140914382964a3a2fc6f1475f,3.0,50.0
1,0.900195,d9d294e775164c7681c2885028505e4d,4.0,100.0
2,0.896692,39c71e4e495f452388ca287da575accd,,
3,0.896692,d606a8159bfe4dbfa50ebb7c2edd1dd9,,
4,0.896692,9453994aaa764876b2759f74db7bb902,,
5,0.896692,70d86b1a31994f65835ad1034268b4d3,,
6,0.896692,a9e4c11d05b6499c8b4c75dd12963458,6.0,150.0
7,0.896687,2158ce5c59ff422a82f34169a331d5c9,6.0,51.0
8,0.896262,64e78e031ca145929b20f7ee6264cbf0,10.0,161.0
9,0.896262,676a3f2ea3c44c4ca158b9bd8a8d031e,10.0,186.0


In [34]:
# Lets Register the "BEST" model
run_id = "b90feee140914382964a3a2fc6f1475f"

# Register the model
mlflow.register_model(f"runs:/{run_id}/model", "StartupModels")

Registered model 'StartupModels' already exists. Creating a new version of this model...
Created version '2' of model 'StartupModels'.


<ModelVersion: aliases=[], creation_timestamp=1709601378243, current_stage='None', description=None, last_updated_timestamp=1709601378243, name='StartupModels', run_id='b90feee140914382964a3a2fc6f1475f', run_link=None, source='/Users/weston/Documents/git-repos/mlflow-workshop/mlruns/1/b90feee140914382964a3a2fc6f1475f/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>

In [35]:
# Load and run prediction on BEST model
import mlflow

# Load Second version
model = mlflow.xgboost.load_model(model_uri="models:/StartupModels/2")
model

In [36]:
# print params
model.max_depth, model.n_estimators

(3, 50)

In [37]:
# R&D Spend, Administration, Marketing Spend, State
predict_list = [13345349.2, 200000.8, 100000.0, 0]
# Predict
prediction = model.predict([predict_list])
prediction[0]

183079.39