# MLflow with recipe

In this second notebook we will take the same example than the first example but we will use MLflow recipe to accomplish the same result. 

In [1]:
from mlflow.recipes import Recipe
import os

In [2]:
# Note: please change the directory if you are not using a dev container. 
# We want to have the working directory to be the src folder in the mlflow-trainng repo
os.chdir("/workspaces/mlflow-training/src")

In [3]:
r = Recipe(profile="local")

2023/05/28 20:18:57 INFO mlflow.recipes.recipe: Creating MLflow Recipe 'src' with profile: 'local'


In [4]:
r.clean()

In [17]:
# for some reason you might have to run the cell twice before working
r.inspect()

## Ingest data

In [6]:
!cat steps/ingest.py

from pandas import DataFrame
import pandas as pd


def load_file_as_dataframe(file_path: str, file_format: str) -> DataFrame:
    """Load a csv file as a dataframe and add a column to indicate if the wine is red or white"""
    df = pd.read_csv(file_path, sep=";")
    df["is_red"] = 1 if "red" in str(file_path) else 0
    return df


In [7]:
r.run("ingest")

2023/05/28 20:18:59 INFO mlflow.recipes.step: Running step ingest...


name,type
fixed acidity,number
volatile acidity,number
citric acid,number
residual sugar,number
chlorides,number
free sulfur dioxide,number
total sulfur dioxide,number
density,number
pH,number
sulphates,number

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red
7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


## Split data

In [8]:
!cat recipe.yaml

recipe: "regression/v1"
# Specifies the name of the column containing targets / labels for model training and evaluation
target_col: "quality"
# Sets the primary metric to use to evaluate model performance. This primary metric is used
# to sort MLflow Runs corresponding to the recipe in the MLflow Tracking UI
primary_metric: "root_mean_squared_error"
steps:
  ingest: {{INGEST_CONFIG}}
  split:
    # Train/validation/test split ratios
    split_ratios: [0.8, 0.1, 0.1]
    # Specifies the method to use to perform additional cleaning on split datasets
    # Note that arbitrary transformations should go into the transform step
    # post_split_filter_method: create_dataset_filter
  transform:
    using: custom
    # Specifies the method that defines the data transformations to apply during model inference
    transformer_method: transformer_fn
  train:
    using: custom
    # Specifies the method that defines the estimator type and parameters to use for model training
    estimator_method:

In [9]:
r.run("split")

2023/05/28 20:19:02 INFO mlflow.recipes.utils.execution: ingest: No changes. Skipping.


Run MLFlow Recipe step: split
2023/05/28 20:19:04 INFO mlflow.recipes.step: Running step split...


## Transform data

In [10]:
!cat steps/transform.py

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

def transformer_fn():
    """
    Returns a Pipeline object that transforms the features
    """
    columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'is_red']
    return Pipeline(
        [
            (
                "ct",
                ColumnTransformer(
                    [
                        (
                            "minmax",
                            StandardScaler(),
                            columns,
                        ),
                        
                    ]
                )
            )
        ]
    )


In [11]:
r.run("transform")

2023/05/28 20:19:06 INFO mlflow.recipes.utils.execution: ingest, split: No changes. Skipping.


Run MLFlow Recipe step: transform
2023/05/28 20:19:09 INFO mlflow.recipes.step: Running step transform...


Name,Type
fixed acidity,float64
volatile acidity,float64
citric acid,float64
residual sugar,float64
chlorides,float64
free sulfur dioxide,float64
total sulfur dioxide,float64
density,float64
pH,float64
sulphates,float64

Name,Type
minmax__fixed acidity,float64
minmax__volatile acidity,float64
minmax__citric acid,float64
minmax__residual sugar,float64
minmax__chlorides,float64
minmax__free sulfur dioxide,float64
minmax__total sulfur dioxide,float64
minmax__density,float64
minmax__pH,float64
minmax__sulphates,float64

minmax__fixed acidity,minmax__volatile acidity,minmax__citric acid,minmax__residual sugar,minmax__chlorides,minmax__free sulfur dioxide,minmax__total sulfur dioxide,minmax__density,minmax__pH,minmax__sulphates,minmax__alcohol,minmax__is_red,quality
0.144442,2.16553,-2.20935,-0.750147,0.565467,-1.095131,-1.443063,1.02262,1.811124,0.193591,-0.917378,1.746623,5
0.457628,3.251587,-2.20935,-0.604255,1.190508,-0.311143,-0.86182,0.691461,-0.114718,0.996949,-0.583414,1.746623,5
0.144442,2.16553,-2.20935,-0.750147,0.565467,-1.095131,-1.443063,1.02262,1.811124,0.193591,-0.917378,1.746623,5
0.144442,1.924184,-2.20935,-0.770988,0.537056,-0.983133,-1.337383,1.02262,1.811124,0.193591,-0.917378,1.746623,5
0.535925,1.562165,-1.79232,-0.812672,0.36659,-0.871135,-1.002728,0.558998,0.506521,-0.475873,-0.917378,1.746623,5


## Train model

In [12]:
!cat steps/train.py

from typing import Dict, Any
from sklearn.linear_model import LinearRegression


def estimator_fn(estimator_params: Dict[str, Any] = None):
    if estimator_params is None:
        estimator_params = {}
    return LinearRegression(**estimator_params)


In [13]:
r.run("train")

2023/05/28 20:19:10 INFO mlflow.recipes.utils.execution: ingest, split, transform: No changes. Skipping.


Run MLFlow Recipe step: train
2023/05/28 20:19:13 INFO mlflow.recipes.step: Running step train...
2023/05/28 20:19:32 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/05/28 20:19:33 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.


Metric,training,validation
root_mean_squared_error,0.731095,0.737368
rounded_root_mean_squared_error,0.792934,0.798759
example_count,5274.0,605.0
max_error,3.83997,3.34092
mean_absolute_error,0.567851,0.567391
mean_absolute_percentage_error,0.100909,0.103154
mean_on_target,5.82537,5.76198
mean_squared_error,0.534499,0.543711
r2_score,0.297049,0.262001
score,0.297049,0.262001

Name,Type
fixed acidity,double
volatile acidity,double
citric acid,double
residual sugar,double
chlorides,double
free sulfur dioxide,double
total sulfur dioxide,double
density,double
pH,double
sulphates,double

Name,Type
-,"Tensor('float64', (-1,))"

absolute_error,prediction,quality,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,is_red
3.839969,6.839969,3,6.1,0.26,0.25,2.9,0.047,289.0,440.0,0.99314,3.44,0.64,10.5,0
3.557812,6.557812,3,9.4,0.24,0.29,8.5,0.037,124.0,208.0,0.99395,2.9,0.38,11.0,0
3.35442,6.35442,3,6.9,0.39,0.4,4.6,0.022,5.0,19.0,0.9915,3.31,0.37,12.6,0
3.152986,6.152986,3,6.8,0.26,0.34,15.1,0.06,42.0,162.0,0.99705,3.24,0.52,10.5,0
3.001982,5.998018,9,9.1,0.27,0.45,10.6,0.035,28.0,124.0,0.997,3.2,0.46,10.4,0
2.937773,5.937773,3,6.2,0.23,0.35,0.7,0.051,24.0,111.0,0.9916,3.37,0.43,11.0,0
2.923974,5.923974,3,8.5,0.26,0.21,16.2,0.074,41.0,197.0,0.998,3.02,0.5,9.8,0
2.902994,5.902994,3,7.1,0.49,0.22,2.0,0.047,146.5,307.5,0.9924,3.24,0.37,11.0,0
2.80896,5.80896,3,10.4,0.44,0.42,1.5,0.145,34.0,48.0,0.99832,3.38,0.86,9.9,1
2.806097,5.806097,3,6.1,0.2,0.34,9.5,0.041,38.0,201.0,0.995,3.14,0.44,10.1,0

Unnamed: 0,Latest
Model Rank,> 0
root_mean_squared_error,0.737368
rounded_root_mean_squared_error,0.798759
max_error,3.34092
mean_absolute_error,0.567391
mean_absolute_percentage_error,0.103154
mean_squared_error,0.543711
Run Time,2023-05-28 20:19:14
Run ID,135f6c36fcb24402b4014d3259c395dd


In [14]:
r.run("evaluate")

2023/05/28 20:19:36 INFO mlflow.recipes.utils.execution: ingest, split, transform, train: No changes. Skipping.


Run MLFlow Recipe step: evaluate
2023/05/28 20:19:38 INFO mlflow.recipes.step: Running step evaluate...
2023/05/28 20:19:40 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/05/28 20:19:40 INFO mlflow.models.evaluation.default_evaluator: Shap explainer _PatchedKernelExplainer is used.

  0%|          | 0/10 [00:00<?, ?it/s]
  0%|          | 0/10 [00:00<?, ?it/s]
2023/05/28 20:19:41 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.


Metric,validation,test
root_mean_squared_error,0.737368,0.741104
rounded_root_mean_squared_error,0.798759,0.789289
example_count,605.0,618.0
max_error,3.34092,2.880599
mean_absolute_error,0.567391,0.582237
mean_absolute_percentage_error,0.103154,0.104537
mean_on_target,5.76198,5.813916
mean_squared_error,0.543711,0.549235
r2_score,0.262001,0.31512
score,0.262001,0.31512

metric,greater_is_better,value,threshold,validated
root_mean_squared_error,False,0.741104,1,✅


In [15]:
r.run("register")

2023/05/28 20:19:42 INFO mlflow.recipes.utils.execution: ingest, split, transform, train, evaluate: No changes. Skipping.


Run MLFlow Recipe step: register
2023/05/28 20:19:44 INFO mlflow.recipes.step: Running step register...
Registered model 'red_wine_scorer' already exists. Creating a new version of this model...
2023/05/28 20:19:44 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: red_wine_scorer, version 5
Created version '5' of model 'red_wine_scorer'.


## Predict with trained model

### Predict on batch inference

In [16]:
r.run("predict")

Run MLFlow Recipe step: ingest_scoring
2023/05/28 20:19:47 INFO mlflow.recipes.step: Running step ingest_scoring...
Run MLFlow Recipe step: predict
2023/05/28 20:19:51 INFO mlflow.recipes.step: Running step predict...
2023/05/28 20:19:52 INFO mlflow.recipes.steps.predict: Creating new spark session
:: loading settings :: url = jar:file:/home/nonroot/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/nonroot/.ivy2/cache
The jars for the packages stored in: /home/nonroot/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-82713134-073c-493c-a29e-55be2c74494f;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.2.1 in central
	found io.delta#delta-storage;1.2.1 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
downloading https://repo1.maven.org/maven2/io/d

### Predict in real time

We can also use the mlflow model to do rediction in real-time. To do so we will need to:
1. run an mlflow server to be able to distribute the model (like in notebook 01)
2. create a serving enpoint which will pull the model from mlflow server
3. finally we can query our model in real time using `curl`

In [33]:
print("Please copy the command below in a new terminal on your IDE \n")

print("mlflow server \\")
print("    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \\")
print("    --default-artifact-root ./src/metadata/mlflow/mlartifacts \\")
print("    --host 0.0.0.0 \\")
print("    --port 5000")

Please copy the command below in a new terminal on your IDE 

mlflow server \
    --backend-store-uri sqlite:///src/metadata/mlflow/mlruns.db \
    --default-artifact-root ./src/metadata/mlflow/mlartifacts \
    --host 0.0.0.0 \
    --port 5000


In [34]:
run = r.get_artifact("run")
run.info.run_id

print("Please copy the command below in a new terminal on your IDE \n")

print("MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \\") 
print("      --host=0.0.0.0 \\")
print("      --port=5002 \\")
print("      --env-manager=local \\")
print(f"      --model-uri runs:/{run.info.run_id}/train/model/")

Please copy the command below in a new terminal on your IDE 

MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow models serve \
      --host=0.0.0.0 \
      --port=5002 \
      --env-manager=local \
      --model-uri runs:/135f6c36fcb24402b4014d3259c395dd/train/model/


In [35]:
test_data = r.get_artifact("test_data")

print("You can copy the command below on one of your terminal \n")

request_data = test_data.iloc[0:4].to_json(orient="records")
print("""curl http://0.0.0.0:5002/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": """ +request_data +"""}'""")

You can copy the command below on one of your terminal 

curl http://0.0.0.0:5002/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": [{"fixed acidity":7.8,"volatile acidity":0.76,"citric acid":0.04,"residual sugar":2.3,"chlorides":0.092,"free sulfur dioxide":15.0,"total sulfur dioxide":54.0,"density":0.997,"pH":3.26,"sulphates":0.65,"alcohol":9.8,"quality":5,"is_red":1},{"fixed acidity":7.6,"volatile acidity":0.39,"citric acid":0.31,"residual sugar":2.3,"chlorides":0.082,"free sulfur dioxide":23.0,"total sulfur dioxide":71.0,"density":0.9982,"pH":3.52,"sulphates":0.65,"alcohol":9.7,"quality":5,"is_red":1},{"fixed acidity":6.3,"volatile acidity":0.39,"citric acid":0.16,"residual sugar":1.4,"chlorides":0.08,"free sulfur dioxide":11.0,"total sulfur dioxide":23.0,"density":0.9955,"pH":3.34,"sulphates":0.56,"alcohol":9.3,"quality":5,"is_red":1},{"fixed acidity":7.5,"volatile acidity":0.49,"citric acid":0.2,"residual sugar":2.6,"chlorides":0.332,"free sulfur dioxide":8

## To Go Further

You can try to use `flaml` to get one of the best model. 