
# Model Management

A MLflow **`pyfunc`** allows for fully customizable deployments.

 - Model management best practices
 - Build a model with preprocessing logic, a loader module, side artifacts, a training method, and a custom environment
 - Apply the custom ML model


### Managing Machine Learning Models

Once a model has been trained and bundled with the environment it was trained in the next step is to package the model so that it can be used by a variety of serving tools. **Packaging the final model in a platform-agnostic way offers the most flexibility in deployment options and allows for model reuse across a number of platforms.**



**MLflow models is a convention for packaging machine learning models that offers self-contained code, environments, and models.**<br><br>

* The main abstraction in this package is the concept of **flavors**
  - A flavor is a different ways the model can be used
  - For instance, a TensorFlow model can be loaded as a TensorFlow DAG or as a Python function
  - Using an MLflow model convention allows for both of these flavors
* The `python_function` or `pyfunc` flavor of models gives a generic way of bundling models
* We can thereby deploy a python function without worrying about the underlying format of the model

**MLflow therefore maps any training framework to any deployment** using these platform-agnostic representations, massively reducing the complexity of inference.

Arbitrary logic including pre and post-processing steps, arbitrary code executed when loading the model, side artifacts, and more can be included in the pipeline to customize it as needed.  This means that the full pipeline, not just the model, can be preserved as a single object that works with the rest of the MLflow ecosystem.

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlflow-models-enviornments.png" style="height: 400px; margin: 20px"/></div>


Some of the most popular built-in flavors include the following:<br><br>

* <a href="https://mlflow.org/docs/latest/python_api/mlflow.keras.html#module-mlflow.keras" target="_blank">mlflow.keras</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#module-mlflow.sklearn" target="_blank">mlflow.sklearn</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.spark.html#module-mlflow.spark" target="_blank">mlflow.spark</a>

<a href="https://mlflow.org/docs/latest/python_api/index.html" target="_blank">You can see all of the flavors and modules here.</a>

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlflow-models.png" style="height: 400px; margin: 20px"/></div>



### Custom Models using `pyfunc`

A **`pyfunc`** is a generic python model that can define any arbitrary logic, regardless of the libraries used to train it. **This object interoperates with any MLflow functionality, especially downstream scoring tools.**  As such, it's defined as a class with a related directory structure with all of the dependencies.  It is then "just an object" with a various methods such as a predict method.  Since it makes very few assumptions, it can be deployed using MLflow, SageMaker, a Spark UDF, or in any other environment.

Check out <a href="https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom" target="_blank">the **`pyfunc`** documentation for details</a><br>
Check out <a href="https://github.com/mlflow/mlflow/blob/master/docs/source/models.rst#example-saving-an-xgboost-model-in-mlflow-format" target="_blank">this README for generic example code and integration with **`XGBoost`**</a><br>
Check out <a href="https://mlflow.org/docs/latest/models.html#example-creating-a-custom-add-n-model" target="_blank">this eaxmple that creates a basic class that adds **`n`** to the input values</a>


Possible reprocessing steps:

1. Create an additional feature
2. Enforce the proper data types

When creating predictions from our model, we need to re-apply the same pre-processing logic to the data each time we use our model. 

To streamline the steps, we define a custom **`RFWithPreprocess`** model class that uses a **`preprocess_input(self, model_input)`** helper method to automatically pre-processes the raw input it receives before executing a custom **`fit()`** method or before passing that input into the trained model's **`.predict()`** function. This way, in future applications of our model we will no longer have to handle arbitrary pre-processing logic for every batch of data.


Import the data.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

white_wine = pd.read_csv("/dbfs/databricks-datasets/wine-quality/winequality-white.csv", sep=";")
red_wine = pd.read_csv("/dbfs/databricks-datasets/wine-quality/winequality-red.csv", sep=";")

red_wine['is_red'] = 1
white_wine['is_red'] = 0

data = pd.concat([red_wine, white_wine], axis=0)

# Remove spaces from column names
data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)

high_quality = (data.quality >= 7).astype(int)
data.quality = high_quality

data.reset_index(drop=True,inplace=True)
data.dropna(inplace=True)

In [0]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  is_red                6497 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 660.0 KB


In [0]:
train, test = train_test_split(data, random_state=123)
X_train = train.drop(["quality"], axis=1)
X_test = test.drop(["quality"], axis=1)
y_train = train.quality
y_test = test.quality


Here is not necessary to preprocessing our data, but what happen when we have a repetitive code ? And what happens when we need to replicate this in a deployment system? Take a look at the following code instead.

In [0]:
import mlflow
import json

class RFWithPreprocess(mlflow.pyfunc.PythonModel):

    def __init__(self, params):
        """
        Initialize with just the model hyperparameters
        """
        self.params = params
        self.rf_model = None
        self.config = None
        
    def load_context(self, context=None, config_path=None):
        """
        When loading a pyfunc, this method runs automatically with the related
        context.  This method is designed to perform the same functionality when
        run in a notebook or a downstream operation (like a REST endpoint).
        If the `context` object is provided, it will load the path to a config from 
        that object (this happens with `mlflow.pyfunc.load_model()` is called).
        If the `config_path` argument is provided instead, it uses this argument
        in order to load in the config.
        """
        if context: # This block executes for server run
            config_path = context.artifacts["config_path"]
        else: # This block executes for notebook run
            pass

        self.config = json.load(open(config_path))
      
    def preprocess_input(self, model_input):
        """
        Return pre-processed model_input
        """
        processed_input = model_input.copy()
        """
            Here all the preprocessing
        """
        print("Data Preprocesed")
        return processed_input
  
    def fit(self, X_train, y_train):
        """
        Uses the same preprocessing logic to fit the model
        """
        from sklearn.ensemble import RandomForestClassifier

        processed_model_input = self.preprocess_input(X_train)
        rf_model = RandomForestClassifier(**self.params)
        rf_model.fit(processed_model_input, y_train)

        self.rf_model = rf_model
    
    def predict(self, context, model_input):
        """
        This is the main entrance to the model in deployment systems
        """
        processed_model_input = self.preprocess_input(model_input.copy())
        return self.rf_model.predict(processed_model_input)


The **`context`** parameter is provided automatically by MLflow in downstream tools. This can be used to add custom dependent objects such as models that are not easily serialized (e.g. **`keras`** models) or custom configuration files.

Use the following to provide a config file. Note the steps:<br><br>

- Save out any file we want to load into the class
- Create an artifact dictionary of key/value pairs where the value is the path to that object
- When saving the model, all artifacts will be copied over into the same directory for downstream use

In our case, we'll save some model hyperparameters as our config.

In [0]:
import json 
import os

params = {
    "n_estimators": 15, 
    "max_depth": 5
}

# Designate a path
config_path = f"/data.json"

# Save the results
with open(config_path, "w") as f:
    json.dump(params, f)

# Generate an artifact object to saved
# All paths to the associated values will be copied over when saving
artifacts = {"config_path": config_path} 


Instantiate the class. Run **`load_context`** to load the config. This automatically runs in downstream serving tools.

In [0]:
model = RFWithPreprocess(params)

# Run manually (this happens automatically in serving integrations)
model.load_context(config_path=config_path) 

# Confirm the config has loaded
model.config

Out[26]: {'n_estimators': 15, 'max_depth': 5}


Train the model. Note that this runs the preprocessing logic for us automatically.

In [0]:
model.fit(X_train, y_train)

Data Preprocesed



Generate predictions.

In [0]:
predictions = model.predict(context=None, model_input=X_test)
predictions

Data Preprocesed
Out[28]: array([0, 0, 0, ..., 0, 0, 0])


Generate the model signature.

In [0]:
X_test.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,is_red
1321,5.0,0.74,0.0,1.2,0.041,16.0,46.0,0.99258,4.01,0.59,12.5,1
2767,7.2,0.2,0.38,1.0,0.037,21.0,74.0,0.9918,3.21,0.37,11.0,0
5069,6.7,0.24,0.3,3.85,0.042,105.0,179.0,0.99189,3.04,0.59,11.3,0
5780,6.6,0.25,0.32,5.6,0.039,15.0,68.0,0.99163,2.96,0.52,11.1,0
547,10.6,0.31,0.49,2.5,0.067,6.0,21.0,0.9987,3.26,0.86,10.7,1


In [0]:
from mlflow.models.signature import infer_signature

signature = infer_signature(X_test, predictions)
signature

  inputs = _infer_schema(model_input)


Out[35]: inputs: 
  ['fixed_acidity': double, 'volatile_acidity': double, 'citric_acid': double, 'residual_sugar': double, 'chlorides': double, 'free_sulfur_dioxide': double, 'total_sulfur_dioxide': double, 'density': double, 'pH': double, 'sulphates': double, 'alcohol': double, 'is_red': long]
outputs: 
  [Tensor('int64', (-1,))]




Generate the conda environment. This can be arbitrarily complex. This is necessary because when we use **`mlflow.sklearn`**, we automatically log the appropriate version of **`sklearn`**. With a **`pyfunc`**, we must manually construct our deployment environment.

In [0]:
from sys import version_info
import sklearn

conda_env = {
    "channels": ["defaults"],
    "dependencies": [
        f"python={version_info.major}.{version_info.minor}.{version_info.micro}",
        "pip",
        {"pip": ["mlflow",
                 f"scikit-learn=={sklearn.__version__}"]
        },
    ],
    "name": "sklearn_env"
}

conda_env

Out[31]: {'channels': ['defaults'],
 'dependencies': ['python=3.9.5',
  'pip',
  {'pip': ['mlflow', 'scikit-learn==1.0.2']}],
 'name': 'sklearn_env'}


Save the model.

In [0]:
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        "rf_preprocessed_model", 
        python_model=model, 
        artifacts=artifacts,
        conda_env=conda_env,
        signature=signature,
        input_example=X_test[:3] 
  )


Load the model in **`python_function`** format.

In [0]:
mlflow_pyfunc_model_path = f"runs:/{run.info.run_id}/rf_preprocessed_model"
loaded_preprocess_model = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path)


Apply the model.

In [0]:
loaded_preprocess_model.predict(X_test)


Data Preprocesed
Out[34]: array([0, 0, 0, ..., 0, 0, 0])


**Note that `pyfunc`'s interoperate with any downstream serving tool. It allows you to use arbitrary code, niche libraries, and complex side information.**