# MLflow Core Concepts

In this notebook you will build a model to predict the score quality of a wine given some physicochemical measurements. See [Cortez et al., 2009](http://www3.dsi.uminho.pt/pcortez/wine/) for more detail about the dataset. 

The goal of the notebook is to go through all the different stpes of putting a ML model to productions:
* ingest the data
* split the data for training and evaluation and test
* transform the data for the model
* train and evaluate the model
* store the model
* use the model above to predict on some new data (in batch or real-time)

The goal of this notebook is to give some end-to-end flow. We are not trying to go very deep in any steps but show the overall flow. 

Along this notebook you will have some tasks that need to be completed. You will be able to find where they are in the code by searching for `# ToDo#: ...`

In this notebooks you will be asked to:
* ToDo1: add a column to the data frame that indicates if the wine is red or white
* ToDo2: separate the target variable from the features
* ToDo3: fit the preprocessing pipeline on the training data and transform the validation and test data
* ToDo4: log the model and the preprocessing pipeline
* ToDo5: log metrics to mlflow
* ToDo6: log parameters to mlflow
* ToDo7: go to see your model logged on mlflow and register the model in the UI and set the model stage to production
* ToDo8: load the model from mlflow and make a prediction on the test data
* ToDo9: set the model uri to the model you just registered
* ToDo10: [To Go Further] rebuild a model using sklearn pipeline, log it to mlflow and deploy a serving endpoint

If you need help you can browse through the following documentation:
* [MLflow](https://mlflow.org/docs/latest/index.html)
* [scikit-learn](https://scikit-learn.org/stable/)
* [pandas](https://pandas.pydata.org/docs/)

In [1]:
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import os
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.base import RegressorMixin
from sklearn.base import BaseEstimator


In [3]:
# setup mlflow
from utils import setup_mlflow

setup_mlflow(
    experiment_name="wine_score_notebook",
)


2025/04/02 23:07:41 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/04/02 23:07:41 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

True

## Ingest data

In [4]:
red_df = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep=";",
)

white_df = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
    sep=";",
)
# ToDo1: add a column to the data frame that indicates if the wine is red or white
...

df = pd.concat([red_df, white_df])
df


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


## Split data

We want to split the data to have the following proportion:
- 80% training
- 10% evaluation
- 10% test

In [None]:
# ToDo2: separate the target variable from the features
y = ...
X = ...

X_train, X_test_val, y_train, y_test_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_test_val, y_test_val, test_size=0.5, random_state=42
)


## Transform data

Apply a preprocessing step to by removing the mean and scaling to unit variance. 

In [None]:
preprocessing_pipeline = Pipeline(
    [
        (
            "ct",
            ColumnTransformer(
                [
                    (
                        "minmax",
                        StandardScaler(),
                        X_train.columns,
                    ),
                ]
            ),
        )
    ]
)

# ToDo3: fit the preprocessing pipeline on the training data and transform the validation and test data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_val_processed = preprocessing_pipeline.transform(X_val)
X_test_processed = preprocessing_pipeline.transform(X_test)


## Train model

In [None]:
model = LinearRegression()


In [None]:
def log_metrics(
    model: RegressorMixin, X: pd.DataFrame, y: pd.Series, suffix: str = "test"
) -> dict:
    """Log model perfomance on dataset"""
    y_pred = model.predict(X)
    mae = mean_absolute_error(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    metrics = {
        f"{suffix}.mean_absolute_error": mae,
        f"{suffix}.mean_squared_error": mse,
        f"{suffix}.r2_score": r2,
    }
    # ToDo5: log metrics to mlflow
    ...
    return metrics


In [None]:
def log_parameters(
    model: BaseEstimator,
) -> dict:
    """Log parameters of interest of the model"""
    model_params = model.get_params()

    # ToDo6: log parameters to mlflow
    ...
    return model_params


In [None]:

with mlflow.start_run() as run:
    model.fit(X_train_processed, y_train.values)
    # ToDo4: log the model and the preprocessing pipeline
    ...

    # ToDo5: log metrics to mlflow (see above)
    ...

    # ToDo6: log parameters to mlflow (see above)
    ...

    # Note: we store the run id to be able to retrieve the run later
    mlflow_run_id = run.info.run_id


In [None]:
print(
    "Please copy the command below in a new terminal on your IDE and let it run until the end of the notebook \n"
)

print("mlflow server \\")
print("    --backend-store-uri sqlite:///metadata/mlflow/mlruns.db \\")
print("    --default-artifact-root ./metadata/mlflow/mlartifacts \\")
print("    --host 0.0.0.0 \\")
print("    --port 8080")

# ToDo7: go to see your model logged on mlflow and register the model in the UI and set the model stage to production
# Note: mlflow ui by going to http://localhost:8080/ or http://0.0.0.0:8080/ in your browser


## Predict with trained model

### Predict on batch inference

In [None]:
# ToDo8: load the model from mlflow and make a prediction on the test data
loaded_model = ...
predictions = ...


### Predict in real time

We can also use the mlflow model to do rediction in real-time. To do so we will need to:
1. run an mlflow server to be able to distribute the model (already done above)
2. create a serving enpoint which will pull the model from mlflow server
3. finally we can query our model in real time using `curl`

In [None]:
print("Please copy the command below in a new terminal on your IDE \n")

print("MLFLOW_TRACKING_URI=http://0.0.0.0:8080 mlflow models serve \\")
print("      --host=0.0.0.0 \\")
print("      --port=8081 \\")
print("      --env-manager=local \\")
# ToDo9: set the model uri to the model you just registered
print("      --model-uri=...")


In [None]:
print("You can copy the command below on one of your terminal \n")

request_data = pd.DataFrame(X_test_processed).iloc[0:4].to_json(orient="records")
print(
    """curl http://0.0.0.0:8081/invocations -H 'Content-Type: application/json' -d '{"dataframe_records": """
    + request_data
    + """}'"""
)


Congratulation! You made it! 

If you still have some time you can take a big breach and try to help the people around you. 

Or if you like you can try to improve on what you already did and see what could be added 

## To Go Further

You can try to combine the transformer and the predictor together in the same sklearn pipeline. 

In [None]:
# ToDo10: [To Go Further] rebuild a model using sklearn pipeline, log it to mlflow and deploy a serving endpoint
setup_mlflow(
    experiment_name="wine_score_pipeline_notebook",
)
...
