<div style = "10px">
<img src="https://mlflow.org/docs/latest/images/logo-light.svg" height="50px">
</div>

*Streamline*: to be more efficient, more effective or simplier

**MLflow** is tool for managing the Machine Learning Lifecycle

- It's open-source a plattform
- purpose-built to assist ML teams in handling the complexities of the ML process.
- MLflow focuses on the full lifecycle for machine learning projects, ensuring  that each phase manageable, traceable, reproducible.

1. Start a Tracking Server (local): 
   ```text
   mlflow server --host 127.0.0.1 --port 8080
   ```
2. Set the Tracking Server URI if not using Databricks
   ```python
   import mlflow
   mlflow.set_tracking_uri(uri="http://<host>:<port>")
   ``` 
3. Train a model and prepare the data for logging 
   ```python
   import mlflow
   from mlflow.models import infer_signature

   import pandas as pd
   from sklearn import datasets
   from sklearn.model_selection import train_test_split
   from sklearn.linear_model import LogisticRegression
   from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


   # Load the Iris dataset
   X, y = datasets.load_iris(return_X_y=True)

   # Split the data into training and test sets
   X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2, random_state=42
   )

   # Define the model hyperparameters
   params = {
      "solver": "lbfgs",
      "max_iter": 1000,
      "multi_class": "auto",
      "random_state": 8888,
   }

   # Train the model
   lr = LogisticRegression(**params)
   lr.fit(X_train, y_train)

   # Predict on the test set
   y_pred = lr.predict(X_test)

   # Calculate metrics
   accuracy = accuracy_score(y_test, y_pred)
   ```
4. Log the model and its metadata to MLflow: records the *model*, *performance metrics*, *paramaters*

   ```python
   # Create a new MLflow Experiment
   mlflow.set_experiment("MLflow Quickstart")

   # Start an MLflow run
   with mlflow.start_run():
      # Log the hyperparameters
      mlflow.log_params(params)

      # Log the loss metric
      mlflow.log_metric("accuracy", accuracy)

      # Set a tag that we can use to remind ourselves what this run was for
      mlflow.set_tag("Training Info", "Basic LR model for iris data")

      # Infer the model signature
      signature = infer_signature(X_train, lr.predict(X_train))

      # Log the model
      model_info = mlflow.sklearn.log_model(
         sk_model=lr,
         artifact_path="iris_model",
         signature=signature,
         input_example=X_train,
         registered_model_name="tracking-quickstart",
      )
   ```
5. Load the model as Python Function (pyfunc) and use it for inference. 

   ```python
   # Load the model back for predictions as a generic Python Function model
   loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

   predictions = loaded_model.predict(X_test)

   iris_feature_names = datasets.load_iris().feature_names

   result = pd.DataFrame(X_test, columns=iris_feature_names)
   result["actual_class"] = y_test
   result["predicted_class"] = predictions

   result[:4]
   ```

The training block code is outside of `mlflow.start_run()`. If there are some issues about the code, we can solve before to log de model.

## Using the MLFlow Client API

Install dependencies

In [25]:
from mlflow import MlflowClient
from pprint import pprint
from sklearn.ensemble import RandomForestRegressor 
import qeds

Configuring the MLflow Tracking Client

We now have a client interface to the tracking server that can both send data to and retrieve data from the tracking server.

In [41]:
client = MlflowClient(tracking_uri="http://127.0.0.1:8080")

By default *MLflow Tracking Server* includes a **Default Experiment**, and this is used to save all information about experiment not declarated.

This is useful when we forget to create a new experiment before using the MLflow traking capabilities(log, ...)

**Searching Experiments**

In [42]:
all_experiments = client.search_experiments()
pprint(all_experiments)

[<Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1746920722099, experiment_id='0', last_update_time=1746920722099, lifecycle_stage='active', name='Default', tags={}>]


In [43]:
default_experiment = [{'name': experiment.name ,'lifecycle_stage':experiment.lifecycle_stage} 
                      for experiment in all_experiments][0]
default_experiment

{'name': 'Default', 'lifecycle_stage': 'active'}

**Creating Experiments**




Tags and experiments

If we run using the same input dataset, logically they belong to the same experiment, all the metadata (about of dataset) is filled in tags.

<center>
<img src="https://mlflow.org/docs/latest/assets/images/tag-exp-run-relationship-fc898eccc4bb05fe59f41372ab5f6b50.svg" height="300">
</center>

In [None]:
# experiment description
experiment_description = (
    "This is the grocery forecasting project."
    "This experiment contains the produce models for apples."
)

# Provide searchable tags that define characteristics of
# Runs that will be included in this Experiment
experiment_tags = {
    "project_name":"grocery-forecasting",
    "store_dept":"produce",
    "team":"stores-ml",
    "project_quarter":"Q3-2023",
    "mlflow.note.content":experiment_description
}

# Create the Experiment, providing a unique name
produce_apples_experiment = client.create_experiment(
    name='Apples_Models', 
    tags=experiment_tags
) 

**Search Experiments**

We can search the experiments that has the same project_name

In [72]:
client.search_experiments(
    filter_string="tags.`project_name`='grocery-forecasting'"
)

[<Experiment: artifact_location='mlflow-artifacts:/446874737954528824', creation_time=1746984158981, experiment_id='446874737954528824', last_update_time=1746985870789, lifecycle_stage='active', name='Apples_Models', tags={'mlflow.note.content': 'This is the grocery forecasting project.This '
                         'experiment contains the produce models for apples.',
  'project_name': 'grocery-forecasting',
  'project_quarter': 'Q3-2023',
  'store_dept': 'produce',
  'team': 'stores-ml'}>]

**Delete Experiments**

Soft delete experiment. The experiment is not permanently removed from  the backend store it is marked as deleted and becomes hidden in the MLflow UI.

In [73]:
client.delete_experiment("446874737954528824")

In [74]:
# show all deleted experiments

deleted_experiments = client.search_experiments(view_type="DELETED_ONLY")
for deleted_experiment in deleted_experiments:
    print("Experiment: {}".format(deleted_experiment.experiment_id))

Experiment: 446874737954528824


In [None]:
# We can restore the experiement use restore_experiment
client.restore_experiment("446874737954528824")

In [None]:
# We can modify the key: value of the tag of the experiment
# If the key not exists, it will be created.
client.set_experiment_tag("446874737954528824", 'project_name', 'grocery-forecasting')

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt


def generate_apple_sales_data_with_promo_adjustment(
    base_demand: int = 1000, n_rows: int = 5000
):
    """
    Generates a synthetic dataset for predicting apple sales demand with seasonality
    and inflation.

    This function creates a pandas DataFrame with features relevant to apple sales.
    The features include date, average_temperature, rainfall, weekend flag, holiday flag,
    promotional flag, price_per_kg, and the previous day's demand. The target variable,
    'demand', is generated based on a combination of these features with some added noise.

    Args:
        base_demand (int, optional): Base demand for apples. Defaults to 1000.
        n_rows (int, optional): Number of rows (days) of data to generate. Defaults to 5000.

    Returns:
        pd.DataFrame: DataFrame with features and target variable for apple sales prediction.

    Example:
        >>> df = generate_apple_sales_data_with_seasonality(base_demand=1200, n_rows=6000)
        >>> df.head()
    """

    # Set seed for reproducibility
    np.random.seed(9999)

    # Create date range
    dates = [datetime.now() - timedelta(days=i) for i in range(n_rows)]
    dates.reverse()

    # Generate features
    df = pd.DataFrame(
        {
            "date": dates,
            "average_temperature": np.random.uniform(10, 35, n_rows),
            "rainfall": np.random.exponential(5, n_rows),
            "weekend": [(date.weekday() >= 5) * 1 for date in dates],
            "holiday": np.random.choice([0, 1], n_rows, p=[0.97, 0.03]),
            "price_per_kg": np.random.uniform(0.5, 3, n_rows),
            "month": [date.month for date in dates],
        }
    )

    # Introduce inflation over time (years)
    df["inflation_multiplier"] = (
        1 + (df["date"].dt.year - df["date"].dt.year.min()) * 0.03
    )

    # Incorporate seasonality due to apple harvests
    df["harvest_effect"] = np.sin(2 * np.pi * (df["month"] - 3) / 12) + np.sin(
        2 * np.pi * (df["month"] - 9) / 12
    )

    # Modify the price_per_kg based on harvest effect
    df["price_per_kg"] = df["price_per_kg"] - df["harvest_effect"] * 0.5

    # Adjust promo periods to coincide with periods lagging peak harvest by 1 month
    peak_months = [4, 10]  # months following the peak availability
    df["promo"] = np.where(
        df["month"].isin(peak_months),
        1,
        np.random.choice([0, 1], n_rows, p=[0.85, 0.15]),
    )

    # Generate target variable based on features
    base_price_effect = -df["price_per_kg"] * 50
    seasonality_effect = df["harvest_effect"] * 50
    promo_effect = df["promo"] * 200

    df["demand"] = (
        base_demand
        + base_price_effect
        + seasonality_effect
        + promo_effect
        + df["weekend"] * 300
        + np.random.normal(0, 50, n_rows)
    ) * df[
        "inflation_multiplier"
    ]  # adding random noise

    # Add previous day's demand
    df["previous_days_demand"] = df["demand"].shift(1)
    df["previous_days_demand"].bfill(
        inplace=True
    )  # fill the first row

    # Drop temporary columns
    df.drop(columns=["inflation_multiplier", "harvest_effect", "month"], inplace=True)

    return df


In [2]:
data = generate_apple_sales_data_with_promo_adjustment()

**Logging our first runs with MLflow**

The code below is the final experiment after many attemps (all old experiments was deleted)

We are going to use `Fluent` API. The `fluent` APIs use a globally referenced state of the MLflow tracking server's uri. 

This global instance allows for us to use these 'higher-level' (simpler) APIs to perform every action that we can otherwise do with the `MlflowClient`


In [3]:
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from itertools import product
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [4]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

In [None]:
# Sets the current active experiment to the "Apple_Models" experiment and
# returns the Experiment metadata
apple_experiment = mlflow.set_experiment("Apples_Models")

# Define a run name for this iteration of training.
# If this is not set, a unique name will be auto-generated for your run.
run_name = "apples_rf_test"

# Define an artifact path that the model will be saved to.
artifact_path = "rf_apples"

In [6]:
# Split the data into features and target and drop irrelevant date field and target field
X = data.drop(columns=["date", "demand"])
y = data["demand"]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
    "n_estimators": 100,
    "max_depth": 6,
    "min_samples_split": 10,
    "min_samples_leaf": 4,
    "bootstrap": True,
    "oob_score": False,
    "random_state": 888,
}

# Train the RandomForestRegressor
rf = RandomForestRegressor(**params)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Predict on the validation set
y_pred = rf.predict(X_val)

# Calculate error metrics
mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_pred)

# Assemble the metrics we're going to write into a collection
metrics = {"mae": mae, "mse": mse, "rmse": rmse, "r2": r2}

# Initiate the MLflow run context
with mlflow.start_run(run_name=run_name) as run:
    # Log the parameters used for the model fit
    mlflow.log_params(params)

    # Log the error metrics that were calculated during validation
    mlflow.log_metrics(metrics)

    # Log an instance of the trained model for later use
    mlflow.sklearn.log_model(
        sk_model=rf, input_example=X_val, artifact_path=artifact_path
    )



Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run apples_rf_test at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/e028267bcebe4d9ea87bb89981176e0e
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824


**MLflow Nested Runs**

The function starts a new nested run in MLflow. Nested runs are useful for organizing hyperparameter tuning experiments as they allow you to group individual runs under a parent run.

In [36]:
def train(params):
    # Train the RandomForestRegressor
    rf = RandomForestRegressor(**params)

    # Fit the model on the training data
    rf.fit(X_train, y_train)

    # Predict on the validation set
    y_pred = rf.predict(X_val)

    # Calculate error metrics
    mae = mean_absolute_error(y_val, y_pred)
    mse = mean_squared_error(y_val, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_val, y_pred)

    # Assemble the metrics we're going to write into a collection
    metrics = {"mae": mae, "mse": mse, "rmse": rmse, "r2": r2}
    
    return rf, X_val, metrics


hyperparameters = {
    "n_estimators": [100, 200],
    "max_depth": [6, 10, 15],
    "min_samples_split": [10],
    "min_samples_leaf": [4],
    "bootstrap": [True],
    "oob_score": [False],
    "random_state": [888]}
# Initiate the MLflow run context


for value_params in product(*hyperparameters.values()):
    with mlflow.start_run(nested = True):
        params = dict(zip(hyperparameters.keys(), value_params))
        rf, X_val, metrics = train(params)
        
        # Log the parameters used for the model fit
        mlflow.log_params(params)

        # Log the error metrics that were calculated during validation
        mlflow.log_metrics(metrics)

        # Log an instance of the trained model for later use
        model_info = mlflow.sklearn.log_model(
            sk_model=rf, input_example=X_val, 
            artifact_path=artifact_path
        )



Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run efficient-sow-531 at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/90f2c3ba83a84ae3b8622fcae7f07f1f
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run blushing-mouse-769 at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/c463b38a1da346f6a1296fc2efdb7800
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run shivering-frog-724 at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/b9086db451ea4d658e01133a4af0160b
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run rogue-shoat-513 at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/ca4c1546743a4c38bcdff88193087806
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run luminous-fowl-963 at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/cd1e9e85420848f2a5ceb78aa69935d0
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824




Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

🏃 View run thundering-fowl-902 at: http://127.0.0.1:8080/#/experiments/446874737954528824/runs/ccf44ca9ddc64574893c0a6e6a91c08b
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/446874737954528824


In [85]:
# Search all runs related to experiment id
runs = mlflow.search_runs(
    experiment_ids=[apple_experiment.experiment_id], 
    output_format='pandas'
    )

In [87]:
id_model = runs.loc[(runs.status == 'FINISHED') & 
                    (runs['metrics.mae'] == runs['metrics.mae'].min()),
                    "run_id"].iloc[0]

In [93]:
model_uri = 'runs:/{}/{}'.format(id_model, artifact_path)

In [95]:
sklearn_pyfunc  = mlflow.sklearn.load_model(model_uri)

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

In [99]:
predict = sklearn_pyfunc.predict(X_val)

ml.