# A SIMPLIFIED MLOPS GUIDE USING MLFLOW

MLOps is one of the crucial phase in a machine learning lifecycle to keep machine learning models maintained and perform well. There's an open source MLOps tool that easily to use, comprehensive ML library, and manage end-to-end ML workkflows from development to production. It is called [MLflow](https://mlflow.org). Here's how to understand by doing a fundamental implementation using ML flow. Let's just dive into the notebook.

[SOURCE](https://mlflow.org/docs/latest/getting-started/logging-first-model/index.html)

# 1. Install MLflow (Including libraries and dependencies)

run it on your IDE terminal

`pip install mlflow`

# 2. Launch the Mlflow Tracking Server

Run it on your IDE terminal

`mlflow server --host 127.0.0.1 --port 8080`

You need to launch the MLflow tracking server and always keep it ongoing during the tutorial, if you close/kill the terminal it'll shut down the server. The screen should be showing like [this](../MLOps/Screenshots/ss1.png) (or ss1.png in the Screenshots folder)

# 3. Using the MLFflow Client API using `MLflowClient`
it could use for :
1. Initiate a new Experiment.
2. Start Runs within an Experiment.
3. Document parameters, metrics, and tags for your Runs.
4. Log artifacts linked to runs, such as models, tables, plots, and more.



## 3.1 Import Dependencies

In [1]:
from mlflow import MlflowClient
from pprint import pprint
from sklearn.ensemble import RandomForestClassifier

#### Configuring the MLflow Tracking Client

In [2]:
client = MlflowClient(tracking_uri="http://127.0.0.1:8080")

# Default Experiment
It's an outset of starting MLflow Tracking Server that if you don’t explicitly create a new experiment in MLflow, any run data is automatically stored in the “Default Experiment” so that it isn’t lost.

## 3.2 Searching Experiments

[mlflow.client.MlflowClient.search_experiments()](https://mlflow.org/docs/latest/python_api/mlflow.client.html#mlflow.client.MlflowClient.search_experiments)

In [3]:
all_experiments = client.search_experiments()
print(all_experiments)

# the output would be a list of Experiment objects

[<Experiment: artifact_location='mlflow-artifacts:/209577529335836473', creation_time=1736946368182, experiment_id='209577529335836473', last_update_time=1736946368182, lifecycle_stage='active', name='Apple_Models1', tags={'mlflow.note.content': 'This is the grocery forecasting project. This '
                        'experiment contains the produce models for apples.',
 'project_name': 'grocery-forecasting',
 'project_quarter': 'Q3-2023',
 'store_dept': 'produce',
 'team': 'stores-ml'}>, <Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1736852050555, experiment_id='0', last_update_time=1736852050555, lifecycle_stage='active', name='Default', tags={}>]


To get familiar with accessing elements from returned collections from MLflow APIs, extract the `name` and the `lifecycle_stage` from the `search_experiments()` query and extract these attributes into a dict.

In [4]:
default_experiment = [
    {
        "name": experiment.name, "lifecycle_stage": experiment.lifecycle_stage
    }
    for experiment in all_experiments
    if experiment.name == "Default"
][0]

pprint(default_experiment)

{'lifecycle_stage': 'active', 'name': 'Default'}


# 4. Creating experiments

## Viewing the MLFlow UI
you could see the default experiment with no run data at [http://127.0.0.1:8080](http://127.0.0.1:8080)

## Notes on Tags vs Experiments
While MLflow does provide a default experiment, it primarily serves as a ‘catch-all’ safety net for runs initiated without a specified active experiment. However, it’s not recommended for regular use. Instead, creating unique experiments for specific collections of runs offers numerous advantages, as we’ll explore below.

**Benefits of Defining Unique Experiments**:

1. **Enhanced Organization**: Experiments allow you to group related runs, making it easier to track and compare them. This is especially helpful when managing numerous runs, as in large-scale projects.
2. **Metadata Annotation**: Experiments can carry metadata that aids in organizing and associating runs with larger projects.

Consider the scenario below: we’re simulating participation in a large demand forecasting project. This project involves building forecasting models for various departments in a chain of grocery stores, each housing numerous products. Our focus here is the ‘produce’ department, which has several distinct items, each requiring its own forecast model. Organizing these models becomes paramount to ensure easy navigation and comparison.

**When Should You Define an Experiment?**

The guiding principle for creating an experiment is the consistency of the input data. If multiple runs use the same input dataset (even if they utilize different portions of it), they logically belong to the same experiment. For other hierarchical categorizations, using tags is advisable.

**[NOTE](https://mlflow.org/docs/latest/getting-started/logging-first-model/step3-create-experiment.html)**

While the business product hierarchy in this case doesn’t explicitly need to be captured within the tags, there is nothing preventing you from doing so. There isn’t a limit to the number of tags that you can apply. Provided that the keys being used are consistent across experiments and runs to permit search to function properly, any number of arbitrary mappings between tracked models and your specific business rules can be applied.


## 4.1 Create the Apples Experiment with Meaningfull Tags

In [5]:
# Provide an Experiment description that will appear in the UI
experiment_description = (
    "This is the grocery forecasting project. "
    "This experiment contains the produce models for apples."
)

# Provide searchable tags that define characteristics of the Runs that
# will be in this Experiment
experiment_tags = {
    "project_name": "grocery-forecasting",
    "store_dept": "produce",
    "team": "stores-ml",
    "project_quarter": "Q3-2023",
    "mlflow.note.content": experiment_description,
}

# Create the Experiment, providing a unique name
produce_apples_experiment = client.create_experiment(
    name="Apple_Models1", tags=experiment_tags
)


# 5. Searching Experiments


## 5.1 Search based on tags

In [5]:
# Use search_experiments() to search on the project_name tag key

apples_experiment = client.search_experiments(
    filter_string="tags.`project_name` = 'grocery-forecasting'"
)

print((apples_experiment[0])) # this output is an Experiment object not a dict
# print(vars(apples_experiment[0])) # this output is a dict
# the results are the metadata from the experiment that you created

<Experiment: artifact_location='mlflow-artifacts:/209577529335836473', creation_time=1736946368182, experiment_id='209577529335836473', last_update_time=1736946368182, lifecycle_stage='active', name='Apple_Models1', tags={'mlflow.note.content': 'This is the grocery forecasting project. This '
                        'experiment contains the produce models for apples.',
 'project_name': 'grocery-forecasting',
 'project_quarter': 'Q3-2023',
 'store_dept': 'produce',
 'team': 'stores-ml'}>


# 6. Create a Dataset about apples

## 6.1 Defining a dataset generator


In [8]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta


def generate_apple_sales_data_with_promo_adjustment(
        base_demand: int = 1000, n_rows: int = 5000
):
    """
    Generates a synthetic dataset for predicting apple sales demand with seasonality
    and inflation.

    This function creates a pandas DataFrame with features relevant to apple sales.
    The features include date, average_temperature, rainfall, weekend flag, holiday flag,
    promotional flag, price_per_kg, and the previous day's demand. The target variable,
    'demand', is generated based on a combination of these features with some added noise.

    Args:
        base_demand (int, optional): Base demand for apples. Defaults to 1000.
        n_rows (int, optional): Number of rows (days) of data to generate. Defaults to 5000.

    Returns:
        pd.DataFrame: DataFrame with features and target variable for apple sales prediction.

    Example:
        >>> df = generate_apple_sales_data_with_seasonality(base_demand=1200, n_rows=6000)
        >>> df.head()
    """

    # Set seed for reproducibility
    np.random.seed(9999)

    # Create date range
    dates = [datetime.now() - timedelta(days=i) for i in range(n_rows)]
    dates.reverse()

    # Generate features
    df = pd.DataFrame(
        {
            "date": dates,
            "average_temperature": np.random.uniform(10, 35, n_rows),
            "rainfall": np.random.exponential(5, n_rows),
            "weekend": [(date.weekday() >= 5) * 1 for date in dates],
            "holiday": np.random.choice([0, 1], n_rows, p=[0.97, 0.03]),
            "price_per_kg": np.random.uniform(0.5, 3, n_rows),
            "month": [date.month for date in dates],
        }
    )

    # Introduce inflation over time (years)
    df["inflation_multiplier"] = (
            1 + (df["date"].dt.year - df["date"].dt.year.min()) * 0.03
    )

    # Incorporate seasonality due to apple harvests
    df["harvest_effect"] = np.sin(2 * np.pi * (df["month"] - 3) / 12) + np.sin(
        2 * np.pi * (df["month"] - 9) / 12
    )

    # Modify the price_per_kg based on harvest effect
    df["price_per_kg"] = df["price_per_kg"] - df["harvest_effect"] * 0.5

    # Adjust promo periods to coincide with periods lagging peak harvest by 1 month
    peak_months = [4, 10]  # months following the peak availability
    df["promo"] = np.where(
        df["month"].isin(peak_months),
        1,
        np.random.choice([0, 1], n_rows, p=[0.85, 0.15]),
    )

    # Generate target variable based on features
    base_price_effect = -df["price_per_kg"] * 50
    seasonality_effect = df["harvest_effect"] * 50
    promo_effect = df["promo"] * 200

    df["demand"] = (
                           base_demand
                           + base_price_effect
                           + seasonality_effect
                           + promo_effect
                           + df["weekend"] * 300
                           + np.random.normal(0, 50, n_rows)
                   ) * df[
                       "inflation_multiplier"
                   ]  # adding random noise

    # Add previous day's demand
    df["previous_days_demand"] = df["demand"].shift(1)
    df["previous_days_demand"].fillna(
        method="bfill", inplace=True
    )  # fill the first row

    # Drop temporary columns
    df.drop(columns=["inflation_multiplier", "harvest_effect", "month"], inplace=True)

    return df


In [9]:
data = generate_apple_sales_data_with_promo_adjustment(base_demand=1_000, n_rows=1_000)

data[-20:]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["previous_days_demand"].fillna(
  df["previous_days_demand"].fillna(


Unnamed: 0,date,average_temperature,rainfall,weekend,holiday,price_per_kg,promo,demand,previous_days_demand
980,2024-12-28 10:48:35.799340,34.130183,1.454065,1,0,1.449177,0,1289.802447,1001.085782
981,2024-12-29 10:48:35.799339,32.353643,9.462859,1,0,2.856503,0,1136.951553,1289.802447
982,2024-12-30 10:48:35.799338,18.816833,0.39147,0,0,1.326429,0,963.352029,1136.951553
983,2024-12-31 10:48:35.799337,34.533012,2.120477,0,0,0.970131,0,1039.385504,963.352029
984,2025-01-01 10:48:35.799336,23.057202,2.365705,0,0,1.049931,0,1019.486305,1039.385504
985,2025-01-02 10:48:35.799335,34.810165,3.089005,0,0,2.035149,0,1002.564672,1019.486305
986,2025-01-03 10:48:35.799334,29.208905,3.673292,0,0,2.518098,0,1086.143402,1002.564672
987,2025-01-04 10:48:35.799334,16.428676,4.077782,1,0,1.268979,0,1420.207186,1086.143402
988,2025-01-05 10:48:35.799333,32.067512,2.734454,1,0,0.762317,0,1396.939894,1420.207186
989,2025-01-06 10:48:35.799332,31.938203,13.883486,0,0,1.153301,0,994.40954,1396.939894


# 7. Logging our first runs with MLflow

Core features of MLflow Tracking:
- Making use of the start_run context for creating and efficiently managing runs.
- An introduction to logging, covering tags, parameters, and metrics.
- Understanding the role and formation of a model signature.
- Logging a trained model, solidifying its presence in our MLflow run.


## 7.1 Using MLflow Tracking to keep track of training

In [10]:
# Import necessary libraries
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Set the global reference to the Tracking server’s address.

In [11]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

#### Defining an Experiment that will be used to log runs to.

In [12]:
# Sets the current active experiment to the "Apple_Models" experiment and
# returns the Experiment metadata
apple_experiment = mlflow.set_experiment("Apple_Models1")

# Define a run name for this iteration of training.
# If this is not set, a unique name will be auto-generated for your run.
run_name = "apples_rf_test"

# Define an artifact path that the model will be saved to.
artifact_path = "rf_apples"


#### Run the experiment

In [13]:
# Split the data into features and target and drop irrelevant date field and target field
X = data.drop(columns=["date", "demand"])
y = data["demand"]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
    "n_estimators": 100,
    "max_depth": 6,
    "min_samples_split": 10,
    "min_samples_leaf": 4,
    "bootstrap": True,
    "oob_score": False,
    "random_state": 888,
}

# Train the RandomForestRegressor
rf = RandomForestRegressor(**params)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Predict on the validation set
y_pred = rf.predict(X_val)

# Calculate error metrics
mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_pred)

# Assemble the metrics we're going to write into a collection
metrics = {"mae": mae, "mse": mse, "rmse": rmse, "r2": r2}

# Initiate the MLflow run context
with mlflow.start_run(run_name=run_name) as run:
    # Log the parameters used for the model fit
    mlflow.log_params(params)

    # Log the error metrics that were calculated during validation
    mlflow.log_metrics(metrics)

    # Log an instance of the trained model for later use
    mlflow.sklearn.log_model(
        sk_model=rf, input_example=X_val, artifact_path=artifact_path

    )




🏃 View run apples_rf_test at: http://127.0.0.1:8080/#/experiments/209577529335836473/runs/e614276845c647deaa4c6f7a0224ad4c
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/209577529335836473


### Horayy you've just logged your first MLflow model!! How about let's try another one in the same Experiment which is `Apple_Models1`

#### Defining the experiment

In [14]:
# Sets the current active experiment
apple_experiment_v2 = mlflow.set_experiment("Apple_Models1")

# Define a run name for this iteration of training.
run_name2 = "apple_dtr_test"

# Define the artifact path that the model will save to
artifact_path2 = "dtr_apples"

In [15]:
from sklearn.tree import DecisionTreeRegressor

dtr_params = {
    "criterion": "friedman_mse",
    "splitter": "random",
    "max_depth": 6,
    "min_samples_split": 10,
    "min_samples_leaf": 4,
    "random_state": 888,
}

dtr = DecisionTreeRegressor(**dtr_params)

dtr.fit(X_train, y_train)

y_pred = dtr.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_pred)

metrics2 = {"mae": mae, "mse": mse, "rmse": rmse, "r2": r2}

# Initiate the MLflow run context
with mlflow.start_run(run_name=run_name2) as run:
    # Log the parameters used for the model fit
    mlflow.log_params(dtr_params)

    # Log the error metrics that were calculated during validation
    mlflow.log_metrics(metrics2)

    # Log an instance of the trained model for later use
    mlflow.sklearn.log_model(
        sk_model=dtr, input_example=X_val, artifact_path=artifact_path2
    )



🏃 View run apple_dtr_test at: http://127.0.0.1:8080/#/experiments/209577529335836473/runs/81f250f1838f49ce9257de8757d8897c
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/209577529335836473


### Congrats!! you've just logged another MLflow model, hope this notebook understanable for you to take your first step into MLOps using MLflow ^-^