# Hackathon III - MLOps

In previous lessonss we've already preprocess the data and created a dataset for training. In this lesson we will:

- 👉 Split the data into: Train / Test / Validation (0.5 p)
- 👉 Train a Regression Model Pipeline (1 p)
- 👉 Evaluate the model (1 p)
- 👉 Log the model and the metrics using MLflow (1 p)
- 👉 Register the model in MLflow (1 p)
- 👉 Deploy the model to a REST API (1 p)
- 👉 Make predictions using the REST API (1 p)
- 👉 Upload the predictions to the database (0.5 p)

Also the following activities will be evaluated:

- 👉 Code legibility (0.5 p)
- 👉 Notebook documentation: titles, subtitles, text explanation, etc. (0.5 p)
- 👉 Creating a GIT repository with the code:
  - 👉 Adding a complete README.md file (0.5 p)
  - 👉 Adding a .gitignore file and a requirements.txt file (0.5 p)
  - 👉 Using Branches for development (1 p)

Bonus points:

- ✅ Any other activity that the student considers important and reflected in code (ie.: GridSearch)
- ✅ Trying with other articles

Following activities will be negatively evaluated:

- ❌ Pushing Data to the repository (it's a bad practice)
- ❌ Pushing passwords or sensitive information to the repository (It's a VERY bad practice)
- ❌ Pushing files generated by MLFlow like `mlruns` (it's a bad practice)

In [1]:
# Avoid unnecessary warnings

import warnings
warnings.filterwarnings('ignore')

## Load the data

First step always is data loading. CSV are provided and can be replicated from the previows lesson. CSVs contains data for the PRODUCT_ID 3960.

- 👉 Load train and test datasets.
- 👉 Set "fecha_venta" as index column.
- 👉 Sort the data by "fecha_venta" column.
- 👉 Show the first 5 rows of the train dataset.

In [2]:
import pandas as pd

# 👇 Code Goes Here

## Split data into train and test sets

We will split the data into train and test sets. We will use the train set to train the model and the test set to evaluate it.


- 👉 Select the feature columns and the label columns
- 👉 Split the data into train and validation sets using a 80/20 or 90/10 ratio.
- 💡 Remember how train/val split should be made in time series problems.

In [3]:
# set the product id and family (only for logging purposes, do not use it for filtering)
PRODUCT_ID = 3960
PRODUCT_FAMILY = "BOLLERIA"

# 👇 Code Goes Here

## Start MLFlow Server

- 👉 Launch a local MLFlow server
- 👉 Connect to local MLFlow server
- 👉 Set the desired experiment
- 👉 Enable MLFlow autologing for sklearn

In [4]:
import mlflow


# 👇 Code Goes Here

## Train and evaluate the model

The next section is to train and evaluate the model. We will use a pipeline to preprocess the data and train the model.

- 👉 Create a Sklearn Pipeline:
  - 👉 Preprocessing: StandardScaler or MinMaxScaler
  - 👉 Model: LinearRegression, RandomForestRegressor, etc.
- 👉 Start a run in MLFlow
- 👉 Train the model using the train dataset
- 👉 Add convenient tags for PRODUCT_ID and FAMILY_ID
- 👉 Evaluate the model
- 💡 Remember this is a regression problem
- 💡 Autolog will automatically log metrics and model

In [5]:
from sklearn.pipeline import Pipeline


# 👇 Code Goes Here

## Register the model

Promote the model to Model Registry. For this section you can choose between using the MLflow UI or using code snipets. If you choose the UI you should provide screenshots.

In [3]:
# 👇 Code Goes Here


## Tag the Model

We can assign a tag to the model to indicate that it is ready for production. This way all versions (v1, v2...) of the model will have the same tag. So we can deploy the model by selecting the (same) tag instead of a specific (different) version.

For this section you can choose between using the MLflow UI or using code snipets. If you choose the UI you should provide screenshots.

In [7]:
MODEL_ALIAS = "production"  # model will promote to this stage


# 👇 Code Goes Here

## Deploy the model

In a terminal run the following command to deploy the model:

```bash
export MLFLOW_TRACKING_URI=http://localhost:5000
mlflow models serve -m models:/<model_name>@production -p 5001 --env-manager local
```

You should see something like this:

```bash
[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://127.0.0.1:5001 (236041)
[INFO] Using worker: sync
[INFO] Booting worker with pid: 236048
```

It means it's working correctly 🎉

## Make requests to the model

The model is now deployed and ready to receive requests. We will make a request to the model using the test set.

- 👉 prepare the test set to be sent as JSON
- 👉 make a POST request to the model
- 👉 get the predictions from the response and show them

In [4]:
import requests
import json


# 👇 Code Goes Here


## Push Results to Database

We push the results to the database so we can visualize them using other tools like Tableau, PowerBI, etc.

In [5]:
# Helpfull class used to connect to the database and push dataframes

import sqlalchemy as sa


class DatabaseConnection:

    def __init__(
        self,
        username: str,
        password: str,
        dialect: str = "mysql",
        driver: str = "pymysql",
        host: str = "database-1.cxlpff3hacbu.eu-west-3.rds.amazonaws.com",
        port: int = 3306,
        database: str = "sandbox",
    ) -> None:
        """Creates a connection to a database

        Args:
            username (str): username
            password (str): password
            dialect (str, optional): dialect. Defaults to "mysql".
            driver (str, optional): driver. Defaults to "pymysql".
            host (str, optional): host. Defaults to "database-1.crek3tiqyj7r.eu-west-3.rds.amazonaws.com".
            port (int, optional): port. Defaults to 3306.
            database (str, optional): database. Defaults to "classicmodels".
        """
        connection_string = f"{dialect}+{driver}://{username}:{password}@{host}:{port}/{database}"
        self.engine = sa.create_engine(connection_string)

    def insert_dataframe(self, df: pd.DataFrame, table_name: str) -> None:
        """Inserts a dataframe into a table
        
        Args:
            df (pd.DataFrame): dataframe to insert
            table_name (str): table name
        """
        df.to_sql(table_name, self.engine, if_exists="replace", index=False)

    def query_to_df(self, query: str) -> pd.DataFrame:
        """Retrieves a dataframe from a query.

        Args:
            query (str): query to perform.

        Returns:
            pd.DataFrame: daframe with the results of the query.
        """
        with self.engine.connect() as conn:
            df = pd.read_sql_query(query, conn)
            return df

    def check_connection(self) -> bool:
        """Checks if the connection is working

        Returns:
            bool: True if the connection is working, False otherwise
        """
        try:
            self.engine.connect()
        except Exception as e:
            print(e)
            return False

Prepare the dataframe to upload to the database

In [10]:
# 👇 Fill this schema with the relevant data 

# Create a dataframe with the data to store
df_article_prediction = pd.DataFrame({
    "fecha": [],
    "cantidad": [],
    "articulo": [],  # repeat the article for each date
    "familia": [],  # repeat the family for each date
})

Push the dataframe to the database

In [11]:
# Database credentials
USER = 'usuario1'
PASSWORD = 'C0d35p4ce.'
NAME = ""
table_name = f"Materials_Prediction_Group_{NAME}"


# Connect to the database
db = DatabaseConnection(USER, PASSWORD)
db.insert_dataframe(df_article_prediction, table_name)