<h1>Part 1 - Machine Learning Pipelines and Orchestration</h1>

<font size=3>
The goal of this section is to get to know better what orchestration is and why it is needed in mlops. To that end, we will be using Prefect, an open-source python-based orchestrator.

This section follows three main steps : 

1 - Refactoring of the code from the previous section

2 - Creating tasks and workflows with Prefect

3 - Scheduling automatic runs of the workflows
</font>

## Intro

### What is Orchestration and why orchestrate ?

Orchestration refers to the automation of tasks and processes needed to run software applications or machine learning models. They englobe tasks related to development, deployment, and management.

> There are different kinds of orchestration: 
- **Infrastructre orchestration** : Auto provisionning and management of servers, networks, storage, etc.
- **Deployment orchestration** : Auto deployment of code into different environments like, dev, staging, testing, or production.
- **Service orchestration** : Auto scaling and management of services/microservices inside applications.
- **Workflow orchestration** : Auto building, testing, and deploying pipelines.

> For the NYC taxi trip duration prediction, we will be implementing **Workflow orchestration**

> Why orchestrate : 

- Better management: have a central place to manage, auto run and monitor machine learning pipelines
- Control dependencies: set dependencies between ml steps and ensure the correct order of execution and error handling
- Configuration: set different behavior for pipelines, for example, when a failure happens
- Reduce manual intervention
- Have the possibility to distribute workload
- Have the possibility to integrate other tools

> Common Orchestration concepts :

- **DAG** : Direct Acyclic Graph, represents (a visualisation of) dependencies between steps in a workflow. Steps are connected in a way that does not form any cycles.

- **SCHEDULING** : It is the case where the run of a workflow is initiated by a time requirement. It generally uses interval of time defined by the user

- **TRIGGER** : It is the case where the run of a workflow is initiated by an action. E.g : The previous task is complete, A specific task fails, A new data notification is received 

## From notebook to Workflows :

### Why go from notebooks to python files for production ?

A major part of the model development phase is done using jupyter notebooks. \
Once the pure development phase is complete, we must ask ourselves a certain number of questions if we want to move our work into a production environment.
- How do we manage inputs, outputs, storage and dependencies?
- How to integrate our work to an existing infrastructure?
- How to get automatic retraining and predictions?

`Jupyter notebooks` are:
- easy to use and interactive
- they make analysis and visualization easy
- they allow to run code in a desired order

But have major drawbacks that make them **not suitable for production**. They are:
- hard to understand when there are a lot of cells
- very hard to integrate in infrastructures
- not designed for scalability
- not designed for collaboration
- not necessarily compatible across different environments
- not adapted for debugging

To solve the majority of these issues, a good practice is to transfer the work from notebooks to **.py** files (python modules and packages). \
Doing this, we can:
- improve organization and code understanding
- easily port and run in scripts on other infrastructures
- use a Python script manager to ensure consistency and reproducibility across different systems
- run in parallel and distribute across different machines
- make use of any orchestration tool

### Good practices preparing code for production

- Refactor code : 
    - set variables in the notebook as function's arguments
    - use clear function names and add docstring
    - make entrypoint function to perform specific operations
    - use typing hints
- Split code into different files
- Manage dependencies with an environment manager and a `requirement.txt` file
- Use git and Github/GitLab for collaboration


## 0 - Setup

### 0.1 - Imports and globals

In [1]:
import asyncio
import os
import pickle
import random
from typing import List, Dict
from pathlib import Path

from prefect import task, flow
from prefect.deployments import Deployment
from prefect.orion.schemas.schedules import (
    CronSchedule,
    IntervalSchedule,
)
from scipy.sparse import csr_matrix
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import urllib.request

In [2]:
ROOT_PATH = Path("..")


class Config:
    TRAIN_DATA = ROOT_PATH / "data/yellow_tripdata_2021-01.parquet"
    TEST_DATA = ROOT_PATH / "data/yellow_tripdata_2021-02.parquet"
    INFERENCE_DATA = ROOT_PATH / "data/yellow_tripdata_2021-03.parquet"
    LOCAL_STORAGE = ROOT_PATH / "results"
    CATEGORICAL_VARS = ["PULocationID", "DOLocationID"]


config = Config()

If you have not run the previous notebook, please run this cell

In [3]:
urllib.request.urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    config.TRAIN_DATA,
)
urllib.request.urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet",
    config.TEST_DATA,
)
urllib.request.urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.parquet",
    config.INFERENCE_DATA,
);

## 1 - Refactoring 

A good rule of thumb when writing functions is to delegate a single task to a single function. When working in a Machine Learning project, we often need to chain several operations one after the other. This implies that we need to combine different functions into a workflow.

In this section, we will be refactoring our code to write two workflows:

- A preprocessing workflow
- A model training workflow

### 1.1 - Preprocessing

The proprocessing workflow is composed of several sequential steps: 
- Loading the data
- Computing the target
- Filtering the outliers
- Encoding categorical columns
- Vectorizing the inputs

We will re-use the functions we defined in the previous step for this part

In [4]:
def load_data(path: str) -> pd.DataFrame:
    "Loads a dataframe from a parquet file."
    return pd.read_parquet(path)


def compute_target(
    taxi_rides: pd.DataFrame,
    pickup_column: str = "tpep_pickup_datetime",
    dropoff_column: str = "tpep_dropoff_datetime",
) -> pd.DataFrame:
    taxi_rides["duration"] = taxi_rides[dropoff_column] - taxi_rides[pickup_column]
    taxi_rides["duration"] = taxi_rides["duration"].dt.total_seconds() / 60
    return taxi_rides


def filter_outliers(
    taxi_rides: pd.DataFrame, min_duration: int = 1, max_duration: int = 60
) -> pd.DataFrame:
    """Filter out outliers based on the ride duration."""
    return taxi_rides[taxi_rides["duration"].between(min_duration, max_duration)]


def encode_categorical_cols(
    taxi_rides: pd.DataFrame, categorical_cols: List[str] = None
) -> pd.DataFrame:
    """Encode categorical columns in `taxi_rides` dataframe."""
    if categorical_cols is None:
        categorical_cols = config.CATEGORICAL_VARS
    taxi_rides[categorical_cols] = (
        taxi_rides[categorical_cols].fillna(-1).astype("int").astype("str")
    )
    return taxi_rides


def vectorize_dataframe(
    taxi_rides: pd.DataFrame,
    categorical_cols: List[str] = None,
    dict_vectorizer: DictVectorizer = None,
    with_target: bool = True,
) -> Dict:
    """Convert a DataFrame into a sparse matrix and target array, optionally using a pre-fit dictionary.

    Args:
        taxi_rides (pd.DataFrame): DataFrame to be converted.
        dict_vectorizer (DictVectorizer, optional): The DictVectorizer to use. Defaults to None.

    Returns:
        Tuple[csr_matrix, np.ndarray, DictVectorizer]: Tuple containing the sparse matrix representation of the
        DataFrame, the target array, and the DictVectorizer used to perform the conversion.
    """
    if categorical_cols is None:
        categorical_cols = config.CATEGORICAL_VARS
    dicts = taxi_rides[categorical_cols].to_dict(orient="records")

    target = None
    if with_target:
        target = taxi_rides["duration"].values

    if dict_vectorizer is None:
        dict_vectorizer = DictVectorizer()
        dict_vectorizer.fit(dicts)

    features = dict_vectorizer.transform(dicts)

    return {"x": features, "y": target, "dv": dict_vectorizer}

Your task now is to create a function that will chain all these operations together. This will be a workflow that can be called without going into the details of what kind of processing is done to the data. 

In [5]:
def process_data(path: str, dict_vectorizer=None, with_target: bool = True) -> dict:
    """
    Process the input data to extract features, apply necessary transformations,
    and convert the data into a sparse matrix.

    Args:
    - path (str): Path to the input data file
    - dict_vectorizer (DictVectorizer): A scikit-learn DictVectorizer object
    - with_target (bool): Indicates if the target variable is to be included in the processed data.
        Default is True.

    Returns:
    - dict: A dictionary containing processed data in a sparse matrix format and target variable if applicable

    """
    df = load_data(path)
    if with_target:
        df = df.pipe(compute_target).pipe(filter_outliers).pipe(encode_categorical_cols)
        return vectorize_dataframe(df, dict_vectorizer=dict_vectorizer)
    else:
        df1 = encode_categorical_cols(df)
        return vectorize_dataframe(
            df1, dict_vectorizer=dict_vectorizer, with_target=with_target
        )

### 1.2 - Model training

Now we'll do the same for the model training. Let's write functions that will be called by 3 different workflows:

In [6]:
def train_model(x_train: csr_matrix, y_train: np.ndarray) -> LinearRegression:
    """Train and return a linear regression model."""
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    return lr


def predict_duration(input_data: csr_matrix, model: LinearRegression) -> np.ndarray:
    """Predict taxi ride duration using a trained model."""
    return model.predict(input_data)


def evaluate_model(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Calculate mean squared error for two arrays."""
    return mean_squared_error(y_true, y_pred, squared=False)


def load_pickle(path: str) -> None:
    with open(path, "rb") as f:
        loaded_obj = pickle.load(f)
    return loaded_obj


def save_pickle(path: str, obj: dict) -> None:
    with open(path, "wb") as f:
        pickle.dump(obj, f)

Complete the following function with your code to train, predict and evaluate : 

In [7]:
def train_and_predict(x_train, y_train, x_test, y_test) -> dict:
    """Train model, predict values and calculate error"""
    model = train_model(x_train, y_train)
    prediction = predict_duration(x_test, model)
    mse = evaluate_model(y_test, prediction)
    return {"model": model, "mse": mse}


def complete_ml(
    train_path: str,
    test_path: str,
    save_model: bool = True,
    save_dv: bool = True,
    local_storage: str = Config.LOCAL_STORAGE,
) -> None:
    """Load, process and save data and model.

    Loads data, prepares sparse matrix using `DictVectorizer`, trains model,
    makes predictions, calculates error, and saves the model and
    `DictVectorizer` in pickle format.

    Args:
        train_path: Path to training data file.
        test_path: Path to testing data file.
        save_model: Boolean indicating whether to save the trained model.
        save_dict_vectorizer: Boolean indicating whether to save the
                              `DictVectorizer` used to process the data.
        local_storage: Path to the local storage location where the model and
                       `DictVectorizer` should be saved.
    """
    if not os.path.exists(local_storage):
        os.makedirs(local_storage)

    train_data = process_data(train_path)
    test_data = process_data(test_path, dv=train_data["dv"])
    model_obj = train_and_predict(
        train_data["x"], train_data["y"], test_data["x"], test_data["y"]
    )
    if save_model:
        save_pickle(f"{local_storage}/model.pickle", model_obj)
    if save_dv:
        save_pickle(f"{local_storage}/dv.pickle", train_data["dv"])


def batch_inference(
    input_path, dv=None, model=None, local_storage=Config.LOCAL_STORAGE
):
    """Loads data and predicts ride duration

    If a DictVectorizer is not provided, it will be loaded from the local storage. If a model is not provided, it will be
    loaded from the local storage. The processed data and the predictions are returned.

    Args:
        input_path (str): Path to the input data
        dv (DictVectorizer): DictVectorizer object to transform data
        model (Model): Model object to make predictions
        local_storage (str): Path to the local storage

    Returns:
        np.ndarray: The predictions for the input data
    """
    if not dv:
        dv = load_pickle(f"{local_storage}/dv.pickle")
    data = process_data(input_path, dv, with_target=False)
    if not model:
        model = load_pickle(f"{local_storage}/model.pickle")["model"]
    return predict_duration(data["x"], model)

## 2 - Orchestrators

Here are a few popular workflow orchestrators :

- Apache Airflow : Open-source platform, used for scheduling and managing workflows. It has a large and active community.

- Prefect : Open-source platform, designed to be highly flexible, easily deployable and scalable. It offers a Python API.

- Flyte : Open-source platform, unified platform for workflow management across cloud and on-premises environments. Also provides a Python API 

- AWS Step Functions : Serverless workflow service offered by Amazon Web Services. It supports building and executing workflows with multiple steps.

- Zapier : Web-based platform that provides a visual interface for connecting different web applications. It does not require coding knowledge.

> You can find [a short Orchestrators Benchmark here](https://miro.medium.com/max/1400/1*b6CAci-A4TfuYwM9coY6nw.webp)

### Workflow orchestration with prefect

> **Version** : 
> This module has been created using Prefect 2.7.9

#### Main Prefect concepts

Prefect uses python to build Jobs using functions decorators. 
As long as your main workflow function is decorated, any run of such flow becomes observable from the Prefect UI.

Basic concepts : 
- **Tasks in prefect** : They are units of work written in python. A task is a function decorated with the @task Prefect decorator. They can only be called in flows.
- **Flows** : They are dags that represents a group of interdependant tasks. 
- **Engine** : This is what define where to run flows. This is where we manage workload distribution. In this course, we only use local machine.
- **State** : They are prefect objects returned by flows. Contain informations about flows and data.

Deployment concepts : 
- **Deployment object** : These are prefect entities that the api can understand for scheduling, auto-runs etc.
- **Work queue** : These are created after a deployment has been applied. It lists all the upcomming runs and their state. (scheduled, running ...)
- **Agent** : These are responsable of pulling the flows from work queues for run a the right time. 

### 2.1 - Set up Prefect UI

Steps :

- Set an API URL for your local server to make sure that your workflow will be tracked by this specific instance : 
    - `prefect config set PREFECT_API_URL=http://0.0.0.0:4200/api`
- Start a local prefect server : `prefect orion start --host 0.0.0.0`

Prefect database is stored at `~/.prefect/orion.db`. If you want to reset the database, run `prefect orion database reset`

In [8]:
###################################################
## Set your prefect UI up (Use a separate terminal)
## Do NOT run the command on this cell using "!"
###################################################

It is possible to configure tasks and flows behavior using arguments in the decorators : 
- name, tags
- version
- retries on failure
- etc ...

In [None]:
@task(name="failure_task", tags=["fails"], retries=3, retry_delay_seconds=60)
def failure():
    print("running")
    if random.randint(1, 10) % 2 == 0:
        raise ValueError("bad code")


@flow(name="failure_flow", version="1.0")
def test_failure():
    failure()

In [None]:
test_failure()

### 2.1 - First flow : Processing flow

Create a processing flow using the functions we defined in section 1 and prefect (copy-paste and decorate): 
 - tasks : all subfunctions of your preprocessing  
 - flows : your preprocessing 

In [None]:
@task(name="load_data", tags=["preprocessing"], retries=2, retry_delay_seconds=60)
def load_data(path: Path) -> pd.DataFrame:
    return pd.read_parquet(path)


@task(name="compute_duration", tags=["preprocessing"])
def compute_target(
    taxi_rides: pd.DataFrame,
    pickup_column: str = "tpep_pickup_datetime",
    dropoff_column: str = "tpep_dropoff_datetime",
) -> pd.DataFrame:
    taxi_rides["duration"] = taxi_rides[dropoff_column] - taxi_rides[pickup_column]
    taxi_rides["duration"] = taxi_rides["duration"].dt.total_seconds() / 60
    return taxi_rides


@task(name="filter_outliers", tags=["preprocessing"])
def filter_outliers(
    taxi_rides: pd.DataFrame, min_duration: int = 1, max_duration: int = 60
) -> pd.DataFrame:
    """Filter out outliers based on the ride duration."""
    return taxi_rides[taxi_rides["duration"].between(min_duration, max_duration)]


@task(name="encode_cat_cols", tags=["preprocessing"])
def encode_categorical_cols(
    taxi_rides: pd.DataFrame, categorical_cols: List[str] = None
) -> pd.DataFrame:
    """Encode categorical columns in `taxi_rides` dataframe."""
    if categorical_cols is None:
        categorical_cols = config.CATEGORICAL_VARS
    taxi_rides[categorical_cols] = (
        taxi_rides[categorical_cols].fillna(-1).astype("int").astype("str")
    )
    return taxi_rides


@task(name="vectorize_dataframe", tags=["preprocessing"])
def vectorize_dataframe(
    taxi_rides: pd.DataFrame,
    categorical_cols: List[str] = None,
    dict_vectorizer: DictVectorizer = None,
    with_target: bool = True,
) -> Dict:
    """Convert a DataFrame into a sparse matrix and target array, optionally using a pre-fit dictionary."""
    if categorical_cols is None:
        categorical_cols = config.CATEGORICAL_VARS
    dicts = taxi_rides[categorical_cols].to_dict(orient="records")

    target = None
    if with_target:
        target = taxi_rides["duration"].values

    if dict_vectorizer is None:
        dict_vectorizer = DictVectorizer()
        dict_vectorizer.fit(dicts)

    features = dict_vectorizer.transform(dicts)

    return {"x": features, "y": target, "dv": dict_vectorizer}

In [None]:
@flow(name="Data processing", retries=1, retry_delay_seconds=30)
def process_data(path: Path, dict_vectorizer=None, with_target: bool = True) -> dict:
    """
    Load data from a parquet file
    Compute target(duration column) and apply threshold filters (optional)
    Turn features to sparce matrix
    :return The sparce matrix, the target' values and the
    dictvectorizer object if needed.
    """
    df = load_data(path)
    if with_target:
        df = df.pipe(compute_target).pipe(filter_outliers).pipe(encode_categorical_cols)
        return vectorize_dataframe(df, dict_vectorizer=dict_vectorizer)
    else:
        df1 = encode_categorical_cols(df)
        return vectorize_dataframe(
            df1, dict_vectorizer=dict_vectorizer, with_target=with_target
        )

In [None]:
res = process_data(Config.TRAIN_DATA)

res_without_y = process_data(
    path=Config.INFERENCE_DATA, dict_vectorizer=res["dv"], with_target=False
)

### 2.2 - Training flow

Replicate what you just did above, but now for the training flows

In [None]:
@task(name="Train model", tags=["Model"])
def train_model(x_train: csr_matrix, y_train: np.ndarray) -> LinearRegression:
    """Train and return a linear regression model"""
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    return lr


@task(name="Make prediction", tags=["Model"])
def predict_duration(input_data: csr_matrix, model: LinearRegression) -> np.ndarray:
    """
    Use trained linear regression model
    to predict target from input data
    :return array of predictions
    """
    return model.predict(input_data)


@task(name="Evaluation", tags=["Model"])
def evaluate_model(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Calculate mean squared error for two arrays"""
    return mean_squared_error(y_true, y_pred, squared=False)


@task(name="Load", tags=["Serialize"])
def load_pickle(path: str):
    with open(path, "rb") as f:
        loaded_obj = pickle.load(f)
    return loaded_obj


@task(name="Save", tags=["Serialize"])
def save_pickle(path: str, obj: dict):
    with open(path, "wb") as f:
        pickle.dump(obj, f)

In [None]:
@flow(name="Model initialisation")
def train_and_predict(x_train, y_train, x_test, y_test) -> dict:
    """Train model, predict values and calculate error"""
    model = train_model(x_train, y_train)
    prediction = predict_duration(x_test, model)
    mse = evaluate_model(y_test, prediction)
    return {"model": model, "mse": mse}


@flow(name="Example Machine learning workflow", retries=1, retry_delay_seconds=30)
def complete_ml(
    train_path: Path,
    test_path: Path,
    save_model: bool = True,
    save_dv: bool = True,
    local_storage: Path = Config.LOCAL_STORAGE,
) -> None:
    """
    Load data and prepare sparse matrix (using dictvectorizer) for model training
    Train model, make predictions and calculate error
    Save model and dictvectorizer to a folder in pickle format
    :return none
    """
    if not os.path.exists(local_storage):
        os.makedirs(local_storage)

    train_data = process_data(train_path)
    test_data = process_data(test_path, dict_vectorizer=train_data["dv"])
    model_obj = train_and_predict(
        train_data["x"], train_data["y"], test_data["x"], test_data["y"]
    )
    if save_model:
        save_pickle(f"{local_storage}/model.pickle", model_obj)
    if save_dv:
        save_pickle(f"{local_storage}/dv.pickle", train_data["dv"])


@flow(name="Batch inference", retries=1, retry_delay_seconds=30)
def batch_inference(
    input_path, dv=None, model=None, local_storage=Config.LOCAL_STORAGE
):
    """
    Load model and dictvectorizer from folder
    Transforms input data with dictvectorizer
    Predict values using loaded model
    :return array of predictions
    """
    if not dv:
        dv = load_pickle(f"{local_storage}/dv.pickle")
    data = process_data(input_path, dv, with_target=False)
    if not model:
        model = load_pickle(f"{local_storage}/model.pickle")["model"]
    return predict_duration(data["x"], model)

In [None]:
complete_ml(Config.TRAIN_DATA, Config.TEST_DATA)

In [None]:
batch_inference(Config.INFERENCE_DATA)

### 2.3 - Deployments

Prefect deployment objects are instances that are used by the prefect API to understand scheduling requirements. \
A flow can be used in multiple deployment objects, but a deployment object is associated to a unique flow. \
It creates work queues and an agent that manages the runs.

There are two types of scheduling that can be used with prefect : 
- cron scheduling : define runs dates based on a cron expression. e.g. : `"0 0 * * 0"` (every sunday at 00:00)
- interval scheduling : define runs interval in minutes/seconds/...

> ATTENTION: To configure shcedulling with prefect, all your flows and tasks have to be written in a python file. 

**Step 1 : Copy your tasks/flows in a python file**

Schedulings are created using : 
```
Deployment.build_from_flow(
    name: <the name of the object>,
    version : <optionnal>,
    tags : [<optionnal tags>],
    schedule: <CronSchedule(...)> / <IntervalSchedule(...)>
    apply: True # send to the prefect API
    parameters: {...}
    entrypoint=f"<path to your python file .py>:<your flow name>",
)
```

**Step 2 : Define some scheduling to run the complete ML workflow each sunday and the batch inference at a desired time interval**

In [None]:
modeling_deployment_every_sunday = await Deployment.build_from_flow(
    name="Model training Deployment",
    flow=complete_ml,
    version="1.0",
    tags=["model"],
    schedule=CronSchedule(cron="0 0 * * 0"),
    apply=True,
    entrypoint="/app/lib/prefect_workflows.py:complete_ml",
    parameters={
        "train_path": Config.TRAIN_DATA.resolve(),
        "test_path": Config.TEST_DATA.resolve(),
    },
)


inference_deployment_every_minute = await Deployment.build_from_flow(
    name="Model Inference Deployment",
    flow=batch_inference,
    version="1.0",
    tags=["inference"],
    schedule=IntervalSchedule(interval=60),
    apply=True,
    entrypoint="/app/lib/prefect_workflows.py:batch_inference",
    parameters={"input_path": Config.INFERENCE_DATA.resolve()},
)

A prefect agent is needed to pull the works and run the flows at the right time/interval.
Start one with : 
```
prefect agent start default
```

#### Open Discussion : 

- Discuss a global vision of how to implement pipelines triggering (by action, not time)
- Discuss a way to implement it to the NYC use case. 