# McKinsey Training - Constructing a Workflow in MLRun
This exercise will use the provided [Palmer Archipelago (Antarctica) penguin dataset](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data) and Python files to create a training and deployment pipeline. The pipeline will have 3 steps (data processing, model training, model deployment).

The three source code components are written for you, however it will be your job to use MLRun to containerize and orchestrate the components together into a larger pipeline (relevant links to documentation will be provided).

In [None]:
import os
import mlrun

## 1. Create an MLRun Project
Use [mlrun.get_or_create_project](https://docs.mlrun.org/en/latest/api/mlrun.projects.html#mlrun.projects.get_or_create_project) to create a project with the name "penguin-classification" in the current directory. Ensure that `user_project=True` so that the project is unique to you.

Relevant docs: [Get a project from DB or create it](https://docs.mlrun.org/en/latest/projects/create-project.html#get-or-create)

In [None]:
project = mlrun.get_or_create_project(...)

---

## 2. Register the Source Code as MLRun Functions

Use [project.set_function](https://docs.mlrun.org/en/latest/api/mlrun.projects.html#mlrun.projects.MlrunProject.set_function) to register the following 3 provided Python files as MLRun functions within the project:
- `data`: Located in `src/data.py`, Register as `job`, Look at source code for name of `handler`
- `train`: Located in `src/train.py`, Register as `job`, Look at source code for name of `handler`
- `serving`: Located in `src/serve.py`, Register as `serving`

Relevant docs: [Create and use functions](https://docs.mlrun.org/en/latest/runtimes/create-and-use-functions.html)

In [None]:
project.set_function(name="data", ...)
project.set_function(name="train", ...)
project.set_function(name="serving", ...)

---

## 3. Write a Batch Workflow Using the 3 Functions

The batch workflow should have 3 steps and use each of the previously registered MLRun functions. The steps will be process data, train model, deploy model. A skeleton of the pipeline has been provided - you can edit the cell directly in the notebook it will write to the corresponding `src/pipeline.py` file.

In general for each step, you will:
- Retrieve the function from the project via [project.get_function()](https://docs.mlrun.org/en/latest/api/mlrun.projects.html?highlight=get_function#mlrun.projects.MlrunProject.get_function)
- Optional: Apply customizations to the function (e.g. requests/limits, node selection, volume mounts, etc.)
- Execute the function via [project.run_function](https://docs.mlrun.org/en/latest/api/mlrun.projects.html?highlight=run_function#mlrun.projects.MlrunProject.run_function) for batch runtimes or [project.deploy_function](https://docs.mlrun.org/en/latest/api/mlrun.projects.html?highlight=deploy_function#mlrun.projects.MlrunProject.deploy_function) for real-time runtimes

Relevant docs: [Create and use functions](https://docs.mlrun.org/en/latest/runtimes/create-and-use-functions.html), [Build and run workflows/pipelines](https://docs.mlrun.org/en/latest/projects/build-run-workflows-pipelines.html), [Managing job resources](https://docs.mlrun.org/en/latest/runtimes/configuring-job-resources.html), [Inputs vs params](https://docs.mlrun.org/en/latest/concepts/submitting-tasks-jobs-to-functions.html#submit-tasks-jobs-using-run-function), [Adding models to a serving function](https://docs.mlrun.org/en/latest/api/mlrun.runtimes.html#mlrun.runtimes.ServingRuntime.add_model)

In [None]:
%%writefile src/pipeline.py
from kfp import dsl
import mlrun
import nuclio

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(
    name="penguin-classification-pipeline",
    description="Example of batch pipeline for palmer penguin dataset"
)
def pipeline(dataset: str, label_column: str = "species"):
    
    # Get current project
    project = mlrun.get_current_project()
    
    # Process data
    data_fn = project.get_function("data").apply(mlrun.mount_v3io())
    data_run = project.run_function(
        function=data_fn,
        inputs={},
        params={},
        outputs=[]
    )
    
    # Train a model
    train_fn = project.get_function("train")
    train_run = project.run_function(
        function=train_fn,
        inputs={},
        outputs=[]
    )

    # Deploy the model as a serverless function
    serving_fn = project.get_function("serving")
    serving_fn.add_model(...)
    deploy_run = mlrun.deploy_function(function=serving_fn)

---

## 4. Register Batch Workflow in Project and Save

Next, register the newly written batch workflow into the project via [project.set_workflow()](https://docs.mlrun.org/en/latest/api/mlrun.projects.html?highlight=set_workflow#mlrun.projects.MlrunProject.set_workflow) and save.

Relevant docs: [Running a multi-stage workflow](https://docs.mlrun.org/en/latest/concepts/workflow-overview.html), [Projects and automated ML pipeline](https://docs.mlrun.org/en/latest/tutorial/04-pipeline.html)

In [None]:
project.set_workflow(...)
project.save()

---

## 5. Execute the Workflow via the MLRun Project

Start a run of the newly registered workflow using [project.run()](https://docs.mlrun.org/en/latest/api/mlrun.projects.html?highlight=MlrunProject.run#mlrun.projects.MlrunProject.run). Pass a dictionary of arguments that includes the key `dataset` and the value of the path to the desired penguin dataset.

Relevant docs: [Running a multi-stage workflow](https://docs.mlrun.org/en/latest/concepts/workflow-overview.html), [Projects and automated ML pipeline](https://docs.mlrun.org/en/latest/tutorial/04-pipeline.html)

In [None]:
DATASET = f"{os.getcwd()}/data/palmer_penguins.csv"

In [None]:
project.run(...)

---

## 6. Send a Test HTTP Request to the Newly Deployed Model

Finally, use the provided model input to make a test HTTP request to the newly deployed model. You can retrieve the serving function via [project.get_function()](https://docs.mlrun.org/en/latest/api/mlrun.projects.html?highlight=get_function#mlrun.projects.MlrunProject.get_function) and invoke it via [serve_fn.invoke()](https://docs.mlrun.org/en/latest/api/mlrun.runtimes.html?highlight=invoke#mlrun.runtimes.RemoteRuntime.invoke)

Relevant docs: [Serving pre-trained ML/DL models](https://docs.mlrun.org/en/latest/tutorial/03-model-serving.html#deploy-the-serving-function), [Quick start tutorial](https://docs.mlrun.org/en/latest/tutorial/01-mlrun-basics.html#build-test-and-deploy-the-model-serving-functions)

In [None]:
MODEL_INPUT = {
    'inputs': [
        [0.0, 1.0, 0.0, 1.0, 0.0, 39.5, 16.7, 178.0, 3250.0],
        [1.0, 0.0, 0.0, 1.0, 0.0, 46.9, 14.6, 222.0, 4875.0],
        [0.0, 0.0, 1.0, 0.0, 1.0, 42.1, 19.1, 195.0, 4000.0],
        [0.0, 1.0, 0.0, 1.0, 0.0, 49.8, 17.3, 198.0, 3675.0],
        [1.0, 0.0, 0.0, 0.0, 1.0, 41.1, 18.2, 192.0, 4050.0]
    ]
}

In [None]:
serve_fn = project.get_function("serving")

In [None]:
serve_fn.invoke(...)

## 7. Bonus: Apply Specific Resource Request/Limits to Training Job

As an bonus exercise, modify the batch workflow to apply specific resource requests/limits to the training job. This will take place between retrieving the function from the project and executing the function itself.

Relevant docs: [Managing job resources](https://docs.mlrun.org/en/latest/runtimes/configuring-job-resources.html), [Customizing functions](https://docs.mlrun.org/en/latest/runtimes/create-and-use-functions.html#customizing-functions)