<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'>
<div style='text-align: justify; margin-left: 5px; margin-right: 5px;'>
<div style="float: left; border-right: 5px solid transparent;">
<table border="0" width="350px;" style="background-color: #f5f5f5; float: left;">
    <tr>
        <td colspan=2>
            <img alt="retrain-pipelines" src="https://github.com/user-attachments/assets/19725866-13f9-48c1-b958-35c2e014351a" />
        </td>
    </tr>
    <tr>
        <td colspan=2>
            <img alt="Metaflow" width="250px" src="https://github.com/user-attachments/assets/ecc20501-869d-4159-b5a0-eb0a117520e5" />
        </td>
    </tr>
    <tr>
        <td style="vertical-align: center;">Dask</td>
        <td> 
            <img alt="Dask" width="50px" src="https://github.com/user-attachments/assets/a94807e7-cc67-4415-9a9e-da1ed4755cb1" />
        </td>
    </tr>
    <tr>
        <td style="vertical-align: center;">LightGBM</td>
        <td> 
            <img alt="LightGBM" width="30px" src="https://github.com/user-attachments/assets/92ac0b53-17f8-470d-9c73-619657db42bd" />
        </td>
    </tr>
    <tr>
        <td style="vertical-align: center;">ML server</td>
        <td> 
            <img alt="ML server" width="50px" src="https://github.com/user-attachments/assets/69c57bce-cd38-4f8c-8730-e5171e842d13" />
        </td>
    </tr>
</table></div>
<br />
Welcome to this introductory notebook for the <code>LightGbmHpCvWandbFlow</code> sample pipeline from the <b>retrain-pipeloines</b> library.<br />
This sample retraining pipeline covers the tabular data regression use case. More specifically, it employ data-parallelism with <a href="https://www.dask.org/" target="_blank">Dask</a> and a <a href="https://lightgbm.readthedocs.io/en/stable/" target="_blank">LightGBM</a> model.<br />
The infrastructure validation (the ability of newly-retrained model versions to accept and respond to inference requests) relies here on <a href="https://www.seldon.io/solutions/seldon-mlserver" target="_blank">ML Server</a> where we pack the fitted inference pipeline and put it to the test.
<br clear="left" />
</div>
</div>

<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'>
<div style='text-align: justify; margin-left: 5px; margin-right: 5px;'>
The herein notebook is here in support of the <code>LightGbmHpCvWandbFlow</code> sample pipeline from the <b>retrain-pipeloines</b> library. It is your step-by-step assistant to guide you into mastering it all super fast.<br />
<br />
From here, you can&nbsp;:
<ul>
    <li>
        Execute a <b>retrain-pipelines</b> run&nbsp;:
        <ul>
            <li>
                generate synthetic dataset if you need some to quickstart
            </li>
            <li>
                set an hyperparameter search space
            </li>
            <li>
                launch a <b>retrain-pipeline</b> run
            </li>
            <li>
                even start customizing default <code>preprocessing</code> and <code>pipeline_card</code> if you feel like it&nbsp;!
            </li>
        </ul>
    </li>
    <li>
        do some after-the-fact investigation thanks to the collection of <code>inspectors</code> offered by the <b>retrain-pipelines</b> library
    </li>
</ul>
<br />
<p style="text-align: justify; color: darkgray;">
<u>REMARK</u>&nbsp;: if you've not done so already, go check <a href="https://github.com/aurelienmorgan/retrain-pipelines/tree/master/extra/frameworks" target="_blank">this section</a> for a Local <em>Metaflow</em> installation. This comes handy for quick prototyping and testing.
</p>
</div>
</div>

<font size="6em"><b>Table of Contents</b></font>

- [setup](#setup)
- [Generate data](#Generate-data)
- [Metaflow Run](#Metaflow-Run)
  - [HP tuning search space](#HP-tuning-search-space)
  - [Run flow](#Run-flow)
    - [Use the as-is sample pipeline](#Use-the-as-is-sample-pipeline)
    - [Customize you retraining pipeline](#Customize-you-retraining-pipeline)
- [Inspectors](#Inspectors)
  - [local Metaflow SDK](#local-Metaflow-SDK)
  - [local custom card explorer](#local-custom-card-explorer)
  - [WandB](#WandB)
  - [hp_cv_inspector](#hp_cv_inspector)
- [Congratulations&nbsp;!](#Congratulationsnbsp)

<hr />

# setup

In [None]:
# !pip install -r requirements.txt

In [None]:
# !pip install retrain-pipelines

In [None]:
%reload_ext autoreload
%autoreload 2

import os, json

# WandB API key
from dotenv import load_dotenv
load_dotenv("../.env")

<hr />

# Generate data

In [None]:
from retrain_pipelines.dataset import DatasetType, pseudo_random_generate

num_samples = 10_000 # 30 # 500 # 1_500 # 
data = pseudo_random_generate(DatasetType.TABULAR_REGRESSION, num_samples)
print(data.head())
# save to file
data.to_csv(os.path.realpath(os.path.join('..', 'data', 'synthetic_classif_tab_data_continuous.csv')), index=False)

<hr />

# Metaflow Run

## HP tuning search space

Chosse which domain shall be considered for the HP tuning grid search&nbsp;:

In [None]:
pipeline_hp_grid = {
    "boosting_type": ["gbdt"],
    "num_leaves": [10],
    "learning_rate": [0.01],
    "n_estimators": [2],
}
os.environ['pipeline_hp_grid'] = str(json.dumps(pipeline_hp_grid)).replace("\n", "")
print(os.environ['pipeline_hp_grid'])

In [None]:
import itertools

pipeline_hp_grid = {
    "boosting_type": ["gbdt"],
    "num_leaves": [75, 100, 125],
    "learning_rate": [0.01],
    "n_estimators": [150, 200],
    "lambda_l1": [0, 0.05],
    "lambda_l2": [0.1, 0.2, 0.3],
    "bagging_fraction": [1, 0.95],
}
os.environ['pipeline_hp_grid'] = str(json.dumps(pipeline_hp_grid))
print(os.environ['pipeline_hp_grid'])
combinatons_count = \
    len([dict(zip(pipeline_hp_grid.keys(), v))
         for v in itertools.product(*pipeline_hp_grid.values())])
print(f"{combinatons_count} sets of hyperparameter values")

## Run flow

### Use the as-is sample pipeline

Load the cell-magic&nbsp;:

In [None]:
%load_ext retrain_pipelines.local_launcher_magic

Take a look at the help for the retraining pipeline&nbsp;:

In [None]:
%retrain_pipelines_local retraining_pipeline.py run --help

You can launch a <b>retrain-pipelines</b> run &nbsp;:

In [None]:
%retrain_pipelines_local retraining_pipeline.py run \
    --data_file "../data/synthetic_classif_tab_data_continuous.csv" \
    --buckets_param '{"num_feature1": 100, "num_feature2": 50}' \
    --pipeline_hp_grid "{pipeline_hp_grid}" \
    --cv_folds 2 \
    --max-workers 4 \
    --dask_partitions 4 \
    --wandb_run_mode offline

You can also resume a prior run from the step of your choosing&nbsp;:

In [None]:
%retrain_pipelines_local retraining_pipeline.py resume pipeline_card

### Customize you retraining pipeline

Start by getting the default which you'd like to customize (any combinaison of the below 3 you'd like)&nbsp;:
<ul>
    <li><code>reprocessing.py</code> module</li>
    <li><code>pipeline_card.py</code> module</li>
    <li><code>template.html</code> html template</li>
</ul>

In [None]:
from retraining_pipeline import LightGbmHpCvWandbFlow

LightGbmHpCvWandbFlow.copy_default_preprocess_module(".", exists_ok=True)
LightGbmHpCvWandbFlow.copy_default_pipeline_card_module(".", exists_ok=True)
LightGbmHpCvWandbFlow.copy_default_pipeline_card_html_template(".", exists_ok=True)

Once you updated any of them, you can launch a <b>retrain-pipelines</b> run so it uses those&nbsp;:

In [None]:
%retrain_pipelines_local retraining_pipeline.py run \
    --data_file "../data/synthetic_classif_tab_data_continuous.csv" \
    --buckets_param '{"num_feature1": 100, "num_feature2": 50}' \
    --pipeline_hp_grid "${pipeline_hp_grid}" \
    --cv_folds 2 \
    --max-workers 4 \
    --dask_partitions 4 \
    --pipeline_card_artifacts_path "." \
    --preprocess_artifacts_path "." \
    --wandb_run_mode disabled

# Inspectors

The <b>retrain-pipelines Inspectors</b> are a set of convenience methods to observe past runs <em>after-the-fact</em>. They're here to ease the discovery of some important facts which, for the sake of consicion, were not included in the <code>pipeline-card</code> generated for that run.<br />
If fo any reason you'd like to dig deeper in a past run and investigate in details what happened, you can rely on the <b>retrain-pipelines Inspectors</b>&nbsp;!

<hr />

We can programatically interact with the Metaflow service using the `metaflow`python package. To connect the package with our self-hosted metaflow service, we simply need to set a couple environment variables before importing it&nbsp;:

In [None]:
mf_flow_name = 'LightGbmHpCvWandbFlow'

## local Metaflow SDK

You can use the metaflow python package to navigate artifacts gennerated by a past <b>retrain-pipelines</b> run just as you would for any metaflow flow. To interact with your local metaflow instance though, you shall use the <code>local_metaflow</code> package as follows&nbsp;:

In [None]:
from retrain_pipelines.frameworks import local_metaflow as metaflow

And explore the content of any given set of flow artifacts, just specify the right <code>flow_id</code> and <code>task_id</code> for it below to for instance view details of the fitted One-Hot encoder&nbsp;:

In [None]:
metaflow.Task("LightGbmHpCvWandbFlow/988/preprocess_data/29959",  attempt=0)['encoder'].data.__dict__

Or you could go copy python commands straight from the dedicated <b>key artifacts</b> section from your <code>pipeline card</code>.

## local custom card explorer

In [None]:
from retrain_pipelines.inspectors import browse_local_pipeline_card

In [None]:
help(browse_local_pipeline_card)

You can open the <code>pipeline card</code> corresponding to the latest run by simply calling&nbsp;:

In [None]:
browse_local_pipeline_card(mf_flow_name)

<hr />

## WandB

Make sure to have the `WANDB_API_KEY` environement variable set adequately.<br />
It can be through a `secret`.

<b>programmatically browse the saved source-code</b>

In [None]:
from retrain_pipelines.inspectors import get_execution_source_code

In [None]:
help(get_execution_source_code)

In [None]:
from retrain_pipelines.inspectors import get_execution_source_code

for source_code_artifact in get_execution_source_code(mf_run_id=<your_flow_id>):
    print(f" - {source_code_artifact.name} {source_code_artifact.url}")

<b>The below command will download source-code artifacts for a given run and open a file explorer on the parent dir&nbsp;:</b>

In [None]:
from retrain_pipelines.inspectors import explore_source_code
# download and open file explorer
explore_source_code(mf_run_id=<your_flow_id>)

<hr />

## hp_cv_inspector

The herein retraining pipeline relies on <em>Dask</em> for data-parallel training. Each Cross-Validation fold of each set of hyperparameter values is trained using a subset of the dataset, parallelized accross workers.

Thanks to the <code>hp_cv_inspector</code>, we can look into pipeline runs from the perspective of detailed training logs of each individual Dask worker during hyperparameter tuning.

First, focusing on the best-performing set of hyperparameters values&nbsp;:

In [None]:
from inspectors import plot_run_cv_history
plot_run_cv_history(mf_run_id=<your_flow_id>, best_cv=True)

Now, looking at all sets of hyperparameter values evaluated&nbsp;:

In [None]:
from inspectors import plot_run_all_cv_tasks
plot_run_all_cv_tasks(mf_run_id=<your_flow_id>)

<hr />

# Congratulations&nbsp;!

<br />
<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'>
<div style='text-align: justify; margin-left: 5px; margin-right: 5px;'>
You're now championing the <code>LightGbmHpCvWandbFlow</code> sample pipeline from the <b>retrain-pipeloines</b> library&nbsp;!
</div>
</div>