<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'>
<div style='text-align: justify; margin-left: 5px; margin-right: 5px;'>
<div style="float: left; border-right: 5px solid transparent;">
<table border="0" width="350px;" style="background-color: #f5f5f5; float: left;">
    <tr>
        <td colspan=2>
            <img alt="retrain-pipelines" src="https://github.com/user-attachments/assets/19725866-13f9-48c1-b958-35c2e014351a" />
        </td>
    </tr>
    <tr>
        <td colspan=2>
            <img alt="Metaflow" width="250px" src="https://github.com/user-attachments/assets/ecc20501-869d-4159-b5a0-eb0a117520e5" />
        </td>
    </tr>
    <tr>
        <td style="vertical-align: center;">Pytorch</td>
        <td> 
            <img alt="PyTorch" width="40px" src="https://github.com/user-attachments/assets/bfa9b38e-e9b3-41ff-8370-e64a0a0a4a93" />
        </td>
    </tr>
    <tr>
        <td style="vertical-align: center;" height="40px">TabNet</td>
        <td></td>
    </tr>
    <tr>
        <td style="vertical-align: center;">TorchServe</td>
        <td> 
            <img alt="PyTorch" width="40px" src="https://github.com/user-attachments/assets/bfa9b38e-e9b3-41ff-8370-e64a0a0a4a93" />
        </td>
    </tr>
</table></div>
<br />
Welcome to this introductory notebook for the <code>TabNetHpCvWandbFlow</code> sample pipeline from the <b>retrain-pipelines</b> library.<br />
This sample retraining pipeline covers the tabular data mutli-class classification use case. More specifically, it employ a <a href="https://pytorch.org/" target="_blank">Pytorch</a> implementation of the <a href="https://github.com/dreamquark-ai/tabnet/tree/develop" target="_blank">TabNet</a> model.<br />
<br />
<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'><center>
    TabNet: Attentive Interpretable Tabular Learning (<a href="https://arxiv.org/abs/1908.07442">arXiv</a>)<br />
    (for a Tensorflow implementation, visit <a tagret = "_blank" href="https://colab.research.google.com/drive/1T8P5DrwBBZpx-FjWrAxXNhZNfsco8y-t?usp=sharing">this reference Google Colab notebook)</a>
</center></div>
<br />
This model is transformer-based. Among some of its most advanced features, it takes full benefit of grouped attention for (out-of-the-box one-hot-encoded) categorical features.
<br />
Like other sample retraining pipelines provided with the <b>retrain-pipelines</b> library, the <code>TabNetHpCvWandbFlow</code> sample pipeline adapts to your data.<br />
<hr />
The infrastructure validation (the ability of newly-retrained model versions to accept and respond to inference requests) relies here on <a href="https://pytorch.org/serve/" target="_blank">TorchServe</a> where we pack the fitted inference pipeline and put it to the test.
<br clear="left" />
</div>
</div>

<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'>
<div style='text-align: justify; margin-left: 5px; margin-right: 5px;'>
The herein notebook indeed is here in support of the <code>TabNetHpCvWandbFlow</code> sample pipeline from the <b>retrain-pipelines</b> library. It is your step-by-step assistant to guide you into mastering it all super fast.<br />
<br />
From here, you can&nbsp;:
<ul>
    <li>
        Execute a <b>retrain-pipelines</b> run&nbsp;:
        <ul>
            <li>
                generate synthetic dataset if you need some to quickstart
            </li>
            <li>
                set an hyperparameter search space
            </li>
            <li>
                launch a <b>retrain-pipeline</b> run
            </li>
            <li>
                even start customizing default <code>preprocessing</code> and <code>pipeline_card</code> if you feel like it&nbsp;!
            </li>
        </ul>
    </li>
    <li>
        do some after-the-fact investigation thanks to the collection of <code>inspectors</code> offered by the <b>retrain-pipelines</b> library
    </li>
</ul>
<br />
<p style="text-align: justify; color: darkgray;">
<u>REMARK</u>&nbsp;: if you've not done so already, go check <a href="https://github.com/aurelienmorgan/retrain-pipelines/tree/master/extra/frameworks" target="_blank">this section</a> for a Local <em>Metaflow</em> installation. This comes in handy for quick prototyping and testing.
</p>
</div>
</div>

<font size="6em"><b>Table of Contents</b></font>

- [setup](#setup)
- [Generate data](#Generate-data)
- [Metaflow Run](#Metaflow-Run)
  - [HP tuning search space](#HP-tuning-search-space)
  - [Run flow](#Run-flow)
    - [Use the as-is sample pipeline](#Use-the-as-is-sample-pipeline)
    - [Customize you retraining pipeline](#Customize-you-retraining-pipeline)
- [Inspectors](#Inspectors)
  - [local Metaflow SDK](#local-Metaflow-SDK)
  - [local custom card explorer](#local-custom-card-explorer)
  - [WandB](#WandB)
- [Congratulations&nbsp;!](#Congratulationsnbsp)

<hr />

# setup

In [None]:
# !pip install -r requirements.txt

In [None]:
# !pip install retrain-pipelines

In [None]:
%reload_ext autoreload
%autoreload 2

import os, json
from textwrap import dedent

from dotenv import find_dotenv, load_dotenv
print(find_dotenv())
print(load_dotenv(os.path.join(os.getcwd(), "..", "..", ".env")))

import torch
print(torch.cuda.get_device_name(0))
print(f"PyTorch {torch.__version__}")

<hr />

# Generate data

In [None]:
from retrain_pipelines.dataset import DatasetType, pseudo_random_generate

num_samples = 10_000 # number of samples
data = pseudo_random_generate(DatasetType.TABULAR_CLASSIFICATION, num_samples)
# Display the first few rows
print(data.head())
# save to file
data.to_csv(os.path.realpath(os.path.join('..', 'data', 'synthetic_classif_tab_data_4classes.csv')), index=False)

In [None]:
from retrain_pipelines.dataset.features_dependencies import \
        dataset_to_heatmap_fig
fig, ax = dataset_to_heatmap_fig(data)
display(fig)

<hr />

# Metaflow Run

## HP tuning search space

Chosse which domain shall be considered for the HP tuning grid search&nbsp;:

In [None]:
from retrain_pipelines.utils import as_env_var

In [None]:
pipeline_hp_grid = {
    "trainer": {
        "max_epochs":[200],
        "patience":[10],
        "batch_size":[1024],
        "virtual_batch_size":[256],
    },
    "model": {
        "n_d":[64],
        "n_a":[64],
        "n_steps":[6],
        "gamma":[1.5],
        "n_independent":[2],
        "n_shared":[2],
        "lambda_sparse":[1e-4],
        "momentum":[0.3],
        "clip_value":[2.],
        "optimizer_fn":["torch.optim.Adam"],
        "optimizer_params":[dict(lr=2e-2), dict(lr=0.1)],
        "scheduler_params":[{"gamma": 0.80,
                            "step_size": 20}],
        "scheduler_fn":["torch.optim.lr_scheduler.StepLR"],
        "epsilon":[1e-15]
    }}
as_env_var(pipeline_hp_grid, env_var_name="pipeline_hp_grid")
print(f"pipeline_hp_grid : {os.environ['pipeline_hp_grid']}")

In [None]:
pipeline_hp_grid = {
    "trainer": {
        "max_epochs":[200],
        "patience":[10],
        "batch_size":[256, 1024, 2048],
        "virtual_batch_size":[128, 256],
    },
    "model": {
        "n_d":[64],
        "n_a":[64],
        "n_steps":[3, 4, 6],
        "gamma":[1.5],
        "n_independent":[2],
        "n_shared":[2],
        "lambda_sparse":[1e-4],
        "momentum":[0.3],
        "clip_value":[2.],
        "optimizer_fn":["torch.optim.Adam"],
        "optimizer_params":[dict(lr=0.1)],
        "scheduler_params":[{"gamma": 0.80,
                            "step_size": 20}],
        "scheduler_fn":["torch.optim.lr_scheduler.StepLR"],
        "epsilon":[1e-15]
    }}
as_env_var(pipeline_hp_grid, env_var_name="pipeline_hp_grid")
print(f"pipeline_hp_grid : {os.environ['pipeline_hp_grid']}")

# combinatons count :
from retrain_pipelines.utils import dict_dict_list_get_all_combinations
combinatons_count = \
    len(dict_dict_list_get_all_combinations(pipeline_hp_grid))
print(f"{combinatons_count} different combinations of hyperparameter values")

## Run flow

### Use the as-is sample pipeline

Load the cell-magic&nbsp;:

In [None]:
%reload_ext retrain_pipelines.legacy_launcher_magic

Take a look at the help for the retraining pipeline&nbsp;:

In [None]:
%retrain_pipelines_legacy retraining_pipeline.py run --help

You can launch a <b>retrain-pipelines</b> run&nbsp;:

In [None]:
%retrain_pipelines_legacy retraining_pipeline.py run \
    --data_file "../../data/synthetic_classif_tab_data_4classes.csv" \
    --buckets_param '{"num_feature2": 100, "num_feature4": 50}' \
    --pipeline_hp_grid '{pipeline_hp_grid}' \
    --cv_folds 2 \
    --max-workers 3 \
    --wandb_run_mode offline

You can also resume a prior run from the step of your choosing&nbsp;:

In [None]:
%retrain_pipelines_legacy retraining_pipeline.py resume pipeline_card

In [None]:
%retrain_pipelines_legacy retraining_pipeline.py resume \
    cross_validation --origin-run-id 1917

### Customize you retraining pipeline

Start by getting the default which you'd like to customize (any combinaison of the below 3 you'd like)&nbsp;:
<ul>
    <li><code>reprocessing.py</code> module</li>
    <li><code>pipeline_card.py</code> module</li>
    <li><code>template.html</code> html template</li>
</ul>

In [None]:
from retraining_pipeline import TabNetHpCvWandbFlow

TabNetHpCvWandbFlow.copy_default_preprocess_module(".", exists_ok=True)
TabNetHpCvWandbFlow.copy_default_pipeline_card_module(".", exists_ok=True)
TabNetHpCvWandbFlow.copy_default_pipeline_card_html_template(".", exists_ok=True)

Once you updated any of them, you can launch a <b>retrain-pipelines</b> run so it uses those&nbsp;:

In [None]:
%retrain_pipelines_local retraining_pipeline.py run \
    --data_file "../data/synthetic_classif_tab_data_4classes.csv" \
    --buckets_param '{"num_feature2": 100, "num_feature4": 50}' \
    --pipeline_hp_grid "${pipeline_hp_grid}" \
    --cv_folds 2 \
    --preprocess_artifacts_path "." \
    --pipeline_card_artifacts_path "." \
    --wandb_run_mode disabled

# Inspectors

The <b>retrain-pipelines Inspectors</b> are a set of convenience methods to observe past runs <em>after-the-fact</em>. They're here to ease the discovery of some important facts which, for the sake of consicion, were not included in the <code>pipeline-card</code> generated for that run.<br />
If for any reason you'd like to dig deeper in a past run and investigate in details what happened, you can rely on the <b>retrain-pipelines Inspectors</b>&nbsp;!

In [None]:
mf_flow_name = "TabNetHpCvWandbFlow"

<hr />

## local Metaflow SDK

You can use the metaflow python package to navigate artifacts gennerated by a past <b>retrain-pipelines</b> run just as you would for any metaflow flow. To interact with your local metaflow instance though, you shall use the <code>local_metaflow</code> package as follows&nbsp;:

In [None]:
from retrain_pipelines.frameworks import local_metaflow as metaflow

And explore the content of any given set of flow artifacts, just specify the right <code>flow_id</code> and <code>task_id</code> for it below to for instance retrieved the newly-retrained model itself&nbsp;:

In [None]:
metaflow.Task('TabNetHpCvWandbFlow/990/train_model/30013', attempt=0)['model'].data

Or you could look into the confusion matrix from the newly retrained model version on the validation dataset&nbsp;:

In [None]:
metaflow.Task('TabNetHpCvWandbFlow/990/evaluate_model/30014', attempt=0)['conf_matrix'].data

Or you could go copy python commands straight from the dedicated <b>key artifacts</b> section from your <code>pipeline card</code>.

## local custom card explorer

In [None]:
from retrain_pipelines.inspectors import browse_local_pipeline_card

In [None]:
help(browse_local_pipeline_card)

You can open the <code>pipeline card</code> corresponding to the latest run by simply calling&nbsp;:

In [None]:
browse_local_pipeline_card(mf_flow_name)

<hr />

## WandB

Make sure to have the `WANDB_API_KEY` environement variable set adequately.<br />
It can be through a `secret`.

<b>programmatically browse the saved source-code</b>

In [None]:
from retrain_pipelines.inspectors import get_execution_source_code

for source_code_artifact in get_execution_source_code(mf_run_id=<your_flow_id>):
    print(f" - {source_code_artifact.name} {source_code_artifact.url}")

<b>The below command will download source-code artifacts for a given run and open a file explorer on the parent dir&nbsp;:</b>

In [None]:
from retrain_pipelines.inspectors import explore_source_code
# download and open file explorer
explore_source_code(mf_run_id=<your_flow_id>)

<hr />

# Congratulations&nbsp;!

<br />
<div style='background-color: rgba(0, 255, 255, 0.04); border: 1px solid rgba(0, 255, 255, .2);'>
<div style='text-align: justify; margin-left: 5px; margin-right: 5px;'>
You're now championing the <code>TabNetHpCvWandbFlow</code> sample pipeline from the <b>retrain-pipelines</b> library&nbsp;!
</div>
</div>