# Tutorial: Building a DVC repo for an ML workflow

This guide demonstrates the construction of a DVC repository that is tailored to a machine learning pipeline with infrastructure-as-code principles. The example pipeline will perform dataset preprocessing and machine learning training/inference stages. The key concepts we will explore include:

 * Repository-wide configuration, managed via repo-policies
 * Stage-wise configuration, controlled through stage-policies
 * The creation of an app-policy that derives its parameters from the previous two, leading to a concrete instantiation of the corresponding DVC stages

It is important to note that stage and app policies cover multiple logically connected DVC stages, not just a single stage.  For example, a machine learning app policy usually encapsulates both a training and inference stage of an ML model.

In the context of an ML workflow, we'll cover how to
 * Handle manual and automated preprocessing steps in DVC
 * Set up an ML stage using these techniques
 * Execute the stages

To streamline this tutorial, we will not be using `EncFS`, containers and SLURM. The focus will primarily be on the stage and app policies as well as on the file hierarchy. All additional features can conveniently be activated later via modifications to the app policy.

## Initializing the DVC repository
We first import the depencies for the tutorial.

In [None]:
import os

In [None]:
from IPython.display import SVG  # test_ml_tutorial: skip

Create a new directory `data/v0` for the DVC root and change to it.

In [None]:
os.chdir('data/v0')

Initialize a `plain` DVC repository using the command

In [None]:
!dvc_init_repo . plain

The DVC repo has been initialized with repo and stage policies available under `.dvc_policies`.

In [None]:
!tree .dvc_policies

## Establishing the input dataset
Our pipeline will be based on a dataset labeled `ml_dataset` that we assume to be split into training, test, and inference. Each of these has specific subsets and we will utilize a subset labeled `ex1` for all of them (although this could be chosen differently for each of them). We populate the repository with this input data and track it with DVC by executing the following commands:

In [None]:
%%bash
mkdir -p in/ml_dataset_v1/{training,test,inference}/original/ex1
touch in/ml_dataset_v1/{training,test,inference}/original/ex1/in.dat
for d in in/ml_dataset_v1/{training,test,inference}/original; do
    cd $d && dvc add ex1 && cd -
done

The resulting file hierarchy looks as follows:

In [None]:
!tree in

Before moving to the definition of preprocessing stages, we define execution labels based on timestamps for the subsequent DVC stages. In a real-world application, the timestamps would usually be generated on the fly when creating the DVC stage.

In [None]:
%env ETL_TRAIN_RUN_LABEL=ex1-20230713-083624
%env ETL_INF_RUN_LABEL=ex1-20230713-083812
%env ETL_TEST_RUN_LABEL=ex1-20230713-083951
%env ML_TRAIN_RUN_LABEL=ex1-20230713-090119
%env ML_INF_RUN_LABEL=ex1-20230713-121007

## Constructing the preprocessing stage
The next step involves setting up the preprocessing stages. For the purpose of the demonstration, we will assume that this consists of a simple copy operation for each of the training, test and inference data. In a real-world scenario, this can be replaced by any other operation as required.

In particular, a preprocessing stage may also involve manual user interaction (e.g. in a GUI) that cannot be reproduced from the command line and, hence, is executed outside of DVC. We will cover this case for the training data, whereas we use DVC-reproducible commands for the preprocessing of the other datasets.

### Manual preprocessing
To implement a manual preprocessing step for the `training` data, we create a DVC stage with a no-op command that is `frozen` in order not to be reproduced by `dvc repro` as configured in the app policy `dvc_app.yaml`. Nevertheless, the data dependencies of this stage are tracked in DVC via an ETL stage policy `dvc_etl.yaml`.

To create this stage, run the following command:

In [None]:
%%bash
dvc_create_stage --app-yaml ../../app_prep/dvc_app.yaml --stage manual_train \
    --run-label ${ETL_TRAIN_RUN_LABEL} --input-etl ex1 --input-etl-file in.dat

To obtain the output data, the manual operation can now be performed. For this purpose, we inspect the newly created `dvc.yaml`. To perform a copy operation, we move the data in `deps` to the `output` directory in `outs`. As described above, in a real application, this step would typically be performed in a GUI app with some user interaction.

In [None]:
%%bash
cd in/ml_dataset_v1/training/app_prep_v1/manual/${ETL_TRAIN_RUN_LABEL}
cat dvc.yaml
cp ../../../original/ex1/* output/
cd -

In [None]:
%%bash
tree in/ml_dataset_v1/training/app_prep_v1/manual/${ETL_TRAIN_RUN_LABEL}

Once the manual operation has completed and the output directory is populated with data, we can commit it with:

In [None]:
%%bash
dvc commit --force in/ml_dataset_v1/training/app_prep_v1/manual/${ETL_TRAIN_RUN_LABEL}/dvc.yaml

### Automated preprocessing
In contrast, for the test and inference datasets, we will assume that the preprocessing is fully automated. These stages can be created and executed with:

In [None]:
%%bash

# test-data
dvc_create_stage --app-yaml ../../app_prep/dvc_app.yaml --stage auto_test \
    --run-label ${ETL_TEST_RUN_LABEL} --input-etl ex1 --input-etl-file in.dat
dvc repro in/ml_dataset_v1/test/app_prep_v1/auto/${ETL_TEST_RUN_LABEL}/dvc.yaml

In [None]:
%%bash

# inference-data
dvc_create_stage --app-yaml ../../app_prep/dvc_app.yaml --stage auto_inf \
    --run-label ${ETL_INF_RUN_LABEL} --input-etl ex1 --input-etl-file in.dat
dvc repro in/ml_dataset_v1/inference/app_prep_v1/auto/${ETL_INF_RUN_LABEL}/dvc.yaml

Execution can also be deferred to the ML stages, where it will be triggered as a dependency.

## Creating the Machine Learning stage
Lastly, we will establish a rudimentary structure for a machine learning application that utilizes the preprocessed data:


In [None]:
%%bash
mkdir -p app_ml/ml_dataset_v1/model_name_v2/{training,inference,config/ex1-config}

We encapsulate hyperparameters and model architecture specifications that are fixed during training in a file `hp.yaml`.

In [None]:
%%bash
touch app_ml/ml_dataset_v1/model_name_v2/config/ex1-config/hp.yaml
cd app_ml/ml_dataset_v1/model_name_v2/config && dvc add ex1-config && cd -
tree app_ml

Following this, we can set up the machine learning training and inference stages. Where necessary, we can obtain completion suggestions with `--show-opts`.


In [None]:
%%bash
dvc_create_stage --app-yaml ../../app_ml/dvc_app.yaml --stage training \
    --run-label ${ML_TRAIN_RUN_LABEL} \
    --input-config ex1-config --input-config-file hp.yaml --input-training ${ETL_TRAIN_RUN_LABEL} --input-test ${ETL_TEST_RUN_LABEL}
dvc_create_stage --app-yaml ../../app_ml/dvc_app.yaml --stage inference \
    --run-label ${ML_INF_RUN_LABEL} \
    --input-config ex1-config --input-config-file hp.yaml --input-training ${ML_TRAIN_RUN_LABEL} --input-inference ${ETL_INF_RUN_LABEL}

## Running the pipeline
These stages can be inspected with:

In [None]:
%%bash
dvc dag --dot app_ml/ml_dataset_v1/model_name_v2/inference/${ML_INF_RUN_LABEL}/dvc.yaml | tee app_ml/ml_dataset_v1/model_name_v2/inference/${ML_INF_RUN_LABEL}/dvc_dag.dot
if [[ $(command -v dot) ]]; then
    dot -Tsvg app_ml/ml_dataset_v1/model_name_v2/inference/${ML_INF_RUN_LABEL}/dvc_dag.dot > app_ml/ml_dataset_v1/model_name_v2/inference/${ML_INF_RUN_LABEL}/dvc_dag.svg
fi

In [None]:
display(SVG(filename='app_ml/ml_dataset_v1/model_name_v2/inference/' + os.environ['ML_INF_RUN_LABEL'] + '/dvc_dag.svg'))  # test_ml_tutorial: skip

And finally executed with:

In [None]:
%%bash
dvc repro app_ml/ml_dataset_v1/model_name_v2/inference/${ML_INF_RUN_LABEL}/dvc.yaml

In [None]:
!tree in app_ml