# Tutorial: Running DVC stages asynchronously with SLURM using EncFS and Sarus

This guide shows how to configure app-policies to run an application in a SLURM cluster within [Sarus](https://products.cscs.ch/sarus/) containers while operating on encrypted data managed through EncFS. We focus on an asynchronous execution model as familiar from `sbatch`. That is when reproducing a stage, DVC will only be blocking during job submission to the SLURM queue, but not wait until the resources have been allocated, the job executed and the output data is available.

Familiarity with the basic constructs such as how to construct a DVC repository with infrastructure-as-code techniques in the [tutorial for an ML repository](ml_tutorial.ipynb) and running [DVC stages on EncFS and Docker](encfs_sim_tutorial.ipynb) is assumed. We will use the same iterative simulation workflow as in the EncFS-tutorial as an example application. The key steps we will cover include:

 * Configure app-policies to run DVC stages on a SLURM cluster
 * Submit DVC stages that are asynchronously completed, committed and pushed to the SLURM queue
 * Monitor and control SLURM jobs corresponding to a DVC stage

The integration of these features with support for containers and EncFS is seamless, i.e. a previously developed application using them can be configured without any invasive code changes to run on a SLURM cluster.

As a prerequisite to this tutorial, you should have `encfs` installed as described in the [instructions](../async_encfs_dvc/encfs_int/README.md).

## Initializing the DVC repository
We first import the depencies for the tutorial.

In [None]:
import os

In [None]:
from IPython.display import SVG  # test_slurm_async_sim_tutorial: skip

Create a new directory `data/v2` for the DVC root and change to it.

In [None]:
os.chdir('data/v2')

Initialize an `encfs` DVC repository using the command

In [None]:
!dvc_init_repo . encfs

As next step, EncFS needs to be configured, which can be achieved by running

```shell
${ENCFS_INSTALL_DIR}/bin/encfs -o allow_root,max_write=1048576,big_writes -f encrypt decrypt
```
as described in the [EncFS initialization instructions](../async_encfs_dvc/encfs_int/README.md).

Here, only for the purpose of this tutorial, we use a pre-established configuration with a simple password. It is important that this is only for demonstration purposes - in practice always generate a **random** key and store it in a **safe location**!

In [None]:
%%bash
echo 1234 > encfs_tutorial.key
cp $(git rev-parse --show-toplevel)/examples/.encfs6.xml.tutorial encrypt/ && mv encrypt/.encfs6.xml.tutorial encrypt/.encfs6.xml

At runtime, EncFS will read the password from a file. The location of that file is passed in an environment variable that has to be set when `dvc repro` is run on a stage or `encfs_launch` is used to e.g. inspect the encrypted data interactively.

In [None]:
os.environ['ENCFS_PW_FILE'] = os.path.realpath('encfs_tutorial.key')

The DVC repo has been initialized with repo and stage policies available under `.dvc_policies`.

In [None]:
!tree .dvc_policies

For the purpose of this tutorial, we will extract the paths of the encrypted directory and the mount target of EncFS into environment variables. This is not a necessary step to run DVC stages with EncFS, though.

In [None]:
from async_encfs_dvc.encfs_int.mount_config import load_mount_config

mount_config = [os.popen(f"echo {d}").read().strip() for d in  # evaluating shell exprs in paths
                load_mount_config('.dvc_policies/repo/dvc_root.yaml')]

os.environ['ENCFS_ENCRYPT_DIR'] = mount_config[0]  # encrypt (same on all hosts)
os.environ['ENCFS_DECRYPT_DIR'] = mount_config[1]  # host-specific

## Establishing the input dataset
Our pipeline will be based on a dataset labeled `sim_dataset_v1` and thereof a specific subset `ex2` (label chosen arbitrarily). First, we create a DVC stage to track the encrypted input data. As `dvc add` on longer supports the `--file` option to locate the `.dvc` file in a different folder, we use a workaround with a frozen no-op stage analogous to the manual preprocessing step for the training dataset in the ML tutorial.

In [None]:
!dvc_create_stage --app-yaml ../../in/dvc_app.yaml --stage add --dataset-name sim_dataset_v1 --subset-name ex2

We now populate this dataset repository with input data that we generate randomly here, even though it could be downloaded from a remote source. As this will be encrypted data, we have to run `dd` that generates this data in an environment managed by EncFS, which is achieved by wrapping the command with `encfs_mount_and_run`. 

In [None]:
%%bash
encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} ${ENCFS_DECRYPT_DIR}/in/sim_dataset_v1/original/ex2/output/encfs_out.log \
    dd if=/dev/urandom of=${ENCFS_DECRYPT_DIR}/in/sim_dataset_v1/original/ex2/output/sim_in.dat bs=4k iflag=fullblock,count_bytes count=$((10**7)) \
    > config/in/sim_dataset_v1/original/ex2/output/stage_out.log

We can inspect the logs of the outer command, that is of `encfs_mount_and_run`, which we capture in the file `stage_out.log`.

In [None]:
!cat config/in/sim_dataset_v1/original/ex2/output/stage_out.log

These logs are unencrypted and do not leak any details on the application inside the EncFS-environment. The application logs are captured separately in the encrypted file `encfs_out.log`. To inspect it, we need to mount a decrypted view of `encrypt`. In a typical session, a user would launch EncFS in the foreground on a another terminal using
```shell
encfs_launch
```
and then access the decrypted data here. However, *exclusively* for the purpose of this notebook, we will copy the decrypted data to an unencrypted location to inspect it outside of EncFS. We emphasize that this is only for demonstration purposes and confidential data should *never* be handled in that manner.

In [None]:
%%bash
encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} /dev/null cp ${ENCFS_DECRYPT_DIR}/in/sim_dataset_v1/original/ex2/output/encfs_out.log encfs_out.log >/dev/null
cat encfs_out.log
rm encfs_out.log

To avoid logging the output of an EncFS-managed application to a file, you can supply `/dev/null` as a third parameter to `encfs_mount_and_run` as done here.

Finally, we commit the newly added data to DVC as described in the manual preprocessing step for the training dataset in the ML tutorial.

In [None]:
!dvc commit --force config/in/sim_dataset_v1/original/ex2/dvc.yaml 

The resulting file hierarchy looks as follows:

In [None]:
!tree encrypt/in config

Before moving to the definition of preprocessing stages, we define execution labels based on timestamps for the subsequent DVC stages. In a real-world application, the timestamps would usually be generated on the fly when creating the DVC stage.

In [None]:
%env ETL_RUN_LABEL=ex2-20230714-083624
%env SIM_0_RUN_LABEL=ex2-20230714-083812-0
%env SIM_1_RUN_LABEL=ex2-20230714-083812-1
%env SIM_2_RUN_LABEL=ex2-20230714-083812-2
%env SIM_3_RUN_LABEL=ex2-20230714-083812-3

## Constructing the preprocessing stage
The next step involves setting up the preprocessing stage. For illustration purposes, we will use the same preprocessing application as in the ML repository tutorial. To simplify things, we do not run this step over SLURM yet, though this could be changed without difficulty.

In [None]:
%%bash

dvc_create_stage --app-yaml ../../app_prep/dvc_app.yaml --stage sim \
    --run-label ${ETL_RUN_LABEL} --input-etl ex2 --input-etl-file output/sim_in.dat

In [None]:
%%bash

dvc repro config/in/sim_dataset_v1/app_prep_v1/auto/${ETL_RUN_LABEL}/dvc.yaml

This will execute the stage synchronously within the DVC-launched process. Execution can also be deferred to the simulation stages, where it will be triggered as a dependency. However, it is not recommended to launch mixed graphs of synchronous and asynchronous (SLURM) stages. In particular, asynchronous stages must not be launched as ancestors of synchronous DVC stages. 

## Creating the simulation stages for SLURM
Lastly, we will establish a rudimentary structure for an iterative simulation workflow that utilizes the preprocessed data. A usual simulation comes with its own tunable parameters. For this purpose, one can use a file such as for hyperparameters in the machine learning tutorial. However, for brevity here we skip this step and assume that all parameters are passed through the command line.

To enable stages to be run with SLURM, we add `slurm_opts` to the application policy under the corresponding stages (here `base_simulation` and `simulation`). Upon reproducing a DVC SLURM stage, the stage dependency graph will be submitted to the SLURM queue as a corresponding job graph. Multiple jobs are created per DVC stage, specifically a stage job for the actual application, upon its succes a DVC commit job and optionally a DVC push job and upon of stage job failure a cleanup job. There is a one-to-one correspondence between dependencies of DVC stages and those of SLURM stage jobs. In the application policy, we typically specify the options for stage jobs under `stage` and those for DVC operations like commit and push under `dvc`. For options that apply to all SLURM jobs we can use `all`.

We are now ready to set up the simulation stages. Where necessary, we can obtain completion suggestions with `--show-opts`.

In [None]:
%%bash
dvc_create_stage --app-yaml ../../app_sim/dvc_app_slurm.yaml --stage base_simulation \
  --run-label ${SIM_0_RUN_LABEL} \
  --input-simulation ${ETL_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**0 / 3))

dvc_create_stage --app-yaml ../../app_sim/dvc_app_slurm.yaml --stage simulation \
  --run-label ${SIM_1_RUN_LABEL} \
  --input-simulation ${SIM_0_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**1 / 3))

dvc_create_stage --app-yaml ../../app_sim/dvc_app_slurm.yaml --stage simulation \
  --run-label ${SIM_2_RUN_LABEL} \
  --input-simulation ${SIM_1_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**2 / 3))

dvc_create_stage --app-yaml ../../app_sim/dvc_app_slurm.yaml --stage simulation \
  --run-label ${SIM_3_RUN_LABEL} \
  --input-simulation ${SIM_2_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**3 / 3))

tree encrypt config

## Running and monitoring the pipeline with SLURM
The stages can be inspected with:

In [None]:
%%bash
dvc dag --dot config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc.yaml | tee config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc_dag.dot
if [[ $(command -v dot) ]]; then
    dot -Tsvg config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc_dag.dot > config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc_dag.svg
fi

In [None]:
dvc_dag_img = 'config/app_sim_v1/sim_dataset_v1/simulation/' + os.environ['SIM_3_RUN_LABEL'] + '/dvc_dag.svg'
if os.path.exists(dvc_dag_img):
    display(SVG(filename=dvc_dag_img))

We can now submit the pipeline of asynchronous SLURM stages. Note that we use `dvc repro --no-commit` to submit asynchronous stages. This is in order not to add any output during SLURM job submission time to the DVC cache.

In [None]:
%%bash
if [[ ! -x "$(command -v sarus)" ]]; then  # slurm_enqueue checks availability of sarus
    module load sarus
fi
dvc repro --no-commit config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc.yaml

The submitted SLURM stages can be monitored with a status file and log files in the `dvc.yaml` directory and the tool `dvc_scontrol`. The latter makes functionality of SLURM's `scontrol` available for asynchronous DVC stages and, thus, also allows to control DVC SLURM job groups. In particular, we can list the submitted jobs.

In [None]:
!dvc_scontrol show all

As the stage and commit jobs are put on hold in order to enable the submission of more DVC SLURM stages by the user, we have to release them. 

In [None]:
!dvc_scontrol release stage,commit

This will enable resource allocation and execution of the application stages. Upon successfull execution, the stage outputs will also be committed. We can now log out from the SLURM cluster and return later to inspect results.

For the purpose of this tutorial, however, we want to keep monitoring the pipeline on SLURM and wait for the completion of the last job, which is a DVC commit.  

In [None]:
%%bash
# ID of last commit job
commit_jobid=$(cat config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/app_sim_v1_sim_dataset_v1_simulation_${SIM_3_RUN_LABEL}.dvc_commit_jobid)
../../slurm_wait_for_job.sh ${commit_jobid}

Once the pipeline has completed, as described above, results can be inspected by running EncFS with `encfs_launch` from the DVC root directory in another terminal and then accessing the data through the decrypted directory.

In [None]:
!tree encrypt config