# Tutorial: Running DVC stages with Docker and EncFS

This guide demonstrates how to configure app-policies for the application to be run in a Docker container operating on encrypted data managed through EncFS. To learn about the steps to construct a DVC repository with infrastructure-as-code techniques please refer to the [tutorial for an ML repository](ml_tutorial.ipynb). Our example application will perform an iterative simulation with a linear data dependency graph. The key steps we will cover include:

 * Initialize a DVC repository to be managed by EncFS
 * Configure an app policy to use Docker in a DVC stage
 * Run a DVC stage using Docker and EncFS and inspect the results through a manually mounted encrypted filesystem

It is important to note that while we use Docker here, the container engine [Sarus](https://products.cscs.ch/sarus/) is equally supported and extensions for others are possible as well.

As a prerequisite to this tutorial, you should have `encfs` installed as described in the [instructions](../async_encfs_dvc/encfs_scripts/README.md).

## Initializing the DVC repository
We first import the depencies for the tutorial.

In [None]:
import os

In [None]:
from IPython.display import SVG  # test_encfs_sim_tutorial: skip

Create a new directory `data/v1` for the DVC root and change to it.

In [None]:
os.chdir('data/v1')

Initialize an `encfs` DVC repository using the command

In [None]:
!dvc_init_repo . encfs

As next step, EncFS needs to be configured, which can be achieved by running

```shell
${ENCFS_INSTALL_DIR}/bin/encfs -o allow_root,max_write=1048576,big_writes -f encrypt decrypt
```
as described in the [EncFS initialization instructions](../async_encfs_dvc/encfs_scripts/README.md).

Here, only for the purpose of this tutorial, we use a pre-established configuration with a simple password. It is important that this is only for demonstration purposes - in practice always generate a **random** key and store it in a **safe location**!

In [None]:
%%bash
echo 1234 > encfs_tutorial.key
cp $(git rev-parse --show-toplevel)/examples/.encfs6.xml.tutorial encrypt/ && mv encrypt/.encfs6.xml.tutorial encrypt/.encfs6.xml

At runtime, EncFS will read the password from a file. The location of that file is passed in an environment variable that has to be set when `dvc repro` is run on a stage or `encfs_launch` is used to e.g. inspect the encrypted data interactively.

In [None]:
os.environ['ENCFS_PW_FILE'] = os.path.realpath('encfs_tutorial.key')

The DVC repo has been initialized with repo and stage policies available under `.dvc_policies`.

In [None]:
!tree .dvc_policies

## Establishing the input dataset
Our pipeline will be based on a dataset labeled `sim_dataset_v1` and thereof a specific subset `ex2` (label chosen arbitrarily). First, we create a DVC stage to track the encrypted input data. As `dvc add` on longer supports the `--file` option to locate the `.dvc` file in a different folder, we use a workaround with a frozen no-op stage analogous to the manual preprocessing step for the training dataset in the ML tutorial.

In [None]:
!dvc_create_stage --app-yaml ../../in/dvc_app.yaml --stage add --dataset-name sim_dataset_v1 --subset-name ex2

We now populate this dataset repository with input data that we generate randomly here, even though it could be downloaded from a remote source. As this will be encrypted data, we have to run `dd` that generates this data in an environment managed by EncFS, which is achieved by wrapping the command with `encfs_mount_and_run_v2.sh`. 

In [None]:
%%bash
encfs_mount_and_run_v2.sh encrypt decrypt decrypt/in/sim_dataset_v1/original/ex2/output/encfs_out.log \
    dd if=/dev/urandom of=decrypt/in/sim_dataset_v1/original/ex2/output/sim_in.dat bs=4k iflag=fullblock,count_bytes count=$((10**7)) \
    > config/in/sim_dataset_v1/original/ex2/output/stage_out.log

We can inspect the logs of the outer command, that is of `encfs_mount_and_run_v2.sh`, which we capture in the file `stage_out.log`.

In [None]:
!cat config/in/sim_dataset_v1/original/ex2/output/stage_out.log

These logs are unencrypted and do not leak any details on the application inside the EncFS-environment. The application logs are captured separately in the encrypted file `encfs_out.log`. To inspect it, we need to mount a decrypted view of `encrypt`. In a typical session, a user would launch EncFS in the foreground on a another terminal using
```shell
encfs_launch .dvc_policies/repo/dvc_root.yaml
```
and then access the decrypted data here. However, *exclusively* for the purpose of this notebook, we will copy the decrypted data to an unencrypted location to inspect it outside of EncFS. We emphasize that this is only for demonstration purposes and confidential data should *never* be handled in that manner.

In [None]:
%%bash
encfs_mount_and_run_v2.sh encrypt decrypt /dev/null cp decrypt/in/sim_dataset_v1/original/ex2/output/encfs_out.log encfs_out.log >/dev/null
cat encfs_out.log
rm encfs_out.log

To avoid logging the output of an EncFS-managed application to a file, you can supply `/dev/null` as a third parameter to `encfs_mount_and_run_v2.sh` as done here.

Finally, we commit the newly added data to DVC as described in the manual preprocessing step for the training dataset in the ML tutorial.

In [None]:
!dvc commit --force config/in/sim_dataset_v1/original/ex2/dvc.yaml 

The resulting file hierarchy looks as follows:

In [None]:
!tree encrypt/in config

Before moving to the definition of preprocessing stages, we define execution labels based on timestamps for the subsequent DVC stages. In a real-world application, the timestamps would usually be generated on the fly when creating the DVC stage.

In [None]:
%env ETL_RUN_LABEL=ex2-20230714-083624
%env SIM_0_RUN_LABEL=ex2-20230714-083812-0
%env SIM_1_RUN_LABEL=ex2-20230714-083812-1
%env SIM_2_RUN_LABEL=ex2-20230714-083812-2
%env SIM_3_RUN_LABEL=ex2-20230714-083812-3

## Constructing the preprocessing stage
The next step involves setting up the preprocessing stage. For illustration purposes, we will use the same preprocessing application as in the ML repository tutorial.

In [None]:
%%bash

dvc_create_stage --app-yaml ../../app_prep/dvc_app.yaml --stage sim \
    --run-label ${ETL_RUN_LABEL} --input-etl ex2 --input-etl-file output/sim_in.dat

In [None]:
%%bash

dvc repro config/in/sim_dataset_v1/app_prep_v1/auto/${ETL_RUN_LABEL}/dvc.yaml

Execution can also be deferred to the simulation stages, where it will be triggered as a dependency.

## Creating the simulation stages
Lastly, we will establish a rudimentary structure for an iterative simulation workflow that utilizes the preprocessed data. A usual simulation comes with its own tunable parameters. For this purpose, one can use a file such as for hyperparameters in the machine learning tutorial. However, for brevity here we skip this step and assume that all parameters are passed through the command line.

We are now ready to set up the simulation stages. Where necessary, we can obtain completion suggestions with `--show-opts`.

In [None]:
%%bash
dvc_create_stage --app-yaml ../../app_sim/dvc_app_docker.yaml --stage base_simulation \
  --run-label ${SIM_0_RUN_LABEL} \
  --input-simulation ${ETL_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**0 / 3))

dvc_create_stage --app-yaml ../../app_sim/dvc_app_docker.yaml --stage simulation \
  --run-label ${SIM_1_RUN_LABEL} \
  --input-simulation ${SIM_0_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**1 / 3))

dvc_create_stage --app-yaml ../../app_sim/dvc_app_docker.yaml --stage simulation \
  --run-label ${SIM_2_RUN_LABEL} \
  --input-simulation ${SIM_1_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**2 / 3))

dvc_create_stage --app-yaml ../../app_sim/dvc_app_docker.yaml --stage simulation \
  --run-label ${SIM_3_RUN_LABEL} \
  --input-simulation ${SIM_2_RUN_LABEL} \
  --simulation-output-file-num-per-rank 3 \
  --simulation-output-file-size $((10**6 * 2**3 / 3))

tree encrypt config

## Running the pipeline
The stages can be inspected with:

In [None]:
%%bash
dvc dag --dot config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc.yaml | tee config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc_dag.dot
if [[ $(command -v dot) ]]; then
    dot -Tsvg config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc_dag.dot > config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc_dag.svg
fi

In [None]:
display(SVG(filename='config/app_sim_v1/sim_dataset_v1/simulation/' + os.environ['SIM_3_RUN_LABEL'] + '/dvc_dag.svg'))  # test_encfs_sim_tutorial: skip

And finally executed with:

In [None]:
%%bash
dvc repro config/app_sim_v1/sim_dataset_v1/simulation/${SIM_3_RUN_LABEL}/dvc.yaml

As described above, results can be inspected by running EncFS with `encfs_launch .dvc_policies/repo/dvc_root.yaml` from the DVC root directory in another terminal and then accessing the data through the `decrypt` directory.

In [None]:
!tree encrypt config