# A PyTorch Vision Transformer example with EncFS and SLURM

This application shows the concepts previously introduced in tutorials - DVC policies, EncFS and SLURM - in action with a [Vision Transformer](https://github.com/pytorch/examples/blob/main/vision_transformer) network from the PyTorch example collection that is applied to the CIFAR10 dataset. Even though we do not use containers here, an extension to use the Sarus container engine can be done without difficulty.

## Initializing the DVC repository
We first import the depencies for the tutorial.

In [None]:
import os

In [None]:
from IPython.display import SVG  # test_vit_example: skip

Create a new directory `data/v3` for the DVC root and change to it.

In [None]:
os.chdir('data/v3')

Initialize an `encfs` DVC repository as explained in the [EncFS-simulation tutorial](encfs_sim_tutorial.ipynb) using the command

In [None]:
!dvc_init_repo . encfs

As next step, EncFS needs to be configured, which can be achieved by running

```shell
${ENCFS_INSTALL_DIR}/bin/encfs -o allow_root,max_write=1048576,big_writes -f encrypt decrypt
```
as described in the [EncFS initialization instructions](../async_encfs_dvc/encfs_int/README.md).

Here, only for the purpose of this tutorial, we use a pre-established configuration with a simple password. It is important that this is only for demonstration purposes - in practice always generate a **random** key and store it in a **safe location**!

In [None]:
%%bash
echo 1234 > encfs_tutorial.key
cp $(git rev-parse --show-toplevel)/examples/.encfs6.xml.tutorial encrypt/ && mv encrypt/.encfs6.xml.tutorial encrypt/.encfs6.xml

At runtime, EncFS will read the password from a file. The location of that file is passed in an environment variable that has to be set when `dvc repro` is run on a stage or `encfs_launch` is used to e.g. inspect the encrypted data interactively.

In [None]:
os.environ['ENCFS_PW_FILE'] = os.path.realpath('encfs_tutorial.key')

The DVC repo has been initialized with repo and stage policies available under `.dvc_policies`.

In [None]:
!tree .dvc_policies

For the purpose of this tutorial, we will extract the paths of the encrypted directory and the mount target of EncFS into environment variables. This is not a necessary step to run DVC stages with EncFS, though.

In [None]:
from async_encfs_dvc.encfs_int.mount_config import load_mount_config

mount_config = [os.popen(f"echo {d}").read().strip() for d in  # evaluating shell exprs in paths
                load_mount_config('.dvc_policies/repo/dvc_root.yaml')]

os.environ['ENCFS_ENCRYPT_DIR'] = mount_config[0]  # encrypt (same on all hosts)
os.environ['ENCFS_DECRYPT_DIR'] = mount_config[1]  # host-specific

## Fetching the input dataset
Our pipeline will be based on the CIFAR10 dataset. For the purpose of this example, we will use the test dataset also as an input at the inference stage. The CIFAR10 dataset requires training and test dataset to be co-located. Therefore, we will not use the same fine-grained DVC file hierarchy as in the ML tutorial, where for each of training, test and inference the original and preprocessed data was grouped. We download the dataset using the code in `ex_in` and track it with DVC using the following commands:

In [None]:
%env CIFAR10_IN_RUN_LABEL=init

In [None]:
!dvc_create_stage --app-yaml ../../ex_in/dvc_app.yaml --stage fetch_cifar10 --run-label ${CIFAR10_IN_RUN_LABEL}


This stage can be executed, frozen upon success and the resulting file hierarchy inspected with

In [None]:
!dvc repro --no-commit config/in/cifar10/original/dvc.yaml

As an asynchronous SLURM stage, we have to wait for its completion until we can inspect the data. As shown in the [SLURM tutorial](slurm_async_sim_tutorial.ipynb), this includes first releasing the stage and commit jobs.

In [None]:
!dvc_scontrol release stage,commit

In [None]:
!../../slurm_wait_for_job.sh $(cat config/in/cifar10/original/in_original_fetch_cifar10_${CIFAR10_IN_RUN_LABEL}.dvc_commit_jobid)

As we do not expect this dataset to change, we can freeze this input stage. 

In [None]:
!dvc freeze config/in/cifar10/original/dvc.yaml:in_original_fetch_cifar10_${CIFAR10_IN_RUN_LABEL}

In [None]:
!tree encrypt/in config

We convince ourselves that the data is encrypted as demonstrated in the [EncFS-simulation tutorial](encfs_sim_tutorial.ipynb)

In [None]:
!head encrypt/in/cifar10/original/output/cifar-10-batches-py/readme.html

whereas it can readily be inspected with EncFS (*never* handle confidential data in this way, always use `encfs_launch` as advised in the EncFS-tutorial)

In [None]:
%%bash
encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} /dev/null cp ${ENCFS_DECRYPT_DIR}/in/cifar10/original/output/cifar-10-batches-py/readme.html readme.html >/dev/null
head readme.html
rm readme.html

The execution of the data fetch stage can also be deferred to the ML stages, where it will be triggered as a dependency.

Before moving to the next stage, we define execution labels based on timestamps for the subsequent DVC stages. In a real-world application, the timestamps would usually be generated on the fly when creating the DVC stage.

In [None]:
%env VIT_TRAIN_RUN_LABEL=run_20230721_081756
%env VIT_INF_RUN_LABEL=run_20230721_115412

The next step usually involves setting up preprocessing stages. In the Vision Transformer example, this corresponds to resizing the images. Since this is an inexpensive operation and comes integrated with the training and inference in the publicly available code, we will skip this. For instructions on how to define a separate preprocessing step, please refer to the [ML repository tutorial](ml_tutorial.md).

## Creating the training and inference stages
Now, we are ready to set up the stages for the Vision Transformer model based on the CIFAR10 dataset:

In [None]:
%%bash
mkdir -p {encrypt,config}/ex_vit/cifar10/baseline_model/{training,inference,config}

We capture the model configuration and output under the name `baseline_model` to refer to the publicly available model. The default values for hyperparameters and model architecture specification are encapsulated in a file `config.yaml`. We commit this file to the DVC repository using a `config` stage analogous to the manual input stage for datasets.

In [None]:
!dvc_create_stage --app-yaml ../../ex_vit/dvc_app.yaml --stage config --run-label init --config-group default

In [None]:
!encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} config/ex_vit/cifar10/baseline_model/config/default/output/stage_log.out cp -v ../../ex_vit/config.yaml ${ENCFS_DECRYPT_DIR}/ex_vit/cifar10/baseline_model/config/default/output/

In [None]:
!dvc commit --force config/ex_vit/cifar10/baseline_model/config/default/dvc.yaml

In [None]:
!tree {encrypt,config}/ex_vit

Following this, we can set up the training and inference stages. Where necessary, we can obtain completion suggestions with `--show-opts`.


In [None]:
%%bash
dvc_create_stage --app-yaml ../../ex_vit/dvc_app.yaml --stage training \
    --run-label ${VIT_TRAIN_RUN_LABEL} \
    --input-config default --input-config-file config.yaml --input-training . --input-test .
dvc_create_stage --app-yaml ../../ex_vit/dvc_app.yaml --stage inference \
    --run-label ${VIT_INF_RUN_LABEL} \
    --input-config default --input-config-file config.yaml --input-training ${VIT_TRAIN_RUN_LABEL} --input-inference .

## Running and monitoring training and inference with SLURM
These stages can be inspected with:

In [None]:
%%bash
dvc dag --dot config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc.yaml | tee config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc_dag.dot
if [[ $(command -v dot) ]]; then
    dot -Tsvg config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc_dag.dot > config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc_dag.svg
fi

In [None]:
dvc_dag_img = 'config/ex_vit/cifar10/baseline_model/inference/' + os.environ['VIT_INF_RUN_LABEL'] + '/dvc_dag.svg'
if os.path.exists(dvc_dag_img):
    display(SVG(filename=dvc_dag_img))

And finally executed with:

In [None]:
%%bash
dvc repro --no-commit config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc.yaml

The submitted SLURM stages can be monitored with a status file and log files in the `dvc.yaml` directory as well as the tool `dvc_scontrol` as familiar from the [SLURM tutorial](slurm_async_sim_tutorial.ipynb).

In [None]:
!dvc_scontrol show all

As the stage and commit jobs are put on hold in order to enable the submission of more DVC SLURM stages by the user, we have to release them. 

In [None]:
!dvc_scontrol release stage,commit

This will enable resource allocation and execution of the application stages. Upon successfull execution, the stage outputs will also be committed. We can now log out from the SLURM cluster and return later to inspect results.

For the purpose of this notebook, however, we want to keep monitoring the pipeline on SLURM and wait for the completion of the last job, which is a DVC commit.  

In [None]:
%%bash
# ID of last commit job
commit_jobid=$(cat config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/ex_vit_cifar10_baseline_model_inference_${VIT_INF_RUN_LABEL}.dvc_commit_jobid)
../../slurm_wait_for_job.sh ${commit_jobid}

In [None]:
!tree {encrypt,config}/{in,ex_vit}

We will convince ourselves that the predicted labels are encrypted. 

In [None]:
import pickle

os.environ['VIT_PREDICTED_LABELS'] = f"ex_vit/cifar10/baseline_model/inference/{os.environ['VIT_INF_RUN_LABEL']}/output/predicted_labels.pkl"

def unpickle_and_print_labels(filepath):
    try:
        with open(filepath, 'rb') as f:
            predicted_labels = pickle.load(f)
        print("Successfully unpickled labels:")
        print(predicted_labels)
    except pickle.UnpicklingError:
        print(f"Error unpickling the file at {filepath}.")

unpickle_and_print_labels(os.path.join('encrypt', os.environ['VIT_PREDICTED_LABELS']))

and that they can be read through the EncFS-layer (only for exposition, not to be performed on confidential data)

In [None]:
!encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} /dev/null cp ${ENCFS_DECRYPT_DIR}/${VIT_PREDICTED_LABELS} predicted_labels.pkl >/dev/null

In [None]:
unpickle_and_print_labels('predicted_labels.pkl')

In [None]:
!rm predicted_labels.pkl