# Distributed training and inference with a Vision Transformer in PyTorch using EncFS and SLURM

This application shows the concepts previously introduced in tutorials - DVC policies, EncFS and SLURM - in action with a [Vision Transformer](https://github.com/pytorch/examples/blob/main/vision_transformer) network from the PyTorch example collection that is applied to the CIFAR10 dataset. We first download the data to an encrypted directory, then perform distributed training and finally inference with the Vision Transformer. Even though we do not use containers here, an extension to use the Sarus container engine can be done without difficulty.

## Initializing the DVC repository
We first import the depencies for the tutorial.

In [1]:
import os
import traceback
import pickle

In [2]:
from IPython.display import SVG  # test_vit_example: skip

Create a new directory `data/v3` for the DVC root and change to it.

In [3]:
os.chdir('data/v3')

Initialize an `encfs` DVC repository as explained in the [EncFS-simulation tutorial](encfs_sim_tutorial.ipynb) using the command

In [4]:
!dvc_init_repo . encfs

[34m2023-07-26 11:07:31,257[39m [34mDEBUG[39m: v0.1.dev8866+gd72e04c, CPython 3.9.4 on Linux-5.3.18-24.102-default-x86_64-with-glibc2.26
[34m2023-07-26 11:07:31,257[39m [34mDEBUG[39m: command: /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/venv/bin/dvc init --subdir --verbose
[34m2023-07-26 11:07:36,465[39m [34mDEBUG[39m: Added '/scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/.dvc/config.local' to gitignore file.
[34m2023-07-26 11:07:36,466[39m [34mDEBUG[39m: Added '/scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/.dvc/tmp' to gitignore file.
[34m2023-07-26 11:07:36,467[39m [34mDEBUG[39m: Added '/scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/.dvc/cache' to gitignore file.
[34m2023-07-26 11:07:36,468[39m [34mDEBUG[39m: Removing '/var/tmp/dvc/repo/21533fdb769b0281e411ded44bd8c4cf'
[34m2023-07-26 11:07:36,694[39m [34mDEBUG[39m: Staging files: {'/scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples

As next step, EncFS needs to be configured, which can be achieved by running

```shell
${ENCFS_INSTALL_DIR}/bin/encfs -o allow_root,max_write=1048576,big_writes -f encrypt decrypt
```
as described in the [EncFS initialization instructions](../async_encfs_dvc/encfs_int/README.md).

Here, only for the purpose of this tutorial, we use a pre-established configuration with a simple password. It is important that this is only for demonstration purposes - in practice always generate a **random** key and store it in a **safe location**!

In [5]:
%%bash
echo 1234 > encfs_tutorial.key
cp $(git rev-parse --show-toplevel)/examples/.encfs6.xml.tutorial encrypt/ && mv encrypt/.encfs6.xml.tutorial encrypt/.encfs6.xml

At runtime, EncFS will read the password from a file. The location of that file is passed in an environment variable that has to be set when `dvc repro` is run on a stage or `encfs_launch` is used to e.g. inspect the encrypted data interactively.

In [6]:
os.environ['ENCFS_PW_FILE'] = os.path.realpath('encfs_tutorial.key')

The DVC repo has been initialized with repo and stage policies available under `.dvc_policies`.

In [7]:
!tree .dvc_policies

[01;34m.dvc_policies[00m
├── [01;34mrepo[00m
│   └── dvc_root.yaml
└── [01;34mstages[00m
    ├── dvc_config.yaml
    ├── dvc_etl.yaml
    ├── dvc_in.yaml
    ├── dvc_ml_inference.yaml
    ├── dvc_ml_training.yaml
    └── dvc_simulation.yaml

2 directories, 7 files


For the purpose of this tutorial, we will extract the paths of the encrypted directory and the mount target of EncFS into environment variables. This is not a necessary step to run DVC stages with EncFS, though.

In [8]:
from async_encfs_dvc.encfs_int.mount_config import load_mount_config

mount_config = [os.popen(f"echo {d}").read().strip() for d in  # evaluating shell exprs in paths
                load_mount_config('.dvc_policies/repo/dvc_root.yaml')]

os.environ['ENCFS_ENCRYPT_DIR'] = mount_config[0]  # encrypt (same on all hosts)
os.environ['ENCFS_DECRYPT_DIR'] = mount_config[1]  # host-specific

## Fetching the input dataset
Our pipeline will be based on the CIFAR10 dataset. For the purpose of this example, we will use the test dataset also as an input at the inference stage. The CIFAR10 dataset requires training and test dataset to be co-located. Therefore, we will not use the same fine-grained DVC file hierarchy as in the ML tutorial, where for each of training, test and inference the original and preprocessed data was grouped. We download the dataset using the code in `ex_in` and track it with DVC using the following commands:

In [9]:
%env CIFAR10_IN_RUN_LABEL=init

env: CIFAR10_IN_RUN_LABEL=init


In [10]:
!dvc_create_stage --app-yaml ../../ex_in/dvc_app.yaml --stage fetch_cifar10 --run-label ${CIFAR10_IN_RUN_LABEL}


Writing DVC stage to config/in/cifar10/original
Using encfs - don't forget to set ENCFS_PW_FILE/ENCFS_INSTALL_DIR when running 'dvc repro --no-commit'.
Added stage 'in_original_fetch_cifar10_init' in 'dvc.yaml'            core[39m>

To track the changes with git, run:

	git add dvc.yaml .gitignore ../../../../encrypt/in/cifar10/original/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m

This stage can be executed, frozen upon success and the resulting file hierarchy inspected with

In [11]:
!dvc repro --no-commit config/in/cifar10/original/dvc.yaml

Running stage 'config/in/cifar10/original/dvc.yaml:in_original_fetch_cifar10_init':
> if (set -o pipefail) 2>/dev/null; then set -o pipefail; fi; mkdir -p ../../../../encrypt/in/cifar10/original/output output && slurm_enqueue.sh in_original_fetch_cifar10_init dvc_app.yaml fetch_cifar10 encfs_mount_and_run ../../../../encrypt /tmp/encfs_25680_async_encfs_dvc_in_original_fetch_cifar10_init_9127b8ee0785_cc38b32792ae ../../../../../../../../../../../../tmp/encfs_25680_async_encfs_dvc_in_original_fetch_cifar10_init_9127b8ee0785_cc38b32792ae/in/cifar10/original/output/encfs_out_{MPI_RANK}.log bash -c "$(git rev-parse --show-toplevel)/examples/ex_in/fetch_cifar10.py --in-output /tmp/encfs_25680_async_encfs_dvc_in_original_fetch_cifar10_init_9127b8ee0785_cc38b32792ae/in/cifar10/original/output" 2>&1 | tee output/stage_out.log
dvc_scontrol: All jobs of lukasd held. To make sure that no other users run DVC commands on this repo use 'dvc_scontrol show commit'.
dvc_scontrol: All jobs of lukasd hel

As an asynchronous SLURM stage, we have to wait for its completion until we can inspect the data. As shown in the [SLURM tutorial](slurm_async_sim_tutorial.ipynb), this includes first releasing the stage and commit jobs.

In [12]:
!dvc_scontrol release stage,commit

dvc_scontrol: Releasing job dvc_in_original_fetch_cifar10_init_cc38b32792ae[sbatch_dvc_stage_in_original_fetch_cifar10_init.sh] (reason JobHeldUser) at 47964849.
dvc_scontrol: Releasing job dvc_op_cc38b32792ae[sbatch_dvc_commit.sh] (reason JobHeldUser) at 47964855.


In [13]:
!../../slurm_wait_for_job.sh $(cat config/in/cifar10/original/in_original_fetch_cifar10_${CIFAR10_IN_RUN_LABEL}.dvc_commit_jobid)

Monitoring SLURM job 47964855.
The SLURM pipeline is still in process.
dvc_scontrol: DVC stage jobs:
           USER           JOBID     REASON       TIME  TIME_LEFT                                                         NAME                                                                                                                           WORK_DIR
         lukasd        47964849       None       0:30    3:59:30              dvc_in_original_fetch_cifar10_init_cc38b32792ae                                      /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/config/in/cifar10/original
dvc_scontrol: DVC commit jobs:
           USER           JOBID     REASON       TIME  TIME_LEFT                                                         NAME                                                                                                                           WORK_DIR
         lukasd        47964855 Dependency       0:00    4:00:00                                 

As we do not expect this dataset to change, we can freeze this input stage. 

In [14]:
!dvc freeze config/in/cifar10/original/dvc.yaml:in_original_fetch_cifar10_${CIFAR10_IN_RUN_LABEL}

Modifying stage 'in_original_fetch_cifar10_init' in 'config/in/cifar10/original/dvc.yaml'
[0m

In [15]:
!tree encrypt/in config

[01;34mencrypt/in[00m
└── [01;34mcifar10[00m
    └── [01;34moriginal[00m
        └── [01;34moutput[00m
            ├── [01;34mcifar-10-batches-py[00m
            │   ├── batches.meta
            │   ├── data_batch_1
            │   ├── data_batch_2
            │   ├── data_batch_3
            │   ├── data_batch_4
            │   ├── data_batch_5
            │   ├── readme.html
            │   └── test_batch
            ├── [00;31mcifar-10-python.tar.gz[00m
            └── encfs_out_0.log
[01;34mconfig[00m
└── [01;34min[00m
    └── [01;34mcifar10[00m
        └── [01;34moriginal[00m
            ├── dvc_app.yaml
            ├── dvc.lock
            ├── dvc_sbatch.dvc_commit.47964855.err
            ├── dvc_sbatch.dvc_commit.47964855.out
            ├── dvc.yaml
            ├── in_original_fetch_cifar10_init.dvc_cleanup_jobid
            ├── in_original_fetch_cifar10_init.dvc_commit_jobid
            ├── in_original_fetch_cifar10_init.dvc_stage_jobid
            ├── [0

We convince ourselves that the data is encrypted as demonstrated in the [EncFS-simulation tutorial](encfs_sim_tutorial.ipynb)

In [16]:
!head encrypt/in/cifar10/original/output/cifar-10-batches-py/readme.html

�~j܂�:�O߹ �y��
_�0���N��-wY��;��Od��i�V!�ަ�1�&��qf�Z�_�p\tw���f%8��B�ү ��i#<�B+A�"�

whereas it can readily be inspected with EncFS (*never* handle confidential data in this way, always use `encfs_launch` as advised in the EncFS-tutorial)

In [17]:
%%bash
encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} /dev/null cp ${ENCFS_DECRYPT_DIR}/in/cifar10/original/output/cifar-10-batches-py/readme.html readme.html >/dev/null
head readme.html
rm readme.html

<meta HTTP-EQUIV="REFRESH" content="0; url=http://www.cs.toronto.edu/~kriz/cifar.html">


The execution of the data fetch stage can also be deferred to the ML stages, where it will be triggered as a dependency.

Before moving to the next stage, we define execution labels based on timestamps for the subsequent DVC stages. In a real-world application, the timestamps would usually be generated on the fly when creating the DVC stage.

In [18]:
%env VIT_TRAIN_RUN_LABEL=run_20230721_081756
%env VIT_INF_RUN_LABEL=run_20230721_115412

env: VIT_TRAIN_RUN_LABEL=run_20230721_081756
env: VIT_INF_RUN_LABEL=run_20230721_115412


The next step usually involves setting up preprocessing stages. In the Vision Transformer example, this corresponds to resizing the images. Since this is an inexpensive operation and comes integrated with the training and inference in the publicly available code, we will skip this. For instructions on how to define a separate preprocessing step, please refer to the [ML repository tutorial](ml_tutorial.md).

## Creating the distributed training and inference stages
Now, we are ready to set up the stages for the Vision Transformer model based on the CIFAR10 dataset:

In [19]:
%%bash
mkdir -p {encrypt,config}/ex_vit/cifar10/baseline_model/{training,inference,config}

We capture the model configuration and output under the name `baseline_model` to refer to the publicly available model. The default values for hyperparameters and model architecture specification are encapsulated in a file `config.yaml`. We commit this file to the DVC repository using a `config` stage analogous to the manual input stage for datasets.

In [20]:
!dvc_create_stage --app-yaml ../../ex_vit/dvc_app.yaml --stage config --run-label init --config-group default

Not using SLURM or MPI in this DVC stage.
Writing DVC stage to config/ex_vit/cifar10/baseline_model/config/default
Using encfs - don't forget to set ENCFS_PW_FILE/ENCFS_INSTALL_DIR when running 'dvc repro'.
Added stage 'ex_vit_cifar10_baseline_model_config_init' in 'dvc.yaml' core[39m>

To track the changes with git, run:

	git add dvc.yaml .gitignore ../../../../../../encrypt/ex_vit/cifar10/baseline_model/config/default/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0mFreezing stage for execution outside dvc - run 'dvc commit ex_vit_cifar10_baseline_model_config_init' when outputs are done.
Modifying stage 'ex_vit_cifar10_baseline_model_config_init' in 'dvc.yaml'e[39m>
[0m

In [21]:
!encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} config/ex_vit/cifar10/baseline_model/config/default/output/stage_log.out cp -v ../../ex_vit/config.yaml ${ENCFS_DECRYPT_DIR}/ex_vit/cifar10/baseline_model/config/default/output/

encfs_mount_and_run[encrypt->/tmp/encfs_25680_async_encfs_dvc_cc38b32792ae]: Unable to determine local MPI rank - assuming to run in non-distributed mode (as a single process).
encfs_mount_and_run[encrypt->/tmp/encfs_25680_async_encfs_dvc_cc38b32792ae]: Rank 0 on daint101: Running encfs-mount at /tmp/encfs_25680_async_encfs_dvc_cc38b32792ae.
encfs_mount_and_run[encrypt->/tmp/encfs_25680_async_encfs_dvc_cc38b32792ae]: total 8.0K
drwxr-xr-x 3 lukasd csstaff 4.0K 26. Jul 11:09 ex_vit
drwxr-xr-x 3 lukasd csstaff 4.0K 26. Jul 11:07 in
encfs_mount_and_run[encrypt->/tmp/encfs_25680_async_encfs_dvc_cc38b32792ae]: Rank 0 on daint101: Successfully mounted encfs-dir at /tmp/encfs_25680_async_encfs_dvc_cc38b32792ae and wrote to sync-file /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/encrypt/.encfs_25680_async_encfs_dvc_cc38b32792ae_daint101_local_sync - starting encfs-job
encfs_mount_and_run[encrypt->/tmp/encfs_25680_async_encfs_dvc_cc38b32792ae]: Rank 0 on daint101: encfs-job

In [22]:
!dvc commit --force config/ex_vit/cifar10/baseline_model/config/default/dvc.yaml

Generating lock file 'config/ex_vit/cifar10/baseline_model/config/default/dvc.lock'
Updating lock file 'config/ex_vit/cifar10/baseline_model/config/default/dvc.lock'
[0m

In [23]:
!tree {encrypt,config}/ex_vit

[01;34mencrypt/ex_vit[00m
└── [01;34mcifar10[00m
    └── [01;34mbaseline_model[00m
        ├── [01;34mconfig[00m
        │   └── [01;34mdefault[00m
        │       └── [01;34moutput[00m
        │           └── config.yaml
        ├── [01;34minference[00m
        └── [01;34mtraining[00m
[01;34mconfig/ex_vit[00m
└── [01;34mcifar10[00m
    └── [01;34mbaseline_model[00m
        ├── [01;34mconfig[00m
        │   └── [01;34mdefault[00m
        │       ├── dvc_app.yaml
        │       ├── dvc.lock
        │       ├── dvc.yaml
        │       └── [01;34moutput[00m
        │           └── stage_log.out
        ├── [01;34minference[00m
        └── [01;34mtraining[00m

14 directories, 5 files


Following this, we can set up the training and inference stages. Where necessary, we can obtain completion suggestions with `--show-opts`.


In [24]:
%%bash
dvc_create_stage --app-yaml ../../ex_vit/dvc_app.yaml --stage training \
    --run-label ${VIT_TRAIN_RUN_LABEL} \
    --input-config default --input-config-file config.yaml --input-training . --input-test .
dvc_create_stage --app-yaml ../../ex_vit/dvc_app.yaml --stage inference \
    --run-label ${VIT_INF_RUN_LABEL} \
    --input-config default --input-config-file config.yaml --input-training ${VIT_TRAIN_RUN_LABEL} --input-inference .

Added stage 'ex_vit_cifar10_baseline_model_training_run_20230721_081756' in 'dvc.yaml'

To track the changes with git, run:

	git add ../../../../../../encrypt/ex_vit/cifar10/baseline_model/training/run_20230721_081756/.gitignore .gitignore dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true
Writing DVC stage to config/ex_vit/cifar10/baseline_model/training/run_20230721_081756
Using encfs - don't forget to set ENCFS_PW_FILE/ENCFS_INSTALL_DIR when running 'dvc repro --no-commit'.
Added stage 'ex_vit_cifar10_baseline_model_inference_run_20230721_115412' in 'dvc.yaml'

To track the changes with git, run:

	git add ../../../../../../encrypt/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/.gitignore dvc.yaml .gitignore

To enable auto staging, run:

	dvc config core.autostage true
Writing DVC stage to config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412
Using encfs - don't forget to set ENCFS_PW_FILE/ENCFS_INSTALL_DIR when running 'dvc repro --n

## Running and monitoring the pipeline with SLURM
These stages can be inspected with:

In [25]:
%%bash
dvc dag --dot config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc.yaml | tee config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc_dag.dot
if [[ $(command -v dot) ]]; then
    dot -Tsvg config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc_dag.dot > config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc_dag.svg
fi

strict digraph  {
"config/ex_vit/cifar10/baseline_model/config/default/dvc.yaml:ex_vit_cifar10_baseline_model_config_init";
"config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/dvc.yaml:ex_vit_cifar10_baseline_model_inference_run_20230721_115412";
"config/ex_vit/cifar10/baseline_model/training/run_20230721_081756/dvc.yaml:ex_vit_cifar10_baseline_model_training_run_20230721_081756";
"config/in/cifar10/original/dvc.yaml:in_original_fetch_cifar10_init";
"config/ex_vit/cifar10/baseline_model/config/default/dvc.yaml:ex_vit_cifar10_baseline_model_config_init" -> "config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/dvc.yaml:ex_vit_cifar10_baseline_model_inference_run_20230721_115412";
"config/ex_vit/cifar10/baseline_model/config/default/dvc.yaml:ex_vit_cifar10_baseline_model_config_init" -> "config/ex_vit/cifar10/baseline_model/training/run_20230721_081756/dvc.yaml:ex_vit_cifar10_baseline_model_training_run_20230721_081756";
"config/ex_vit/cifar10/baseline_model/

In [26]:
dvc_dag_img = 'config/ex_vit/cifar10/baseline_model/inference/' + os.environ['VIT_INF_RUN_LABEL'] + '/dvc_dag.svg'  # test_vit_example: skip
if os.path.exists(dvc_dag_img):  # test_vit_example: skip
    display(SVG(filename=dvc_dag_img))  # test_vit_example: skip

And finally executed with:

In [27]:
%%bash
dvc repro --no-commit config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/dvc.yaml



Stage 'config/ex_vit/cifar10/baseline_model/config/default/dvc.yaml:ex_vit_cifar10_baseline_model_config_init' didn't change, skipping




Stage 'config/in/cifar10/original/dvc.yaml:in_original_fetch_cifar10_init' didn't change, skipping
Running stage 'config/ex_vit/cifar10/baseline_model/training/run_20230721_081756/dvc.yaml:ex_vit_cifar10_baseline_model_training_run_20230721_081756':
> if (set -o pipefail) 2>/dev/null; then set -o pipefail; fi; mkdir -p ../../../../../../encrypt/ex_vit/cifar10/baseline_model/training/run_20230721_081756/output output && slurm_enqueue.sh ex_vit_cifar10_baseline_model_training_run_20230721_081756 dvc_app.yaml training encfs_mount_and_run ../../../../../../encrypt /tmp/encfs_25680_async_encfs_dvc_ex_vit_cifar10_baseline_model_training_run_20230721_081756_8d48e64a10c6_cc38b32792ae ../../../../../../../../../../../../../../tmp/encfs_25680_async_encfs_dvc_ex_vit_cifar10_baseline_model_training_run_20230721_081756_8d48e64a10c6_cc38b32792ae/ex_vit/cifar10/baseline_model/training/run_20230721_081756/output/encfs_out_{MPI_RANK}.log bash -c "$(git rev-parse --show-toplevel)/examples/ex_vit/trainin



Generating lock file 'config/ex_vit/cifar10/baseline_model/training/run_20230721_081756/dvc.lock'
Updating lock file 'config/ex_vit/cifar10/baseline_model/training/run_20230721_081756/dvc.lock'





Running stage 'config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/dvc.yaml:ex_vit_cifar10_baseline_model_inference_run_20230721_115412':
> if (set -o pipefail) 2>/dev/null; then set -o pipefail; fi; mkdir -p ../../../../../../encrypt/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/output output && slurm_enqueue.sh ex_vit_cifar10_baseline_model_inference_run_20230721_115412 dvc_app.yaml inference encfs_mount_and_run ../../../../../../encrypt /tmp/encfs_25680_async_encfs_dvc_ex_vit_cifar10_baseline_model_inference_run_20230721_115412_2c499e122511_cc38b32792ae ../../../../../../../../../../../../../../tmp/encfs_25680_async_encfs_dvc_ex_vit_cifar10_baseline_model_inference_run_20230721_115412_2c499e122511_cc38b32792ae/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/output/encfs_out_{MPI_RANK}.log bash -c "$(git rev-parse --show-toplevel)/examples/ex_vit/inference.py --training-output /tmp/encfs_25680_async_encfs_dvc_ex_vit_cifar10_baseline_model_infe



Generating lock file 'config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/dvc.lock'
Updating lock file 'config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/dvc.lock'

To track the changes with git, run:

	git add config/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/dvc.lock config/ex_vit/cifar10/baseline_model/training/run_20230721_081756/dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


The submitted SLURM stages can be monitored with a status file and log files in the `dvc.yaml` directory as well as the tool `dvc_scontrol` as familiar from the [SLURM tutorial](slurm_async_sim_tutorial.ipynb).

In [28]:
!dvc_scontrol show all

dvc_scontrol: DVC stage jobs:
           USER           JOBID     REASON       TIME  TIME_LEFT                                                         NAME                                                                                                                           WORK_DIR
         lukasd        47964882 JobHeldUse       0:00   12:00:00 dvc_ex_vit_cifar10_baseline_model_training_run_20230721_0817 /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/config/ex_vit/cifar10/baseline_model/training/run_20230721_0817
         lukasd        47964886 JobHeldUse       0:00   12:00:00 dvc_ex_vit_cifar10_baseline_model_inference_run_20230721_115 /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/config/ex_vit/cifar10/baseline_model/inference/run_20230721_115
dvc_scontrol: DVC commit jobs:
           USER           JOBID     REASON       TIME  TIME_LEFT                                                         NAME                                           

As the stage and commit jobs are put on hold in order to enable the submission of more DVC SLURM stages by the user, we have to release them. 

In [29]:
!dvc_scontrol release stage,commit

dvc_scontrol: Releasing job dvc_ex_vit_cifar10_baseline_model_training_run_20230721_081756_cc38b32792ae[sbatch_dvc_stage_ex_vit_cifar10_baseline_model_training_run_2023072] (reason JobHeldUser) at 47964882.
dvc_scontrol: Releasing job dvc_ex_vit_cifar10_baseline_model_inference_run_20230721_115412_cc38b32792ae[sbatch_dvc_stage_ex_vit_cifar10_baseline_model_inference_run_20230] (reason JobHeldUser) at 47964886.
dvc_scontrol: Releasing job dvc_op_cc38b32792ae[sbatch_dvc_commit.sh] (reason JobHeldUser) at 47964884.
dvc_scontrol: Releasing job dvc_op_cc38b32792ae[sbatch_dvc_commit.sh] (reason JobHeldUser) at 47964888.


This will enable resource allocation and execution of the application stages. Upon successfull execution, the stage outputs will also be committed. We can now log out from the SLURM cluster and return later to inspect results.

For the purpose of this notebook, however, we want to keep monitoring the pipeline on SLURM and wait for the completion of the last job, which is a DVC commit.  

In [30]:
%%bash
# ID of last commit job
commit_jobid=$(cat config/ex_vit/cifar10/baseline_model/inference/${VIT_INF_RUN_LABEL}/ex_vit_cifar10_baseline_model_inference_${VIT_INF_RUN_LABEL}.dvc_commit_jobid)
../../slurm_wait_for_job.sh ${commit_jobid}

Monitoring SLURM job 47964888.
The SLURM pipeline is still in process.
dvc_scontrol: DVC stage jobs:
           USER           JOBID     REASON       TIME  TIME_LEFT                                                         NAME                                                                                                                           WORK_DIR
         lukasd        47964882       None       0:29   11:59:31 dvc_ex_vit_cifar10_baseline_model_training_run_20230721_0817 /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/config/ex_vit/cifar10/baseline_model/training/run_20230721_0817
         lukasd        47964886 Dependency       0:00   12:00:00 dvc_ex_vit_cifar10_baseline_model_inference_run_20230721_115 /scratch/snx3000/lukasd/mitraccel/async-encfs-dvc/examples/data/v3/config/ex_vit/cifar10/baseline_model/inference/run_20230721_115
dvc_scontrol: DVC commit jobs:
           USER           JOBID     REASON       TIME  TIME_LEFT                                 

In [31]:
!tree {encrypt,config}/{in,ex_vit}

[01;34mencrypt/in[00m
└── [01;34mcifar10[00m
    └── [01;34moriginal[00m
        └── [01;34moutput[00m
            ├── [01;34mcifar-10-batches-py[00m
            │   ├── batches.meta
            │   ├── data_batch_1
            │   ├── data_batch_2
            │   ├── data_batch_3
            │   ├── data_batch_4
            │   ├── data_batch_5
            │   ├── readme.html
            │   └── test_batch
            ├── [00;31mcifar-10-python.tar.gz[00m
            └── encfs_out_0.log
[01;34mencrypt/ex_vit[00m
└── [01;34mcifar10[00m
    └── [01;34mbaseline_model[00m
        ├── [01;34mconfig[00m
        │   └── [01;34mdefault[00m
        │       └── [01;34moutput[00m
        │           └── config.yaml
        ├── [01;34minference[00m
        │   └── [01;34mrun_20230721_115412[00m
        │       └── [01;34moutput[00m
        │           ├── encfs_out_0.log
        │           └── predicted_labels.pkl
        └── [01;34mtraining[00m
            └── [

We will convince ourselves that the predicted labels are encrypted. 

In [32]:
os.environ['VIT_PREDICTED_LABELS'] = f"ex_vit/cifar10/baseline_model/inference/{os.environ['VIT_INF_RUN_LABEL']}/output/predicted_labels.pkl"

def unpickle_and_print_labels(filepath):
    try:
        with open(filepath, 'rb') as f:
            predicted_labels = pickle.load(f)
        print("Successfully unpickled labels:")
        print(predicted_labels)
    except:
        print(f"Error unpickling the file at {filepath}:")
        traceback.print_exc()

unpickle_and_print_labels(os.path.join('encrypt', os.environ['VIT_PREDICTED_LABELS']))

Error unpickling the file at encrypt/ex_vit/cifar10/baseline_model/inference/run_20230721_115412/output/predicted_labels.pkl:


Traceback (most recent call last):
  File "/tmp/ipykernel_24303/3340629226.py", line 6, in unpickle_and_print_labels
    predicted_labels = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\xff'.


and that they can be read through the EncFS-layer (only for exposition, not to be performed on confidential data)

In [33]:
!encfs_mount_and_run encrypt ${ENCFS_DECRYPT_DIR} /dev/null cp ${ENCFS_DECRYPT_DIR}/${VIT_PREDICTED_LABELS} predicted_labels.pkl >/dev/null

In [34]:
unpickle_and_print_labels('predicted_labels.pkl')

Successfully unpickled labels:
[array([7, 7, 6, 6])]


In [35]:
!rm predicted_labels.pkl