# Welcome to the DYAD component of the Flux tutorial


> What is DYAD? 🤔️

DYAD is a locality-aware, write-once, read-many file cache that runs on top of local NVMe and other burst buffer-style technologies (e.g., El Capitan Rabbit nodes). It is designed to accelerate large, distributed workloads, such as distributed Deep Learning (DL) training and scientific computing workflows, on HPC systems. Unlike similar tools (e.g., DataSpaces and UnifyFS), which tend to optimize for write performance, DYAD aims to provide good write **and read** performance. To optimize read performance, DYAD uses a locality-aware "Hierarchical Data Locator," which prioritizes node-local metadata and data retrieval to minimize the amount of network communications. When moving data from another node, DYAD also uses a streaming RPC over RDMA protocol, which uses preallocated buffers and connection caching to maximize network bandwidth. This process is shown in the figure below:

![DYAD Reading Process](img/dyad_design.png)

DYAD uses several services provided by Flux (key-value store, remote proceedure call, broker modules) to orchestrate data movement between nodes. It also uses UCX to move data.

> I'm ready! How do I do this tutorial? 😁️

The process for running this tutorial is the same as `flux.ipynb`. To step through examples in this notebook 
you need to execute cells. To run a cell, press Shift+Enter on your keyboard. If you prefer, you can also paste 
the shell commands in the JupyterLab terminal and execute them there.

# Accelerating Distributed Deep Learning (DL) Training with DYAD



## Show code

[data loader](../dlio_extensions/dyad_torch_data_loader.py)

In [None]:
import os
import sys
import inspect
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
from IPython.display import display, HTML

sys.path.insert(0, os.path.abspath("../dlio_extensions/dyad_torch_data_loader.py"))

from dyad_torch_data_loader import DYADTorchDataset

In [None]:
display(HTML(highlight(inspect.getsource(DYADTorchDataset.worker_init), PythonLexer(), HtmlFormatter(full=True))))

In [None]:
display(HTML(highlight(inspect.getsource(DYADTorchDataset.__getitem__), PythonLexer(), HtmlFormatter(full=True))))

## Configure DLIO and DYAD

In [None]:
kvs_namespace = "dyad"
initial_data_directory = "/tmp/dlio_data"
managed_directory = "/tmp/dyad_data"
workers_per_node = 8

In [None]:
dyad_install_prefix = "/usr"
num_nodes = !flux hostlist -c
dlio_extensions_dir = !$HOME/flux-tutorial-2024/dlio_extensions
dtl_mode = "UCX"
workload = "dyad_unet3d_small"

In [None]:
env_lines = [
    f"DYAD_KVS_NAMESPACE={kvs_namespace}\n",
    f"DYAD_DTL_MODE={dtl_mode}\n",
    f"DYAD_PATH={managed_directory}\n",
    f"PYTHONPATH={dlio_extensions_dir}:$PYTHONPATH\n",
    "DLIO_PROFILER_ENABLE=0\n",
    "DLIO_PROFILER_INC_METADATA=1\n",
    "DLIO_PROFILER_LOG_LEVEL=ERROR\n",
    "DLIO_PROFILER_BIND_SIGNALS=0\n",
    "HDF5_USE_FILE_LOCKING=0\n",
]
with open("dlio_env.txt", "w") as f:
    for el in env_lines:
        f.write(el)

## Create Flux KVS Namespace and start DYAD service

In [None]:
!flux kvs namespace create {kvs_namespace}

In [None]:
!flux exec -r all flux module load {dyad_install_prefix}/lib/dyad.so --mode={dtl_mode} {managed_directory}

In [None]:
!flux module list

In [None]:
!flux kvs namespace list

## Generate Data for Unet3D

In [None]:
!flux run -N {num_nodes} --tasks-per-node=1 mkdir -p {managed_directory} 
!flux run -N {num_nodes} --tasks-per-node=1 rm -r {managed_directory}/* 

In [None]:
!flux run -N {num_nodes} -o cpu-affinity=off --tasks-per-node={workers_per_node} --env-file=dlio_env.txt \
    dlio_benchmark --config-dir={dlio_extensions_dir}/configs workload={workload} \
        ++workload.dataset.data_folder={initial_data_directory} ++workload.workflow.generate_data=True \
        ++workload.workflow.train=False

## Run "training" through DLIO

In [None]:
!flux run -N {num_nodes} -o cpu-affinity=on --tasks-per-node={workers_per_node} --env-file=dlio_env.txt \
    dlio_benchmark --config-dir={dlio_extensions_dir}/configs workload={workload} \
        ++workload.dataset.data_folder={initial_data_directory} ++workload.workflow.generate_data=False \
        ++workload.workflow.train=True

## Shutdown the DYAD service and cleanup

In [None]:
!flux kvs namespace remove {kvs_namespace}
!flux exec -r all flux module remove dyad

In [None]:
!flux module list

In [None]:
!flux kvs namespace list

# This concludes the notebook tutorial for DYAD.

If you are interested in learning more about DYAD, check out our [ReadTheDocs page](https://dyad.readthedocs.io/en/latest/), our [GitHub repository](https://github.com/flux-framework/dyad), and our published/presented works:
* [eScience 2022 Short Paper](https://dyad.readthedocs.io/en/latest/_downloads/27090817b034a89b76e5538e148fea9e/ShortPaper_2022_eScience_LLNL.pdf)
* [SC 2023 ACM Student Research Competition Extended Abstract](https://github.com/flux-framework/dyad/blob/main/docs/_static/ExtendedAbstract_2023_SC_ACM_SRC_DYAD.pdf)
* [IPDPS 2024 HiCOMB Workshop Paper](https://github.com/flux-framework/dyad/blob/main/docs/_static/Paper_2024_IPDPS_HiCOMB_DYAD.pdf)

If you are interested in working with us, please reach out to Jae-Seung Yeom (yeom2@llnl.gov), Hariharan Devarajan (hariharandev1@llnl.gov), or Ian Lumsden (ilumsden@vols.utk.edu).