# DYAD

DYAD is a synchronization and data movement tool for computational science workflows built on top of Flux. DYAD aims to provide the benefits of in situ and in transit tools (e.g., fine-grained synchronization between producer and consumer applications, fast data access due to spatial locality) while relying on a file-based data abstraction to maximize portability and minimize code change requirements for workflows. More specifically, DYAD aims to overcome the following challenges associated with traditional shared-storage and modern in situ and in transit data movement approaches:

* Lack of per-file object synchronization in shared-storage approaches
* Poor temporal and spatial locality in shared-storage approaches
* Poor performance for file metadata operations in shared-storage approaches (and possibly some in situ and in transit approaches)
* Poor portability and the introduction of required code changes for in situ and in transit approaches

In resolving these challenges, DYAD aims to provide the following to users:

* Good performance (similar to in situ and in transit) due to on- or near-node temporary storage of data
* Transparent per-file object synchronization between producer and consumer applications
* Little to no code change to existing workflows to achieve the previous benefits

To demonstrate DYAD's capabilities, we will use the simple demo applications found in the `dyad_demo` directory. This directory contains C and C++ implementations of a single producer application and a single consumer application. The producer application generates several files, each consisting of 10, 32-bit integers, and registers them with DYAD. The consumer application uses DYAD to wait until the desired file is produced. Then, if needed, it will use DYAD to retrieve the generated files from the Flux broker on which the producer application is running. Finally, the consumer application will read and validate the contents of each file.

To start, specify which versions of the producer and consumer applications you would like to use by setting the `producer_program` and `consumer_program` variables. There are two versions for the producer (i.e., `c_prod` and `cpp_prod`) and two versions for the consumer (i.e., `c_cons` and `cpp_cons`).

In [None]:
producer_program = "/opt/dyad_demo/c_prod" # Change to "/opt/dyad_demo/cpp_prod" for C++
consumer_program = "/opt/dyad_demo/c_cons" # Change to "/opt/dyad_demo/cpp_cons" for C++

Next, specify the number of files you wish to generate and transfer by setting the `num_files_transfered` variable.

In [None]:
num_files_transfered = 10

The next step is to set the directories for DYAD to track. Each DYAD-enabled application tracks two directories: a **producer-managed directory** and a **consumer-managed directory**. At least one of these directories must be specified to use DYAD.

When a producer-managed directory is provided, DYAD will store information about any file stored in that directory (or its subdirectories) into a namespace within the Flux key-value store (KVS). This information is later used by DYAD to transfer files from producer to consumer.

When a consumer-managed directory is provided, DYAD will block the application whenever a file inside that directory (or subdirectory) is opened. This blocking will last until DYAD sees information about the file in the Flux KVS namespace. If the information retrieved from the KVS indicates that the file is actually located elsewhere, DYAD will use Flux's remote procedure call (RPC) system to ask the Flux broker at the file's location to transfer the file. If a transfer occurs, the file's contents will be stored at the file path passed to the original file opening function (e.g., `open`, `fopen`).

In this demo, we will use 3 different directories: one unique to the consumer (`consumer_managed_directory`), one unique to the producer (`producer_managed_directory`), and one shared between producer and consumer (`shared_managed_directory`). Set the 3 variables in the cell below to specify these directories.

In [None]:
consumer_managed_directory = "/tmp/cons"
producer_managed_directory = "/tmp/prod"
shared_managed_directory = "/tmp/shared"

Finally, empty these directories or create new ones if they do not already exist.

In [None]:
!rm -rf {consumer_managed_directory}
!mkdir -p {consumer_managed_directory}
!chmod 755 {consumer_managed_directory}
!rm -rf {producer_managed_directory}
!mkdir -p {producer_managed_directory}
!chmod 755 {producer_managed_directory}
!rm -rf {shared_managed_directory}
!mkdir -p {shared_managed_directory}
!chmod 755 {shared_managed_directory}

## Example 1

In this first example, we will be using DYAD to transfer data between a producer and consumer in different locations (e.g., on different nodes of a supercomputer). However, since this demo assumes we are running on a single AWS node, we will simulate the difference in locations by specifying different directories for the producer's managed directory and the consumer's managed directory. Normally, these directories would be the same and would both point to local, on-node storage.

In this example, data will be transfered from the proudcer's managed directory to the consumer's managed directory. Additionally, each file opening call (e.g,. `open`, `fopen`) in the consumer application will be blocked until the relevant file is available in the producer's managed directory. The figure below illustrates this transfer and synchronization process.

<div>
<center><img src="dyad/dyad_example1.svg" width="400"/>
</div>

Before running the DYAD-enabled applications, there are two things we must do:
1. Setup a namespace in the Flux KVS to be used by DYAD
2. Load DYAD's Flux module

To begin, set the `kvs_namespace` variable to the namespace you wish to use for DYAD. This namespace can be any string value you want.

In [None]:
kvs_namespace = "dyad_test"

Next, create the namespace by running `flux kvs namespace create`. The cell below also runs `flux kvs namespace list` to allow you to verify that the namespace was created successfully.

In [None]:
!flux kvs namespace create {kvs_namespace}
!flux kvs namespace list

The next step is to load DYAD's Flux module. This module is the component of DYAD that actually sends files from producer to consumer.

To start this step, set `dyad_module` below to the path to the DYAD module (i.e., `dyad.so`). For this demo, DYAD has already been installed under the `/usr` prefix, so the path to the DYAD module should be `/usr/lib/dyad.so`.

In [None]:
dyad_module = "/usr/lib/dyad.so"

Next, choose the communication backend for DYAD to use. This backend is used by DYAD's data transport layer (DTL) component to move data from producer to consumer. Currently, valid values are:
* `UCX`: use Unified Communication X for data movement
* `FLUX_RPC`: use Flux Remote Procedure Call (RPC) feature for data movement

In [None]:
dtl_mode = "UCX"

Finally, load the DYAD module by running `flux module load` on each broker. We load the module onto each broker because, normally, we would not know exactly which brokers the producer and consumer would be running on.

When being loaded, the DYAD module takes a single command-line argument: the producer-managed directory. The module uses this directory to determine the path to any files it needs to transfer to consumers.

In [None]:
!flux exec -r all flux module load {dyad_module} {producer_managed_directory} {dtl_mode}

After loading the module, we can double check it has been loaded by running the cell below.

In [None]:
!flux exec -r all flux module list | grep dyad

Now, we will generate the shell commands that we will use to run the producer and consumer applications. These commands can be broken down into three pieces.

First, the commands will set the `LD_PRELOAD` environment variable if running the C version of the producer or consumer. We set `LD_PRELOAD` because DYAD's C API uses the preload trick to intercept the `open`, `close`, `fopen`, and `fclose` functions.

Second, the commands set a couple of environment variables to configure DYAD. The environment variables used in this example are:
* `DYAD_KVS_NAMESPACE`: specifies the Flux KVS namespace to use with DYAD
* `DYAD_DTL_MODE`: sets the communication backend to use for data movement
* `DYAD_PATH_PRODUCER`: sets the producer-managed path
* `DYAD_PATH_CONSUMER`: sets the consumer-managed path

Finally, the rest of the commands are the invocation of the applications themselves.

Run the following 2 cells to generate and see the commands for the producer and consumer.

In [None]:
producer_launch_cmd = "{preload} DYAD_KVS_NAMESPACE={kvs_namespace} DYAD_DTL_MODE={dtl_mode} \
DYAD_PATH_PRODUCER={producer_managed_directory} flux exec -r 0 \
{producer_program} {num_files_transfered} {producer_managed_directory}".format(
    preload="LD_PRELOAD=\"/usr/lib/dyad_wrapper.so\"" if producer_program.split("/")[-1].strip().startswith("c_") else "",
    kvs_namespace=kvs_namespace,
    dtl_mode=dtl_mode,
    producer_managed_directory=producer_managed_directory,
    producer_program=producer_program,
    num_files_transfered=num_files_transfered,
)
print(producer_launch_cmd)

In [None]:
consumer_launch_cmd = "{preload} DYAD_KVS_NAMESPACE={kvs_namespace} DYAD_DTL_MODE={dtl_mode} \
DYAD_PATH_CONSUMER={consumer_managed_directory} flux exec -r 1 \
{consumer_program} {num_files_transfered} {consumer_managed_directory}".format(
    preload="LD_PRELOAD=\"/usr/lib/dyad_wrapper.so\"" if producer_program.split("/")[-1].strip().startswith("c_") else "",
    kvs_namespace=kvs_namespace,
    dtl_mode=dtl_mode,
    consumer_managed_directory=consumer_managed_directory,
    consumer_program=consumer_program,
    num_files_transfered=num_files_transfered,
)
print(consumer_launch_cmd)

Finally, we will run the producer and consumer applications. Thanks to DYAD's fine-grained, per-file synchronization features, the order in which we launch the applications does not matter. In this example, we will run the consumer first to illustrate DYAD's synchronization features.

Run the cell below to run the consumer. You will see that the consumer will immediately begin waiting for data to be made available.

In [None]:
!{consumer_launch_cmd}

Now that the consumer is running, we will run the producer. However, Jupyter will not let us launch the producer from within this notebook for as long as the consumer is running. To get around this, we will use the Jupyter Lab terminal.

First, copy the producer command from above. Then, from the top of the file explorer on the left, click the plus (`+`) button. In the new Jupyter Lab tab that opens, click on "Terminal" (in the "Other" category) to launch the Jupyter Lab terminal. Finally, paste the producer command into the terminal, and run it.

We know that the applications ran successfully if the consumer outputs "OK" for each file it checks.

To see that the files were transfered, we can check the contents of the producer-managed and consumer-managed directories. If everything worked correctly, we will see the same files in both directories.

Run the next two cells to check the contents of these directories.

In [None]:
!flux exec -r 0 ls -lah {producer_managed_directory}

In [None]:
!flux exec -r 1 ls -lah {consumer_managed_directory}

Before moving onto the next example, we need to remove the KVS namespace and unload the DYAD module. We cannot just reuse the namspace and module from this example for two reasons.

First, the keys in the KVS that DYAD uses are based on the paths to the files *relative to the producer- and consumer-managed directories.* Since we are using the same applications for the next example, these relative paths will be the same, which means the keys will already be present in the KVS. This can interfere with the synchronization of the consumer.

Second, the DYAD module currently tracks only a single directory at a time. We will be using a different directory for the next example, so we will need to startup the DYAD module from scratch to track this new directory.

Run the next two cells to unload the DYAD module and remove the KVS namespace.

In [None]:
!flux exec -r all flux module unload dyad

In [None]:
!flux kvs namespace remove {kvs_namespace}

Run this cell to verify that the DYAD module and KVS namespace are no longer present.

In [None]:
!echo "Modules Post-Cleanup"
!echo "===================="
!flux module list
!echo ""
!echo "KVS Namespaces Post-Cleanup"
!echo "==========================="
!flux kvs namespace list

## Example 2

In the second example, we will show how DYAD can help workflows even if data is in shared storage (e.g., parallel file system) by still providing built-in and transparent fine-grained synchronization.

The figure below illustrates the data movement that will happen in this example.

<div>
<center><img src="dyad/dyad_example2.svg" width="400"/>
</div>

To start, we must setup the Flux KVS namespace and DYAD module again. 

Run the cells below to setup the Flux KVS namespace and the DYAD module.

In [None]:
!flux kvs namespace create {kvs_namespace}
!flux kvs namespace list

In [None]:
!flux exec -r all flux module load {dyad_module} {shared_managed_directory} {dtl_mode}

In [None]:
!flux exec -r all flux module list | grep dyad

Next, we will generate the shell commands that we will use to run the producer and consumer applications. The only differences between these commands and the ones in Example 1 are as follows:
* The `DYAD_PATH_PRODUCER`, `DYAD_PATH_CONSUMER`, and second command-line argument to the applications all have the same value (i.e., the value of `shared_managed_directory` from the top of the notebook).
* The `DYAD_SHARED_STORAGE` environment variable is provided and set to 1. This tells DYAD to only perform fine-grained synchronization, rather than both synchronization and file transfer.

Run the next two cells to generate the commands.

In [None]:
producer_launch_cmd = "{preload} DYAD_KVS_NAMESPACE={kvs_namespace} DYAD_DTL_MODE={dtl_mode} \
DYAD_PATH_PRODUCER={producer_managed_directory} DYAD_SHARED_STORAGE=1 \
flux exec -r 0 \
{producer_program} {num_files_transfered} {producer_managed_directory}".format(
    preload="LD_PRELOAD=\"/usr/lib/dyad_wrapper.so\"" if producer_program.split("/")[-1].strip().startswith("c_") else "",
    kvs_namespace=kvs_namespace,
    dtl_mode=dtl_mode,
    producer_managed_directory=shared_managed_directory,
    producer_program=producer_program,
    num_files_transfered=num_files_transfered,
)
print(producer_launch_cmd)

In [None]:
consumer_launch_cmd = "{preload} DYAD_KVS_NAMESPACE={kvs_namespace} DYAD_DTL_MODE={dtl_mode} \
DYAD_PATH_CONSUMER={consumer_managed_directory} DYAD_SHARED_STORAGE=1 \
flux exec -r 1 \
{consumer_program} {num_files_transfered} {consumer_managed_directory}".format(
    preload="LD_PRELOAD=\"/usr/lib/dyad_wrapper.so\"" if producer_program.split("/")[-1].strip().startswith("c_") else "",
    kvs_namespace=kvs_namespace,
    dtl_mode=dtl_mode,
    consumer_managed_directory=shared_managed_directory,
    consumer_program=consumer_program,
    num_files_transfered=num_files_transfered,
)
print(consumer_launch_cmd)

Finally, we will run the producer and consumer applications. To show how DYAD provides fine-grained synchronization even to shared storage workflows (e.g., workflows that use the parallel file system for data movement), we will run the consumer first.

Run the cell below to run the consumer. The consumer will immediately begin waiting for data to be made available in shared storage.

In [None]:
!{consumer_launch_cmd}

Now that the consumer is running, we will run the producer. Just like Example 1, we will run the producer by copying the producer command from above and running it in the Jupyter Lab terminal.

As with Example 1, we know that the applications ran successfully if the consumer outputs "OK" for each file it checks.

Finally, we need to remove the KVS namespace and unload the DYAD module.

Run the next two cells to do this.

Run the final code cell to verify that the DYAD module and KVS namespace are no longer present.

In [None]:
!flux exec -r all flux module unload dyad

In [None]:
!flux kvs namespace remove {kvs_namespace}

In [None]:
!echo "Modules Post-Cleanup"
!echo "===================="
!flux module list
!echo ""
!echo "KVS Namespaces Post-Cleanup"
!echo "==========================="
!flux kvs namespace list

# This concludes the notebook tutorial for DYAD.

## If you are interested in learning more about DYAD, check out our [ReadTheDocs page](https://dyad.readthedocs.io/en/latest/), our [GitHub repository](https://github.com/flux-framework/dyad), and our [short paper](https://dyad.readthedocs.io/en/latest/_downloads/27090817b034a89b76e5538e148fea9e/ShortPaper_2022_eScience_LLNL.pdf) and [poster](https://dyad.readthedocs.io/en/latest/_downloads/1f11761622683662c33fe0086d1d7ad2/Poster_2022_eScience_LLNL.pdf) from eScience 2022.

## If you are interested in working with us, please reach out to Jae-Seung Yeom (yeom2@llnl.gov) or Ian Lumsden (ilumsden@vols.utk.edu).