<h1> Introduction </h1>

This notebook demonstrates how to scale an implemenation of Full Waveform Inversion using multiple FPGAs with bitstreams containing one or multiple copied compute units.

<h3> Define experiment parameters: </h3>
    
PLATFORM => One of the two supported platforms by the driver (alveo/zynq-iodma)

XCLBIN_PATH_ => The path to the .xclbin file

DEVICE_NAME_DEFAULT => Default name for the FPGA device if one not provided via command line args

XRT_ENV_PATH => The path to the setup file that can be used to source the XRT environment

SPAWN_PATH => The path where the dask workers are spawned  

DIR_PATH => Path to the directory containing the FWI input files. 

In [1]:
XCLBIN_PATH_8000 = "bitstreams/8000_100_1CU/8000_100_1cu.xclbin"

XRT_ENV_PATH = "/opt/xilinx/xrt/setup.sh"
DEVICE_NAME_DEFAULT="xilinx_u280_xdma_201920_3"
SPAWN_PATH = "/home/<username>/octoray/fwi/"
DIR_PATH_FWI = "default/"

### Cluster configuration
The cluster configuration is used by OctoRay to set up the cluster. The configuration is identical to the Dask SSH Cluster found at: https://docs.dask.org/en/stable/deploying-ssh.html  
In addition, the **xrt** and **dir** keywords can be used to specify the path to the XRT setup file and the worker's spawn paths.  
The **overlay** keyword is used to specify the bitstream used. By passing a list of bitstreams it is possible to configure different bit streams on different hosts. The list must be the same length as the number of hosts.  
  
In this example a configuration is specified for two hosts. The first host uses a single instance bitstream and therefore a single worker is specified in the worker options. As the second host uses a double bitstream, the number of workers is specifed to 2.

In [2]:
import json
cluster_config = {
    "scheduler":"10.1.212.130",
    "hosts":["10.1.212.130"],
    "connect_options":{"port":22,"xrt":XRT_ENV_PATH,"dir":SPAWN_PATH},
    "worker_options":{"nthreads":0,"n_workers":1,"preload":"pynqimport.py","nanny":"0","memory_limit":0},
    "scheduler_options":{"port":8786},
    "worker_class":"distributed.Worker",
    "overlay": XCLBIN_PATH_8000,
}

with open("cluster_config.json","w") as f:
    json.dump(cluster_config,f)



<h3> Define the worker method </h3> 

Here, we define the Python method which will be executed on each of the Dask workers. This function calls the driver using the data partition it receives, and returns the output data (along with some performance statistics) to the caller (the OctoRay client). 


In [7]:
def execute_function(grid_data,kernel,id):
    import numpy as np
    import os
    import psutil
    import time
    from pynq import Overlay, allocate, Device, lib

    from FWIDriver import FWI
    
    start_time = time.time()
    
    # Set up the configuration
    cu = kernel["instance_id"]
    config = kernel["config"]
    path = kernel["path_to_bitstream"]
    gridsize = config["grid"]
    
    resolution = config["Freq"]["nTotal"] * config["nSources"] * config["nReceivers"]
    config["tolerance"] = 9.99*10**-7
    config["max"] = 1000
    
    acceleration = True

    if acceleration:
        
        # Load the overlay
        devices = Device.devices
        ol = Overlay(path, download=False, device=devices[0])

        # Allocate the buffers
        A = allocate(shape=(resolution,gridsize), dtype=np.complex64, target=getattr(ol,kernel["functions"][0]["dotprod_"+str(cu)][0]))
        B = allocate(shape=(gridsize,), dtype=np.float32, target=getattr(ol,kernel["functions"][0]["dotprod_"+str(cu)][1]))
        C = allocate(shape=(resolution,), dtype=np.complex64, target=getattr(ol,kernel["functions"][0]["dotprod_"+str(cu)][2]))

        D = allocate(shape=(resolution,gridsize), dtype=np.complex64,  target=getattr(ol,kernel["functions"][1]["update_"+str(cu)][0]))
        E = allocate(shape=(resolution,),dtype=np.complex64,  target=getattr(ol,kernel["functions"][1]["update_"+str(cu)][1]))
        F = allocate(shape=(gridsize), dtype=np.complex64, target=getattr(ol,kernel["functions"][1]["update_"+str(cu)][2]))

        # set up the kernel IP's
        dotprod = getattr(ol,"dotprod_"+str(cu))
        update = getattr(ol,"update_"+str(cu))

        
        # Execute the Full Waveform Inversion algorithm
        fwi = FWI(A,B,C,D,E,F,dotprod,update,config,resolution,gridsize,acceleration)
    else:
        fwi = FWI(config=config,resolution=resolution,gridsize=gridsize,acceleration=acceleration)

    fwi.pre_process(grid_data)

    # reconstruct the grid by performing Full Wavefrom Inversion
    chi = fwi.reconstruct()


    if acceleration:
        # free all the buffers
        A.freebuffer()
        B.freebuffer()
        C.freebuffer()
        D.freebuffer()
        E.freebuffer()
        F.freebuffer()
        ol.free()
        
    # Return statistics and results from FWI
    
    total_time = time.time() - start_time
    
    dict_t = {
    "index": id,
    "cu": cu,
    "dot": fwi.model.dot_time,
    "upd": fwi.inverse.updtime,
    "time": total_time,
    }
    
    return dict_t
    


<h2> Set up Octoray </h2>

Set up the Octoray framework. A user can pass either a dict or a config file containing a dict and specifiy if the scheduler and workers need to be set up or are already instantiated manually.

In [4]:
from octoray import Octoray

octoray = Octoray(ssh_cluster=True,cluster_config=cluster_config)

octoray.create_cluster()

Initializing OctoRay with client ip: 10.1.212.130


distributed.deploy.ssh - INFO - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at:   tcp://10.1.212.130:8786
distributed.deploy.ssh - INFO - distributed.diskutils - INFO - Found stale lock file and directory '/home/<username>/octoray/fwi/dask-worker-space/worker-grtmx6m_', purging
distributed.deploy.ssh - INFO - distributed.utils - INFO - Reload module pynqimport from .py file
distributed.deploy.ssh - INFO - distributed.preloading - INFO - Import preload module: pynqimport.py
distributed.deploy.ssh - INFO - distributed.worker - INFO -       Start worker at:   tcp://10.1.212.130:46747


Waiting until workers are set up on remote machines...
Current amount of workers: 1


In [8]:
import copy
import time

# Load in data and config settings
original_data = []
with open(DIR_PATH_FWI+"input/"+"10x10_100"+".txt") as f:
    for l in f:
        original_data.append(float(l))
        

fwi_config = None
with open(DIR_PATH_FWI+"input/GenericInput.json") as f:
    fwi_config = json.load(f)


#set specific configurations for different types of kernels
fwi_config["grid"] = 100

single_cu_config = fwi_config
single_cu_config["ngrid"]["x"]=10
single_cu_config["ngrid"]["z"]=10
double_cu_config = copy.deepcopy(fwi_config)

single_cu_config["Freq"]["nTotal"]=20
single_cu_config["nSources"]=20
single_cu_config["nReceivers"]=20

double_cu_config["Freq"]["nTotal"]=10
double_cu_config["nSources"]=20
double_cu_config["nReceivers"]=20

fwi_kernels = []

# Configure the kernels by specifying the path to the bitstream, number of compute units, batchsize per compute unit and the function names and variables with their respective memory banks.
fwi_kernels.append(octoray.create_kernel(XCLBIN_PATH_8000,1,int(100),[[{"dotprod_1":["DDR0","DDR0","DDR0"]},{"update_1":["DDR1","DDR1","DDR1"]}]],single_cu_config,host="10.1.212.130",device=DEVICE_NAME_DEFAULT))

# In the case of multiple instances, subdivide the kernels over the number of instances
kernels_split = octoray.split_kernels(fwi_kernels)    

# Divide the data set over the kernels based on batch size per instance
data_split = octoray.split_data(original_data,kernels_split)


# Launch the tasks after scattering the data and kernels to the correct workers
result = octoray.execute_hybrid(execute_function,data_split,kernels_split)

# Reorder the response based on the original input order
result.sort(key = lambda result: result['index'])
print(result)


[{'index': 1, 'cu': 1, 'dot': 1.0364792346954346, 'upd': 0.2154538631439209, 'time': 3.294934034347534}]


<h3> Graceful shutdown for OpenSSH version >= 7.9 </h3>
 
Doesn't work on XACC yet... (UNTESTED)

In [8]:
octoray.shutdown()

<h3> Ungraceful shutdown for OpenSSH version < 7.9 </h3>

In [None]:
await octoray.fshutdown()