<h1> Introduction </h1>

This notebook demonstrates how to scale an implemenation of Full Waveform Inversion using multiple FPGAs with bitstreams containing one or multiple copied compute units.

<h3> Define experiment parameters: </h3>
    
XCLBIN_PATH_DEFAULT => Default path for the .xclbin file containing 1 compute unit  
XCLBIN_PATH_MULTCU => Default path for .xclbin file containing 2 copied compute units  
XRT_ENV_PATH => Path to the Xilinx Runtime setup script.  
DEVICE_NAME_DEFAULT => Default name for the FPGA device  
DIR_PATH => Path to the directory containing the FWI input files.  

In [1]:
XCLBIN_PATH_DEFAULT = "bitstreams/u280_xclbin/500_500_HBM/FullW.xclbin"
XCLBIN_PATH_MULTCU = "bitstreams/u280_xclbin/500_250_HBM/FullW.xclbin"
XRT_ENV_PATH = "/opt/xilinx/xrt/setup.sh"
DEVICE_NAME_DEFAULT="xilinx_u280_xdma_201920_3"

DIR_PATH = "default/"

<h3> Define the worker method </h3> 

Here, we define the Python method which will be executed on each of the Dask workers. This function calls the driver using the data partition it receives, and returns the output data (along with some performance statistics) to the caller (the Dask client). 

We present two methods, the first can be used to execute single compute unit bitstreams, the second demonstrates how we can use multiple dask workers with the same bitstream.

In [8]:
def execute_function(grid_data,kernel,id):
    import numpy as np
    import time
    
    cu = kernel["compute_unit"]
    config = kernel["config"]
    path = kernel["path_to_kernel"]
    start_time = time.time()
    grid_data = grid_data
    
    from pynq import Overlay, allocate, Device, lib
    from FWIDriver import FWI

    resolution = config["Freq"]["nTotal"] * config["nSources"] * config["nReceivers"]
    gridsize = len(grid_data)   
    config["tolerance"] = 9.99*10**-7
    config["max"] = 1000
    
#     return kernel 

    import socket
    st = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    try:       
        st.connect(('10.255.255.255', 1))
        IP = st.getsockname()[0]
    except Exception:
        IP = '127.0.0.1'
    finally:
        st.close()

    # Load the overlay
    devices = Device.devices
    if len(devices)==0:
        return IP, cu, id
    
    ol = Overlay(path, download=False, device=devices[0])
    Device.active_device.reset(ol.parser, ol.timestamp, ol.bitfile_name)

#     test = ol.ip_dict.keys()
#     return test
#     else:
#         return "cu 1"
#         ol = Overlay(path, download=False, device=devices[0])
#     return ol.__doc__

    # Allocate the buffers
    A = allocate(shape=(resolution,gridsize), dtype=np.complex64, target=getattr(ol,kernel["functions"][0]["dotprod_"+str(cu)][0]))
    B = allocate(shape=(gridsize,), dtype=np.float32, target=getattr(ol,kernel["functions"][0]["dotprod_"+str(cu)][1]))
    C = allocate(shape=(resolution,), dtype=np.complex64, target=getattr(ol,kernel["functions"][0]["dotprod_"+str(cu)][2]))

    D = allocate(shape=(resolution,gridsize), dtype=np.complex64,  target=getattr(ol,kernel["functions"][1]["update_"+str(cu)][0]))
    E = allocate(shape=(resolution,),dtype=np.complex64,  target=getattr(ol,kernel["functions"][1]["update_"+str(cu)][1]))
    F = allocate(shape=(gridsize), dtype=np.complex64, target=getattr(ol,kernel["functions"][1]["update_"+str(cu)][2]))

    # set up the kernel IP's
    dotprod = getattr(ol,"dotprod_"+str(cu))
    if cu==1:
        cu=2
    else:
        cu=1

    update = getattr(ol,"update_"+str(cu))

#     return "dotprod_"+str(cu), "update_"+str(cu), IP

    fwi = FWI(A,B,C,D,E,F,dotprod,update,config,resolution,gridsize,True)

#     try:
    # pre process the grid data
    fwi.pre_process(grid_data)

    # reconstruct the grid by performing Full Wavefrom Inversion
    chi = fwi.reconstruct()
#     except Exception as e:
#         return f"error: {e} IP: {IP} cu: {cu}"
    
    total_time = time.time() - start_time

    # free all the buffers
    A.freebuffer()
    B.freebuffer()
    C.freebuffer()
    D.freebuffer()
    E.freebuffer()
    F.freebuffer()
    ol.free()
    
    dict_t = {
    "id": id,
    "cu": cu,
    "time": total_time,
    "chi":chi,
    "IP":IP
    }
    return dict_t
    


<h2> Setup method for multiple compute units </h2>

In [4]:
def setup_multcu():
    try:
        from pynq import Device, Overlay
        ol = Overlay(XCLBIN_PATH_MULTCU,download=True,device=Device.devices[0])
    except Exception as e:
        return f" error: {e}"
    return 'setup succesful'
                        

In [5]:
def tear_down():
    from pynq import Device, Overlay
    try:
        ol = Overlay(XCLBIN_PATH_MULTCU, download = False, device=Device.devices[0])
    except Exception as e:
        return f"error: {e} "
    return "teardown succesful"
        

Second method!

In [5]:
# def run_on_worker(grid_data,kernel,index):
#     from dask import delayed, compute
#     from pynq import Overlay, allocate, Device, lib
#     from multiprocessing import Queue, Process

#     devices = Device.devices
#     ol = Overlay(kernel["path_to_kernel"], download=True, device=devices[0])
    
#     def execute_function(queue, grid_data,kernel,path,index,config):
#         import numpy as np
#         import time
        
#         start_time = time.time()


#         from pynq import Overlay, allocate, Device, lib
#         from FWIDriver import FWI
        
        
#         resolution = config["Freq"]["nTotal"] * config["nSources"] * config["nReceivers"]
#         gridsize = len(grid_data)   
#         config["tolerance"] = 9.99*10**-7
#         config["max"] = 1000

#         # Load the overlay
#         devices = Device.devices
#         ol = Overlay(path, download=True, device=devices[0])
        
#         # Allocate the buffers
#         A = allocate(shape=(resolution,gridsize), dtype=np.complex64, target=getattr(ol,kernel["functions"]["dotprod_"+index][0]))
#         B = allocate(shape=(gridsize,), dtype=np.float32, target=getattr(ol,kernel[index]["dotprod_"+index][1]))
#         C = allocate(shape=(resolution,), dtype=np.complex64, target=getattr(ol,kernel[index]["dotprod_"+index][2]))

#         D = allocate(shape=(resolution,gridsize), dtype=np.complex64,  target=getattr(ol,kernel[index]["update_"+index][0]))
#         E = allocate(shape=(resolution,),dtype=np.complex64,  target=getattr(ol,kernel[index]["update_"+index][1]))
#         F = allocate(shape=(gridsize), dtype=np.complex64, target=getattr(ol,kernel[index]["update_"+index][2]))

#         # set up the kernel IP's
#         dotprod = getattr(ol,"dotprod_"+index)
#         update = getattr(ol,"update_"+index)
        
        
#         fwi = FWI(A,B,C,D,E,F,dotprod,update,config,resolution,gridsize,True)
        
#         # pre process the grid data
#         fwi.pre_process(grid_data)
        
#         # reconstruct the grid by performing Full Wavefrom Inversion
#         chi = fwi.reconstruct()
        
#         total_time = time.time() - start_time

#         # free all the buffers
#         A.freebuffer()
#         B.freebuffer()
#         C.freebuffer()
#         D.freebuffer()
#         E.freebuffer()
#         F.freebuffer()
        
#         dict_t = {
#         "index": index,
#         "time": total_time,
#         "chi":chi
#         }
#         queue.put(dict_t)
    

#     # Create a subprocess to handle multiple compute units, if we only use single compute units bitstreams we can map the execute functions directly
#     # TODO: example of that?
#     if kernel["no_instances"]==1:
#         q = Queue()
#         cu = 0
#         p = Process(target=execute_function,args=(q, grid_data[cu],kernel["functions"][cu],kernel["path_to_kernel"],str(cu+1),kernel["config"]))
#         p.start()
#         result = q.get()
#         p.join()
#     else:
#         p = []
#         q = []
#         result = []
#         for cu in range(kernel["no_instances"]):
#             q.append(Queue())
#             p.append(Process(target=execute_function,args=(q[cu], grid_data[cu],kernel["functions"][cu],kernel["path_to_kernel"],str(cu+1),kernel["config"])))
#             p[cu].start()
#         for cu in range(kernel["no_instances"]):
#             result.append(q[cu].get())
#         for cu in range(kernel["no_instances"]):
#             p[cu].join()
     
#     # Free the overlay and return the results
#     ol.free()
#     return result
            


<h2> SSH helper function </h2>

https://www.ssh.com/academy/ssh/copy-id

In [13]:
import asyncio, asyncssh
import json

with open("cluster_config.json") as f:
    conf = json.load(f)
    
for h in conf["hosts"]:
    async with asyncssh.connect(h,22,password="",username="") as conn:
        res = await conn.run("uname")
        print(res.stdout,end='')
                

PermissionDenied: Permission denied

<h2> Set up Octoray </h2>

In [6]:
from Octoray import Octoray

# Create an octoray instance with the 
# octoray = Octoray(ssh_cluster=True, scheduler="10.1.212.126",scheduler_port=8786, hosts=["10.1.212.127","10.1.212.126"], config_file="cluster_config.json")
octoray = Octoray(ssh_cluster=True,config_file="cluster_config.json")


# first load in the data
import json
import copy
import time

# Load in data and config settings
data = []
with open(DIR_PATH+"input/"+"10x10_100"+".txt") as f:
    for l in f:
        data.append(float(l))
        
data.extend(data)

config = None
with open(DIR_PATH+"input/GenericInput.json") as f:
    config = json.load(f)

#set specific configurations for different types of kernels
single_cu_config = config
double_cu_config = copy.deepcopy(config)
single_cu_config["ngrid"]["x"]=50
double_cu_config["ngrid"]["x"]=25

# Configure the kernels by specifying the path to the bitstream, number of compute units, batchsize per compute unit and the function names and variables with their respective memory banks.
single_cu = octoray.create_kernel(XCLBIN_PATH_DEFAULT,1,500,[[{"dotprod_1":["HBM0","HBM1","HBM2"]},{"update_1":["HBM3","HBM4","HBM5"]}]],single_cu_config)

double_cu = octoray.create_kernel(XCLBIN_PATH_MULTCU,2,250,[[{"dotprod_1":["HBM0","HBM1","HBM2"]},{"update_1":["HBM6","HBM7","HBM8"]}],
                                                [{"dotprod_2":["HBM3","HBM4","HBM5"]},{"update_2":["HBM9","HBM10","HBM11"]}]],double_cu_config)

# Finally, add the kernels you want to execute
# data_split, kernels_split = octoray.setup_cluster(data,single_cu,copy.deepcopy(single_cu))
data_split, kernels_split = octoray.setup_cluster(data,double_cu)


Initializing OctoRay with client ip: 10.1.212.126
[{'nthreads': 1, 'n_workers': 2, 'preload': 'pynqimport.py', 'nanny': '0', 'memory_limit': 0}]
1
['10.1.212.126', '10.1.212.129']


distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at:   tcp://10.1.212.126:8786
distributed.deploy.ssh - INFO - distributed.diskutils - INFO - Found stale lock file and directory '/mnt/scratch/ldierick/octoray/fwi/dask-worker-space/worker-23cmrzok', purging
distributed.deploy.ssh - INFO - distributed.diskutils - INFO - Found stale lock file and directory '/mnt/scratch/ldierick/octoray/fwi/dask-worker-space/worker-ouzspkgx', purging
distributed.deploy.ssh - INFO - distributed.utils - INFO - Reload module pynqimport from .py file
distributed.deploy.ssh - INFO - distributed.preloading - INFO - Import preload module: pynqimport.py
distributed.deploy.ssh - INFO - distributed.utils - INFO - Reload module pynqimport from .py file
distributed.deploy.ssh - INFO - distributed.preloading - INFO - Import preload module: pynqimport.py
distributed.deploy.ssh - INFO - distributed.worker - INFO -  

Waiting until workers are set up on remote machines...
Current amount of workers: 2


In [None]:
# Execute the added kernels on the workers.
# print(kernels_split)

# print(data_split)
results = octoray.execute(execute_function,data_split,kernels_split,range(len(octoray.kernels)))

print(results)

In [9]:

# print(len(data_split[1]))

def x(data, kernels, *args):
    if len(data) != len(kernels):
        raise ValueError("data and kernels don't have same dimensions.")
    futures = []
    for i,krnl in enumerate(kernels):
        if isinstance(krnl,dict):
            print(i+1)
            futures.append(self.client.submit(func,data[i],krnl,i+1))
        elif isinstance(krnl,list):
            for j,k in enumerate(krnl):
                print(i+j+1)
                futures.append(self.client.submit(func,data[i][j],k,i+j+1))
    print(self.client.gather(futures))

f = octoray.client.submit(setup_multcu)
# g = octoray.client.submit(setup_multcu)

print(octoray.client.gather([f]))

# x(data=data_split,kernels=kernels_split)

# t = octoray.client.submit(tear_down)
# print(octoray.client.gather(t))

result = octoray.execute_hybrid(setup_multcu,execute_function,data_split,kernels_split)
print(result)



['setup succesful']
1
2
len futures: 2


RuntimeError: Buffer submit failed: -22

<h3> Graceful shutdown for OpenSSH version >= 7.9 </h3>

In [None]:
octoray.shutdown()

<h3> Ungraceful shutdown for OpenSSH version < 7.9 </h3>

In [14]:
await octoray.fshutdown()

ProcessError: Process exited with signal TERM

In [8]:
from dask.distributed import Client, progress

client = Client("tcp://10.1.212.126:8786")

# print(len(octoray.kernels))
# print(len(client.scheduler_info()["workers"]))
client

0,1
Connection method: Direct,
Dashboard: http://10.1.212.126:8787/status,

0,1
Comm: tcp://10.1.212.126:8786,Workers: 2
Dashboard: http://10.1.212.126:8787/status,Total threads: 2
Started: 21 minutes ago,Total memory: 0 B

0,1
Comm: tcp://10.1.212.129:38673,Total threads: 1
Dashboard: http://10.1.212.129:43955/status,Memory: 0 B
Nanny: 0,
Local directory: /mnt/scratch/ldierick/octoray/fwi/dask-worker-space/worker-agki2d1c,Local directory: /mnt/scratch/ldierick/octoray/fwi/dask-worker-space/worker-agki2d1c
Tasks executing: 0,Tasks in memory: 1
Tasks ready: 0,Tasks in flight: 0
CPU usage: 4.0%,Last seen: Just now
Memory usage: 144.00 MiB,Spilled bytes: 0 B
Read bytes: 2.01 kiB,Write bytes: 1.89 kiB

0,1
Comm: tcp://10.1.212.129:43887,Total threads: 1
Dashboard: http://10.1.212.129:42729/status,Memory: 0 B
Nanny: 0,
Local directory: /mnt/scratch/ldierick/octoray/fwi/dask-worker-space/worker-2xi4topm,Local directory: /mnt/scratch/ldierick/octoray/fwi/dask-worker-space/worker-2xi4topm
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 144.00 MiB,Spilled bytes: 0 B
Read bytes: 2.00 kiB,Write bytes: 1.89 kiB
