# Optimising compute time with concurrent IO

With many iterative processes there is a need to get information **off** the device at regular intervals. Up to this point we have been transferring data off the compute device **after** kernel execution. Furthermore, the routines to read information from device buffers have thus far been used in a blocking manner, that is the program pauses while the read occurs. 

Most compute devices have the ability to transfer data **while** kernels are being executed. This means IO transfers can take place during compute and may in some instances **take place entirely** during kernel execution. For the cost of additional programming complexity significant compute savings can be obtained, as the following diagram illustrates:

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/optimising_io.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: The difference between sequential and concurrent IO.</figcaption>
</figure>

The following tools can help enforce the required dependencies between items in a workflow.

* Events
* Multiple command queues
* Asynchronous IO transfers


In [13]:
%matplotlib widget

import os
import sys
import numpy as np
import subprocess
from ipywidgets import widgets
from matplotlib import pyplot as plt
from matplotlib import animation, rc
from IPython.display import HTML

sys.path.insert(0, os.path.abspath("../include"))

import py_helper

float_type = np.float32

defines=py_helper.load_defines("mat_size.hpp")

# Velocity of the medium
vel=333.0

# Make up the velocity and first two timesteps
V=np.ones((defines["N0"], defines["N1"]), dtype=float_type)*vel

# write files to disk
V.tofile("array_V.dat")

## Run the wave application with synchronous IO

In [29]:
# Run the application
subprocess.run([os.path.join(os.getcwd(),"wave2d_sync.exe"), "-gpu"])

	               name: gfx1035 
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
dt=0.001201, Vmax=333.000000
dt=0.0012012, fm=33.3, Vmax=333, dt2=1.44288e-06
The synchronous calculation took 40885 milliseconds.


CompletedProcess(args=['/home/toby/Pelagos/Projects/OpenCL_Course/course_material/L9_Asynchronous_IO/wave2d_sync.exe', '-gpu'], returncode=0)

## Run the wave application with asynchronous IO

In [30]:
# Run the application
subprocess.run([os.path.join(os.getcwd(),"wave2d_async.exe"), "-gpu"])

	               name: gfx1035 
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
dt=0.001201, Vmax=333.000000
dt=0.0012012, fm=33.3, Vmax=333, dt2=1.44288e-06
The asynchronous calculation took 28424 milliseconds.

CompletedProcess(args=['/home/toby/Pelagos/Projects/OpenCL_Course/course_material/L9_Asynchronous_IO/wave2d_async.exe', '-gpu'], returncode=0)

In [24]:
# Read the outputfile back in for display
output=np.fromfile("array_out.dat", dtype=float_type)
nslices=output.size//(defines["N0"]*defines["N1"])
output=output.reshape(nslices, defines["N0"], defines["N1"])

In [25]:
# Animate the result
fig, ax = plt.subplots(1,1, figsize=(8,6))
extent=[ -0.5*defines["D1"], (defines["N1"]-0.5)*defines["D1"],
    -0.5*defines["D0"], (defines["N0"]-0.5)*defines["D0"]]
img = ax.imshow(
    output[0,...], 
    extent=extent, 
    vmin=np.min(output), 
    vmax=np.max(output),
    origin="lower"
)

ax.set_xlabel("Dimension 1")
ax.set_ylabel("Dimension 0")
ax.set_title("Wavefield")

def update(n=0):
    img.set_data(output[n,...])
    plt.show()
    #return (img,)

# Run the interaction
result = widgets.interact(
    update,
    n=(0, output.shape[0]-1, 1)
)

interactive(children=(IntSlider(value=0, description='n', max=639), Output()), _dom_classes=('widget-interact'…