# Function Decorators for Accelerated Code

The idea is to provide a simple API for end users to interact with custom IP in the fabric, and provide a simple mechanism for overlay writers to expose that functionality to end users. The idea would be to have a decorator that marks a function as being potentially offloaded `@hardware_function(vlnv)` that handles all of the communication. The return type and argument type are then expressed using python type annotations. If the VLNV appears in the loaded bitstream then a wrapper will be returned that, upon accessing the data, will act like a numpy array of the specified type. If the VLNV is not in the block design, the function will be executed as per normal.

## Representation of call chains
The first task is to provide wrappers for the call chains which are being offloaded. This is taken wholesale from the test notebook. At the moment, it is assumed that all functions take one or more streams and input and return a single stream.

In [31]:
import numpy as np

class Wrapper:
    def __init__(self, wrapped, dtype = np.int32):
        self.wrapped = wrapped
        self.dtype = dtype
    def value(self):
        return self.wrapped

class Call:
    def __init__(self, func, stream_args, scalar_args, return_type = np.uint32):
        self.func = func
        self.args = stream_args
        self.scalar_args = scalar_args
        self.dtype = return_type
        self.cached = None

    def value(self):
        return self.func(*[a.value() for a in self.args])
    
    def hw_value(self):
        return execute_hardware(self)
    
    def __str__(self):
        if self.cached is None:
            self.cached = self.hw_value()
        return str(self.cached)
    
    def __getitem__(self, index):
        if self.cached is None:
            self.cached = self.hw_value()
        return self.cached[index]
    
    def __len__(self):
        if self.cached is None:
            self.cached = self.hw_value()
        return len(self.cached)

## Determining whats in the bitstream
In order to correctly wire up the switches in the bitstream, we need to extract from the TCL file what IP is in the diagram and how it is wired. This is future work so, for now, it is hard-coded to the example bitstream but this will be changed post proof-of-concept.

In [32]:
from collections import namedtuple

Function = namedtuple('Function', 'in_ports out_ports name')

class FunctionMetadata:
    def __init__(self):
        self.DMA = [([0],[0]),([5],[4])]
        self.DMA_names = ['axi_dma_0', 'axi_dma_1']
        self.functions = {}
        self.functions['Xilinx:hls:stream_double:1.0'] = Function(in_ports=[2],out_ports=[2],name=None)
        #self.functions['Xilinx:hls:stream_mult:1.0'] = Function(in_ports=[3,4],out_ports=[3],name=None)
        self.functions['xilinx.com:hls:wrapped_conv_im2col_hw:1.0'] = Function(in_ports=[3,4],out_ports=[3],name=None)
        self.functions['Xilinx:hls:simple_sum:1.0'] = Function(in_ports=[1],out_ports=[1],name=None)
        self.functions['Xilinx:hls:mult_constant:1.0'] = Function(in_ports=[6],out_ports=[5],name='mult_constant_0')
        
metadata = FunctionMetadata()

## Controlling the switch
The next helper class controls the switch by setting routes. It is a thin wrapper around the control interface of the Xilinx AXI Stream Switch.

In [33]:
from pynq import PL
from pynq import MMIO

class StreamingSwitch:
    def __init__(self, name):
        base_addr = int(PL.ip_dict["SEG_{0}_Reg".format(name)][0],16)
        self.mmio = MMIO(base_addr, 256)
        self.reset()
        
    def set_route(self, in_port, out_port):
        print('SWITCH: setting route {0} to {1}'.format(in_port, out_port))
        self.mmio.write(0x40 + out_port * 4, in_port)
        
    def reset(self):
        for i in range(16):
            # Disable the output on every port
            self.mmio.write(0x40 + i * 4, 0x80000000)
    
    def commit(self):
        # Causes the switch to update atomically to the new routing
        self.mmio.write(0, 2)
        

## The Decorator
Take a function and wrap it in a call object

In [34]:
import inspect

def wrap_arg(a, dtype=np.int32):
    if type(a) is Call or type(a) is Wrapper:
        return a
    else:
        # TODO: sort out element type
        return Wrapper(a, dtype);

def hardware_function(vlnv):
    def decorator(func):
        sig = inspect.signature(func)
        ret_type = sig.return_annotation[0]
        def wrapped_function(*args, **kwargs):
            ba = sig.bind(*args, **kwargs)
            if vlnv in metadata.functions:
                stream_args = []
                scalar_args = []
                for param in sig.parameters.values():
                    if type(param.annotation) is list:
                        stream_args.append(wrap_arg(ba.arguments[param.name], param.annotation[0]))
                    else:
                        scalar_args.append(ba.arguments[param.name])
                return Call(vlnv, stream_args, scalar_args, return_type=ret_type)
            else:
                # We don't have the function available so we might
                # as well just call the function and return
                return func(*args, **kwargs)
        return wrapped_function
    return decorator

## Configuring the Switch and DMA
The final step is to take a Call object and configure the switch accordingly. This process should also prime the DMA with the correct to be sent. We need a mechanism to set the correct size of the receiving buffer, thoughts welcome.

In [35]:
# Horrible hack to load the DMA driver
from pynq import Overlay
Overlay('base.bit').download()
from pynq.drivers import DMA
import pynq.drivers.dma
#Overlay('/home/xilinx/decorator_test.bit').download()
Overlay('/home/xilinx/decorator_conv_im2col.bit').download()

## Wrap the DMA
Provide a simple API to the DMA. The DMA engine out to be separated out into a separate buffer as proposed separately then the DMA engine instances can be static and buffers could be returned without being copied.

In [36]:
class DMAWrapper:
    def __init__(self,index):
        print('Send DMA: create index {0} name {1}'.format(index, metadata.DMA_names[index]))
        base_addr = int(PL.ip_dict["SEG_{0}_Reg".format(metadata.DMA_names[index])][0],16)
        print('Send DMA: base_address {0:x}'.format(base_addr))
        self.dma = DMA(base_addr, 0)
        self.ports = metadata.DMA[index]
        
    def set_data(self, data, dtype):
        self.length = len(data) * dtype.itemsize
        print('Send DMA: sending {0} bytes'.format(self.length))
        self.dma.create_buf(self.length)
        ffi = pynq.drivers.dma.ffi
        buf = ffi.buffer(self.dma.buf, self.length)
        view = np.frombuffer(buf, dtype, -1)
        np.copyto(view, data, casting='same_kind')

    def transfer(self):
        print('Send DMA: transfer started')
        self.dma.transfer(self.length, 0)
    
    def wait(self):
        self.dma.wait()
        print('Send DMA: transfer finished')

## Parse the execution plan
Next a recursive function is used to walk the execution plan. At the moment, there is no protection against using a function multiple times in a plan. That will follow later.

In [37]:
def prepare_execution(plan, dma, return_port):
    if type(plan) is Wrapper:
        d = DMAWrapper(len(dma))
        d.set_data(plan.wrapped, plan.dtype())
        dma.append(d)
        hw_switch.set_route(d.ports[1][0], return_port)
    elif type(plan) is Call:
        in_ports = metadata.functions[plan.func].in_ports
        out_ports = metadata.functions[plan.func].out_ports
        name = metadata.functions[plan.func].name
        mmio = None
        if name:
            mmio = MMIO(int(PL.ip_dict['SEG_{0}_Reg'.format(name)][0],16),256)
        for i, a in enumerate(plan.args):
            prepare_execution(a, dma, in_ports[i])
        for i, a in enumerate(plan.scalar_args):
            mmio.write(0x10 + 4*i, a)
        hw_switch.set_route(out_ports[0], return_port)
    else:
        print("Unknown plan type: " + repr(plan))

## Execute the plan
This is the main function that executes the plan. It first calls the parsing functions, then configures the input DMA engineswith suitable buffers and then waits for the return DMA to complete. Because the return buffer belongs to the DMA engine, a copy has to be taken. This can be changed with a modified DMA API

In [38]:
hw_switch = StreamingSwitch('axis_switch_0')

def execute_hardware(plan):
    dma = []
    hw_switch.reset()
    ret_dma_base = int(PL.ip_dict["SEG_{0}_Reg".format(metadata.DMA_names[0])][0],16)
    ret_dma_mmio = MMIO(ret_dma_base, 256)
    ret_dma = DMA(ret_dma_base, 1)
    # TODO: Metadata for how big the buffer should be?
    ret_dma.create_buf(8388607)
    prepare_execution(plan, dma, metadata.DMA[0][0][0])
    hw_switch.commit()
    for d in dma:
        d.transfer()
    for d in dma:
        d.wait()
    ret_dma.transfer(8388607, 1)
    ret_dma.wait()
    bytes_read = ret_dma_mmio.read(0x58)
    ffi = pynq.drivers.dma.ffi
    buf = ffi.buffer(ret_dma.buf, bytes_read)
    view = np.frombuffer(buf, plan.dtype, -1).copy()
    return view

# Testing the Decorator
Create some simple functions which map to the hardware functions and see if the decorator maps accordingly. We'll add some print statements to the python versions of the functions so we can make sure they're not called

In [39]:
@hardware_function('Xilinx:hls:simple_sum:1.0')
def total(vs:[np.int32]) -> [np.int32]:
    print("In total")
    return sum(vs)

@hardware_function('Xilinx:hls:stream_double:1.0')
def double(vs:[np.int32]) -> [np.int32]:
    print("In double")
    return [v * 2 for v in vs]

#@hardware_function('Xilinx:hls:stream_mult:1.0')
@hardware_function('xilinx.com:hls:wrapped_conv_im2col_hw:1.0')
def mult(a:[np.int32], b:[np.int32]) -> [np.int32]:
    return [a1 * b1 for (a1,b1) in zip(a,b)]


First we chain two hardware functions together. Note that no computation happens at this point as we don't know if the user wants this value or plans to use it as an intermediate value

In [42]:

#vals = [1,2,3,4,5,6]
#vals2 = [6,5,4,3,2,1]
#inter = double(mult(vals, vals))

#t = total(inter)

#val1 = [5,5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#val1 = [6,6,
#        1,1,1,1,1,1,
#        1,1,1,1,1,1,
#        1,1,1,1,1,1,
#        1,1,1,1,1,1,
#        1,1,1,1,1,1,
#        1,1,1,1,1,1]
#val2 = [5,5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2]
#inter = mult(val1, val2)

#t = total(inter)

A = [8, 8, 
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8,
     1, 2, 3, 4, 5, 6, 7, 8
     ]

#B = [3, 3, 
#     1, 3, 5,
#     3, 5, 1,
#     5, 3, 1
#     ]

B = [8,8,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1
    ]
t = mult(A, B)
print(t)

Send DMA: create index 0 name axi_dma_0
Send DMA: base_address 40400000
Send DMA: sending 264 bytes
SWITCH: setting route 0 to 3
Send DMA: create index 1 name axi_dma_1
Send DMA: base_address 40410000
Send DMA: sending 264 bytes
SWITCH: setting route 4 to 4
SWITCH: setting route 3 to 0
Send DMA: transfer started
Send DMA: transfer started
Send DMA: transfer finished
Send DMA: transfer finished


TimeoutError: DMA wait timed out.

By calling print, we trigger the execution and the value is return

In [46]:
#print(t)
#tmp = t.hw_value() + 3
print(total(mult(A,B)))

Send DMA: create index 0 name axi_dma_0
Send DMA: base_address 40400000
Send DMA: sending 264 bytes
SWITCH: setting route 0 to 3
Send DMA: create index 1 name axi_dma_1
Send DMA: base_address 40410000
Send DMA: sending 264 bytes
SWITCH: setting route 4 to 4
SWITCH: setting route 3 to 1
SWITCH: setting route 1 to 0
Send DMA: transfer started
Send DMA: transfer started
Send DMA: transfer finished
Send DMA: transfer finished


TimeoutError: DMA wait timed out.

Because we never stored the intermediate value, if the user later requests it, we would need to redo the computation

In [12]:
print(inter)

Send DMA: create index 0 name axi_dma_0
Send DMA: base_address 40400000
Send DMA: sending 152 bytes
SWITCH: setting route 0 to 3
Send DMA: create index 1 name axi_dma_1
Send DMA: base_address 40410000
Send DMA: sending 108 bytes
SWITCH: setting route 4 to 4
SWITCH: setting route 3 to 0
Send DMA: transfer started
Send DMA: transfer started
Send DMA: transfer finished
Send DMA: transfer finished
[ 9 12 15 15 12  9 12 16 20 20 16 12 15 20 26 26 21 16 15 20 26 26 21 16 12
 16 21 21 17 13  9 12 16 16 13 10]


Our hardware also contains a block that multiplies by a constant. The constant is passed in using the AXI-lite interface.

In [43]:
@hardware_function('Xilinx:hls:mult_constant:1.0')
def constant_multiply(in_data:[np.int32], constant:np.int32) -> [np.int32]:
    return [v * constant for v in in_data]

print(constant_multiply([1,2,3,4,5,6,7], 5))

Send DMA: create index 0 name axi_dma_0
Send DMA: base_address 40400000
Send DMA: sending 28 bytes
SWITCH: setting route 0 to 6
SWITCH: setting route 5 to 0
Send DMA: transfer started
Send DMA: transfer finished
[ 5 10 15 20 25 30 35]


As `constant_multiple` is a python function like any other, we can also do function-y things to it. For example, we can use the `functools` library to partially apply the constant, giving us a new implementation of `double` in terms of `constant_multiply`

In [14]:
import functools

new_double = functools.partial(constant_multiply, constant=2)
print(new_double(mult(vals,vals2)))



NameError: name 'vals' is not defined

## Open Problems
* Allocation of receive buffer
* Data bigger than buffer size - SG may be able to help here
* 0-length arrays - AXI4-Stream has no concept of a 0-length stream. Maybe a word with no strb bits?
* Current wrapper logic is patchy at best but completely proxying a python object is non-trivial

## Possible features
* Plan partitioning for plans with more Calls than execution units/DMA engines
* Re-use of intermediate values
* I/O functions which configure the switch to route I/O directly
* AXI-Master HLS support

## Performance considerations
* Need a way for users to CMA alloc a numpy array
* Buffers not bound to DMA so that any CMA allocated buffer can be passed