# DAC Contest
This reference design will help you walk through a design flow of DAC SDC 2023. This is a simplified design to help users get started on the FPGA platform and to understand the overall flow. It does not contain any object detection hardware.

If you have any questions, please post on the Slack page (link on SDC website sidebar).

### Hardware

### Software
Note:
  * You will not submit your `dac_sdc.py` file, so any changes you make to this file will not be considered during evluation.  
  * You can use both PS and PL side to do inference.

### Object Detection

Object detection will be done on images in batches:
  * You will provide a Python callback function that will perform object detection on batch of images.  This callback function wile be called many times.
  * The callback function should return the locations of all images in the batch.
  * Runtime will be recorded during your callback function.
  * Images will be loaded from SD card before each batch is run, and this does not count toward your energy usage or runtime.
  
### Notebook
Your notebook should contain 4 code cells:

1. Importing all libraries and creating your Team object.
1. Downloading the overlay, compile the code, and performany any one-time configuration.
1. Python callback function and any other Python helper functions.
1. Running object detection
1. Cleanup



## 1. Imports and Create Team

In [1]:
import os
os.environ['PATH'] = '/usr/local/cuda-10.2/bin:' + os.environ.get('PATH', '')
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64:' + os.environ.get('LD_LIBRARY_PATH', '')
os.environ['CUDA_HOME'] = os.environ.get('CUDA_HOME', '') + ':/usr/local/cuda-10.2'

In [2]:
print('Please make sure that pycuda 2022.1 version is installed, otherwise our code will run error.')
print('Pycuda 2022.1 version may be incompatible with other software libraries.We have installed and tested it in a brand new system many tiems. If you cannot install it, we recommend that you install it in a new environment after flashing the system.If you have any questions, please contact us in time.')

Please make sure that pycuda 2022.1 version is installed, otherwise our code will run error.
Pycuda 2022.1 version may be incompatible with other software libraries.We have installed and tested it in a brand new system many tiems. If you cannot install it, we recommend that you install it in a new environment after flashing the system.If you have any questions, please contact us in time.


In [3]:
!pip3 install pycuda==2022.1

Collecting pycuda==2022.1
Collecting mako (from pycuda==2022.1)
  Using cached https://files.pythonhosted.org/packages/b4/4d/e03d08f16ee10e688bde9016bc80af8b78c7f36a8b37c7194da48f72207e/Mako-1.1.6-py2.py3-none-any.whl
Collecting appdirs>=1.4.0 (from pycuda==2022.1)
  Using cached https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Collecting pytools>=2011.2 (from pycuda==2022.1)
Collecting MarkupSafe>=0.9.2 (from mako->pycuda==2022.1)
Collecting dataclasses>=0.7; python_version <= "3.6" (from pytools>=2011.2->pycuda==2022.1)
  Using cached https://files.pythonhosted.org/packages/fe/ca/75fac5856ab5cfa51bbbcefa250182e50441074fdc3f803f6e76451fab43/dataclasses-0.8-py3-none-any.whl
Collecting platformdirs>=2.2.0 (from pytools>=2011.2->pycuda==2022.1)
  Using cached https://files.pythonhosted.org/packages/b1/78/dcfd84d3aabd46a9c77260fb47ea5d244806e4daef83aa6fe5d83adb182c/platformdirs-2.4.0-py3-none-any.

In [4]:
from __future__ import print_function
import cv2
import numpy as np
import sys
import os

sys.path.append(os.path.abspath("../common"))

import dac_sdc
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.compiler import SourceModule

In [5]:
batchsize = 1
team_name = 'SEU AIC Lab'
dac_sdc.BATCH_SIZE = batchsize
team = dac_sdc.Team(team_name)

In [6]:
conf_thres = 0.3403
nms_thres = 0.3
trt_path = './seu-aic-lab.trt'

**Your team directory where you can access your notebook, and any other files you submit, is available as `team.team_dir`.**

## 2. Preparing the library and model
Prepare the dependencies for the contest, including installing python packages, compiling your binaries, and linking to the notebook.

Your team is responsible to make sure the correct packages are installed. For the contest environment, use the configuration below provided by Nvidia:
- [JetPack 4.6.1](https://developer.nvidia.com/embedded/jetpack-sdk-461)
    - Ubuntu 18.04
    - CUDA 10.2
    - cuDNN 8.2.1
    - gcc 7.5.0
    - python 3.6.9
    - TensorRT 8.2.1

In [7]:
def resize_clip(x, h, w, rh, rw):
    x = x / (rw, rh, rw, rh)
    x = np.clip(x, 0, (w-1, h-1, w-1, h-1))

    return x


def bbox_iou(box1, box2, area):
    inter_rect_xy1 = np.maximum(box1[:, :2], box2[:, :2])
    inter_rect_xy2 = np.minimum(box1[:, 2:4], box2[:, 2:4])
    inter_area = (inter_rect_xy2 - inter_rect_xy1 + 1).clip(0).prod(1)
    iou = inter_area / (area[:1] + area[1:] - inter_area + 1e-16)

    return iou


def compute_area(xyxy):
    return (xyxy[:, 2] - xyxy[:, 0] + 1) * (xyxy[:, 3] - xyxy[:, 1] + 1)


def nms(bb, nms_iou=0.6):
    bb = bb[np.argsort(-bb[:, 4])]
    area = compute_area(bb[:, :4])

    keep_boxes = []
    while bb.shape[0]:
        invalid = bbox_iou(bb[:1, :4], bb[1:, :4], area) > nms_iou
        keep_boxes += [bb[0]]
        bb = bb[1:][~invalid]
        area = area[1:][~invalid]

    return np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])


In [8]:
kernel_code_bilinear = """
    __global__ void resize_bilinear(const unsigned char* input, float* output, int input_width, int input_height, int output_width, int output_height)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;
        int z = threadIdx.z;

        if (x < output_width && y < output_height)
        {
            float scale_x = (float)input_width / output_width;
            float scale_y = (float)input_height / output_height;

            float src_x = (x + 0.5f) * scale_x - 0.5f;
            float src_y = (y + 0.5f) * scale_y - 0.5f;
            
            int src_x1 = (int)src_x;
            int src_y1 = (int)src_y;
            int src_x2 = src_x1 + 1;
            int src_y2 = src_y1 + 1;
            
            float dx = src_x - src_x1;
            float dy = src_y - src_y1;

            int src_index11 = (src_y1 * input_width + src_x1) * 3 + z;
            int src_index12 = (src_y2 * input_width + src_x1) * 3 + z;
            int src_index21 = (src_y1 * input_width + src_x2) * 3 + z;
            int src_index22 = (src_y2 * input_width + src_x2) * 3 + z;
            int dst_index = (z * output_height + y) * output_width + x;

            output[dst_index] = round(input[src_index11] * (1 - dx) * (1 - dy) + input[src_index12] * (1 - dx) * dy + input[src_index21] * dx * (1 - dy) + input[src_index22] * dx * dy) / 255;
        }
    }
"""
mod_bilinear = SourceModule(kernel_code_bilinear)
bilinear_func = mod_bilinear.get_function("resize_bilinear")
block_size = (16, 16, 3)
grid_size = ((320 - 1) // block_size[0] + 1, (192 - 1) // block_size[1] + 1, 1)

## 3. Python Callback Function and Helper Functions


### Pushing the picture through the pipeline
In this example, we use contiguous memory arrays for sending and receiving data via DMA.

The size of the buffer depends on the size of the input or output data.  The example images are 640x360 (same size as training and test data), and we will use `pynq.allocate` to allocate contiguous memory.

### Callback function
The callback function:
  - Will be called on each batch of images (will be called many times)
  - Is prvided with a list of tuples of (image path, RGB image)
  - It should return a dictionary with an entry for each image:
    - Key: Image name (`img_path.name`)
    - Value: Dictionary of item type and bounding box (keys: `type`, `x`, `y`, `width`, `height`)

See the code below for an example:


In [9]:
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()
        

class dac_model(object):
    def __init__(self, trt_path):
        
        TRT_LOGGER = trt.Logger()

        runtime = trt.Runtime(TRT_LOGGER)
        with open(trt_path, "rb") as f:
            engine = runtime.deserialize_cuda_engine(f.read())
            
        self.context = engine.create_execution_context()
        self.inputs, self.outputs, self.bindings = [], [], []
        self.stream = cuda.Stream()
        
        for binding in engine:
            size = tuple(engine.get_binding_shape(binding))
            dtype = trt.nptype(engine.get_binding_dtype(binding))

            if engine.binding_is_input(binding):
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                self.inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                self.outputs.append(HostDeviceMem(host_mem, device_mem))
        self.resize_input = cuda.mem_alloc(1200*2000*3)
    
    def infer(self, img):
        h0, w0, _ = img.shape     
        cuda.memcpy_htod_async(self.resize_input, img, self.stream)
        bilinear_func(self.resize_input, self.inputs[0].device, np.int32(w0), np.int32(h0), np.int32(320), np.int32(192), block=block_size, grid=grid_size)

        self.context.execute_async(batch_size=1, bindings=self.bindings, stream_handle=self.stream.handle)
        
        [cuda.memcpy_dtoh_async(out.host, out.device, self.stream) for out in self.outputs]
        self.stream.synchronize()

        return [out.host for out in self.outputs], h0, w0

    
model = dac_model(trt_path)
    
for i in range(50):
    image = np.random.randint(0, 256, (1088, 1920, 3), np.uint8)
    preds, h0, w0 = model.infer(image)


In [10]:
def my_callback(rgb_imgs):
    object_locations_by_image = {}
    object_locations = []
   
    img_path, img = rgb_imgs[0]
    preds, h0, w0 = model.infer(img)
    
    rw = 320 / w0
    rh = 192 / h0
    
    seg = preds[1].astype(np.uint8)
    for i in range(3):
        p = seg[i]
        contours, _ = cv2.findContours(p, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
        for contour in contours:
            contour = contour / (rw, rh)
            object_locations.append(
                    {"type": int(i+8), "x": -1, "y": -1, "width": -1, "height": -1,
                    "segmentation": contour.reshape(1, -1).tolist()})       

    det = preds[0] 
    xc = det[..., 4] > conf_thres
    det = det[xc]
    det[..., :4] = resize_clip(det[..., :4], h0, w0, rh, rw)
    det = nms(det, nms_iou=nms_thres)
    for box in det:
        object_locations.append(
            {"type": int(box[5]+1), "x": int(box[0]), "y": int(box[1]), "width": int(box[2]) - int(box[0]),
             "height": int(box[3]) - int(box[1]), "segmentation": [[]]})
    object_locations_by_image[img_path.name] = object_locations
    
    return object_locations_by_image


## 4. Running Object Detection

Call the following function to run the object detection.  Extra debug output is enabled when `debug` is `True`.

In [1]:
print('Please wait for 10 minutes, the result will appear')
team.run(my_callback, debug=False)