# DPU example: ENet
----

## Aim/s
* This notebooks shows an example of DPU applications. The application,as well as the DPU IP, is pulled from the official 
[Vitis AI Github Repository](https://github.com/Xilinx/Vitis-AI).
* Description: ENet for segmentation on Cityscapes.
* Input size: 512*1024
* Task: Segmentation

## References
* [Vitis AI Github Repository](https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html).

## Last revised
* Apr 20, 2023
    * Initial revision

----

## 1. Prepare the overlay
We will download the overlay onto the board. 

In [None]:
from pynq_dpu import DpuOverlay
overlay = DpuOverlay("dpu.bit")

## 2. Utility functions

In this section, we will prepare a few functions for later use.

In [None]:
import os
import time
import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

The `load_model()` method will automatically prepare the `graph`
which is used by VART.

**Note** For the KV260 board you may see TLS memory allocation errors if cv2 gets loaded before loading the vitis libraries in the Jupyter Lab environment. Make sure to load cv2 first in these cases.

In [None]:
overlay.load_model("dpu_enet_cityscapes.xmodel")

Let's first define a few useful preprocessing functions.

In [None]:
pallete = [128, 64, 128, 244, 35, 232, 70, 70, 70, 102, 102, 156, 190, 153, 153, 153, 153, 153, 250, 170, 30,
           220, 220, 0, 107, 142, 35, 152, 251, 152, 70, 130, 180, 220, 20, 60, 255, 0, 0, 0, 0, 142, 0, 0, 70,
           0, 60, 100, 0, 80, 100, 0, 0, 230, 119, 11, 32 ]

MEANS = [.485, .456, .406]
STDS = [.229, .224, .225]

def preprocess_fn(image):
    image = image.astype(np.float32)
    image =  image / 255.0
    for j in range(3):
        image[:, :, j] -= MEANS[j]
    for j in range(3):
        image[:, :, j] /= STDS[j]
    return image

Keep in mind that our original images are 2048x1024 so we need to preprocess them
later to make sure it fits our model, which takes 1024x512 input.

In [None]:
image_folder = 'img/segm'
original_images = [i for i in os.listdir(image_folder) if i.endswith('png')]
total_images = len(original_images)

## 3. Use VART
Now we should be able to use VART to do image segmentation.

In [None]:
dpu = overlay.runner

In [None]:
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()

shapeIn = tuple(inputTensors[0].dims)   # (1, 512, 1024, 3)
shapeOut = tuple(outputTensors[0].dims) # (1, 512, 1024, 19)
outputSize = int(outputTensors[0].get_data_size() / shapeIn[0]) # 9961472 = 512x1024x19 / 1

We can define a few buffers to store input and output data. They will be reused
during multiple runs.

In [None]:
input_data = [np.empty(shapeIn, dtype=np.float32, order="C")]
output_data = [np.empty(shapeOut, dtype=np.float32, order="C")]
image = input_data[0]

Remember that we have a list of `original_images`. 
We can now define a new function `run()` which takes the image index as 
the input, then decode and post-process the output as the detection result.
With the argument `display` set to `True`, the original image as well as the
detected objects and their labels can be rendered.

It is obvious that the range of `image_index` should be [0, `total_images`-1].

In [None]:
def run(image_index, display=False):
    # Read input image
    input_image = cv2.imread(os.path.join(image_folder, original_images[image_index]))
    
    # Pre-processing
    resized = cv2.resize(input_image,(1024,512))
    preprocessed = preprocess_fn(resized)
    
    # Fetch data to DPU and trigger it
    image[0,...] = preprocessed.reshape(shapeIn[1:])
    job_id = dpu.execute_async(input_data, output_data)
    dpu.wait(job_id)
    
    # Retrieve output data
    classMap_numpy = np.argmax(output_data[0][0], axis=-1).astype(np.uint8)
    classMap_numpy = Image.fromarray(classMap_numpy)
    classMap_numpy_color = classMap_numpy.copy()
    classMap_numpy_color.putpalette(pallete)
    if display:
        _, ax = plt.subplots(1)
        _ = ax.imshow(classMap_numpy_color)

In [None]:
run(0, display=True)

We can also run it for multiple images as shown below. In this example
we have only used 1 thread; in principle, users should be able to boost
the performance by employing more threads.

In [None]:
time1 = time.time()
[run(i) for i in range(total_images)]
time2 = time.time()
fps = total_images/(time2-time1)
print("Performance: {} FPS".format(fps))

We will need to remove references to `vart.Runner` and let Python garbage-collect
the unused graph objects. This will make sure we can run other notebooks without
any issue.

In [None]:
del overlay
del dpu

----

Copyright (C) 2021 Xilinx, Inc

SPDX-License-Identifier: Apache-2.0 License

----

----