# AI Engine With PL Example

The AI engine with PL example demonstrates how to use AI engine for scalar
computation, and use PL for data movement. In this example, to run the matrix
multiplication on AI engine, we use standard matrix multiplication algorithm.
The user can change the matrix size and the number of cores utilized at
compile-time. The expected matrix size must be a multiple of 50 (number of
cores used) with the minimum and maximum value as 100x100 and 1200x1200
respectively. Please note that this example is intended to be a proof of
concept only. There can be other ways of implementation, which can leverage
more of the AIE resource and hence can result in better performance figures.


## Introduction

Consider two matrices A and B, the product of the two, i.e. AxB, is a linear
combination of the columns of A by matrix B. This means that the elements
in a row (i) of A are multiplied with the elements in a column of B (j) and are
summed up to give the corresponding single element in the matrix AxB at i, j.
This means that if A is an n x m matrix and B is an m x p matrix, then the
corresponding product AxB would have dimensions n x p. Note how the number of
columns of A equals the number of rows in B to make the matrix multiplication
possible.

## Implementation

### Data movement

The application uses the PLIO attribute to make external memory-mapped
connections to or from global memory. These connections are created between
AIE kernel and the logical global memory port of the hardware platform design
via AXI-Multichannel Direct Memory Access IP in the fabric. In this design,
the buffer descriptors are programmed in the AXI-MCDMA IP to initiate AIE to DDR
read and write transactions in the PS program. The burst-length of the
memory-mapped transaction is 64-bit,and AXI-MCDMAs use physical memory
addressing read/write data from global memory.

<img src="pics/aie_app_data_movement.png">

### Data slicing

To compute matrix multiplication on AIE, matrix A is sliced horizontally and
distributed equally among all the core utilized through the AXI-Stream network.
Matrix B is transposed and feed to the first core in the design element by
element. The first core shares the input matrix B with the next core through the
AXI-Stream connection. As the output received from the cores is in z-order
fashion, hence a re-ordering of the output matrix is expected.

<img src="pics/data_movement.png">


## Compilation Flow

<img src="pics/compilation_flow.png">

There are 2 sets of external interfaces for AI Engine configuration
- AI Engine configuration
	Direct call to AI Engine driver APIs
	CDO parsing APIs
- ELF loading
	Direct call to AI Engine driver to load ELF file

The high-level tool, Vitis, can generates aie_control_xrt.cpp which is
cross-compiled to run on the target. The compiled application loads the
generated AIE ELFs and CDOs (packeged into XCLBIN) to the corresponding tile
through load XCLBIN API. The AI Engine configuration can also be done by calling
AI Engine driver APIs directly without loading XCLBIN.


## Sample Work Directory Structure

<img src="pics/work_directory.png">


## Run-time Execution

At run-time, Linux application binary calls AI Engine userspace driver and
(tool generated) run time library, libadf_api_xrt.a. AIE userspace drivers
abstract the kernel space driver which handles runtime configurations along
with ELF loading.

<img src="pics/runtime_execution.png">

The userspace talks to AI engine partition to access the partition hardware.

## Build Application Using PetaLinux Tools

By default, the AIE matrix multiplication application is enabled. To
enable/disble, run:

```
petalinux-config -c rootfs

    [*] user packages --->
        [*] aie-matrix-multiplication
```
To rebuild the project run,

petalinux-build.

The generated FIT image will be in images/linux/image.ub.


## Sample Output


Follow PetaLinux boot process to launch the Linux on the target.
After you see the Linux login prompt, you can log in with user "root" and
password "root".

The AIE XCLBIN is installed in the `/lib/firmware/aie` directory, and executable
in '/usr/bin' directory.

```
root@xilinx-vck190-2020_2:~# aie-matrix-multiplication
Initializing ADF API...
Matrix size(int32): 1200x1200
[ 2831.185340] [drm] Pid 986 opened device
PLIO MCDMA> allocated matrix A at 0xffffa7722000 (phy addr: 0x67500000)
PLIO MCDMA> allocated matrix B at 0xffffa7ca0400 (phy addr: 0x67a7e400)
PLIO MCDMA> allocated matrix B transpose at 0xffffa821e800 (phy addr: 0x67ffc800)
PLIO MCDMA> allocated matrix C at 0xffffa879cc00 (phy addr: 0x6857ac00)
PLIO MCDMA> allocated AIE result at 0xffffa8d1b000 (phy addr: 0x68af9000)
PLIO MCDMA> allocated APU result at 0xffffa9299400 (phy addr: 0x69077400)
PLIO MCDMA> allocated MM2S BD chain #0 at 0xffffa9817800 (phy addr: 0x695f5800)
PLIO MCDMA> allocated S2MM BD chain #0 at 0xffffa9d5d800 (phy addr: 0x69b3b800)
PLIO MCDMA> allocated MM2S BD chain #1 at 0xffffa98c0400 (phy addr: 0x6969e400)
PLIO MCDMA> allocated S2MM BD chain #1 at 0xffffa9d70400 (phy addr: 0x69b4e400)
PLIO MCDMA> allocated MM2S BD chain #2 at 0xffffa9969000 (phy addr: 0x69747000)
PLIO MCDMA> allocated S2MM BD chain #2 at 0xffffa9d83000 (phy addr: 0x69b61000)
PLIO MCDMA> allocated MM2S BD chain #3 at 0xffffa9a11c00 (phy addr: 0x697efc00)
PLIO MCDMA> allocated S2MM BD chain #3 at 0xffffa9d95c00 (phy addr: 0x69b73c00)
PLIO MCDMA> allocated MM2S BD chain #4 at 0xffffa9aba800 (phy addr: 0x69898800)
PLIO MCDMA> allocated S2MM BD chain #4 at 0xffffa9da8800 (phy addr: 0x69b86800)
PLIO MCDMA> allocated MM2S BD chain #5 at 0xffffa9b63400 (phy addr: 0x69941400)
PLIO MCDMA> allocated S2MM BD chain #5 at 0xffffa9dbb400 (phy addr: 0x69b99400)
PLIO MCDMA> allocated MM2S BD chain #6 at 0xffffa9c0c000 (phy addr: 0x699ea000)
PLIO MCDMA> allocated S2MM BD chain #6 at 0xffffa9dce000 (phy addr: 0x69bac000)
PLIO MCDMA> allocated MM2S BD chain #7 at 0xffffa9cb4c00 (phy addr: 0x69a92c00)
PLIO MCDMA> allocated S2MM BD chain #7 at 0xffffa9de0c00 (phy addr: 0x69bbec00)
PLIO MCDMA> init_dmas: 0xa4000000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4010000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4020000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4030000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4040000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4050000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4060000, page size: 0x1000
PLIO MCDMA> init_dmas: 0xa4070000, page size: 0x1000
[2773308.118750]Loading PDI from DDR
[2773308.586234]Monolithic/Master Device
[2773312.205084]4.050287 ms: PDI initialization time
[2773316.723665]+++++++Loading Image No: 0x0, Name: default_subs, Id: 0x1C000000
[2773323.664250]-------Loading Prtn No: 0x0
[2773376.862181] 49.349571 ms for PrtnNum: 0, Size: 18992496 Bytes
[2773379.938650]Subsystem PDI Load: Done
[ 2839.828438] [drm] FPGA Manager load DONE
[ 2839.830909] [drm] Partition 1 already requested
Success!
[ 2839.839382] [drm] zocl_xclbin_read_axlf d7f4dd9c-65c1-45f7-bfc7-984668960922 ret: 0
[ 3327.447633] [drm] Pid 986 closed device
root@xilinx-vck190-2020_2:~#
```

## Customizing and Rebuilding

Following are prerequisites if the user wants to make any changes in
software,
  * xgemm repo in Yocto workspace and aiecompiler for any software-related
    changes.
  * PetaLinux.

For cross-compiling the main application, sysroot is required.To get the Linux
sysroot, go to the PetaLinux project directory, run the following command:

```
  $ petalinux-build --sdk
  $ petalinux-package --sysroot
```

The sysroot will be generated in
`images/linux/sdk/sysroots/aarch64-xilinx-linux/` directory.

Now, to pull the xgemm repository using PetaLinux, set the Yocto build tool as
devtool:

```
  $ petalinux-config
    Yocto Settings --->
	Build tool --->
	  (*) devtool
  $ petalinux-build -c aie-matrix-multiplication -x modify
```

The xgemm source files can be found in
`components/yocto/workspace/sources/aie-matrix-multiplication/`

As mentioned earlier, the user can change the number of AIE cores utilized for
matrix multiplication. However, since the data memory immediately available to
the core is limited, reducing the number of cores utilized reduces the maximum
matrix size supported by the application. Within the config.h header file
NUM_HW_ROWS and NUM_HW_COLS macro can be set to change the number of cores
utilized. The maximum cores available are 400. With the changes made in the
application, care must be taken by the user that the newly generated
configuration is supported by the underlying hardware design.

To rebuild,
   1. Go to the meta-user demo recipe files directory:
      `project-spec/meta-user-recipes-apps/aie-matrix-multiplication/files`.
   2. Assign values to VITIS_DIR, XILINXD_LICENSE_FILE, SYSROOT, and PFM_XPFM
      in settings.sh script and source settings.sh in a shell session.
	* SYSROOT is the sysroot generated from PetaLinux
	* PFM_XPFM is the Vitis platform file.
   3. Run make to compile AIE dataflow graph, create XCL BIN, and cross-compile
      Linux control application. The `makefile` contains how to use Xilinx
      tools generate the binaries:
      * use aiecompiler to generate the AI engine kernel ELFs, and CDOs
      * use v++ to generate the AI engine configuration package `xclbin` file
      * compile the Linux application which runs on APU Linux.

Optionally, user can run `clean-aie-work.sh` to clean the Work directory to
remove unnecessary files, only leave the AIE kernels, AIE CDOs, and the
aie_control_xrt.cpp file.

The generated Linux binary will be `aie-matrix-multiplication`.

The user can then rebuild Petalinux.

NOTE:
 * No hardware change is supported in this version of Vivado for this design.

## Run Demo Example


In [None]:
!aie-matrix-multiplication

## References
 [1]  Vivado User Guide - for hardware related design.

 [2]  Vitis User Guide - for AIE application.

 [3]  Versal TRM - General subsystem overview.
