# AI Engine: GMIO Matrix Multiplication Example

The AI Engine matrix multiplication example demonstrates how to use AI Engine 
for scalar computation and use GMIO attributes for data movement. This example 
implements the standard matrix multiplication algorithm to multiply two matrices 
on the AIE cores. The user can change the matrix size and the number of cores 
utilized at compile-time. The default size is 1200x1200 (int32). The matrix 
could be further scaled to utilize the entire AIE's local data memory.

Please note that this example is a proof of concept only. There could be other 
implementations that could leverage more AI Engine resources and result in 
better performance figures.

## Introduction

Consider two matrices A and B, the product of the two, i.e. AxB, is a linear 
combination of the columns of A by matrix B. This means that the elements in 
a row (i) of A are multiplied with the elements in a column of B (j) and are 
summed up to give the corresponding single element in the matrix AxB at i, j. 
This means that if A is an n x m matrix and B is an m x p matrix, then the 
corresponding product AxB would have dimensions n x p. Note how the number of 
columns of A equals the number of rows in B to make the matrix multiplication 
possible.

## Implementation

### Data movement

The application uses the GMIO attribute to make external memory-mapped connections
to and from global memory. These connections are created between the AIE kernel and 
the logical global memory port of the hardware platform design via a Network on Chip
(NoC). In this design, the buffer descriptors are programmed in the AIE Shim DMAs 
to initiate AIE to DDR read and write transactions from the PS program. The burst 
length of the memory-mapped transaction is 64-bit, and shim DMAs use physical 
memory addressing read/write data from global memory.

<img src="images/data_movement.png">


### Data slicing

To compute matrix multiplication on AIE, matrix A is sliced horizontally and 
distributed equally among all the core utilized through the AIE AXI-Stream 
network. Matrix B is transposed and feed to the first core in the design element 
by element. The first core shares the input matrix B with the other AIE cores 
through the AXI-Stream connection. As the output is in a z-order, hence a 
re-ordering of the output matrix is required.

<img src="images/data_slicing.png">


## Build Flow

<img src="images/build_flow.png">

Vitis generates aie_control_xrt.cpp, which is cross-compiled to run on the 
target. The compiled application loads the generated AIE ELFs and CDOs 
(packaged into XCLBIN) to the corresponding tile through load XCLBIN API.


## Run-time Execution

At runtime, Linux application binary calls AI Engine userspace driver,
and runtime library, libadf_api_xrt.a. AIE userspace drivers abstract the 
kernel-space driver which handles runtime configurations along with ELF loading.

<img src="images/runtime.png">


## Customizing and Rebuilding

The AIE application source files are in the aie_app directory.

As mentioned earlier, the user can change the number of AIE cores utilized 
for matrix multiplication. However, since the data memory immediately available
to the core is limited, reducing the number of AIE cores reduces the maximum 
matrix size supported by the application. Within the config.h header file, 
NUM_HW_ROWS and NUM_HW_COLS macro can be set to change the number of cores 
utilized. The maximum number of AIE cores available is 400.

## Directory Structure

```
plnx-aie-examples/
├── designs
│   └── xgemm-gmio - GMIO based AIE GeMM application
│       ├── aie_app
│       │   ├── kernels
│       │   │   ├── config.h - user defined parameters
│       │   │   ├── one_input.cc - first AIE compute kernel within an AIE row that reads data from input stream
│       │   │   ├── one_output.cc - last AIE compute kernel within an AIE row that sends data to input stream
│       │   │   └── two_inputs.cc - subsequent compute kernels that circulate input data and output
│       │   ├── kernels.h - kernels declaration
│       │   ├── Makefile
│       │   ├── xgemm.cpp- PS main application
│       │   └── xgemm.h - dataflow graph definition
│       ├── hw
│       │   ├── system.cfg - defines connections to and from PL to AIE
│       │   └── Makefile
│       ├── images
│       │   ├── build_flow.png
│       │   ├── data_movement.png
│       │   ├── data_slicing.png
│       │   └── runtime.png
│       ├── Makefile
│       ├── ps_app_hw
│       │   └── linux
│       │       ├── Makefile
│       │       └── xrt.ini - XRT configuration file
│       ├── README
│       └── sw
│           └── Makefile
├── LICENSE-BINARIES
├── LICENSE-MIT
├── platforms
│   ├── Makefile
│   └── platform_create.tcl
└── settings.sh
```

## Build Instructions

Vitis and PetaLinux tools need to be installed and sourced before building the design.

```
source <Vitis_install_path>/Vitis/202X.X/settings64.sh
source <PetaLinux_install_path>/settings.sh
```

Export the path to the VCK190 XSA and BSP 

```
export BASE_XSA="<Base_XSA_path>"
export PTLNX_BSP="<PetaLinux_BSP_path>"
```

Source settings.sh in a shell session.

```
source settings.sh
```

Issue Make to begin the build.

```
make
```

## Run Demo Example

Follow the PetaLinux documentation to generate boot images for your target.
Once booted, login with the user root and password root.

The AIE XCLBIN and executable are pre-installed in the /usr/bin/ directory. To run
the demo, simply run the application "aie-matrix-multiplication' to begin. 


In [None]:
!aie-matrix-multiplication


## Sample Output

```
root@xilinx-vck190-2021_2:~# aie-matrix-multiplication
Initializing ADF API...
[INFO] AIE GMIO Matrix Multiplication
[INFO] Matrix size(int32): 1200x1200
[  729.529262] zocl-drm axi:zyxclmm_drm: zocl_create_client: created KDS client for pid(855), ret: 0
[  729.538172] zocl-drm axi:zyxclmm_drm: zocl_destroy_client: client exits pid(855)
[  729.545629] zocl-drm axi:zyxclmm_drm: zocl_create_client: created KDS client for pid(855), ret: 0
[  729.589247] [drm] found kind 29(AIE_RESOURCES)
[742107.428]Loading PDI from DDR
[742107.545]Monolithic/Master Device
[742110.856]3.399 ms: PDI initialization time
[742114.856]+++Loading Image#: 0x0, Name: aie_image, Id: 0x1C000000
[742120.791]---Loading Partition#: 0x0, Id: 0x0
[742173.617] 48.585 ms for Partition#: 0x0, Size: 19032880 Bytes
[742176.541]Subsystem PDI Load: Done
[  729.589258] [drm] found kind 18(PDI)
[  729.681940] [drm] FPGA Manager load DONE
[  729.688579] [drm] Partition 1 already requested
XAIEFAL: INFO: Resource group Avail is created.
XAIEFAL: INFO: Resource group Static is created.
XAIEFAL: INFO: Resource group Generic is created.
[INFO] XCLBIN download complete
[  729.702391] [drm] zocl_xclbin_read_axlf 9ebbbd79-4961-094c-09c8-ce28a79e87f6 ret: 0
[INFO] AIE cores are done executing
[INFO] Running sanity check
[INFO] XGeMM Success!
[  775.819920] zocl-drm axi:zyxclmm_drm: zocl_destroy_client: client exits pid(855)
root@xilinx-vck190-2021_2:~#
```


## References
1. AI Engine Programming Environment User Guide.
2. Vivado User Guide - for hardware related design.
3. Vitis User Guide - for AIE application.
4. Versal Technical Reference Manual
