## Getting Started

Welcome to Allo tutorial! We will walk through the basic usage of Allo and some advanced features. We will start with a simple example of matrix multiplication and then gradually optimize it to achieve high performance.

Feel free to ask questions during the live demo!

Allo is a Python-based Accelerator Design Language (ADL). It is designed to be simple and easy to use, while also providing a powerful set of primitives for hardware customizations.

Allo documentation: https://cornell-zhang.github.io/allo/

First, we import the necessary packages.

In [4]:
import allo

### Algorithm Definition
Allo leverages an algorithm-customization decoupled paradigm, which means
users can first define the algorithm in a high-level language and then
optimize the program with various hardware customization techniques (i.e.,
schedule primitives). Here we show how to define a general matrix multiplication
(GEMM) in the Allo DSL.

We first import the necessary data types from Allo. In this example, we
use ``float32`` as the data type for all the variables.

In [5]:
from allo.ir.types import float32

We then define a function that takes two 128x128 matrices as inputs and
returns a 128x128 matrix as output. The variable declaration is defined
as ``<name>: <type>[<shape>]``, and the function type is defined as
``(<in_type0>, <in_type1>, ...) -> <out_type>``.
We require **strict type annotation** in Allo's kernels, which is different
from directly programming in Python.

Inside the kernel, we provide a shorthand for the loop iterator. For example,
``for i, j in allo.grid(128, 128)`` is equivalent to the following
nested for-loop:

```python
    for i in range(128):
        for j in range(128):
            # body
```
The ``allo.grid`` API is used to define the iteration space of the loop.
The arguments denote the upper bounds of the loop iterators.
Notice the above range-loop is also supported in the new Allo, so
users have more flexibility to define the loop structure.

We also provide ``allo.reduction`` to define the reduction loop.

In [22]:
M, N, K = 128, 128, 128

def gemm(A: float32[M, K], B: float32[K, N]) -> float32[M, N]:
    C: float32[M, N] = 0.0
    for i, j in allo.grid(M, N):
        for k in allo.reduction(K):
            C[i, j] += A[i, k] * B[k, j]
    return C

### Create a Schedule

Hardware customizations in Allo are applied on a **schedule**.  After defining the algorithm, we create a schedule with ``allo.customize``.

In [23]:
s = allo.customize(gemm)

#### Inspect the Intermediate Representation (IR)
Allo leverage the `MLIR <https://mlir.llvm.org/>`_ infrastructure to
represent the program, and we can directly print out the IR by using
``s.module``.

In [None]:
print(s.module)

An MLIR program is
a set of operations in different dialects, and the operations are referred
to as ``<dialect>.<ops>``. In this example, we can see that the generated IR
contains the following dialects:
- ``func``: Used to define the function signature and the return of the function.
- ``memref``: Used to define the shape and memory layout of the tensors.
- ``affine``: Used to define the loop structure.
- ``arith``: Used to conduct actual arithmetic operations.
- ``linalg``: Currently only used to initialize the tensors.
And the inner-most dot-product is explicitly represented by a sequence of load/store
operations and some arithmetic operations.
Allo also attaches some attributes to the operations, including the tensor
names, loop names, and operation names, which are further used for optimization.

📌 **Note**: Allo customizations are applied immediately on the IR. In the later exercises, you can print the IR after each customization to see the changes.


### Validate the Functional Correctness on the CPU Backend

Allo supports multiple backends, including CPU, FPGA, and AI Engine. We can target different backends by specifying the target hardware in the ``.build()`` function. We will start with the CPU backend.

For functional validation on the CPU backend, we  call ``.build()`` function on the schedule and specify the target
hardware as ``llvm``. By default, Allo will generate a LLVM program that
can be executed on the CPU. 

In [6]:
executable = s.build(target="llvm")

📌 **Note**: ``s.build(target="llvm")`` is equivalent to ``s.build()``.


#### Prepare the Inputs/Outputs for the Executable

To run the executable, we can generate random NumPy arrays as input data, and
directly feed them into the executable. 

In [7]:
import numpy as np

np_A = np.random.rand(M, K).astype(np.float32)
np_B = np.random.rand(K, N).astype(np.float32)

#### Run the Executable

With the prepared inputs/outputs, we can feed them to our executable.
Notice our module can return a new array as output, so we can directly
assign the output to a new variable.

In [8]:
np_C = executable(np_A, np_B)

Finally, we can compare the results with the NumPy to see if the results are correct.

In [None]:
golden_C = np.matmul(np_A, np_B)
np.testing.assert_allclose(np_C, golden_C, rtol=1e-3, atol=1e-3)
print("\033[92mResults are correct! ✅\033[0m")

## Target the FPGA Backend

To generate high-performance designs for FPGA, we apply hardware-specific customizations to transform algorithm specifications into efficient hardware implementations. 

### Setting up Vitis HLS
Before delving into the details, let's set up the environment variables to use Vitis HLS in Jupyter notebook. This step is only required for Jupyter notebook.

In [None]:
import subprocess
import os

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)


# Run Bash to source the script and print environment variables
bash_command = "bash -c 'source /opt/xilinx/Vitis_HLS/2022.1/settings64.sh && env'"
env_vars = subprocess.run(bash_command, shell=True, capture_output=True, text=True, check=True)

# Parse and update Python's environment variables
for line in env_vars.stdout.split("\n"):
    if "=" in line:
        key, value = line.split("=", 1)
        os.environ[key] = value

# Verify
!which vitis_hls

import allo.backend.hls as hls
print(hls.is_available("vitis_hls"))

### Test Baseline Implementation

To target FPGA, we simply change the target to ``vitis_hls``. For example, we can specify the mode as ``csyn`` to synthesize the design.

In [None]:
mod = s.build(target="vitis_hls", mode="csyn", project="baseline.prj")
mod()

This will generate a Vivado HLS project in the ``baseline.prj`` directory. You can navigate to the project folder to find the generated `kernel.cpp` file.

Without any customizations, the generated design is a inner-product matrix multiply. The following is a simplified version of the generated `kernel.cpp` HLS code and the corresponding datapath diagram.

<div style="text-align:center"><img width=90% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1740726173/default_gemm_udtmvl.png" alt="default gemm"></div>


The resource utilization and performance results are available in the Vitis HLS report ``baseline.prj/out.prj/solution1/syn/report/gemm_csynth.rpt``

<details><summary markdown="span">Let's see some HLS report!</summary>

```text
+----------+----------+-----------+-----------+----------+----------+---------+
|   Latency (cycles)  |   Latency (absolute)  |       Interval      | Pipeline|
|    min   |    max   |    min    |    max    |    min   |    max   |   Type  |
+----------+----------+-----------+-----------+----------+----------+---------+
|  14958622|  14958622|  49.812 ms|  49.812 ms|  14958623|  14958623|       no|
+----------+----------+-----------+-----------+----------+----------+---------+

    
================================================================
== Utilization Estimates
================================================================
* Summary: 
+---------------------+---------+------+---------+---------+-----+
|         Name        | BRAM_18K|  DSP |    FF   |   LUT   | URAM|
+---------------------+---------+------+---------+---------+-----+
|DSP                  |        -|     -|        -|        -|    -|
|Expression           |        -|     -|        0|      458|    -|
|FIFO                 |        -|     -|        -|        -|    -|
|Instance             |        0|     5|     3258|     4428|    0|
|Memory               |       48|     -|        0|        0|    0|
|Multiplexer          |        -|     -|        -|      611|    -|
|Register             |        -|     -|      691|        -|    -|
+---------------------+---------+------+---------+---------+-----+
|Total                |       48|     5|     3949|     5497|    0|
+---------------------+---------+------+---------+---------+-----+
|Available SLR        |     1344|  3008|   869120|   434560|  320|
+---------------------+---------+------+---------+---------+-----+
|Utilization SLR (%)  |        3|    ~0|       ~0|        1|    0|
+---------------------+---------+------+---------+---------+-----+
|Available            |     4032|  9024|  2607360|  1303680|  960|
+---------------------+---------+------+---------+---------+-----+
|Utilization (%)      |        1|    ~0|       ~0|       ~0|    0|
+---------------------+---------+------+---------+---------+-----+
```
</details>

By default, Vitis HLS will automatically apply pipelining to the innermost loop. We can examine the pipeline result in HLS report:

```text
* Loop: 
+-------------+---------+---------+----------+-----------+-----------+------+----------+
|             |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
|  Loop Name  |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
+-------------+---------+---------+----------+-----------+-----------+------+----------+
|- l_S_k_0_k  |      902|      902|        14|          7|          1|   128|       yes|
+-------------+---------+---------+----------+-----------+-----------+------+----------+
```

Due to the read-after-write dependency on ``k`` loop iterations, it can only be pipelined to an initiation interval of 7, meaning that a new iteration can only start evey 7 cycles.

### Apply Customizations

One way to improve the pipelining is to move the reduction loop outside, so the innermost loop does not have data dependency.

In this section, we exercise single-kernel customizations with an example: transforming an inner-product matrix multiply to scalar-vector product. 

<div style="text-align:center"><img width=60% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1740145203/svp_ixb7ya.png" alt="scalar-vector"></div>

The figure above illustrates the difference between inner-product and scalar-vector product computations. In the inner-product implementation, the loop over `k` accumulates partial sums. If we attempt to pipeline this loop, we cannot initiate the next iteration immediately after the previous one starts due to data dependencies in the accumulation process.

Instead, we can reorder loops `k` and `j` and pipeline loop `j`. Since there are no dependencies between iterations of loop `j`, a new iteration can begin every cycle. This effectively transforms the computation into a scalar-vector product.

Let's try to reorder the inner reduction loop with the middle loop. This is for changing the computation order of matrix multiply.

**Exercise**: reorder loop `k` and loop `j` with the `.reorder()` primitive.

Syntax:
```python
reorder(*args)
# Reorders nested loops with indices listed in args such that the outermost loop is the first index listed in args, the second is the second outermost, and so on.
```

💡**Tip**: You can print the IR after each customization to see the changes.


In [26]:
## Write your code here



<details>
  <summary> Answer </summary>
  
  `s.reorder("k", "j")`
</details>

Next, we need an accumulation buffer for one row of partial sums. We create a new buffer for the output tensor C. We provide a `.buffer_at(tensor, axis="loop")` primitive for users to quickly create a new buffer along a specific axis. Since Allo has attached all the tensors to the function, we can directly use <schedule>.<tensor> to access a specific tensor in the schedule.

**Exercise**: insert a buffer for output tensor `C` at loop level `i`.

Syntax:
```python
buffer_at(target, axis)
# Creates a chip buffer to hold the values of target written to in loop with index axis instead of immediately writing them to memory.
```

In [27]:
## Write your code here



<details>
  <summary> Answer </summary>
  
  `s.buffer_at(s.C, axis="i")`
</details>


Lastly, we pipeline the `j` loop in order to achieve the best performance.

**Exercise**: pipeline loop `j`.

Syntax:
```python
pipeline(axis[, initiation_interval, rewind])
# Pipelines a loop with index axis into initiation_interval stages.
```

In [28]:
## Write your code here



<details>
  <summary> Answer </summary>
  
  `s.pipeline("j")`
</details>

Next, let's push the design through synthesis and observe the speedup:

In [None]:
mod = s.build(target="vitis_hls", mode="csyn", project="scalar-vector.prj")
mod()

You can find the generated `kernel.cpp` HLS code and the corresponding datapath diagram in the `scalar-vector.prj` directory. The HLS report is available in `scalar-vector.prj/out.prj/solution1/syn/report/gemm_csynth.rpt`.
The following is a simplified version of the generated `kernel.cpp` HLS code and the corresponding datapath diagram.

<div style="text-align:center"><img width=90% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1740727380/reorder_buffer_at_ccst24.png" alt="scalar-vector"></div>

From the above generated code, we can see that Allo automatically creates an intermediate buffer for C and attach it inside the `i` loop. Also two additional loop nested named `j_init` and `j_back` are created to initialize and write the intermediate buffer back to output tensor.

<details><summary markdown="span">Let's see some HLS report!</summary>

```text
+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+---------+---------+---------+
    |  Latency (cycles) |  Latency (absolute) |      Interval     | Pipeline|
    |   min   |   max   |    min   |    max   |   min   |   max   |   Type  |
    +---------+---------+----------+----------+---------+---------+---------+
    |  2198686|  2198686|  7.322 ms|  7.322 ms|  2198687|  2198687|       no|
    +---------+---------+----------+----------+---------+---------+---------+

================================================================
== Utilization Estimates
================================================================
* Summary: 
+---------------------+---------+------+---------+---------+-----+
|         Name        | BRAM_18K|  DSP |    FF   |   LUT   | URAM|
+---------------------+---------+------+---------+---------+-----+
|DSP                  |        -|     -|        -|        -|    -|
|Expression           |        -|     -|        0|      371|    -|
|FIFO                 |        -|     -|        -|        -|    -|
|Instance             |        0|     5|     3384|     4619|    0|
|Memory               |       48|     -|       32|       65|    0|
|Multiplexer          |        -|     -|        -|      668|    -|
|Register             |        -|     -|      649|        -|    -|
+---------------------+---------+------+---------+---------+-----+
|Total                |       48|     5|     4065|     5723|    0|
+---------------------+---------+------+---------+---------+-----+
|Available SLR        |     1344|  3008|   869120|   434560|  320|
+---------------------+---------+------+---------+---------+-----+
|Utilization SLR (%)  |        3|    ~0|       ~0|        1|    0|
+---------------------+---------+------+---------+---------+-----+
|Available            |     4032|  9024|  2607360|  1303680|  960|
+---------------------+---------+------+---------+---------+-----+
|Utilization (%)      |        1|    ~0|       ~0|       ~0|    0|
+---------------------+---------+------+---------+---------+-----+

```
</details>

Wow! We have improved the total latency from 14,958,622 cycles to 2,198,686 cycles, a 6.8x speedup!

We check pipline:

```text
* Loop: 
+-----------------+---------+---------+----------+-----------+-----------+-------+----------+
|                 |  Latency (cycles) | Iteration|  Initiation Interval  |  Trip |          |
|    Loop Name    |   min   |   max   |  Latency |  achieved |   target  | Count | Pipelined|
+-----------------+---------+---------+----------+-----------+-----------+-------+----------+
|- l_S_k_0_k_l_j  |    16397|    16397|        15|          1|          1|  16384|       yes|
+-----------------+---------+---------+----------+-----------+-----------+-------+----------+
```

We see that the `j` loop is pipelined to an initiation interval of 1, meaning that a new iteration can begin every cycle, achieving the best performance.

Furthermore, we can increase parallelism by unrolling the `j` loop.

**Exercise**: unroll loop `j` with a factor of 16.

Syntax:
```python
unroll(axis[, factor])
# Unrolls a loop with loop index axis by factor.
```

In [30]:
## Write your code here


With unrolling, we create 16 parallel units to compute the scalar-vector product. The following is the datapath diagram of the unrolled design.

<div style="text-align:center"><img width=40% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1740727380/unroll_yl4lxg.png" alt="scalar-vector-unroll"></div>

We push the design through synthesis again and observe the speedup.

In [None]:
mod = s.build(target="vitis_hls", mode="csyn", project="unroll-scalar-vector.prj")
mod()

The HLS report is available in `unroll-scalar-vector.prj/out.prj/solution1/syn/report/gemm_csynth.rpt`.

<details><summary markdown="span">Let's see some HLS report!</summary>

```text
+ Latency: 
+---------+---------+----------+----------+--------+--------+---------+
|  Latency (cycles) |  Latency (absolute) |     Interval    | Pipeline|
|   min   |   max   |    min   |    max   |   min  |   max  |   Type  |
+---------+---------+----------+----------+--------+--------+---------+
|   232478|   232478|  0.774 ms|  0.774 ms|  232479|  232479|       no|
+---------+---------+----------+----------+--------+--------+---------+

================================================================
== Utilization Estimates
================================================================
* Summary: 
+---------------------+---------+------+---------+---------+-----+
|         Name        | BRAM_18K|  DSP |    FF   |   LUT   | URAM|
+---------------------+---------+------+---------+---------+-----+
|DSP                  |        -|     -|        -|        -|    -|
|Expression           |        -|     -|        0|      350|    -|
|FIFO                 |        -|     -|        -|        -|    -|
|Instance             |        0|    80|    12274|     9303|    0|
|Memory               |       64|     -|      512|      528|    0|
|Multiplexer          |        -|     -|        -|     2082|    -|
|Register             |        -|     -|      634|        -|    -|
+---------------------+---------+------+---------+---------+-----+
|Total                |       64|    80|    13420|    12263|    0|
+---------------------+---------+------+---------+---------+-----+
|Available SLR        |     1344|  3008|   869120|   434560|  320|
+---------------------+---------+------+---------+---------+-----+
|Utilization SLR (%)  |        4|     2|        1|        2|    0|
+---------------------+---------+------+---------+---------+-----+
|Available            |     4032|  9024|  2607360|  1303680|  960|
+---------------------+---------+------+---------+---------+-----+
|Utilization (%)      |        1|    ~0|       ~0|       ~0|    0|
+---------------------+---------+------+---------+---------+-----+
```

With unrolling, we created 16 parallel units to compute the scalar-vector product, so we use more DSPs and FFs. We further improve the total latency from 14,958,622 cycles to 232,478 cycles, a 64.4x total speedup! 🎉

We check the pipelined `j` loop:

```text
* Loop: 
+-----------------+---------+---------+----------+-----------+-----------+------+----------+
|                 |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
|    Loop Name    |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
+-----------------+---------+---------+----------+-----------+-----------+------+----------+
|- l_S_k_0_k_l_j  |     1035|     1035|        13|          1|          1|  1024|       yes|
+-----------------+---------+---------+----------+-----------+-----------+------+----------+
```

The `j` loop is pipelined to an initiation interval of 1, meaning that a new iteration can begin every cycle, achieving the best performance.

### Summary

In this section, we walked through the following topics:
- Creating a customization schedule from an algorithm.
- Using CPU backend for functional validation.
- Targeting Vitis HLS for hardware synthesis.
- Applying loop reordering, buffer insertion, pipelining, and unrolling to improve the performance.

We have successfully transformed the matrix multiply example into an accelerator design with a 64.4x speedup! 🎉

In the next section, we will show how to _verify_ the correctness of the accelerator design.

## Verification

Each customization transform the Allo program. How to make sure the accelerator remains correct after applying hardware customizations? 🤔

Allo integrates an equivalence verification tool that checks the equivalence before and after customizations:

<div style="text-align:center"><img width=60% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1739966977/verify_wa2lf4.png" alt="verify"></div>

This verification tool interprets the program before and after customizations to build a pair of symbolic representations, then checks if they are equivalent. This method is agnostic to loop transformations, data layout, buffer insertion. If the two programs are equivalent, the verification tool will return `True`. Otherwise, it will return `False`, and give the difference between the two programs.

To read more about the verification tool, please refer to our paper published at FPGA 2024: [Formal Verification of Source-to-Source Transformations for HLS](https://dl.acm.org/doi/10.1145/3626202.3637563).

We verify the scalar-vector matrix multiply example as follows:

In [None]:
M, N, K = 32, 32, 32

def gemm(A: float32[M, K], B: float32[K, N]) -> float32[M, N]:
    C: float32[M, N] = 0.0
    for i, j in allo.grid(M, N):
        for k in allo.reduction(K):
            C[i, j] += A[i, k] * B[k, j]
    return C

s = allo.customize(gemm)
s.reorder("k", "j")
s.buffer_at(s.C, axis="i")
s.pipeline("j")
s.unroll("j", 4)


s1 = allo.customize(gemm)
equivalent = allo.verify(s, s1)
if equivalent:
    print("\033[92m" + "Verification Passed!" + "\033[0m")
else:
    print("\033[91m" + "Verification Failed!" + "\033[0m")


💡**Tip**: You can verify the equivalence of two schedules at any point of composing the customizations, even at every step. This makes the verification process scalable to large-scale and complex customizations.

## PyTorch

With Allo, we can also easily convert a PyTorch model into an accelerator design.


In Allo example directory, we provide demo code for converting a self-attention module, a Bert layer, and a full GPT2 model into accelerator designs. If you are interested in the details, please refer to the [examples](https://github.com/cornell-zhang/allo/tree/main/examples/torch) directory.

Here, we show a small example of converting a simple MLP into an accelerator design.

In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import allo


class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(16, 32)  # 8*16 * 32*16
        self.linear2 = torch.nn.Linear(32, 10)

    def forward(self, data):
        out = self.linear1(data)
        out = self.linear2(out)
        out = F.relu(out)
        return out


model = MLP()
model.eval()
example_inputs = [torch.rand(8, 16)]
hls_mod = allo.frontend.from_pytorch(
    model, example_inputs=example_inputs, verbose=False,
    target="vitis_hls", mode="csyn", project="pytorch_demo.prj"
)
hls_mod()

## Multi-Kernel Composition Demo

Allo is a composable accelerator design framework with the ability to compose multiple kernels into a larger accelerator design.

First, we show how to design a systolic array with stream type and spatial composition.

#### Stream Types

Spatial composition typically involves specializing distinct PEs for specific operators or layers, enabling direct communication between them using streaming buffers (e.g., FIFOs or multi-buffers). Similar to partition types, a stream can be viewed as a layout that enforces the memory access order. To improve spatial composability, we introduce the stream type, which serializes the data within it. As shown in the following figure, two operations are associated with the stream type: the `.put()`
operation places data into the stream, while the `.get()` operation retrieves data from the stream in a first-in-first-out manner.

<div style="text-align:center"><img width=60% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1739966977/stream_type_arn3y3.png" alt="stream_type"></div>

### Constructing A Systolic Array with Spatial PE Composition

A systolic array is a grid of processing elements (PEs) where each PE operates on the data it receives from its neighbors. The PEs are connected to each other through a network of communication channels, which allows them to exchange data and perform computations in parallel.

<div style="text-align:center"><img width=40% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1739961848/systolic_array_pix4ue.png" alt="systolic_array"></div>

In [None]:
import allo.dataflow as df
import numpy as np
from allo.ir.types import float32


M, N, K = 2, 2, 2
P0, P1 = M + 2, N + 2

@df.region()
def top():
    fifo_A = df.array(df.pipe(dtype=float32, shape=(), depth=4), shape=(P0, P1))
    fifo_B = df.array(df.pipe(dtype=float32, shape=(), depth=4), shape=(P0, P1))

    @df.kernel(mapping=[P0, P1])
    def gemm(A: float32[M, K], B: float32[K, N], C: float32[M, N]):
        i, j = df.get_pid()
        # periperals kernels
        with allo.meta_if(i in {0, M + 1} and j in {0, N + 1}):
            pass
        with allo.meta_elif(j == 0):
            # i > 0
            for k in range(K):
                fifo_A[i, j + 1].put(A[i - 1, k])
        with allo.meta_elif(i == 0):
            # j > 0
            for k in range(K):
                fifo_B[i + 1, j].put(B[k, j - 1])
        # drain
        with allo.meta_elif(i == M + 1 and j > 0):
            for k in range(K):
                b: float32 = fifo_B[i, j].get()
        with allo.meta_elif(j == N + 1 and i > 0):
            for k in range(K):
                a: float32 = fifo_A[i, j].get()
        # main body
        with allo.meta_else():
            c: float32 = 0
            for k in range(K):
                a: float32 = fifo_A[i, j].get()
                b: float32 = fifo_B[i, j].get()
                c += a * b
                fifo_A[i, j + 1].put(a)
                fifo_B[i + 1, j].put(b)
            C[i - 1, j - 1] = c


A = np.random.rand(M, K).astype(np.float32)
B = np.random.rand(K, N).astype(np.float32)
C = np.zeros((M, N), dtype=np.float32)

sim_mod = df.build(top, target="simulator")
sim_mod(A, B, C)
np.testing.assert_allclose(C, np.dot(A, B), atol=1e-5)
print("\033[92mDataflow Simulator Passed!\033[0m")

### Composing Two Systolic Arrays

When composing multiple kernels, there could be conflicts in the data layout. Allo checks and resolves this by modeling data layout as types, and use type inference to check and fix such potential data layout inconsistency.

In the following example, we show how to compose two systolic arrays with potential data layout inconsistency.

<div style="text-align:center"><img width=40% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1739969013/temporal_compose_xghz28.png" alt="systolic_array"></div>

As shown in the above figure, we demonstrate an example of calling two consecutive GEMM kernels. By using the `.compose()` primitive, users can easily integrate the schedule of a subfunction into the top-level function.

Here, we customize the submodule `systolic_tile`, then compose it with the top-level module.

Notice the potential data layout inconsistency for tensor `Y`:

- It is fully partitioned as the output of the first `systolic_tile` call.
- It is also partitioned along rows as the input of the second `systolic_tile` call.

To make sure the data layout is consistent, we fully partition both input A and output C.

Allo models data layout as types, and use type inference to check and fix such potential data layout inconsistency. The intuition is, always supply equal or more parallelism, but never less.

In [None]:
from allo.library.systolic import systolic_tile
from allo.ir.types import int8

M0, M1, KK = 4, 4, 4
W_A_cst = np.random.randint(-4, 4, size=(M0, M1)).astype(np.int8)
W_B_cst = np.random.randint(-4, 4, size=(M0, M1)).astype(np.int8)

def top(X: int8[M0, M1]) -> int8[M0, M1]:
    Y: int8[M0, M1] = 0
    Z: int8[M0, M1] = 0
    W_A: int8[M0, M1] = W_A_cst
    W_B: int8[M0, M1] = W_B_cst
    systolic_tile[int8, int8, int8, KK, M0, M1](X, W_A, Y)
    systolic_tile[int8, int8, int8, KK, M0, M1](Y, W_B, Z)
    return Z

s_top = allo.customize(top)
# print(s_top.module)
# CPU testing
mod = s_top.build()
X = np.random.randint(-4, 4, size=(M0, M1)).astype(np.int8)
allo_C = mod(X)
np_C = X @ W_A_cst @ W_B_cst
np.testing.assert_allclose(allo_C, np_C, atol=1e-3)
print("Passed!")
# Submodule customization
s = allo.customize(
    systolic_tile,
    instantiate=[int8, int8, int8, KK, M0, M1],
)
s.partition(s.C, dim=0) 
s.partition(s.A, dim=1)
s.partition(s.B, dim=2)
pe = s.unfold("PE", [0, 1])  # specify which are spatial loops
s.to(s.A_fifo, pe, axis=1, depth=M0 + 1)
s.to(s.B_fifo, pe, axis=0, depth=M1 + 1)
# Compose with submodule
s_top.compose(s)
# HLS testing
code = s_top.build("vhls")
print(code)

## Conclusions

In this tutorial, we demonstrated Allo's _verifiable_ accelerator design approach. We walked through the following topics:
- Single kernel design and customization with Allo.
- Verifying the correctness of the accelerator design.
- Importing PyTorch models and converting them into accelerator designs.
- Composing multiple kernels into a larger accelerator design.
- Synthesis, simulation of the accelerator design targeting FPGA.

Next, back to Hongzheng for AI Engine Demo!

<div style="text-align:center"><img width=90% src="https://res.cloudinary.com/dxzx2bxch/image/upload/v1740771896/aie_bmulwc.png" alt="AIE"></div>