### Briefly introduction to the algorithm or overall system

In this lab, we are required to do a discrete Fourier transform (DFT) kernel, and implemented it in PYNQ. The algorithm is shown in the formula below.

$$X[k] = \sum_{n=0}^{N-1} e^{-\frac{i2\pi}{N}nk} x[n]$$

We perform the real part and imaginary part separately, so the formula can be rewritten in this way.

$$X[k] = \sum_{n=0}^{N-1} \left[\cos\left(\frac{2\pi}{N}nk\right) - i\sin\left(\frac{2\pi}{N}nk\right)\right] (x_R[n] + ix_I[n])$$

$$X_R[k] = \sum_{n=1}^{N-1} \cos\left(\frac{2\pi}{N}nk\right) x_R[n] + \sin\left(\frac{2\pi}{N}nk\right) x_I[n]$$

$$X_R[k] = \sum_{n=1}^{N-1} \cos\left(\frac{2\pi}{N}nk\right) x_R[n] + \sin\left(\frac{2\pi}{N}nk\right) x_I[n]$$

$$X_{I}[k] = \sum_{n=1}^{N-1} \cos\left(\frac{2\pi}{N}nk\right) x_{I}[n] - \sin\left(\frac{2\pi}{N}nk\right) x_{R}[n]$$

# Explain the original code/system/pragmas and how you implement it

Here, I will show the baseline code. In the next section, I will implement optimization step by step following the questions of the guide.

```
void read(DTYPE real_in[SIZE], DTYPE imag_in[SIZE], DTYPE real_out[SIZE], DTYPE imag_out[SIZE])
{
    int i;

READ_LOOP:
    for (i = 0; i < SIZE; i++) {
        real_out[i] = real_in[i];
        imag_out[i] = imag_in[i];
    }

    return;
}</pre>
```

The function in the above figure is used to read the input.

```
void dft_compute(DTYPE real_sample[SIZE], DTYPE imag_sample[SIZE], DTYPE Real_freq[SIZE], DTYPE Imag_freq[SIZE])
{
    int k = 0;
    int n = 0;

DFT_OUTER_LOOP:
    for (k = 0; k < SIZE; k++) {
        Real_freq[k] = 0;
        Imag_freq[k] = 0;

DFT_INNER_LOOP:
    for (n = 0; n < SIZE; n++) {
        Real_freq[k] += real_sample[n] * cos(2*M_PI*n*k/SIZE) + imag_sample[n] * sin(2*M_PI*n*k/SIZE);
        Imag_freq[k] += imag_sample[n] * cos(2*M_PI*n*k/SIZE) - real_sample[n] * sin(2*M_PI*n*k/SIZE);
    }
    return;
-}</pre>
```

The function in the above figure is used to compute DFT.

```
void write(DTYPE real_in[SIZE], DTYPE imag_in[SIZE], DTYPE real_out[SIZE], DTYPE imag_out[SIZE])

{
    int i;

WRITE_LOOP:
    for (i = 0; i < SIZE; i++) {
        real_out[i] = real_in[i];
        imag_out[i] = imag_in[i];
    }

    return;
-}</pre>
```

The function in the figure above is used to output the result.

```
void dft(DTYPE real_sample[SIZE], DTYPE imag_sample[SIZE])

{
    //Write your code here
    DTYPE real_in[SIZE], imag_in[SIZE];
    DTYPE real_out[SIZE], imag_out[SIZE];
    read(real_sample, imag_sample, real_in, imag_in);
    dft_compute(real_in, imag_in, real_out, imag_out);
    write(real_out, imag_out, real_sample, imag_sample);
    return;
}
```

The 3 functions introduced above are called here.

# Analyze the timing/performance/utilization

LookUpTable

```
DFT_INNER_LOOP:

| for (n = 0; n < SIZE; n++) {
| Real_freq[k] += real_sample[n] * cos_coefficients_table[n*k%SIZE] - imag_sample[n] * sin_coefficients_table[n*k%SIZE];
| Imag_freq[k] += imag_sample[n] * cos_coefficients_table[n*k%SIZE] + real_sample[n] * sin_coefficients_table[n*k%SIZE];
| }
```

The inner loop is modified in this way so that we don't need to calculate cos and sin functions.

### How does this change the throughput and resource utilization?

Throughput

Baseline

| Modules<br>& Loops    | Issue <br>  Type | Slack | Latency  <br>(cycles) | Latency<br>(ns) | Iteration<br>Latency |         | Trip  <br>Count | Pipelined | BRAM    | DSP       | FF          |             | URAM |
|-----------------------|------------------|-------|-----------------------|-----------------|----------------------|---------|-----------------|-----------|---------|-----------|-------------|-------------|------|
| + dft                 | -                | 0.00  | 7015171               | 7.015e+07       | -                    | 7015172 | -               | no        | 20 (7%) | 158 (71%) | 15552 (14%) | 18256 (34%) | -    |
| o READ_LOOP           | i -i             | 7.30  | 512                   | 5.120e+03       | 2                    | -       | 256             | no        |         | · - 1     | 1 -         |             |      |
| o DFT_OUTER_LOOP      | i -i             | 7.30  | 7014144               | 7.014e+07       | 27399                | -       | 256             | no        |         | -         | -           |             |      |
| o DFT_INNER_LOOP      | l -I             | 7.30  | 27392                 | 2.739e+05       | 107                  | -1      | 256             | no        |         | -1        | -           |             |      |
| + sin_or_cos_double_s |                  | 0.25  | 55                    | 550.000         | - [                  | 55      |                 | no        | 8 (2%)  | 54 (24%)  | 5933 (5%)   | 6480 (12%)  |      |
| o Loop 1              | l -I             | 7.30  | 9                     | 90.000          | 3                    | -       | 3               | no        |         | -1        | -           |             |      |
| o Loop 2              |                  | 7.30  |                       | 40.000          | 1                    | -       |                 | no        |         | -         | -           |             |      |
| o Loop 3              |                  | 7.30  |                       | 90.000          | 2                    |         |                 | no        |         | -         | -           |             |      |
| + sin_or_cos_double_s | l -I             | 0.25  | 55                    | 550.000         | -                    | 55      |                 | no        | 8 (2%)  | 54 (24%)  | 5933 (5%)   | 6480 (12%)  |      |
| o Loop 1              |                  | 7.30  |                       | 90.000          | 3                    |         |                 | no        |         | -         | -           |             |      |
| o Loop 2              | I -I             | 7.30  |                       | 40.000          | 1                    | -       |                 | no        |         | -         | -           |             |      |
| o Loop 3              |                  | 7.30  |                       | 90.000          | 2                    |         |                 | no        |         | -         | -           |             |      |
| o WRITE_LOOP          | -                | 7.30  | 512                   | 5.120e+03       | 2                    | -       | 256             | no        |         | -         | -           |             |      |

### Lookup table

| Modules<br>& Loops | Issue <br>  Type | Slack | Latency  <br>(cycles) | Latency  <br>(ns) | Iteration<br>Latency |         | Trip  <br>Count | <br> Pipelined | BRAM   | DSP     | FF        | LUT       | URAM |
|--------------------|------------------|-------|-----------------------|-------------------|----------------------|---------|-----------------|----------------|--------|---------|-----------|-----------|------|
| + dft              | - 1              | 0.04  | 1050115               | 1.050e+07         |                      | 1050116 |                 | no             | 6 (2%) | 16 (7%) | 1500 (1%) | 2509 (4%) |      |
| o READ_LOOP        | -                | 7.30  | 512                   | 5.120e+03         | 2                    |         | 256             | no             |        |         |           |           |      |
| o DFT_OUTER_LOOP   | -1               | 7.30  | 1049088               | 1.049e+07         | 4098                 |         | 256             | no             |        |         | -         |           |      |
| O DFT_INNER_LOOP   | -                | 7.30  | 4096                  | 4.096e+04         | 16                   |         | 256             | no             |        |         | -         |           |      |
| O WRITE_LOOP       | -1               | 7.30  | 512                   | 5.120e+03         |                      |         | 256             | no             |        |         |           |           | -    |

From the interval, we can calculate the throughput.

Baseline:  $1 \div (7015172 \times 10ns) = 14.25$ 

Lookup table:  $1 \div (1050116 \times 10ns) = 95.22$ 

### Resource allocation

| Baseline                                                        |                |                               |                              |                               |           | Lookup table                                                                | )                      |                              |           |                                      |                                             |
|-----------------------------------------------------------------|----------------|-------------------------------|------------------------------|-------------------------------|-----------|-----------------------------------------------------------------------------|------------------------|------------------------------|-----------|--------------------------------------|---------------------------------------------|
| + Name<br>+ DSP<br> Expression<br> FIFO<br> Instance<br> Memory | +              | DSP  <br>+<br>- <br>- <br>158 | FF  <br>+<br>0 <br><br>14494 | LUT<br>-<br>102<br>-<br>17586 | URAM <br> | Name<br>  Name<br>  DSP<br>  Expression<br>  FIFO<br>  Instance<br>  Memory | ++<br>  BRAM_18K <br>+ | DSP  <br>+<br>- <br>- <br>16 | FF  <br>  | LUT  <br>-  <br>115  <br>-  <br>2064 | URAM <br>  URAM <br> +<br>  - <br> - <br> - |
| Multiplexer<br> Register                                        | i -i<br>-i     |                               | -  <br>1058                  | 568<br>-                      |           | Multiplexer<br> Register                                                    | - <br> -               | -<br>-                       | - <br>518 | 33 <b>0</b><br>-                     |                                             |
| Total                                                           | 20             | 158                           | 15552                        | 18256                         | 0         | Total                                                                       | <br>  6                | 16                           | 1500      | 2509                                 | 0                                           |
| Available                                                       | 280            | 220                           | 106400                       | 53200                         | 0         | Available                                                                   | 280                    | 220                          | 106400    | 53200                                | 0                                           |
| <br> Utilization (%)<br>+                                       | -<br>  7 <br>+ | 71 <br>+                      | 14 <br>+                     | 34                            | 0         | Utilization (%)                                                             | 2 <br>                 | 7                            | 1         | 4                                    | 0                                           |

From the above result, we can see that the throughput is greatly increased. Besides, the resource utilization also decreases a lot. Therefore, lookup table is very helpful for DFT.

# What happens to the table lookup when you change the size of your DFT?

The size of lookup table is the same as the size of DFT.

### 2. IOArray

### Original

```
void dft(DTYPE real_sample[SIZE], DTYPE imag_sample[SIZE])
{
    //Write your code here
    DTYPE real_in[SIZE], imag_in[SIZE];
    DTYPE real_out[SIZE], imag_out[SIZE];
    read(real_sample, imag_sample, real_in, imag_in);
    dft_compute(real_in, imag_in, real_out, imag_out);
    write(real_out, imag_out, real_sample, imag_sample);
    return;
}
```

#### New

```
void dft(DTYPE real_sample[SIZE], DTYPE imag_sample[SIZE], DTYPE Real_freq[SIZE], DTYPE Imag_freq[SIZE])

{
    //Write your code here
    DTYPE real_in[SIZE], imag_in[SIZE];
    DTYPE real_out[SIZE], imag_out[SIZE];
    read(real_sample, imag_sample, real_in, imag_in);
    dft_compute(real_in, imag_in, real_out, imag_out);
    write(real_out, imag_out, Real_freq, Imag_freq);
    return;
}
```

The function arguments are modified in this way.

## How does it change the performance?

| Modules<br>  & Loops                         | Issue<br>  Type |                        | Latency  <br>(cycles)       | Latency<br>(ns)                     | Iteration <br>Latency |                     | Trip<br>Count       | Pipelined        | BRAM             | DSP                |                     | LUT                 | URAM |
|----------------------------------------------|-----------------|------------------------|-----------------------------|-------------------------------------|-----------------------|---------------------|---------------------|------------------|------------------|--------------------|---------------------|---------------------|------|
| + dft<br>  o READ_LOOP<br>  o DFT_OUTER_LOOP | -<br>  -        | 0.04 <br>7.30 <br>7.30 | 1050115 <br>512 <br>1049088 | 1.050e+07<br>5.120e+03<br>1.049e+07 | 2<br>4098             | 1050116 <br>- <br>- | -  <br>256  <br>256 | no <br>no <br>no | 6 (2%)<br>-<br>- | 16 (7%) <br>-<br>- | 1500 (1%)<br>-<br>- | 2481 (4%)<br>-<br>- | -    |
| o DFT_INNER_LOOP O WRITE_LOOP                | - <br>  -       | 7.30 <br>  7.30 <br>   | 4096 <br>512                | 4.096e+04<br>5.120e+03              |                       | - <br>-             | 256  <br>256        | no <br>no        | -<br>-           | :                  |                     | -                   |      |

The throughput is  $1 \div (1050116 \times 10ns) = 95.22$ . Compared to the throughput of the last question, 95.22, it doesn't change. It is reasonable since the difference between them is where to output the new array.

# How does the resource usage change?

| Lookup table                                                   | 9                                                |                                     |                          |                                                 |                                               | IOArray |                                                  |                                          |                            |                                                 |                                          |
|----------------------------------------------------------------|--------------------------------------------------|-------------------------------------|--------------------------|-------------------------------------------------|-----------------------------------------------|---------|--------------------------------------------------|------------------------------------------|----------------------------|-------------------------------------------------|------------------------------------------|
| Name  DSP Expression FIFO Instance Memory Multiplexer Register | BRAM_18K <br>  - <br>  - <br>  - <br>  - <br>  6 | DSP  <br>- <br>- <br>- <br>16 <br>- | FF  <br>                 | LUT  <br>- <br>115 <br>- <br>2064 <br>0 <br>330 | URAM <br>+<br>- <br>- <br>- <br>- <br>0 <br>- | Name  + | BRAM_18K <br>  - <br>  - <br>  - <br>  - <br>  6 | DSP  <br>+<br>- <br>- <br>16 <br>- <br>- | FF  <br>                   | LUT  <br>- <br>115 <br>- <br>2064 <br>0 <br>302 | URAM <br>- <br>- <br>- <br>- <br>0 <br>- |
| +                                                              | 6 <br>  6 <br>                                   | 220 <br>7                           | 1500 <br>1500 <br>106400 | 2509 <br>  2509<br>  53200<br>  4               | <br>0 <br>                                    | +       | ++<br>  6 <br>++<br>  280 <br>++<br>  2          | 16 <br>+<br>220 <br>+<br>7               | 1500 <br>+<br>106400 <br>+ | 2481 <br>+<br>53200 <br>+                       | 0                                        |

From the above tables, we can see that the LUT of multiplexer becomes smaller. Take a look at their tables.



We can see that real\_sample\_address0 and imag\_sample\_address0 are gone. I think that it is because we write output to another array in this case, we don't need to access the address based on the read/write state.

### 3. Loop optimization

To apply loop unroll and array partition, I exchange the outer loop and inner loop so that the inner loop can write to different element if unrolled.



### What is the relationship between array partitioning and loop unrolling?

When we use loop unrolling, we have a chance to access a block of data that can only be accessed one by one. If we make use of array partitioning, we can access those data at the same time, and increase the throughput.

Plot the performance in terms of number of DFT operations per second (throughput) versus the unroll and array partitioning factor. Plot the same trend for resources (showing LUTs, FFs, DSP blocks, BRAMs).





# Resource usage

# Array partitioning

| factor | resource        |       |                 |     |        |      |      |
|--------|-----------------|-------|-----------------|-----|--------|------|------|
| 2      |                 |       |                 |     |        |      |      |
| 2      | Name            | BRAM  | _18K            | DSP | FF     | LUT  | URAM |
|        | DSP             | -     |                 | -   | -      | -    | -    |
|        | Expression      | -     |                 | -   | 0      | 140  | -    |
|        | FIFO            | -     |                 | -   | -      | -    | -    |
|        | Instance        | -     |                 | 8   | 491    | 1068 | -    |
|        | Memory          |       | 8               | -   | 0      |      |      |
|        | Multiplexer     | -     |                 | -   | -      | 485  | -    |
|        | Register        | -     |                 | -   | 541    |      | -    |
|        | Total           |       |                 | 8   | 1032   | 1693 | 0    |
|        | Available       |       | 280             | 220 | 106400 |      |      |
|        | Utilization (%) | )     | 2               | 3   | ~0     | 3    | 0    |
|        |                 |       |                 |     |        |      |      |
| 4      | Name            | BRAM_ | 18K             | DSP | FF     | LUT  | URAM |
|        | DSP             | -     | $\neg \uparrow$ | -   | -      | -    | -    |
|        | Expression      | -     | $\neg \uparrow$ | -   | 0      | 140  | -    |
|        | FIFO            | -     | $\neg \uparrow$ | -   | -      | -    | -    |
|        | Instance        | -     | $\neg \dagger$  | 16  | 982    | 2144 | -    |
|        | Memory          |       | 12              | -   | 0      | 0    | 0    |
|        | Multipléxer     | -     |                 | -   | -      | 585  | -    |
|        | Register        | -     | $\neg \uparrow$ | -   | 637    | -    | -    |
|        | Total           |       | 12              | 16  | 1619   | 2869 | 0    |
|        | Available       |       | 280             | 220 | 106400 |      | 0    |
|        | Utilization (%) |       | 4               | 7   | 1      | 5    | 0    |
|        |                 |       |                 |     |        |      |      |
| 8      | Name            | BRAM_ | 18K             | DSP | FF     | LUT  | URAM |
|        | DSP             | -     |                 | -   | -      | -    | -    |
|        | Expression      | -     |                 | -   | 0      | 140  | -    |
|        | FIFO            | -     |                 | -   | -      | -    | -    |
|        | Instance        | -     |                 | 16  | 982    | 2232 | -    |
|        | Memory          |       | 20              | -   | 0      | 0    | 0    |
|        | Multiplexer     | -     |                 | -   | -      | 901  | -    |
|        | Register        | -     |                 | -   | 680    | -    | -    |
|        | Total           |       | 20              | 16  | 1662   | 3273 | 0    |
|        | Available       |       | 280             | 220 | 106400 |      | 0    |
|        | Utilization (%) |       | 7               | 7   | 1      | 6    | 0    |
|        |                 |       |                 |     |        |      |      |

# Loop unrolling

| factor | resource        |          |     |        |       |      |  |
|--------|-----------------|----------|-----|--------|-------|------|--|
| 2      |                 |          |     |        |       |      |  |
| _      | Name            | BRAM_18K | DSP | FF     | LUT   | URAM |  |
|        | DSP             | -        | -   | -      | -     | -    |  |
|        | Expression      | -        | -   | 0      | 137   | -    |  |
|        | FIFO            | -        | -   | -      | -     | -    |  |
|        | Instance        | -        | 32  | 1964   | 4169  | -    |  |
|        | Memory          | 8        | -   | 0      | 0     | 0    |  |
|        | Multiplexer     | -        | -   | -      | 442   | -    |  |
|        | Register        | -        | -   | 865    | -     | -    |  |
|        | Total           | 8        | 32  | 2829   | 4748  | 0    |  |
|        | Available       | 280      | 220 | 106400 | 53200 | 0    |  |
|        | Utilization (%) | 2        | 14  | 2      | 8     | 0    |  |
|        |                 |          |     |        |       |      |  |

| 4 |                                                               |                      |                  |                                               |                           |                                                      |                     |
|---|---------------------------------------------------------------|----------------------|------------------|-----------------------------------------------|---------------------------|------------------------------------------------------|---------------------|
| 4 | Name                                                          | BRAM.                | _18K             | DSP                                           | FF                        | LUT                                                  | URAM                |
|   | DSP                                                           | -                    |                  | -                                             | -                         | -                                                    | -                   |
|   | Expression                                                    | -                    |                  | -                                             | 0                         | 50                                                   | -                   |
|   | FIFO                                                          | -                    |                  | -                                             | -                         | -                                                    | -                   |
|   | Instance                                                      |                      | 2                | 32                                            | 3464                      | 4978                                                 | -                   |
|   | Memory                                                        |                      | 6                | -                                             | 0                         | 0                                                    | 0                   |
|   | Multiplexer                                                   | -                    |                  | -                                             | -                         | 221                                                  | -                   |
|   | Register                                                      | -                    |                  | -                                             | 43                        | -                                                    | -                   |
|   | Total                                                         |                      |                  | 32                                            | 3507                      |                                                      |                     |
|   | Available                                                     |                      | 280              | 220                                           | 106400                    | 53200                                                |                     |
|   |                                                               |                      |                  |                                               |                           |                                                      |                     |
|   | Utilization (%)                                               |                      | 2                | 14                                            | 3                         | 9                                                    | 0                   |
|   |                                                               |                      | 2                | 14                                            | 3                         | 9                                                    | 0                   |
| 0 |                                                               |                      | 2                | 14                                            | 3                         | 9                                                    | 0                   |
| 8 |                                                               | BRAM_                |                  |                                               | FF SF                     |                                                      | URAM URAM           |
| 8 | Utilization (%)                                               | BRAM_                |                  |                                               |                           |                                                      |                     |
| 8 | Utilization (%)  Name                                         | BRAM_<br>-<br>-      |                  | DSP                                           |                           |                                                      |                     |
| 8 | Utilization (%)  Name DSP                                     | BRAM_<br>-<br>-<br>- |                  | DSP                                           | FF<br>-                   | LUT<br>-                                             |                     |
| 8 | Utilization (%)  Name DSP Expression                          | BRAM_<br>-<br>-<br>- |                  | DSP                                           | FF<br>-                   | LUT<br>-                                             |                     |
| 8 | Name DSP Expression FIFO Instance Memory                      | BRAM_<br>-<br>-<br>- | _<br>_18K        | DSP<br>-<br>-                                 | FF<br>-<br>0              | LUT<br>-<br>50                                       | URAM<br>-<br>-      |
| 8 | Name DSP Expression FIFO Instance                             | BRAM_<br>-<br>-<br>- | _18K<br>_2       | DSP<br>-<br>-<br>-<br>-<br>32                 | FF<br>-<br>0<br>-<br>4810 | LUT<br>-<br>50<br>-<br>5480                          | URAM<br>-<br>-<br>- |
| 8 | Name DSP Expression FIFO Instance Memory                      | BRAM                 | _18K<br>_2       | DSP<br>-<br>-<br>-<br>32                      | FF<br>-<br>0<br>-<br>4810 | LUT<br>-<br>50<br>-<br>5480<br>0                     | URAM<br>-<br>-<br>- |
| 8 | Name DSP Expression FIFO Instance Memory Multiplexer          | -                    | _18K<br>_2       | DSP<br>-<br>-<br>-<br>32                      | FF - 0 - 4810 0 -         | LUT<br>-<br>50<br>-<br>5480<br>0<br>221              | URAM 0              |
| 8 | Name DSP Expression FIFO Instance Memory Multiplexer Register | -                    | _18K<br>_2<br>_6 | DSP<br>-<br>-<br>-<br>32<br>-<br>-<br>-<br>32 | FF - 0 - 4810 0 - 43      | LUT<br>-<br>50<br>-<br>5480<br>0<br>221<br>-<br>5751 | URAM                |

# Array partitioning and loop unrolling

| factor | resource              |          |                     |         |           |               |          |
|--------|-----------------------|----------|---------------------|---------|-----------|---------------|----------|
| 2      |                       |          |                     |         |           |               |          |
| 2      | Name                  | BRAM_    | _18K                | DSP     | FF        | LUT           | URAM     |
|        | DSP                   | -        |                     | -       | -         | -             | -        |
|        | Expression            | -        |                     | -       | 0         | 129           | -        |
|        | FIFO                  | -        |                     | -       | -         | -             | -        |
|        | Instance              | -        |                     | 32      | 1964      | 4187          | -        |
|        | Memory                |          | 8                   | -       | 0         | 0             | 0        |
|        | Multiplexer           | -        |                     | -       | -         | 494           | -        |
|        | Register              | -        |                     | -       | 928       |               | -        |
|        | Total                 |          | 8                   | 32      |           |               | 0        |
|        | Available             |          | 280                 |         | 106400    |               | 0        |
|        | Utilization (%)       |          | 2                   | 14      | 2         | 9             | 0        |
|        |                       |          |                     |         |           |               |          |
| 4      | Name                  |          | 1016                | DCB     |           | LUT           | LIDARA   |
| •      | Name                  | BRAM_    | _18K                | DSP     | FF        | LUT           | URAM     |
|        | DSP                   |          |                     | -       | -         |               |          |
|        | Expression<br>FIFO    | -        |                     |         | 0         | 50            |          |
|        |                       | -        | 12                  | -<br>64 | -<br>5417 | -<br>9138     |          |
|        | Instance              |          | 10                  | - 04    | 0         | 9136          | - 0      |
|        | Memory<br>Multiplexer |          | 10                  | -       | 0         | 412           | ——"      |
|        | Register              |          |                     | -       | 110       | 412           |          |
|        | Total                 | _        | 22                  | 64      | 5527      | 9600          | 0        |
|        | Available             | _        |                     |         | 106400    |               | <u> </u> |
|        | Utilization (%)       | _        | 7                   | 29      | 5         | 18            | <u> </u> |
|        | 0 01120 0011 (70)     |          | - '                 |         |           | 10            |          |
|        |                       |          |                     |         |           |               |          |
| 8      | Name                  | BRAM     | 19K                 | nsp     | FF        | LUT           | URAM     |
|        | DSP                   | DIXAIVI. | _101                | -       | - ' '     | -             | -        |
|        | Expression            | -        |                     | _       | 0         | 50            |          |
|        | FIFO                  | _        |                     | -       |           |               |          |
|        | Instance              |          | 24                  | 112     | 971.8     | 16036         |          |
|        | Memory                |          | _ <del></del><br>18 | -       | 00        | 0             | 0        |
|        | Multiplexer           | _        |                     | -       | -         | 708           | - 1      |
|        | Register              | -        |                     | -       | 111       | -             | -        |
|        | Total                 |          | 42                  | 112     |           | 16794         | 0        |
|        |                       |          |                     |         |           |               |          |
|        | Available             |          | 280                 | 220     | 106400    | 5320 <u>0</u> | 0        |

First, I want to discuss the result of array partitioning. First, let's take a look at the performance tables.

| factor | performance table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| No     | Modules & Loops         Issue Type         Violation Type         Distance         Slack         Latency(cycles)         Latency(ns)         Iteration Latency           4                                                                                                                                                                                                                                                                                                                                                                               |
| 2      | Modules & Loops         Issue Type         Violation Type         Distance         Slack         Latency(cycles)         Latency(ns)         Iteration Latency           4                                                                                                                                                                                                                                                                                                                                                                               |
| 4      | Modules & Loops         Issue Type         Violation Type         Distance         Slack         Latency(cycles)         Latency(ns)         Iteration Latency           ■ odft         -         1116420         1.116E7         -         1           S READ_LOOP         -         512         5.120E3         2           S DFT_INIT_LOOP         -         256         2.560E3         1           P S DFT_OUTER_LOOP         -         1114380         1.115E7         4355           S WRITE_LOOP         -         768         7.680E3         3 |

There are two things I want to discuss.

The first thing is that the iteration latency of WRITE\_LOOP. We can see that after array partitioning is set, the iteration increases from 2 to 3.



From the schedule viewer, we can see that when there is no array partitioning, write is executed as soon as the output data is prepared. However, in the result of factor 2, there is a MUX such that the write operation cannot be done in the

same cycle. Therefore, 3 cycles are required.

The second thing is that the latency of factor 2 is twice larger than that of factor 4.



From the schedule viewers, we can see that in the inner loop, the result of factor 2 calculate the multiplication of Real\_freq and Imag\_freq sequentially, and the result of factor 4 calculate at the same time. Therefore, the latency of the result of factor 2 is almost twice larger than that of factor 4. However, I have no idea why it operates in this way.

Now, I want to discuss the performance of "loop unrolling" and "array partitioning and loop unrolling". We can see that the "loop unrolling" result of factor 2 is a bit better than the "array partitioning and loop unrolling" result. However, it is not the case in factor 4 and 8. Let's take a look at the performance tables of factor 2 and 4.

Factor 2:



From this table, we can see that the only difference is the WRITE\_LOOP. The latency of DFT\_OUTER\_LOOP is the same. I guess it is because the memory is 2-port, the data can be written at the same time.

Factor 4:



From this table, we can see that the decrease in dft\_compute is larger than the increase in WRITE\_LOOP. Therefore, the overall latency is decreased. To know where the improvement comes from, let's take a look at the schedule viewer of factor 8.

The schedule viewer of loop unrolling.



The schedule viewer of array partitioning and loop unrolling.



We can see that write operation with array partitioning is done in parallel, so the latency is reduced.

The general trend is that if we use larger factor, the throughput will increase, and the resource usage will increase as well.

### 4. Dataflow



I apply the dataflow pragma to my top function. The top function includes read, dft\_compute, and write functions. I encountered a problem when doing cosimulation on the synthesized result without pipeline, so I enable pipeline at first. Now, let's compare the performance with and without dataflow.



We can see that although the latency of 3 submodules is the same, the total latency is different. Besides, the interval is smaller.

| Throughput | without dataflow | 11399.91 |
|------------|------------------|----------|
| Throughput | with dataflow    | 12121.21 |

Let's take a look at the resource utilization.



We can see that with dataflow, we will need more BRAM.

I also have tried to change my code into more submodules, but end up less efficient.

### 5. Best architecture

The best architecture of mine is the one with smallest interval, that is the result with dataflow in the previous question. The optimization I used is shown in the figure below.



The tradeoffs are the resource utilization. Let's compare the performance table and resource utilization of the original architecture (lookup table) and the last architecture.



# Throughput

| original | $1 \div 10501160ns = 95.23$ |
|----------|-----------------------------|
| last     | $1 \div 82500ns = 12121.21$ |

| original               |                  |     |           |          | last       |                 |          |     |        |       |      |
|------------------------|------------------|-----|-----------|----------|------------|-----------------|----------|-----|--------|-------|------|
| +<br>  Name            | ++<br>  BRAM_18K | DSP | +<br>FF   | LUT      | URAM       | Name            | BRAM_18K | DSP | FF     | LUT   | URAM |
| +                      | ++<br>           |     |           |          |            | DSP             | -        | -   | -      | -     | -    |
| Expression             | <br> -           | -   | 9         | -<br>115 |            | Expression      | -        | -   | 0      | 82    | -    |
| FIFO                   | i -i             |     |           | - j      |            | FIFO            | -        | -   | -      | -     | -    |
| Instance               | -                | 16  | 982       | 2064     | -  <br>  0 | Instance        | 28       | 160 | 15883  | 25030 | -    |
| Memory<br> Multiplexer | 6 <br>  -        | -   | 0 <br>-   | 0<br>330 |            | Memory          | 66       | -   | 0      | 0     | 0    |
| Register               | i -i             |     | 518       | - j      |            | Multiplexer     | -        | -   | -      | 162   | -    |
| Total                  | ++<br>  6        | 16  | +<br>1500 | 2509     | +<br>  0   | Register        | -        | -   | 18     | -     | -    |
| ÷                      |                  | +   | +         |          |            | Total           | 94       | 160 | 15901  | 25274 | 0    |
| Available              | 280              | 220 | 106400    | 53200    | 0          | Available       | 280      | 220 | 106400 | 53200 | 0    |
| Utilization (%)        |                  | 7   | 1         | 4        | 0          | Utilization (%) | 33       | 72  | 14     | 47    | 0    |
| +                      | ++               | +   | +         |          | ++         |                 |          |     | ·      |       |      |

We can see that the throughput increases a lot, but the resource utilization also increases about 10 times of the original usage.

6. Streaming interface synthesis

Describe the major changes that you made to your code to implement the streaming interface. What benefits does the streaming interface provide? What are the drawbacks?

```
void dff(hls::streamcDTYPE> &real_sample, hls::streamcDTYPE> &imag_sample, hls::streamcDTYPE> &Real_freq, hls::streamcDTYPE> &Imag_freq)
{
    //Write your code here
    int k = 0;
    int n = 0;
    DTYPE Real[SIZE];
    DTYPE Imag[SIZE];
    DTYPE Imag[SIZE];
    DTYPE imag[SIZE];

DTYPE imag[SIZE];

DTI_INIT_LOOP:
    for (k = 0; k < SIZE; k++) {
        Real[k] = 0;
        real[k] = real_sample.read();
        imag[k] = imag_sample.read();
    }
}

DFI_OUTER_LOOP:
    for (k = 0; k < SIZE; h++) {
        Real[k] += real[n] * cos_coefficients_table[n*|XSIZE] - imag[n] * sin_coefficients_table[n*|XSIZE];
        Imag[k] += imag[n] * cos_coefficients_table[n*|XSIZE] + real[n] * sin_coefficients_table[n*|XSIZE];
    }
}

DFI_OUTPUT_LOOP:
    for (k = 0; k < SIZE; k++) {
        Real_freq << Real[k];
        Imag_freq << Imag[k];
    }

    return;
}</pre>
```

I use 4 arrays to store the input streams and output streams. We must finish calculating a value before writing it into stream, so I use arrays to store the temporary value. real and imag must be loaded first. Otherwise, they need to be read in the DFT\_OUTER\_LOOP, and we cannot perform pipeline to lower the latency. Does it benefit? Let's take a look at the performance table.



We can see that the total latency decreases. However, the interval increases. Since we can't start a new computation when output the values.

| without stream  |          |     |        |       | with stream |                 |          |     |        |       |      |
|-----------------|----------|-----|--------|-------|-------------|-----------------|----------|-----|--------|-------|------|
|                 |          |     |        |       |             |                 |          |     |        |       |      |
| Name            | BRAM_18K | DSP | FF     | LUT   | URAM        | Name            | BRAM_18K | DSP | FF     | LUT   | URAM |
| DSP             | -        | -   | -      | -     | -           | DSP             | -        | -   | -      | -     | -    |
| Expression      | -        | -   | 0      | 82    | -           | Expression      | -        | -   | 0      | 70    | -    |
| FIFO            | -        | -   | -      | -     | -           | FIFO            | -        | -   | -      | -     | - 1  |
| Instance        | 28       | 160 | 15883  | 25030 | -           | Instance        | 30       | 160 | 15784  | 25118 | 0    |
| Memory          | 66       | -   | 0      | 0     | 0           | Memory          | 64       | -   | 0      | 0     | 0    |
| Multiplexer     | -        | -   | -      | 162   | -           | Multiplexer     | -        | -   | -      | 144   | -    |
| Register        | -        | -   | 18     | -     | -           | Register        | -        | -   | 16     | -     | -    |
| Total           | 94       | 160 | 15901  | 25274 | 0           | Total           | 94       | 160 | 15800  | 25332 | 0    |
| Available       | 280      | 220 | 106400 | 53200 | 0           | Available       | 280      | 220 | 106400 | 53200 | 0    |
| Utilization (%) | 33       | 72  | 14     | 47    | 0           | Utilization (%) | 33       | 72  | 14     | 47    | 0    |
|                 |          |     |        |       |             |                 |          |     |        |       |      |

We can see that it uses less FF but more LUT.

The figure below is the co-simulation report



### Explain what you observed and learned

In this lab, I learned to optimize a design in several ways, including lookup table, array partition, unroll, pipeline, and dataflow. I also learned how to transmit float data through streams.

# Explain what problem you encountered and how you solved it

PYNQ Demo failed.

In the PYNQ Demo section, we are required to implement the result of streaming interface in the last question, and we need to change the streaming data type and the problem size increases from 256 to 1024. The reason of changing data type is that to send a stream of elements, we need to use specific structures for the elements. The structures are shown in the figure below.

```
template<int D,int U,int TI,int TD>
struct ap_axis {
   ap_int<D>
                data;
   ap uint<D/8> keep;
   ap_uint<D/8> strb;
   ap_uint<U> user;
   ap_uint<1> last;
   ap_uint<TI> id;
   ap_uint<TD> dest;
};
template<int D,int U,int TI,int TD>
struct ap_axiu {
   ap_uint<D> data;
   ap_uint<D/8> keep;
   ap_uint<D/8> strb;
   ap_uint<U> user;
   ap_uint<1>
                last;
   ap_uint<TI> id;
   ap_uint<TD> dest;
```

In this way, the stream signals can be synthesized as shown in the figure below.

| ▼ AXIS      |               |       |       |       |        |       |        |  |
|-------------|---------------|-------|-------|-------|--------|-------|--------|--|
| Interface   | Register Mode | TDATA | TKEEP | TLAST | TREADY | TSTRB | TVALID |  |
| lmag_freq   | both          | 32    | 4     | 1     | 1      | 4     | 1      |  |
| Real_freq   | both          | 32    | 4     | 1     | 1      | 4     | 1      |  |
| imag_sample | both          | 32    | 4     | 1     | 1      | 4     | 1      |  |
| real_sample | both          | 32    | 4     | 1     | 1      | 4     | 1      |  |
|             |               |       |       |       |        |       |        |  |

It contains necessary signals to use DMA. TLAST is a key signal that tells if the stream finishes sending or receiving elements.

However, in this lab, we need to transfer float data rather than integer through the stream, so the tutorial asks us to construct a structure as shown in the figure below.

```
struct DTYPE
{
    float data;
    ap_int<1> last;
};
```

"last" of the last output element should be set to 1, others are set to 0. After synthesis, the stream signals are shown in the figure below.

| ▼ AXIS      |               |       |        |        |  |
|-------------|---------------|-------|--------|--------|--|
| Interface   | Register Mode | TDATA | TREADY | TVALID |  |
| lmag_freq   | both          | 64    | 1      | 1      |  |
| Real_freq   | both          | 64    | 1      | 1      |  |
| imag_sample | both          | 64    | 1      | 1      |  |
| real_sample | both          | 64    | 1      | 1      |  |
|             |               |       |        |        |  |

We can see that even though we have added "last" in the struct, TLAST is still gone. What would happen if TLAST is gone? After we export IP, add it in the block diagram of Vivado, and then create wrapper, it will raise critical warning that TLAST in the DMA is not connected. If we run the bit file on Jupyter, the stream will get stuck there, since it doesn't know if the stream is done.

After I tried several methods and searched on the internet, I found nothing helpful. In the end, when I was writing this report, I found that the size of TDATA in the above figure becomes 64, which means that the "last" is inside it. Therefore, when I connect the block diagram in Vivado, I split the TDATA[64:0] into TDATA[32] and TDATA[31:0], corresponding to "last" and "data" respectively (variables in a structure are arranged from lower bit field to higher bit filed). Part of the block diagram is shown below.



When we output the result from dft\_0 to axi\_dma\_0, the result goes through xlclice\_1 and xlslice\_0, and send to tdata and tlast. Ready and valid signals are also connected by hand. In this way, we can get the correct bit file. The demo result is shown in the figures below.

### **Verifying Functionality**

```
In [7]: golden_op=np.fft.fft(a)
           for i in range(NUM_SAMPLES):
                 real_error[i]="{0:.6f}".format(abs(out_r[i]-golden_op.real[i]))
imag_error[i]="{0:.6f}".format(abs(out_i[i]-golden_op.imag[i]))
In [8]: sum_sq_real=0
           sum_sq_imag=0
           for i in range(NUM_SAMPLES):
                sum_sq_real =sum_sq_real+(real_error[i]*real_error[i])
real_rmse = np.sqrt(sum_sq_real / (i+1))
sum_sq_imag =sum_sq_imag+(imag_error[i]*imag_error[i])
                 imag_rmse = np.sqrt(sum_sq_imag / (i+1))
nt("Real Part RMSE: ", real_rmse, "Imaginary Part RMSE:", imag_rmse)
           print("Real Part RMSE:
           if real_rmse<0.001 and imag_rmse<0.001:
           print("PASS")
else:
                print("FAIL")
           Real Part RMSE: 1.2013298295534854e-05 Imaginary Part RMSE: 4.895988472719265e-06
           PASS
                                  Real Part Error
                                                                                                     Imaginary Part Error
      0.00012
                                                                           0.00005
      0.00010
                                                                           0.00004
      0.00008
                                                                         는 0.00003
     0.00006
                                                                           0.00002
      0.00004
                                                                           0.00001
      0.00002
      0.00000
                                                                           0.00000
                          200
                                    400
                                              600
                                                       800
                                                                 1000
                                                                                                         400
                                                                                                                   600
                                                                                                                             800
                                                                                                                                      1000
                                                                                                              Index
                             1024-DFT
                                                                                  1024-FFT -Numpy
                                                   real
      500
                                                                 500
                                                                                                              real
                                                   imag
                                                                                                              imag
  DFT real and imaginary data
                                                              and imaginary data
      400
                                                                 400
                                                                 300
      300
      200
                                                                 200
                                                             FFT real
      100
                                                                 100
         0
                                                                    0
                                                                           -0.4
                -0.4
                         -0.2
                                  0.0
                                          0.2
                                                   0.4
                                                                                   -0.2
                                                                                             0.0
                                                                                                     0.2
                                                                                                              0.4
                              Frequency
                                                                                        Frequency
```

Another problem is that the python code they provide is too old, the way to prepare IO stream has changed. The figures below show the original code and the new code.

### Original code

## **Submit your work to Github**

https://github.com/JasonYanggg/HLS-Lab-DFT