

# Project 2: 1D Convolution Engine

Issued: 10/4/21; Due 10/27/21, 4:25PM

### Introduction

In this project, you will utilize the multiply-and-accumulate (MAC) unit you designed in Project 1 along with memories and control logic to build a system that performs a one-dimensional convolution.

In part 1 [40 points], you will construct a system that takes as input two vectors and convolves them, outputting the result. In part 2 [15 points], you will extend this system to work with larger vectors. Then, in part 3 [35 points], you will modify your system to make it as fast as you can. In addition to the 90 points described above, 10 points will be awarded based on the quality of your code, comments, and report.

The objective of this project is to give you more experience creating designs in SystemVerilog, debugging them, and evaluating them through synthesis. Further, you will gain insight into how changes you make to a system will affect its costs and performance.

#### You will turn in:

- your documented code for all three parts
- clearly labeled synthesis reports for each time you are asked to synthesize a design
- a report that answers specific questions asked throughout this assignment

We will run additional simulations on the code you turn in, so it is very important to:

- 1. Make sure the timing of all signals in your designs matches the specifications in this handout
- 2. Carefully label and document your code
- 3. Create separate subdirectories for each part of the project (part1, part2, part3)
- 4. Include a README file in each directory giving a description of each file, and the exact commands you are using to run simulation and (where appropriate) synthesis.

Your project will be evaluated on the correctness and efficiency of your designs and your answers to the questions in the report.

You will continue to work with your partner from Project 1 (or continue to work alone). You may not share code with others (except your partner). This means you may not allow others to see your code, nor may you read others' code (for this or related projects). All code will be run through an automatic code comparison tool. Plagiarism will result in a score of zero on the assignment for all involved parties. If you have questions as to what is acceptable, please come to office hours or send Prof. Milder email to ask for clarification.

If you have general questions about the project, please post them Piazza.

## **Background**

#### Convolution

Convolution is a mathematical operation that takes as input two signals and combines their values to produce a third signal. Assume you are given two discrete signals x[n] and h[n]. Let's call x[n] our "input" and h[n] our "filter." We can define their convolution y[n] as:

$$y[n] = x[n] * h[n] = \sum_{k=-\infty}^{\infty} x[k]h[n-k]$$

You probably have seen an expression like this in a signal processing class. Here, we will make some reasonable simplifications:

- 1. We will assume that the input and filter signals are finite (that is, they have a finite number of non-zero values).
- 2. We will only calculate the values of *y* that correspond to locations where *x* and *h* fully overlap.
- 3. We will assume that we already have a version of h[n] stored in reversed order, which we will call f[n] (so therefore f[n] = h[-n]).

Let's assume that x[n] is of length N, f[n] is of length M, and that N>M. Let L = N-M+1. So, we can now write their convolution as:

$$y[n] = \sum_{k=0}^{M-1} x[n+k]f[k], \quad n = 0, \dots, L-1$$
 (1)

Or, in pseudocode:

for 
$$n = 0$$
 to L-1:  
 $y[n] = 0$   
for  $k = 0$  to M-1:  
 $y[n] += x[n+k]*f[k]$ 

For example, let x = [10, -20, 30, 40, -50, 60] and let f = [5, 4, 3]. Therefore N=6, M=3, and L=4. (Note that values of x and f can be positive or negative as seen here.) We would calculate our four output values:

```
y[0] = x[0]*f[0] + x[1]*f[1] + x[2]*f[2] = 10*5-20*4+30*3 = 60

y[1] = x[1]*f[0] + x[2]*f[1] + x[3]*f[2] = -20*5+30*4+40*3 = 140

y[2] = x[2]*f[0] + x[3]*f[1] + x[4]*f[2] = 30*5+40*4-50*3 = 160

y[3] = x[3]*f[0] + x[4]*f[1] + x[5]*f[2] = 40*5-50*4+60*3 = 180
```

### Memory

This project will require the use of memories. The following is the SystemVerilog description of the memory you must use. (You can also find the code at /home/home4/pmilder/ese507/proj2/memory.sv)

```
module memory(clk, data in, data out, addr, wr en);
                               WIDTH=16, SIZE=64;
    parameter
    localparam
                               LOGSIZE=$clog2(SIZE);
    input [WIDTH-1:0]
                               data in:
    output logic [WIDTH-1:0] data_out;
                               addr;
    input [LOGSIZE-1:0]
    input
                               clk, wr_en;
    logic [SIZE-1:0][WIDTH-1:0] mem;
    always_ff @(posedge clk) begin
     data_out <= mem[addr];</pre>
     if (wr_en)
          mem[addr] <= data_in;</pre>
    end
endmodule
```

There are several important things to understand about this memory:

- The memory has one read port, one write port, and one address input. All reads and writes are synchronous (that is, they will occur on a positive clock edge).
- Unlike some examples we looked at in class, this module only contains a single address input, which will be used for both reading and writing. (So, you cannot read and write to different locations of this memory at the same time.)
- The memory is parameterized by two parameters:
  - a. WIDTH, the number of bits of each word
  - b. SIZE, the number of words stored in memory
- The memory has a "local parameter" called LOGSIZE, which represents the number of address bits needed to address SIZE entries. This is automatically computed as the log base two of SIZE, rounded up.

Remember, you can overwrite these parameters when you instantiate the module. For example, if you instantiate the memory as:

```
memory #(12, 256) myMemInst(clk, din, dout, addr, wren)
```

Then you would be building a memory with 256 words, each with 12 bits.

Figure 1 demonstrates the timing of reading from the memory. On each positive clock edge, the system samples the value on addr. A short time after the edge, it will output the value in memory at location addr. In this diagram, mem[7] just represents the value stored in address 7 of the memory.



Figure 1. Timing of memory read.

Figure 2 demonstrates the timing of writing to memory. In this example, you are first writing the value 3 to address 6. Then, on the following cycle, no write is performed because the wr\_en signal is 0 on the clock edge. Then, value 4 is written to address 1 in the third cycle.



Figure 2. Timing of memory write.

Recall from class that this RTL memory module will not synthesize to SRAM—instead it will produce a memory structure out of flip-flops. If these memories were large, this would be very inefficient. However, the memories you will need in this project will be fairly small, so we will simply let the logic synthesis tool implement them using registers.

### Input/Output Streaming Protocol

For this and future projects, we will use a simplified variation on ARM's AXI4-Stream Protocol. Shown in Figure 3, this is a simple synchronous protocol that allows a sender to transfer data to a receiver when both sides "agree."



Figure 3. Streaming protocol signals

The source asserts the VALID signal when it has placed valid data on the DATA signal. The destination asserts the READY signal when it is capable of consuming that data. On any positive clock edge, data is transferred if (and only if) both the VALID and the READY signals are asserted. (No data will ever be transferred unless both are asserted.) Note that both source and destinations modules must share a common clock. We will call this collection of signals (VALID, READY, and DATA) a "stream port."

Figure 4 and the accompanying table (Table 1) illustrate the stream port's functionality and timing. Your system will utilize this stream port protocol for both input and output—your system will receive input data as two "destination" ports, and it will produce its output as a "source" of another port. (This is explained further in the following section.)



| la-#    | MALID | DEADY | Fundamentian                                |
|---------|-------|-------|---------------------------------------------|
| cycle # | VALID | READY | •                                           |
| 1       | 0     | 0     | Neither valid nor ready; nothing is         |
|         |       |       | transferred                                 |
| 2       | 1     | 0     | Source puts data on DATA signal and asserts |
|         |       |       | VALID. However, destination module is not   |
|         |       |       | READY so no data is transferred             |
| 3       | 1     | 0     | Destination module is still not READY so no |
|         |       |       | data is transferred                         |
| 4       | 1     | 1     | Destination module is now READY, so it      |
|         |       |       | receives the data word d[0].                |
| 5       | 1     | 1     |                                             |
|         |       |       | clock edge, the source module now changes   |
|         |       |       | the data to the next word. This word is     |
|         |       |       | transferred immediately.                    |
| 6       | 1     | 1     | _                                           |
|         |       |       | transferred                                 |
| 7       | 0     | 1     | Now, the source module has de-asserted      |
|         |       |       | VALID. The destination module does not read |
|         |       |       | anything (regardless of what the source     |
|         |       |       | module has placed onto DATA).               |
| 8       | 1     | 0     | The source has asserted VALID but the       |
|         | _     | · ·   | destination has de-asserted READY. Nothing  |
|         |       |       | is transferred here.                        |
| 9       | 1     | 1     | Both VALID and READY are asserted, so the   |
|         | _     | •     | destination reads d[3] from DATA.           |
|         |       |       | accumation reads a[o] nom billin            |

Table 1. Data transfer timing example

### Basic System Architecture

Figure 5 shows the top-level view of your design. In addition to inputs clk and reset, the system has three stream ports: a stream input port for X, a stream input port for f, and a stream output port for y. Recall that each stream port consists of three signals: data, valid, and ready (see "Input/Output Protocol" above). The system's data inputs are 10-bit signed values, and the data output is 23 bits, also signed.



Figure 5. Top-level design. See above for information about valid/ready signals.

After your system finishes a computation, it should be able to immediately begin taking a new set of inputs for the next computation. Remember, your stream input ports should only assert their ready signals ( $x_ready$  and  $f_ready$ ) when it is capable of taking new input values in. Similarly, when your system is outputting results, its stream output port needs to wait until the  $y_ready$  signal is asserted.

## Part 1: Small Convolution [40 points]

In Part 1, you will get started by building a system for N = 12, and M = 5, that is, the X vector will have length 12 and the f vector will have length 5. This means the output vector will have length 12-5+1=8.

Your system will use the input/output protocol described above, with its X vector inputs received on one stream input port and its f vector inputs received on the other. Once all values from both vectors are received, your system can compute and output the 8 values of y.

Since the input values are 10 bits each, we will size our multiplier's output to be 20 bits and we will make the adder's accumulation path 23 bits. We will not use saturation logic here. (Think carefully: why is 23 bits the right number for this system?)

Here you will build a datapath based on the MAC unit from Project 1, with some added memories (and without the saturation logic). Figure 6 shows a rough idea of what this system should look like. You can feel free to design your own datapath, but for Parts 1 and 2 it is required you follow this basic structure: two memories, one multiplier, and one adder. Each block labeled mem is an instantiation of the module named memory from above. (Note that you will need to determine the correct parameters for these modules, as well as the bits needed for the address lines.)



Figure 6. Datapath structure.

Note that many of the datapath's inputs are control signals. These signals (addr\_x, wr\_en\_x, addr\_w, wr\_en\_w, clear\_acc, en\_acc) will need to be generated by a control module that you will build. Your control module will the contain control logic you require. Figure 7 shows the suggested interconnections between your datapath and the control module. Feel free to add, remove, or change the control signals that your system uses. The control system is the most complex portion of the design. Think carefully about how you want the control module to behave and interact with the datapath.

I recommend you begin designing your control system by breaking it down into three cooperating control blocks: one that handles loading data into memory x, one that handles loading data into memory f, and one that controls reading data from those memories and controlling the accumulator and output signals. These three control blocks can then cooperate (e.g., when both data-loading controllers signal that input vectors x and f have been fully loaded, the "compute" controller can begin working).



Figure 7. Suggested interconnections between control unit and datapath.

For your top-level synthesizable module, use the following module name, port names, and port declarations:

```
input clk, reset, x_valid, f_valid, y_ready;
input signed [9:0] x_data, f_data;
output x_ready, f_ready, y_valid;
output signed [22:0] y_data;
```

If necessary for your system, you can replace the last lines above with:

```
output logic x_ready, f_ready, y_valid; output logic signed [22:0] y_data;
```

as needed.

Assume the reset is asserted synchronously and is asserted high.

**Example Testbench.** To help you get started, we will provide you with two testbenches: one is a very small simple testbench that you can use to check that your system's timing is correct. The other is a large random testbench that you can use to test more thoroughly. In both cases, to help simulate the fact that the x\_valid, f\_valid, and y\_ready signals can change at any time, these testbenches use a random number generator to set these values. On every cycle, the testbench will randomly decide whether or not those three input control signals are set. This randomization makes the test much more robust by automatically testing that your control logic will behave correctly for different timings of these signals.

To take advantage of the randomization, you will want to run these testbenches multiple times with different random seeds. You can set the random seed when you run VSim by typing a command like:

```
vsim -sv seed 42 [the rest of your vsim commands go here]
```

In this example, the number 42 is the random seed. If you run this again with the seed set to 42, the random numbers will behave exactly the same way each time. So, when you re-run the testbench, change the number 42 to any other number, and the random behavior will differ each time.

This testbenches are available at:

```
/home/home4/pmilder/ese507/proj2/part1/part1_simple_tb.sv
/home/home4/pmilder/ese507/proj2/part1/part1_random_tb.sv
```

To use the random testbench, you will also need to copy:

```
/home/home4/pmilder/ese507/proj2/part1/random_in_x_p1.hex
/home/home4/pmilder/ese507/proj2/part1/random_in_f_p1.hex
/home/home4/pmilder/ese507/proj2/part1/expected_out_p1.hex
```

Note: when you compile these testbenches with "vlog" you may get a warning message:

```
** Warning: part1_random_tb.sv(46): (vlog-2254)
SystemVerilog testbench feature (randomization or coverage)
```

detected in the design. These features are only supported in Questasim.

This warning is safe to ignore.

#### Your tasks for Part 1 are:

- 1. Create this design in SystemVerilog. For ease of debugging and future extension, I suggest you structure your design with a top-level module (CONV\_12\_5 as above), a control module, and a datapath module. Use CONV\_12\_5.sv as the name of your top-level file.
- 2. Test your design. The small testbench we will provide for you is a simple place to get started, and the random testbench we provide is good for thorough testing. However, you may want to have some medium tests between these two. In your report, make sure you explain how you tested and what bugs you observed.
- 3. Use Synopsys DesignCompiler to synthesize your design and find the maximum possible clock frequency. Save the resulting synthesis log file with a descriptive name and submit it with your project. From this log, determine the design's area, power, maximum clock frequency, and the critical path's location in your logic.
- 4. In your report, include the following:
  - a. How many arithmetic operations are required to convolve a vector x of length N=12 with a vector f of length M=5 (using our slightly simplified version of convolution)? Then, generalize this: how many arithmetic operations are required to convolve a vector of length *N* with a vector of length *M*?
  - b. Explain how your control module works. What are the steps it goes through? How does it keep track of its place in execution? (That is, what state does it store?)
  - c. In this design, we take 10-bit inputs; the multiplier's product is 20 bits; the accumulator is 23 bits. Is this sufficient to guarantee the system cannot overflow, even without saturation? Explain why or why not.
  - d. Explain how you verified your system. Did you find and solve any problems? Did you write any other testbenches besides those provided to you? (If you did, please include them with your submission.)
  - e. Report the area, power, frequency, and critical path location you determined from your synthesis report. Explain the critical path location descriptively; in other words, carefully explain where the path flows through your design in words or diagrams and *why it makes sense that it is the critical path*.
  - f. From your simulation (and your understanding of your design), determine how many clock cycles the system takes to load one set of inputs, compute one convolution when the vectors are length 12 and 5, and output the result,

assuming that the testbench does not stall your design (that is, the testbench ensures that x\_valid, f\_valid, and y\_ready are always asserted). Count from the first cycle of loading the input until the last cycle of outputting the result. Then, multiply this by the fastest clock period you could reach to find the minimum delay of the system (in nanoseconds).

- g. A joint metric that combines the effects of area and speed in a single value is the *area-delay product*. The area-delay product is found by multiplying the area of the system times its delay. (Since these are both metrics that we want to minimize, lower area-delay products are better than higher ones.) Calculate the area-delay product of your system.
- h. Based on the synthesis power estimate, how much energy is consumed by your system while computing one convolution (using the delay you found in part f)?
- i. We can define the *arithmetic operation count* as the number of useful additions and multiplications required to perform one convolution. What is the operation count for your system? How much energy does your system consume *per arithmetic operation*?

## Part 2: Larger Convolution (112 by 49) [15 points]

In Part 2, you will adapt your design from Part 1 to make a design with N=112 and M=49. That is, your input vector x is now of length 112 and f is of length 49. Then you will reason about how the design will change as N and M increase.

Use module name conv\_112\_49 for the top-level module of this design and use conv\_112\_49.sv as your top-level filename. Create this design, update your testbench, simulate your design, and synthesize it. *Change the adder's output (accumulator) to 26 bits (instead of 23). This will also mean the system's output data port will be 26 bits.* 

We have again provided testbench modules for you. Please see:

```
/home/home4/pmilder/ese507/proj2/part2/part2_simple_tb.sv
/home/home4/pmilder/ese507/proj2/part2/part2_random_tb.sv
/home/home4/pmilder/ese507/proj2/part2/random_in_x_p2.hex
/home/home4/pmilder/ese507/proj2/part2/random_in_f_p2.hex
/home/home4/pmilder/ese507/proj2/part2/expected_out_p2.hex
```

- 1. In your report describe the ways that your design for Part 2 changed relative to Part 1. Be specific: don't just say "I changed a parameter." Explain how changes you made effect the implementation of your system. How difficult was it to change your control module?
- 2. If you wanted to build a design for much larger values of the parameters M and N parameters, how would you do so? Would it be easy or difficult to change? In your

report, explain exactly how your design would change as *N* and *M* increases. Be specific. Are there practical limits on their values? If so, what are they?

- 3. Repeat steps e-i from Part 1 for your new design, including your responses in your report.
- 4. Why did we change the accumulator from 23 bits to 26 bits? Is this sufficient to guarantee the system cannot overflow?

## Part 3: Delay-Optimized System [35 points]

Now, you will take your design from Part 2 and modify it to be as fast as possible, while keeping the input/output ports and timing as specified in the previous sections. Name your top-level synthesizable module conv\_112\_49\_opt and use conv\_112\_49\_opt.sv for your top-level filename. You may change the internals of your design in any way possible, but the input/output behavior must not deviate from this specification. (In other words, your system must still have two stream input ports each with (\_valid, \_ready, and \_data signals), one each for f and x for input, and it must have one output stream port (y\_data, y\_valid, and y\_ready) for output y.

Some ideas you may want to pursue:

- increasing the parallelism. That is, building more adders, multipliers, and memories so that you can perform more operations concurrently,
- pipelining,
- modifying your control logic so that you can overlap input and output data (i.e., while the system is outputting one output convolution, it can begin taking in new data for the *next* convolution),
- using DesignWare components (including the pipelined versions). Please note that this does add some complexity to the simulation process; see below.

You may not modify the internals of the memory module (except of course changing its two parameters if necessary), but you may use as many memory modules as you want.

You may use DesignWare components for adders and multipliers, but you may not use the DesignWare multiply-and-accumulate unit. If you choose to use any DesignWare components, you will need to include them in your simulation (see slides from end of Topic 11).

Create this design, update the testbench, simulate your design, and synthesize it. Note that the exact same testbench from Part 2 will still work, as long as you change the module instance to instantiate your new design. In your report explain:

a. What did you do to boost the speed? How well did it work? If you tried multiple things, explain what they were and whether or not they helped.

- b. Collect the information requested in parts e-i from Part 1 for your new design and include them in your report.
- c. Your new design performs the same computation as your design in Part 2, but it should be faster, larger, and consume higher power. Compare its area-delay product and energy per arithmetic operation with your Part 2 design. In these metrics, is your faster/larger design better or worse than your previous design?
- d. If instead of optimizing for delay, what if your goal was to optimize for the overall lowest energy-per-operation? How would you build a convolution system to minimize energy?
- e. Because you are constrained by the number of input and output ports, the maximum speed of your design here is limited. What could you do to make a design even faster, if you were allowed to change the input/output timing and ports? Estimate (quantitatively) how fast such as system could be.

A portion of the points allotted for Part 3 will be based on the quality of your optimizations and the final delay of your correctly functioning design. We will measure speed in seconds (the number of cycles times the clock period). In order to account for designs that overlap input and output data (item #3 in the list of suggestions above), we will measure cycles as the number of clock cycles required between when the system takes in the first element of the input vector, and when it is capable of taking in the first input element of the *next* input vector. (This way of measuring speed is *throughput*-based, as opposed to latency based.)

# **Code and Report Submission**

#### 1. Code and Synthesis Reports

You will turn in a single .zip, .tar, or .tgz file to Blackboard. *Do not use a different archive format (including .rar)*. This compressed file should hold all of the files from your project. Organize them into three directories: part1, part2, and part3. For each system you synthesized, submit a clearly labeled synthesis report. (You only need to include the report for the fastest clock frequency you were able to reach for each design.)

Do not turn in things like ModelSim "work" directories or gate-level Verilog produced by synthesis. Please also don't turn in the test data files like random\_in\_\*.hex, expected\_out\_\*.hex, etc.

#### 2. Report

Your report should include the information requested above. Your report should be electronically produced (not handwritten) and neatly formatted. Include your report in the electronic hand-in with your code (as a PDF file only). If you worked with a partner, at the end of your report, explain each partner's contribution to the project. (If you worked alone obviously you can skip this.)

### 3. Electronic Hand-in Process

To hand in your code, go to Blackboard -> Assignments -> Project 2. There you can upload your .zip, .tar, or .tgz file. You only need to hand in once per group, but make sure both partners' names are clear in your code and report.