This is an implementation of a 2D convolution in VHDL.
-
stride 1
-
3x3 filters
-
zero padding
-
one pixel per clock input/output
-
re loadable parameters
-
input image size can be changed on runtime (max image size os fixed)
-
quadratic images only
-
low resource usage
-
short critical path
-
no vendor libraries
-
bit depths and max image size can be changed by generics
To achieve one pixel per clock input and output rate 9 multiply-accumulate operations (because of the 3x3 filter size) have to be processed in parallel. Two line buffers (module pixbuf.vhd) are used to store intermediate results of the computation.
The convolution operation can be written as
Where to are the filter coefficients from the top left to the bottom right and is the input pixel value at the location .
This computation is now split into terms to model the pixel wise input. If we get an input pixel we can immediately compute the first term.
In the next clock cycle when we get the second pixel we can calculate
and after 3 Pixels seen
will now be stored in the line buffer until is fed to the input and we can calculate
and so on. Our final value will then be that contains y-terms and will be output.
To achieve maximum throughput of course all of these calculations mentioned above have to be executed in parallel.
There is an example how to use the module in the testbench (conv2d_tb.vhd). In order to start convolution the first line of zeros (padding) has to be fed into the module. Zero values also have to be applied on the borders the the left and right of an input image.
The last row of zeros for the bottom zero padding will be generated by the module. For each pixel that that is fed to the module the current pixels address (x/y) and a valid signal has to be assigned.
This is the number of bits per input pixel. To keep precision the module does all arithmetic operations using NUM_BITS_PIXEL*2 arithmetic.
Number of bits to encode the pixel addresses x and y. Basically lb(MAX_IMAGE_WIDTH).
Word width of the coefficients.
Max image width for the block to expect. Images must be quadratic.
Max image width for the block to expect. Images must be quadratic.
A test file generated by a python module (data_py.bin) can be checked against a file generated by the testbench of the conv2d module (data_sim.bin).
Because of the 9 parallel MACs the module needs 9 DSPs. The maximum image size and the internal bit width defines the number of BRAM instances the synthesis tool will infer. The memory in bits needed can be calculated using the formula
Using the parameters and a standard synthesis in Vivado 2019.2 for 7 Series FPGAs boils down to
- 256 LUTs
- 1 BRAM
- 9 DSP
-
write a wrapper for an on chip bus (AXI Stream...) for easier integration
-
add different strides/paddings/filtersizes that can be changed on the fly