conv2d

This is an implementation of a 2D convolution in VHDL.

Features

stride 1
3x3 filters
zero padding
one pixel per clock input/output
re loadable parameters
input image size can be changed on runtime (max image size os fixed)
quadratic images only

Focus

low resource usage
short critical path
no vendor libraries
bit depths and max image size can be changed by generics

Implementation

To achieve one pixel per clock input and output rate 9 multiply-accumulate operations (because of the 3x3 filter size) have to be processed in parallel. Two line buffers (module pixbuf.vhd) are used to store intermediate results of the computation.

The convolution operation can be written as

$P_{c(0,0)}=P_{(0,0)}c_0 + P_{(0,1)}c_1 + P_{(0,2)}c_2+P_{(1,0)}c_3+P_{(1,1)}c_4+P_{(1,2)}c_5+P_{(2,0)}c_6+P_{(2,1)}c_7+P_{(2,2)}c_8$

Where $c_0$ to $c_8$ are the filter coefficients from the top left to the bottom right and $P_{(y,x)}$ is the input pixel value at the location $(y,x)$ .

This computation is now split into $y_{(y,x)}$ terms to model the pixel wise input. If we get an input pixel we can immediately compute the first term.

$y_{(0,0)}=P_{(0,0)}c_0$

In the next clock cycle when we get the second pixel we can calculate

$y_{(0,1)}=P_{(0,1)}c_1+y_{(0,0})$

and after 3 Pixels seen

$y_{(0,2)}=P_{(0,2)}c_2+y_{(0,1})$

$y_{(0,2)}$ will now be stored in the line buffer until $P_{(1,0)}$ is fed to the input and we can calculate

$y_{(1,0)}=P_{(1,0)}c_3+y_{(0,2})$

and so on. Our final value will then be $y_{(2,2)}$ that contains y-terms and will be output.

To achieve maximum throughput of course all of these calculations mentioned above have to be executed in parallel.

Integration

There is an example how to use the module in the testbench (conv2d_tb.vhd). In order to start convolution the first line of zeros (padding) has to be fed into the module. Zero values also have to be applied on the borders the the left and right of an input image.

The last row of zeros for the bottom zero padding will be generated by the module. For each pixel that that is fed to the module the current pixels address (x/y) and a valid signal has to be assigned.

Parameters

NUM_BITS_PIXEL

This is the number of bits per input pixel. To keep precision the module does all arithmetic operations using NUM_BITS_PIXEL*2 arithmetic.

NUM_BITS_ADDR

Number of bits to encode the pixel addresses x and y. Basically lb(MAX_IMAGE_WIDTH).

NUM_BITS_COEFF

Word width of the coefficients.

MAX_IMG_WIDTH

Max image width for the block to expect. Images must be quadratic.

MAX_IMG_HEIGHT

Max image width for the block to expect. Images must be quadratic.

Test

A test file generated by a python module (data_py.bin) can be checked against a file generated by the testbench of the conv2d module (data_sim.bin).

Resource utilization

Because of the 9 parallel MACs the module needs 9 DSPs. The maximum image size and the internal bit width defines the number of BRAM instances the synthesis tool will infer. The memory in bits needed can be calculated using the formula

$D_{Linebuffer}=2\cdot MAX_IMG_WIDTH \cdot 2 \cdot NUM_BITS_PIXEL\ b$

Using the parameters $MAX_IMG_WIDTH=64$ and $MAX_IMG_WIDTH=16$ a standard synthesis in Vivado 2019.2 for 7 Series FPGAs boils down to

256 LUTs
1 BRAM
9 DSP

Todo

write a wrapper for an on chip bus (AXI Stream...) for easier integration
add different strides/paddings/filtersizes that can be changed on the fly

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
python_sim		python_sim
LICENSE		LICENSE
README.md		README.md
conv2d.vhd		conv2d.vhd
conv2d_tb.vhd		conv2d_tb.vhd
data_py.bin		data_py.bin
data_sim.bin		data_sim.bin
pixbuf.vhd		pixbuf.vhd
simulate_with_ghdl.sh		simulate_with_ghdl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python_sim

python_sim

LICENSE

LICENSE

README.md

README.md

conv2d.vhd

conv2d.vhd

conv2d_tb.vhd

conv2d_tb.vhd

data_py.bin

data_py.bin

data_sim.bin