A two-staged CNN hardware accelerator using Verilog RTL for machine learning applications.
A hardware accelerator is designed to accelerate the calculation of simplified two stage version of Convolutional Neural Network. The first layer is a feature extraction layer from the input and the second layer is a fully connected layer to identify classes. The 12x12 input matrix is stored in SRAM (Input memory) along with the four 1x9 B vectors and eight 1x64 M vectors (Vector memory). The 8x1 output vector is also written back to SRAM (Output memory). The design intends to balance the tradeoffs between area of the chip and delay to complete the computation. To overcome the possible contention of vector memory bus, all the elements of B vectors are fetched from the SRAM and stored in internal registers before the starting the computations. Removal of this contention facilitated a two stage pipelined design, where the feature extraction (step 1) and class identification (step 2) computations were parallelized.