Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quartus GRU #596

Merged
merged 3 commits into from Aug 12, 2022
Merged

Quartus GRU #596

merged 3 commits into from Aug 12, 2022

Conversation

bo3z
Copy link
Contributor

@bo3z bo3z commented Jul 11, 2022

Description

📝 Gated Recurrent Units (GRUs) for Quartus backend

Type of change

  • New feature (non-breaking change which adds functionality)

Tests

  • Accuracy tests through PyTest, for more details see test/pytest/test_rnn.py
  • IP simulation using cosim.
  • Successful synthesis and analysis of device resources and latency (see blow)

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.

Implementation

  • HLS code for GRU layers consist of two main functions:
  1. gru_cell(t, h, weights, recurrent_weights, bias, recurent_bias) - which takes in the input vector, x, and hidden state, h, at time step t, and produces the new hidden state according to GRU logic (reset, update, candidate state gate) - this function has several loops over the number of GRU units/states; therefore, those loops are often unrolled with the appropriate reuse factor. For results on resource usage and latency, see below.
  2. gru(data, res, weights, recurrent_weights, bias, recurent_bias) - makes use of the previously mentioned function, by traversing through the data at each time step and obtaining the new state, until the final output is obtained. Note, it is not possible to pipeline this function, because there is a loop dependency (LD). Namely, the at every iteration, the state needs to be available so that the new state can be calculated.
  • The backend containes a layer initialiser and the appropriate templates. Matrix multiplication and bias addition is done through the Dense layer. Finally, a resource strategy optimizer handles matrix transposes needed for Dense multiplication, rather than being done in layer initialising procedures.

Results

Below are latency, DSP, REG and ALM usage results of a GRU layer with a 5-dimensional input, 8 time steps and a variable number of units.

As expected, the latency remains approximately constant when increasing the number of units, while DSPs, REGs and ALM increase at a linear rate. This occurs because the implementation contains several loops unrolled over the number of units/states. Therefore, such an implementation is time-invariant, but resource-ineffficient.
latency - units
dsp - units
reg - units
alm - units

Finally, with the units fixed to 8 and the input size to 5, similar plots are obtained. As the time loop has pipelining disabled (due to loop dependencies), the use of DSPs remains approximately constant. ALMs and REGs increase slightly, because a larger input needs to be stored. The latency increases at a linear rate, as expected.
latency-time steps
dsp - time steps
reg - time steps
alm - time steps

@jmitrevs
Copy link
Contributor

The pytest failure was related to running out of disk space. It probably is unrelated to this PR.

@vloncar vloncar merged commit ae31793 into fastmachinelearning:main Aug 12, 2022
calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants