Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quartus streaming support for Activations, Dense & Batch Normalization #557

Merged
merged 11 commits into from Jul 26, 2022

Conversation

bo3z
Copy link
Contributor

@bo3z bo3z commented May 31, 2022

A PR enabling support for io_stream in the Qaurtus backend. Currently, the supported layers for streaming I/O are Dense, Activation and Batch Normalisation. Due to the inherent differences between parallel and streaming interfaces in Intel HLS, this is an extensive PR. Furthermore, Intel's HLS stream has certain requirements when implemented (no copy constructor, global declaration etc.). Therefore, It is recommended to review this PR commit by commit (rather than side-by-side diff), with each commit being briefly explained below. Each commit is self-contained and can be checked out and the project compiled.

  1. 2233a68 Allow multiple includes of the same LUT - This simply allows the same look-up tables to be used for both io_parallel and io_stream, by removing the IFNDEF guard

  2. e003234 Quartus custom Stream & distinction between g++ (nnet::stream) and i++ (ihc::stream):

  • When executing hls4ml.compile() or hls4ml.predict(...), GCC is used. However, GCC doesn't have access to HLS source files (which include Intel's streaming interface, ihc::stream, ihc::stream_in, ihc::stream_out). Unlike ac_int, ac_fixed which are open-source and included in hls4ml, Intel's HLS stream source files are protected by licence and cannot be included in the repository. Therefore, a custom nnet::stream struct is written, having the same high-level function as Intel's HLS ihc::stream, but implemented using queues. Please note, pytests written for streaming layers use nnet::stream. To verify correct functionality of the IP, use cosim with the following command: i++ -march=x86-64 -o myproject_test -v myproject_test.cpp firmware/myproject.cpp
  • Finally, as per this tutorial by Intel: https://youtu.be/aZYBlkcoj8Q?t=819, streams are always passed by reference to the top-level component. A top-level component with a streaming output is void - instead of returning a stream, it takes the stream object by reference. It is not even possible to return a stream from a component, since the internal implementation contains a explicitly deleted copy constructor.
  • Finally, there are 3 distinct types of stream - inputs to an HLS component must be of type stream_in, outputs of type stream_out and inter-component connection of type stream. With this distinction, the HLS compiler is able to distinguish between component inputs and outputs. If only stream was used, the component would not synthesize with the correct inputs/outputs
  1. 1bbcca5 Top-level & Quartus Writer distinction between io_parallel and io_stream:
  • This commit handles the inherent differences between io_parallel and io_stream in myproject.cpp, myproject.h and defines.h - in parallel, the top-level component takes a input struct, containing the array of data and returns a struct containing the output array. On the other hand, as explained above, streaming interfaces are void, as both the input and output and output are passed by reference.
  • Secondly, a new cosim benchmark was written for streaming inputs. Previously (and still for io_parallel), all of the input data is processed, stored to a vector and then passed to the component sequentially. However, due to an explicitly deleted copy constructor, ihc::stream cannot be stored inside a vector. Therefore, this new benchmark processes data from input files and executes the component straight away.
  • Finally, for io_stream, inter-layer connections (type stream) must be declared outside of main() - Intel HLS has a requirement through which all streams are either passed by reference to the top-level component or declared as global variables. No stream types can be instantiated inside the component.
  1. d600882 Quartus Stream Variable Converter & Backend io_stream enabled:
  • Adds a stream variable converter, in a similar manner to Vivado
  • Add nnet::array, as an array-like struct for storing data inside streams. Implementation similar to Vivado.
  1. 1b17b3b Quartus Clone Optimizer - Sets up streaming optimizers in Quartus backend and adds the clone optimizer, as the most commonly used. From here, it should be straight forward to add further streaming passes.

  2. d4da71f Tanh bug fix in Quartus - Addresses a small bug when invoking TanH activation. This is simply a pre-requisite for writing streaming activations.

  3. The other commits add support for streaming Dense, BatchNormalization and Activation, in a similar manner to Vivado, with the appropriate pipeline initiation interval:

  • Existing PyTests were expanded to include io_stream. Some new tests (test_activations.py) were written as well, to verify correct results of less-frequently used layers. Important to note, a passing PyTest does not imply correct HLS outputs - as explained above, PyTest uses nnet::stream and HLS, when synthesizing uses ihc::stream - these two are inherently different, even though they offer the same high-level funcionality. To verify correct HLS behaviour, cosim should be used (explained above), as well as some RTL-level simulation, such as ModelSim or Questa. All of the above layers were tested using both PyTest and cosim. Finally, all of the above layers synthesized correctly and produced a valid IP block.
  • Finally, some of the Quartus-equivalent pragmas couldn't be used. For example, Vivado offers a #pragma HLS data_pack directive, for which the Intel equivalent would be hls_bankwidth. However, hls_bankwidth (similar to other memory-optimisation pragmas, except for hls_register) is not supported for variables passed by reference/pointers, which is a necessity for correct functionality.

@bo3z
Copy link
Contributor Author

bo3z commented Jul 8, 2022

cppname is being removed in #562, this PR will also have to reflect these changes

@bo3z bo3z mentioned this pull request Jul 11, 2022
6 tasks
@jmitrevs
Copy link
Contributor

Can you expand a bit on why multiple includes are needed (e8c9170). It seems a bit strange to me to include the same header file twice.

@jmitrevs
Copy link
Contributor

@bo3z, can you fix the conflicts? I understand that it's high priority to get this in fairly soon, so I'd like to make a pr/557 to run all the tests.

@bo3z
Copy link
Contributor Author

bo3z commented Jul 13, 2022

Can you expand a bit on why multiple includes are needed (e8c9170). It seems a bit strange to me to include the same header file twice.

Unlike Vivado, Quartus generates look-up tables thorugh Python and stores them in .h files. These files simply contain a single C-array, the entries of the LUT. These header files are only included in the function using that LUT, not globally. io_parallel and io_stream are simply overloads of the same function, but completely independent from each other. So, both of them have to include the LUT header file. If one of the overloads isn't able to include the header files, compilation will fail, with a message along the lines of 'No variable named exp_table'. Keeping the #IFNDEF guard prevents both files to be included and compilation fails.

Now I am aware that this is very bad practice, especially in a big project. One alternative is generating the same LUT twice, once with the suffix parallel and once with the suffix stream. This will slow down GCC compilation time, but should have no impact on HLS synthesis, as HLS compiler removes all unused code. Using a linker (-flto flag) during GCC compilation might also do the trick (I haven't tried this) - it will remove all unused code and we can always be certain that only one of the two activation functions is called at a time.

@bo3z
Copy link
Contributor Author

bo3z commented Jul 13, 2022

@bo3z, can you fix the conflicts? I understand that it's high priority to get this in fairly soon, so I'd like to make a pr/557 to run all the tests.

Rebased and pushed. All the tests should pass now.

@jmitrevs
Copy link
Contributor

About the multiple includes, I think it's fine. The includes are used inside of functions, so effectively define the variables there, if I understand correctly.

@jmitrevs
Copy link
Contributor

I see the following errors whey I run pytests:

FAILED test_qkeras.py::test_qactivation_kwarg[quantized_bits(4, 0, alpha=1)-quantized_relu(8, 0)] - AssertionError: 
FAILED test_transpose_concat.py::test_accuracy[io_stream] - AssertionError: 

@jmitrevs
Copy link
Contributor

(In general, though, the PR looks good.)

@bo3z
Copy link
Contributor Author

bo3z commented Jul 24, 2022

I see the following errors whey I run pytests:

FAILED test_qkeras.py::test_qactivation_kwarg[quantized_bits(4, 0, alpha=1)-quantized_relu(8, 0)] - AssertionError: 
FAILED test_transpose_concat.py::test_accuracy[io_stream] - AssertionError: 

Could you post the full CI/CD log please? Or maybe re-run the tests ? I just re-ran both of these tests locally and they passed. test_transpose_concat.py really should not fail, because it defaults to Vivado and I have not modified it, so my PR should have no impact on it at all?

@bo3z
Copy link
Contributor Author

bo3z commented Jul 24, 2022

About the multiple includes, I think it's fine. The includes are used inside of functions, so effectively define the variables there, if I understand correctly.

That's correct.

@jmitrevs
Copy link
Contributor

They both passed. I don't understand what happened before. They had previously failed by not matching perfectly.

I have to run now, but I want to push your branch to pr/557 on the upstream repository to trigger the pytests as part of the CI (you can do it if you want) and I'll look over the code one more time before approving.

@bo3z
Copy link
Contributor Author

bo3z commented Jul 24, 2022

They both passed. I don't understand what happened before. They had previously failed by not matching perfectly.

I have to run now, but I want to push your branch to pr/557 on the upstream repository to trigger the pytests as part of the CI (you can do it if you want) and I'll look over the code one more time before approving.

No clue what happened there, but I've seen this happen once before as well. Sometimes the CI just fails. I will push to pr/557 now, all tests should pass. I've just merged main into this branch.

@bo3z
Copy link
Contributor Author

bo3z commented Jul 25, 2022

E           AssertionError: 
E           Not equal to tolerance rtol=1, atol=0.00390625
E           
E           Mismatched elements: 1 / 1000 (0.1%)
E           Max absolute difference: 0.0078125
E           Max relative difference: 1.
E            x: array([0.      , 0.      , 0.042969, 0.203125, 0.214844, 0.      ,
E                  0.371094, 0.046875, 0.      , 0.136719, 0.      , 0.      ,
E                  0.570312, 0.15625 , 0.121094, 0.089844, 0.792969, 0.      ,...
E            y: array([0.      , 0.      , 0.039062, 0.199219, 0.210938, 0.      ,
E                  0.367188, 0.042969, 0.      , 0.132812, 0.      , 0.      ,
E                  0.566406, 0.15[234](https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/jobs/23506353#L234)4, 0.117188, 0.085938, 0.789062, 0.      ,...
test_qkeras.py:309: AssertionError

This PyTest still seems to be failing, by a very small margin. However, this test defaults to Vivado, so I'm not sure why that's the case?

@jmitrevs
Copy link
Contributor

This PyTest still seems to be failing, by a very small margin. However, this test defaults to Vivado, so I'm not sure why that's the case?

I wonder if it's a different random seed.

Copy link
Contributor

@vloncar vloncar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing. Can you fix the two minor cosmetic issues before we merge?

hls4ml/writer/quartus_writer.py Outdated Show resolved Hide resolved
@bo3z
Copy link
Contributor Author

bo3z commented Jul 26, 2022

All comments addressed. Needed to rebase one final time to avoid merge conflicts as well as include a small amount of code from a different branch in quartus_writer.py (lines 95-107), as previously synthesis failed (I had this same PR on two branches, to avoid any loss of code). All tests pass now and verified synthesis and HLS simulation with a few single-layer and multi-layer models.

It might be worth including in the CI/CD process a step for synthesising Quartus designs, as we don't currently support that.

Copy link
Contributor

@jmitrevs jmitrevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vladimir should check his change requests first, but from my side things look good.

@vloncar
Copy link
Contributor

vloncar commented Jul 26, 2022

I'm fine with it. No idea about the test failure. It seems to choose the same seed every time for this PR, which is strange. I'll approve it, not sure if we should merge, or investigate the test more? For sure no changes in this PR cause the test to fail.

@bo3z
Copy link
Contributor Author

bo3z commented Jul 26, 2022

I'm fine with it. No idea about the test failure. It seems to choose the same seed every time for this PR, which is strange. I'll approve it, not sure if we should merge, or investigate the test more? For sure no changes in this PR cause the test to fail.

I would maybe merge it (just to avoid further merge conflicts and kick of the work on streaming CNNs) and see if this test fails on other PRs, rebased on top of this one?

@vloncar
Copy link
Contributor

vloncar commented Jul 26, 2022

Let's go!!!

@vloncar vloncar merged commit c5507f3 into fastmachinelearning:main Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants