Introduce support for generic elementwise binary operations #1040

iksnagreb · 2024-04-12T20:11:00Z

This includes a set of HWCustomOp and HLSBackend operator templates which can be specialized in just a few lines of code to implement arbitrary elementwise binary operations, like Add, Mul, Sub, Div, And, Equal, etc., supporting multidirectional broadcasting. Concrete implementations for most of these operators according to standard ONNX is already sketched out. Still missing are specializations for accumulator and weight bit-width minimization and some tricky to implement operators. Also still missing is floating-point support due to HLS-backend limitations, though these seem to be just minor defects regarding flatten and Slice.

Adds unit tests in Python, C++ and RTL simulation for these new operators, though these are probably not exhaustive enough to validate all edge cases.

Proposes a new scheme for registering and importing custom operators into their corresponding module namespace, i.e., the 'custom_op' dictionary used to lookup operators by ONNX domain.

Implementation Progress

This includes a set of HWCustomOp and HLSBackend operator templates which can be specialized in just a few lines of code to implement arbitrary elementwise binary operations, like Add, Mul, Sub, Div, And, Equal, etc., supporting multidirectional broadcasting. Concrete implementations for most of these operators according to standard ONNX is already sketched out. Still missing are specializations for accumulator and weight bit-width minimization and some tricky to implement operators. Also still missing is floating-point support due to HLS-backend limitations, though these *seem* to be just minor defects regarding "flatten" and "Slice". Adds unit tests in Python, C++ and RTL simulation for these new operators, though these are probably not exhaustive enough to validate all edge cases. Proposes a new scheme for registering and importing custom operators into their corresponding module namespace, i.e., the 'custom_op' dictionary used to lookup operators by ONNX domain.

fpjentzsch · 2024-04-14T17:49:25Z

Looks useful!

you define the op with the format (Identifier, Python, C++, RTL), do you have a use for the "RTL" definition or is this just a preparation for the future addition of an RTL backend?

I wonder how this functionality overlaps with the following existing FINN CustomOPs and if it makes any or all of them obsolete:

AddStreams (back-end = AddStreams_Batch from hlslib)
ChannelwiseOp (back-end = Thresholding_Batch from hlslib)
StreamingEltwise (back-end = StreamingEltwise from hlslib)
(AddStreamsLayer_Batch from hlslib, which doesn't seem to have a corresponding op in FINN)

iksnagreb · 2024-04-15T08:41:02Z

you define the op with the format (Identifier, Python, C++, RTL), do you have a use for the "RTL" definition or is this just a preparation for the future addition of an RTL backend?

Yes, it is just intended to make this future-proof, or at least to sketch out a potential future implementation of a generic RTL backend as well. But I am not even sure right now, whether I want to keep this format, as for example, for implementing the Mod and BitShift operator this simple string template is not sufficient, as the implementation depends on node attributes. Maybe I will fully transition this to the already present *_op property methods as these could contain some logic handling different node attributes. I am also not sure whether these backend-specific definitions are at the right place/level of the hierarchy here, on the other hand I want to minimize the number of customization points required to add new specializations of the operator template...

I wonder how this functionality overlaps with the following existing FINN CustomOPs and if it makes any or all of them obsolete:

Hm, it depends, I probably know more once I have sketched out the Infer... method for the new ones. Probably the old ones have much more strict assumptions and sometimes this could actually be intended to map to some special cases and get something more efficient. I intend to reach (almost) full ONNX compliance with the new operators, which could indeed render the AddStreams and StreamingEltwise obsolete - that is actually how I ended up implementing this in the first place: Initially I just wanted to add a new operator to handle element-wise operation (actually addition) of a constant tensor and a stream but though "why not generalize this?"... Regarding the ChannelwiseOp, I am not really sure what this one exactly does, but as it has the Thresholding_Batch as backend, it seems to do something slightly more elaborate than simple element-wise arithmetic or logical operations, so it will probably not be obsolete?

Folding quantized initializers into add-like nodes did not repsect the order of inputs to the add node correctly. This is fixed by testing for one of the two possible orders and selecting the following indices accordingly. Shape inference following the transformation is fixed by deleting the annotations instead of propagating them incorrectly. Deleting the shape annotations should not hurt, as these are redone by running shape inference after each transformation anyways.

This probably is still rather sketchy, but at least it tries to check the data layout annotation. For now seems to be enough for getting the thresholds of multi-head attention right, IF qonnx properly annotates the 3D layouts.

Add is commutative and thus the export does not always generate the initializer as the second input. However, this was always assumed by this transformation, failing via assertion if the inputs were simply ordered differently. The transformation now handles both of the two possible input orderings.

See Xilinx#978

Note: This applies to the "container" type, not the simulated quantization type. This is to prevent accidental promotion to float64.

See Xilinx#978

Up until now, this was not a problem, as QONNX and FINN assumed all tensors to be either broadcasted offline, or, if not, be "trivially" boradcastable, like scalars or effectively scalar tensors. With the introduction of proper multidirectional broadcasting for elementwise binary operations, this might not be the case anymore and we need to explicitly reject these from being absorbed into multi-thresholds, if broadcasting is not possible (otherwise, without testing, this transformation just fails with some numpy exception).

Shapes propagating backwards in graph transformations can break subsequent shape inference. In particular, this is the case for operators involving broadcasting semantics, where the output shape cannot be fully reflected in the input shapes, i.e., even for elementwise operations, the input shape might not be identical to the output shape. This is fixed be deleting problematic shape annotations to be re-done immediately.

iksnagreb · 2024-04-17T10:01:01Z

This now relies on cherry-picked commits from #901 and #1030.

The new test case tests export, streamlining, conversion to hardware layers and subsequent Python, C++ and RTL simulation of QuantEltwiseAdd from Brevitas, serving as a representative example of an elementwise binary operation.

Shape propagation when reordering around elementwise addition did not behave as expected when any of the tensors is broadcast by one of the reordered operations. This is fixed by deleting and re-doing the shape annotations for the connecting tensors of the reordered pattern.

Without MoveLinearPastEltwiseAdd the two input streams variant of the integration test did not actually convert the elementiwse addition to a hardware operator, effectively "testing" the vanilla ONNX version of the operator. With this transformation and AbsorbSignBiasIntoMultiThreshold to get the signs right, the hardware operator is tested as intended now.

This is done mostly according to the Vitis High-Level Synthesis User Guide (UG1399), see the library reference on arbitrary precision integer types. The new transformations are added to all relevant test cases and some data type need to be adjusted to make the numpy references behave more robust.

This depends on adding float support support to Slice in finn-hlslib.

Join-node Mul operations have no intitializer (parameters) and thus there is nothing to factor out.

This is probably just a workaround and proper datatype inference should be implemented later. For now it seems more safe to implicitly treat the resulting parameter tensor as floating-point than assuming a wrong datatype. In most cases the resulting Add operation will later be absorbed and rounded into some thresholds anyway.

maltanar · 2024-10-16T08:34:10Z

src/finn/custom_op/fpgadataflow/elementwise_binary.py

+            assert _min != 0
+            assert _max != 0


why is there an assert for min/max of rhs not equal to zero, specifically for const rhs?
this can happen for e.g. ReLU implemented as ElementwiseMaximum(x, 0)

iksnagreb changed the base branch from main to dev April 12, 2024 20:27

iksnagreb added 9 commits April 16, 2024 11:53

Make quantized activation handlers data layout aware

8691d3f

This probably is still rather sketchy, but at least it tries to check the data layout annotation. For now seems to be enough for getting the thresholds of multi-head attention right, IF qonnx properly annotates the 3D layouts.

Fix clipping range issue in RoundAndClipThresholds transformation

e632328

Rework RoundAndClipThresholds to avoid range and type promotion issues

8dd85f4

See Xilinx#978

[Thresholding] Make sure the output of python simulation is float32

8b7c2eb

Note: This applies to the "container" type, not the simulated quantization type. This is to prevent accidental promotion to float64.

[Tests] Rework test-cases for reworked RoundAndClipThresholds

f01d02f

See Xilinx#978

iksnagreb added 6 commits April 18, 2024 17:53

[Elementwise] Add InferElementwiseBinaryOperation transformation

3f13673

[Elementwise] Some cleanup / simplification of generated code

fd1aedd

iksnagreb mentioned this pull request Apr 19, 2024

Add float support to Slice by specializing Width and Caster template Xilinx/finn-hlslib#140

Open

iksnagreb added 5 commits April 19, 2024 21:21

[Elementwise] Add support for floating-point operations

4769d8e

This depends on adding float support support to Slice in finn-hlslib.

[Elementwise] Implement get_exp_cycles for ElementwiseBinaryOperation

87fc002

[Elementwise] Add support for ElementwiseBinaryOperation to SetFolding

efb1cc9

[Elementwise] Remove FIFO depths attribute overloads

f34dcfc

[Elementwise] Add ARRAY_PARTITION and BIND_STORAGE directives

e361cb9

iksnagreb mentioned this pull request Aug 7, 2024

[Squeeze] Introduce Squeeze and Unsqueeze hardware operators #1153

Draft

iksnagreb added 3 commits August 8, 2024 16:05

[Streamline] Prevent FactorOutMulSignMagnitude from handling join-nodes

653673b

Join-node Mul operations have no intitializer (parameters) and thus there is nothing to factor out.

[Elementwise] Reintroduce FIFO depths attribute overloads

dd68078

maltanar reviewed Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce support for generic elementwise binary operations #1040

Introduce support for generic elementwise binary operations #1040

iksnagreb commented Apr 12, 2024 •

edited

Loading

fpjentzsch commented Apr 14, 2024

iksnagreb commented Apr 15, 2024 •

edited

Loading

iksnagreb commented Apr 17, 2024

maltanar Oct 16, 2024

Introduce support for generic elementwise binary operations #1040

Are you sure you want to change the base?

Introduce support for generic elementwise binary operations #1040

Conversation

iksnagreb commented Apr 12, 2024 • edited Loading

Implementation Progress

fpjentzsch commented Apr 14, 2024

iksnagreb commented Apr 15, 2024 • edited Loading

iksnagreb commented Apr 17, 2024

maltanar Oct 16, 2024

Choose a reason for hiding this comment

iksnagreb commented Apr 12, 2024 •

edited

Loading

iksnagreb commented Apr 15, 2024 •

edited

Loading