Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for generic elementwise binary operations #1040

Draft
wants to merge 24 commits into
base: dev
Choose a base branch
from

Conversation

iksnagreb
Copy link
Contributor

@iksnagreb iksnagreb commented Apr 12, 2024

This includes a set of HWCustomOp and HLSBackend operator templates which can be specialized in just a few lines of code to implement arbitrary elementwise binary operations, like Add, Mul, Sub, Div, And, Equal, etc., supporting multidirectional broadcasting. Concrete implementations for most of these operators according to standard ONNX is already sketched out. Still missing are specializations for accumulator and weight bit-width minimization and some tricky to implement operators. Also still missing is floating-point support due to HLS-backend limitations, though these seem to be just minor defects regarding flatten and Slice.

Adds unit tests in Python, C++ and RTL simulation for these new operators, though these are probably not exhaustive enough to validate all edge cases.

Proposes a new scheme for registering and importing custom operators into their corresponding module namespace, i.e., the 'custom_op' dictionary used to lookup operators by ONNX domain.

Implementation Progress

  • Generic HWCustomOp and HLSBackend "templates" providing infrastructure and HLS code generation for arbitrary elementwise binary operations
  • ElementwiseAdd
  • ElementwiseSub
  • ElementwiseMul
  • ElementwiseDiv (Not covered by unit test due to difficulties avoiding zeros in test inputs...)
  • ElementwiseMod
  • ElementwiseAnd
  • ElementwiseOr
  • ElementwiseXor
  • ElementwiseEqual
  • ElementwiseLess
  • ElementwiseLessEqual
  • ElementwiseGreater
  • ElementwiseGreaterEqual
  • ElementwiseBitwiseAnd
  • ElementwiseBitwiseOr
  • ElementwiseBitwiseXor
  • ElementwiseBitShift
  • ElementwisePow
  • Unit tests for these new operators
  • Integration test using QuantEltwiseAdd from Brevitas
  • Infer... transformations in convert_to_hw_layers.py for these newly supported operators
  • Accumulator and weight bit-width minimization for these new operators: At least the accumulator minimization needs to be specialized for each concrete instance of operator and probably cannot be handled by the template: Arithmetic and logical operators behave too different in this regard. Add, Sub, Mul etc. are probably already to different for a shared implementation.
  • Maybe add floating-point support, but not sure how this would integrate with the rest of FINN

This includes a set of HWCustomOp and HLSBackend operator templates
which can be specialized in just a few lines of code to implement
arbitrary elementwise binary operations, like Add, Mul, Sub, Div,
And, Equal, etc., supporting multidirectional  broadcasting. Concrete
implementations for most of these operators according to standard ONNX
is already sketched out. Still missing are specializations for
accumulator and weight bit-width minimization and some tricky to
implement operators. Also still missing is floating-point support due
to HLS-backend limitations, though these *seem* to be just minor defects
regarding "flatten" and "Slice".

Adds unit tests in Python, C++ and RTL simulation for these new
operators, though these are probably not exhaustive enough to validate
all edge cases.

Proposes a new scheme for registering and importing custom operators
into their corresponding module namespace, i.e., the 'custom_op'
dictionary used to lookup operators by ONNX domain.
@iksnagreb iksnagreb changed the base branch from main to dev April 12, 2024 20:27
@fpjentzsch
Copy link
Collaborator

Looks useful!

you define the op with the format (Identifier, Python, C++, RTL), do you have a use for the "RTL" definition or is this just a preparation for the future addition of an RTL backend?

I wonder how this functionality overlaps with the following existing FINN CustomOPs and if it makes any or all of them obsolete:

@iksnagreb
Copy link
Contributor Author

iksnagreb commented Apr 15, 2024

you define the op with the format (Identifier, Python, C++, RTL), do you have a use for the "RTL" definition or is this just a preparation for the future addition of an RTL backend?

Yes, it is just intended to make this future-proof, or at least to sketch out a potential future implementation of a generic RTL backend as well. But I am not even sure right now, whether I want to keep this format, as for example, for implementing the Mod and BitShift operator this simple string template is not sufficient, as the implementation depends on node attributes. Maybe I will fully transition this to the already present *_op property methods as these could contain some logic handling different node attributes. I am also not sure whether these backend-specific definitions are at the right place/level of the hierarchy here, on the other hand I want to minimize the number of customization points required to add new specializations of the operator template...

I wonder how this functionality overlaps with the following existing FINN CustomOPs and if it makes any or all of them obsolete:

Hm, it depends, I probably know more once I have sketched out the Infer... method for the new ones. Probably the old ones have much more strict assumptions and sometimes this could actually be intended to map to some special cases and get something more efficient. I intend to reach (almost) full ONNX compliance with the new operators, which could indeed render the AddStreams and StreamingEltwise obsolete - that is actually how I ended up implementing this in the first place: Initially I just wanted to add a new operator to handle element-wise operation (actually addition) of a constant tensor and a stream but though "why not generalize this?"... Regarding the ChannelwiseOp, I am not really sure what this one exactly does, but as it has the Thresholding_Batch as backend, it seems to do something slightly more elaborate than simple element-wise arithmetic or logical operations, so it will probably not be obsolete?

Folding quantized initializers into add-like nodes did not repsect the
order of inputs to the add node correctly. This is fixed by testing for
one of the two possible orders and selecting the following indices
accordingly.

Shape inference following the transformation is fixed by deleting the
annotations instead of propagating them incorrectly. Deleting the shape
annotations should not hurt, as these are redone by running shape
inference after each transformation anyways.
This probably is still rather sketchy, but at least it tries to check
the data layout annotation. For now seems to be enough for getting the
thresholds of multi-head attention right, IF qonnx properly annotates
the 3D layouts.
Add is commutative and thus the export does not always generate the
initializer as the second input. However, this was always assumed by
this transformation, failing via assertion if the inputs were simply
ordered differently. The transformation now handles both of the two
possible input orderings.
Note: This applies to the "container" type, not the simulated
quantization type. This is to prevent accidental promotion to float64.
Up until now, this was not a problem, as QONNX and FINN assumed all
tensors to be either broadcasted offline, or, if not, be "trivially"
boradcastable, like scalars or effectively scalar tensors. With the
introduction of proper multidirectional broadcasting for elementwise
binary operations, this might not be the case anymore and we need to
explicitly reject these from being absorbed into multi-thresholds, if
broadcasting is not possible (otherwise, without testing, this
transformation just fails with some numpy exception).
Shapes propagating backwards in graph transformations can break
subsequent shape inference. In particular, this is the case for
operators involving broadcasting semantics, where the output shape
cannot be fully reflected in the input shapes, i.e., even for
elementwise operations, the input shape might not be identical to the
output shape. This is fixed be deleting problematic shape annotations to
be re-done immediately.
@iksnagreb
Copy link
Contributor Author

This now relies on cherry-picked commits from #901 and #1030.

The new test case tests export, streamlining, conversion to hardware
layers and subsequent Python, C++ and RTL simulation of QuantEltwiseAdd
from Brevitas, serving as a representative example of an elementwise
binary operation.
Shape propagation when reordering around elementwise addition did not
behave as expected when any of the tensors is broadcast by one of the
reordered operations. This is fixed by deleting and re-doing the shape
annotations for the connecting tensors of the reordered pattern.
Without MoveLinearPastEltwiseAdd the two input streams variant of the
integration test did not actually convert the elementiwse addition to a
hardware operator, effectively "testing" the vanilla ONNX version of the
operator. With this transformation and AbsorbSignBiasIntoMultiThreshold
to get the signs right, the hardware operator is tested as intended now.
This is done mostly according to the Vitis High-Level Synthesis User
Guide (UG1399), see the library reference on arbitrary precision integer
types. The new transformations are added to all relevant test cases and
some data type need to be adjusted to make the numpy references behave
more robust.
Join-node Mul operations have no intitializer (parameters) and thus
there is nothing to factor out.
This is probably just a workaround and proper datatype inference should
be implemented later. For now it seems more safe to implicitly treat the
resulting parameter tensor as floating-point than assuming a wrong
datatype. In most cases the resulting Add operation will later be
absorbed and rounded into some thresholds anyway.
Comment on lines +472 to +473
assert _min != 0
assert _max != 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there an assert for min/max of rhs not equal to zero, specifically for const rhs?
this can happen for e.g. ReLU implemented as ElementwiseMaximum(x, 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants