-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise division by a constant at runtime for integer division #10348
base: main
Are you sure you want to change the base?
Conversation
Builds are all failing seemingly because of formatting issues. Similar optimisations can be done for the modulo operator. I will file a seperate PR for that once this is merged. Tagging @lnkuiper since I already mentioned this to him on monday |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Great performance results.
- Can we create a separate
divide_by_const
function, and then rewritex // C
todivide_by_const(x, C)
as an optimization in theArithmeticSimplificationRule
, instead of doing this optimization within the division operator? - We avoid explicit SIMD instructions in DuckDB - and
libdivide
seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.
I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.
I havn't yet tested fastmod, that was the library I was intending to use for the follow-up using modulo. One advantage of this library is that it also provides branchless versions, which may prove advantageous for automatic vectorisation. I also don't know the performance of fastmod for division |
Have a look here, I think the rewrite should be relatively straightforward. |
I implemented it as a function instead of an operator. I wasn't entirely sure where to put it, but it can easily be moved of course. I also looked into using |
355b92e
to
fb06827
Compare
I moved this PR to use fastmod. This reduced the available types this optimization applies to, to |
bf4f550
to
aef45de
Compare
9a5c7b0
to
229b309
Compare
I reimplemented fastmod using the duckdb 128 bit types to allow unsigned as well as signed 32 bit execution. It is even possible to implement the unsigned 64 bit variant using this. To get the performance somewhat good I needed the fast-path of the hugeint multiply to be in the header, as otherwise the compiler cant optimize this. I left the slow path in the c++ file, but in case of a modern gcc or clang compiler, calling the multiply function will use the optimized bath and be inlined. This makes the multiply a lot faster (if used directly, not via If this looks good I can easily implement this for |
f89cb84
to
4ff7d3a
Compare
This allows for significantly faster multiplication of (u)hugeint on these compilers
This computes the multiplicative inverse ahead of time, when dividing multiply by this instead. Multiplication is a lot faster than division.
2e16c9c
to
2f47ce1
Compare
When the rhs of an integer divide is known to be constant, it is possible to optimize this. Libdivide does this optimization at runtime, so even when the constant isn't known at compile time, it can still perform similar optimizations.
As you can see below the code using libdivide runs up to twice as fast. It should be even faster for unsigned integers.
It might be possible for the compiler to auto-vectorise the branchless version of libdivide, but I have not looked into that. Open for future research.
Issues with this code
I personally find the current method a bit messy, as it is the responsibility of the OPWRAPPER to decide what to optimize and what not. On one hand this allows the OP to be as generic as it was before. On the other hand, adding this libdivide optimization to more types will result in lots of duplicate code across the wrappers.
Benchmarks (ran with turbo-boost disabled)
No optimisation:
Just performing row validity check based on the rhs (if possible):
Using libdivide: