Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise division by a constant at runtime for integer division #10348

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

JAicewizard
Copy link

When the rhs of an integer divide is known to be constant, it is possible to optimize this. Libdivide does this optimization at runtime, so even when the constant isn't known at compile time, it can still perform similar optimizations.

As you can see below the code using libdivide runs up to twice as fast. It should be even faster for unsigned integers.
It might be possible for the compiler to auto-vectorise the branchless version of libdivide, but I have not looked into that. Open for future research.

Issues with this code

I personally find the current method a bit messy, as it is the responsibility of the OPWRAPPER to decide what to optimize and what not. On one hand this allows the OP to be as generic as it was before. On the other hand, adding this libdivide optimization to more types will result in lots of duplicate code across the wrappers.

Benchmarks (ran with turbo-boost disabled)

No optimisation:

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	1.170270
benchmark/micro/arithmetic/division_constrhs.benchmark	2	1.173304
benchmark/micro/arithmetic/division_constrhs.benchmark	3	1.164200
benchmark/micro/arithmetic/division_constrhs.benchmark	4	1.179625
benchmark/micro/arithmetic/division_constrhs.benchmark	5	1.177252

Just performing row validity check based on the rhs (if possible):

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	1.113453
benchmark/micro/arithmetic/division_constrhs.benchmark	2	1.130460
benchmark/micro/arithmetic/division_constrhs.benchmark	3	1.125331
benchmark/micro/arithmetic/division_constrhs.benchmark	4	1.134362
benchmark/micro/arithmetic/division_constrhs.benchmark	5	1.120623

Using libdivide:

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	0.498989
benchmark/micro/arithmetic/division_constrhs.benchmark	2	0.497163
benchmark/micro/arithmetic/division_constrhs.benchmark	3	0.498707
benchmark/micro/arithmetic/division_constrhs.benchmark	4	0.504418
benchmark/micro/arithmetic/division_constrhs.benchmark	5	0.502800

@JAicewizard
Copy link
Author

Builds are all failing seemingly because of formatting issues.
As mentioned the biggest issue I have with the code is that the responsibility of optimization is with the wrapper not the operation itself. As I am not very familiar with c++ templates, I don't really know how to improve this.
However if this isn't an issue on your side, I can remove the WIP and fix the formatting issues.

Similar optimisations can be done for the modulo operator. I will file a seperate PR for that once this is merged.

Tagging @lnkuiper since I already mentioned this to him on monday

@github-actions github-actions bot marked this pull request as draft January 26, 2024 11:43
@JAicewizard JAicewizard marked this pull request as ready for review January 26, 2024 11:56
@JAicewizard JAicewizard changed the title WIP: Optimise division by a constant at runtime for integer division Optimise division by a constant at runtime for integer division Jan 26, 2024
@github-actions github-actions bot marked this pull request as draft January 26, 2024 12:03
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Great performance results.

  • Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?
  • We avoid explicit SIMD instructions in DuckDB - and libdivide seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.

@JAicewizard
Copy link
Author

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?

I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.

We avoid explicit SIMD instructions in DuckDB - and libdivide seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.

I havn't yet tested fastmod, that was the library I was intending to use for the follow-up using modulo. One advantage of this library is that it also provides branchless versions, which may prove advantageous for automatic vectorisation. I also don't know the performance of fastmod for division
All the explicit vectorisation can be removed I think, I will look into that.

@Mytherin
Copy link
Collaborator

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?

I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.

Have a look here, I think the rewrite should be relatively straightforward.

@JAicewizard
Copy link
Author

I implemented it as a function instead of an operator. I wasn't entirely sure where to put it, but it can easily be moved of course.

I also looked into using fastmod for division, however it only supports 32 and 64 bit integers, and doesnt support 64 bit division on MSVC at all. I also measured the performance, and it is significantly slower.

@JAicewizard JAicewizard force-pushed the optimise_intdiv branch 2 times, most recently from 355b92e to fb06827 Compare January 28, 2024 13:11
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 13:12
@github-actions github-actions bot marked this pull request as draft January 28, 2024 21:08
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 21:08
@github-actions github-actions bot marked this pull request as draft January 28, 2024 21:15
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 21:16
@github-actions github-actions bot marked this pull request as draft January 28, 2024 21:21
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 21:22
@github-actions github-actions bot marked this pull request as draft January 31, 2024 14:09
@JAicewizard
Copy link
Author

I moved this PR to use fastmod. This reduced the available types this optimization applies to, to uint32_t, but performance is around the same.

@JAicewizard
Copy link
Author

I reimplemented fastmod using the duckdb 128 bit types to allow unsigned as well as signed 32 bit execution. It is even possible to implement the unsigned 64 bit variant using this.

To get the performance somewhat good I needed the fast-path of the hugeint multiply to be in the header, as otherwise the compiler cant optimize this. I left the slow path in the c++ file, but in case of a modern gcc or clang compiler, calling the multiply function will use the optimized bath and be inlined. This makes the multiply a lot faster (if used directly, not via *).

If this looks good I can easily implement this for uint64 and (u)int{8/16}

@JAicewizard JAicewizard marked this pull request as ready for review February 14, 2024 12:32
@github-actions github-actions bot marked this pull request as draft February 14, 2024 12:41
@JAicewizard JAicewizard marked this pull request as ready for review February 14, 2024 12:55
@JAicewizard JAicewizard requested a review from lnkuiper May 16, 2024 20:12
@Mytherin Mytherin marked this pull request as draft May 30, 2024 08:34
@Mytherin Mytherin marked this pull request as ready for review May 30, 2024 08:34
@Mytherin Mytherin deleted the branch duckdb:main June 21, 2024 12:39
@Mytherin Mytherin closed this Jun 21, 2024
@Mytherin Mytherin reopened this Jun 21, 2024
@Mytherin Mytherin changed the base branch from feature to main June 21, 2024 14:28
This allows for significantly faster multiplication of (u)hugeint on these compilers
This computes the multiplicative inverse ahead of time, when dividing
multiply by this instead. Multiplication is a lot faster than division.
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 26, 2024 09:30
@JAicewizard JAicewizard marked this pull request as ready for review June 26, 2024 09:31
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 26, 2024 09:37
@JAicewizard JAicewizard marked this pull request as ready for review June 26, 2024 09:38
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 26, 2024 10:11
@JAicewizard JAicewizard marked this pull request as ready for review June 26, 2024 10:11
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 26, 2024 10:18
@JAicewizard JAicewizard marked this pull request as ready for review June 26, 2024 10:18
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 26, 2024 15:24
@JAicewizard JAicewizard marked this pull request as ready for review June 26, 2024 15:24
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 26, 2024 16:36
@JAicewizard JAicewizard marked this pull request as ready for review June 26, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants