Optimise division by a constant at runtime for integer division #10348

JAicewizard · 2024-01-25T22:25:18Z

When the rhs of an integer divide is known to be constant, it is possible to optimize this. Libdivide does this optimization at runtime, so even when the constant isn't known at compile time, it can still perform similar optimizations.

As you can see below the code using libdivide runs up to twice as fast. It should be even faster for unsigned integers.
It might be possible for the compiler to auto-vectorise the branchless version of libdivide, but I have not looked into that. Open for future research.

Issues with this code

I personally find the current method a bit messy, as it is the responsibility of the OPWRAPPER to decide what to optimize and what not. On one hand this allows the OP to be as generic as it was before. On the other hand, adding this libdivide optimization to more types will result in lots of duplicate code across the wrappers.

Benchmarks (ran with turbo-boost disabled)

No optimisation:

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	1.170270
benchmark/micro/arithmetic/division_constrhs.benchmark	2	1.173304
benchmark/micro/arithmetic/division_constrhs.benchmark	3	1.164200
benchmark/micro/arithmetic/division_constrhs.benchmark	4	1.179625
benchmark/micro/arithmetic/division_constrhs.benchmark	5	1.177252

Just performing row validity check based on the rhs (if possible):

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	1.113453
benchmark/micro/arithmetic/division_constrhs.benchmark	2	1.130460
benchmark/micro/arithmetic/division_constrhs.benchmark	3	1.125331
benchmark/micro/arithmetic/division_constrhs.benchmark	4	1.134362
benchmark/micro/arithmetic/division_constrhs.benchmark	5	1.120623

Using libdivide:

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	0.498989
benchmark/micro/arithmetic/division_constrhs.benchmark	2	0.497163
benchmark/micro/arithmetic/division_constrhs.benchmark	3	0.498707
benchmark/micro/arithmetic/division_constrhs.benchmark	4	0.504418
benchmark/micro/arithmetic/division_constrhs.benchmark	5	0.502800

JAicewizard · 2024-01-25T22:40:19Z

Builds are all failing seemingly because of formatting issues.
As mentioned the biggest issue I have with the code is that the responsibility of optimization is with the wrapper not the operation itself. As I am not very familiar with c++ templates, I don't really know how to improve this.
However if this isn't an issue on your side, I can remove the WIP and fix the formatting issues.

Similar optimisations can be done for the modulo operator. I will file a seperate PR for that once this is merged.

Tagging @lnkuiper since I already mentioned this to him on monday

Mytherin

Thanks for the PR! Great performance results.

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?
We avoid explicit SIMD instructions in DuckDB - and libdivide seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.

JAicewizard · 2024-01-26T19:57:30Z

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?

I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.

We avoid explicit SIMD instructions in DuckDB - and libdivide seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.

I havn't yet tested fastmod, that was the library I was intending to use for the follow-up using modulo. One advantage of this library is that it also provides branchless versions, which may prove advantageous for automatic vectorisation. I also don't know the performance of fastmod for division
All the explicit vectorisation can be removed I think, I will look into that.

Mytherin · 2024-01-27T09:22:52Z

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?

I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.

Have a look here, I think the rewrite should be relatively straightforward.

JAicewizard · 2024-01-28T12:54:37Z

I implemented it as a function instead of an operator. I wasn't entirely sure where to put it, but it can easily be moved of course.

I also looked into using fastmod for division, however it only supports 32 and 64 bit integers, and doesnt support 64 bit division on MSVC at all. I also measured the performance, and it is significantly slower.

src/core_functions/scalar/math/numeric.cpp

JAicewizard · 2024-01-31T14:13:26Z

I moved this PR to use fastmod. This reduced the available types this optimization applies to, to uint32_t, but performance is around the same.

JAicewizard · 2024-02-14T12:31:39Z

I reimplemented fastmod using the duckdb 128 bit types to allow unsigned as well as signed 32 bit execution. It is even possible to implement the unsigned 64 bit variant using this.

To get the performance somewhat good I needed the fast-path of the hugeint multiply to be in the header, as otherwise the compiler cant optimize this. I left the slow path in the c++ file, but in case of a modern gcc or clang compiler, calling the multiply function will use the optimized bath and be inlined. This makes the multiply a lot faster (if used directly, not via *).

If this looks good I can easily implement this for uint64 and (u)int{8/16}

This allows for significantly faster multiplication of (u)hugeint on these compilers

This computes the multiplicative inverse ahead of time, when dividing multiply by this instead. Multiplication is a lot faster than division.

github-actions bot marked this pull request as draft January 26, 2024 11:43

JAicewizard marked this pull request as ready for review January 26, 2024 11:56

JAicewizard changed the title ~~WIP: Optimise division by a constant at runtime for integer division~~ Optimise division by a constant at runtime for integer division Jan 26, 2024

github-actions bot marked this pull request as draft January 26, 2024 12:03

Mytherin reviewed Jan 26, 2024

View reviewed changes

Mytherin added the Changes Requested label Jan 26, 2024

JAicewizard force-pushed the optimise_intdiv branch 2 times, most recently from 355b92e to fb06827 Compare January 28, 2024 13:11

JAicewizard marked this pull request as ready for review January 28, 2024 13:12

github-actions bot marked this pull request as draft January 28, 2024 21:08

JAicewizard marked this pull request as ready for review January 28, 2024 21:08

github-actions bot marked this pull request as draft January 28, 2024 21:15

JAicewizard marked this pull request as ready for review January 28, 2024 21:16

github-actions bot marked this pull request as draft January 28, 2024 21:21

JAicewizard marked this pull request as ready for review January 28, 2024 21:22

xuke-hat reviewed Jan 30, 2024

View reviewed changes

src/core_functions/scalar/math/numeric.cpp Outdated Show resolved Hide resolved

github-actions bot marked this pull request as draft January 31, 2024 14:09

JAicewizard force-pushed the optimise_intdiv branch from bf4f550 to aef45de Compare January 31, 2024 14:14

JAicewizard force-pushed the optimise_intdiv branch from 9a5c7b0 to 229b309 Compare February 14, 2024 11:33

JAicewizard requested a review from Mytherin February 14, 2024 12:31

JAicewizard marked this pull request as ready for review February 14, 2024 12:32

github-actions bot marked this pull request as draft February 14, 2024 12:41

JAicewizard force-pushed the optimise_intdiv branch from f89cb84 to 4ff7d3a Compare February 14, 2024 12:54

JAicewizard marked this pull request as ready for review February 14, 2024 12:55

JAicewizard requested a review from lnkuiper May 16, 2024 20:12

Mytherin marked this pull request as draft May 30, 2024 08:34

Mytherin marked this pull request as ready for review May 30, 2024 08:34

Mytherin deleted the branch duckdb:main June 21, 2024 12:39

Mytherin closed this Jun 21, 2024

Mytherin reopened this Jun 21, 2024

Mytherin changed the base branch from feature to main June 21, 2024 14:28

JAicewizard added 5 commits June 26, 2024 11:30

(u)hugeint multiply will now be inlined when uint128 is supported

f4c1807

This allows for significantly faster multiplication of (u)hugeint on these compilers

Implement fast division for 32 both bit integers.

3a60684

This computes the multiplicative inverse ahead of time, when dividing multiply by this instead. Multiplication is a lot faster than division.

Automatically optimise constrhs division to use fast function

a035805

Add support for (u)int{8,16} and uint64 fast constant division

5697680

Use templates for FastDiv and ComputeM

2f47ce1

JAicewizard force-pushed the optimise_intdiv branch from 2e16c9c to 2f47ce1 Compare June 26, 2024 09:30

duckdb-draftbot marked this pull request as draft June 26, 2024 09:30

JAicewizard marked this pull request as ready for review June 26, 2024 09:31

Fix formatting

828b623

duckdb-draftbot marked this pull request as draft June 26, 2024 09:37

JAicewizard marked this pull request as ready for review June 26, 2024 09:38

fix warning

3d952b3

duckdb-draftbot marked this pull request as draft June 26, 2024 10:11

JAicewizard marked this pull request as ready for review June 26, 2024 10:11

try again

31ebf00

duckdb-draftbot marked this pull request as draft June 26, 2024 10:18

JAicewizard marked this pull request as ready for review June 26, 2024 10:18

Remove duplicate include

70846ed

duckdb-draftbot marked this pull request as draft June 26, 2024 15:24

JAicewizard marked this pull request as ready for review June 26, 2024 15:24

Rename struct

b8ee059

duckdb-draftbot marked this pull request as draft June 26, 2024 16:36

JAicewizard marked this pull request as ready for review June 26, 2024 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise division by a constant at runtime for integer division #10348

Optimise division by a constant at runtime for integer division #10348

JAicewizard commented Jan 25, 2024

JAicewizard commented Jan 25, 2024

Mytherin left a comment

JAicewizard commented Jan 26, 2024

Mytherin commented Jan 27, 2024

JAicewizard commented Jan 28, 2024

JAicewizard commented Jan 31, 2024

JAicewizard commented Feb 14, 2024

Optimise division by a constant at runtime for integer division #10348

Are you sure you want to change the base?

Optimise division by a constant at runtime for integer division #10348

Conversation

JAicewizard commented Jan 25, 2024

Issues with this code

Benchmarks (ran with turbo-boost disabled)

JAicewizard commented Jan 25, 2024

Mytherin left a comment

Choose a reason for hiding this comment

JAicewizard commented Jan 26, 2024

Mytherin commented Jan 27, 2024

JAicewizard commented Jan 28, 2024

JAicewizard commented Jan 31, 2024

JAicewizard commented Feb 14, 2024