Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can "Math.DivRem" do division operation only once? #4155

Closed
ygc369 opened this issue Apr 21, 2015 · 23 comments
Closed

Can "Math.DivRem" do division operation only once? #4155

ygc369 opened this issue Apr 21, 2015 · 23 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@ygc369
Copy link

ygc369 commented Apr 21, 2015

Currently it does division twice, but most CPUs support to get the two values by doing division only once.
Besides,this method should support uint and ulong too.

category:cq
theme:div-mod-rem
skill-level:expert
cost:large
impact:medium

@jkotas
Copy link
Member

jkotas commented Apr 21, 2015

Math.DivRem is not in the .NET Core public surface, but I agree that it would be nice for the code generators to recognize the pattern and perform the division just one.

cc @CarolEidt @cmckinsey

@stephentoub
Copy link
Member

cc: @KrzysztofCwalina, as IIRC, Krzysztof bumped up against this as a measurable cost when doing some recent fomatting/parsing experimentation.

@CarolEidt
Copy link
Contributor

This would not be a very difficult pattern to recognize: a div and a rem with incoming value numbers that match, though it doesn't "fit" with any existing (or other common) optimization patterns. The thing that would be challenging would be to deal with the resulting IR node that would produce two values in registers. Although that would not be all that difficult, it would require more pervasive changes.

@tannergooding
Copy link
Member

👍

In short term, couldn't Math.DivRem be treated as a special-case (intrinsic)? I wouldn't think that this optimization wouldn't really require many changes.

Long term though, adding an optimization pattern to recognize / and % with matching value members would be great.

@CarolEidt
Copy link
Contributor

@tannergooding - you would think that it would be simpler, but actually it requires the messiest part of the work (handling the node in the IR that produces two values), while only relieving the JIT of what is probably the more straightforward part of the implementation. It might actually be easier on x86 (32 bit) where we already have to deal with longs that occupy multiple registers.

@tannergooding
Copy link
Member

@CarolEidt, I'm actually wondering why this would require more pervasive changes?

I would think that instructions like idiv should already document (via some metadata) that they end up writing two registers, correct? As without this being carried around by the code generator, it would potentially end up overwriting the second argument of the parameter list whenever it generates the idiv instruction.

I've not looked at the code, but I would guess that various instructions do indicate what registers end up being input/output by the instruction and that the changes here would be to allow the node to actually carry the second value (which could probably be done via inheritance and type checking). I actually think this is in a similar vein to having an intrinsic that takes 3 arguments where the GenTreeOp currently only allows 2 arguments.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@ygc369
Copy link
Author

ygc369 commented Mar 29, 2020

@CarolEidt
Is there any progress about this?

@ygc369
Copy link
Author

ygc369 commented Aug 26, 2020

@tannergooding @CarolEidt
After more than five years, has this issue been solved?

@tannergooding
Copy link
Member

No, but .NET 5 brought about some register allocator changes that should (hopefully) make it simpler to support now.

@caihongxu
Copy link

Before this is addressed in .NET 5, any remedies we could do ourselves. e.g. is it possible to roll our own DivRem method to achieve doing division only once?

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@ygc369
Copy link
Author

ygc369 commented Feb 18, 2021

@tannergooding
From this blog Announcing .NET 6 Preview 1, I see that there will be a new overload of DivRem for .Net 6, which returns a tuple. Will this new overload be able to do division only once?

@tannergooding
Copy link
Member

The new API surface is setup in a way that would enable that optimization on x86/x64. I don't believe there is any instruction to compute quotient and remainder simultaneously on ARM32/ARM64.

Whether or not it gets enabled likely depends on prioritization and availability from the dev's familiar enough with that area of the codebase.

@Bip901
Copy link

Bip901 commented Mar 9, 2022

@tannergooding @CarolEidt After more than five years, has this issue been solved?

Seems like it's doing a single division now. Still not using the x86 optimization though.

@ygc369
Copy link
Author

ygc369 commented Jul 25, 2022

I wonder why this issue so difficult to fix? .Net 7 is coming soon, when will this issue be fixed? @tannergooding

@tannergooding
Copy link
Member

tannergooding commented Jul 25, 2022

The issue is complex because you have to take an instruction with very explicit register usage and then map that back to a higher level concept.

In particular, div and idiv are really two instructions in one. They compute the quotient and the remainder "simultaneously". They also don't do simple 32 / 32 = 32 or 64 / 64 = 64, they do 128 / 64 = 64 and 64 / 32 = 32. They have a requirement that the input be in RDX:RAX (or EDX:EAX, or DX:AX, or AH:AL) and then return the result in left (Quotient, so RAX, EAX, AX, or AH) and right (Remainder, so RDX, EDX, DX, or AL).

This then needs to be mapped, at a higher level, back to T DivRem(T left, T right, out T remainder) or (T Quotient, T Remainder) DivRem(T left, T right).

In the case of the first, the Quotient is simple because the "natural" return register is RAX/EAX. The output is slightly problematic because what you have is really an address (so a GC tracked T*).

In the case of the latter, what you have is a struct containing two fields. On some platforms this will be "one register" (often RAX/EAX), on other platforms it will be "two registers" (often RAX:RDX/EAX:EDX), and on others still it will actually be an "out buffer" (so the signature is really ref ValueTuple<T, T> DivRem(out ValueTuple<T, T> result, T left, T right)).

In each of these cases, the JIT has some complexities it has to work around and deal with such as eliding the pointers/indirections and ensuring the data can be consumed directly.

There is also the note that this is an x86/x64 specific problem. Other platforms just have "div" and if you also need the remainder, you compute it exactly as x86/x64 is already in DivRem (remainder = left - (quotient * right)). On modern CPUs this isn't going to be terrible either because modern x86/x64 can optimize the case where you don't consume the remainder. Likewise, if right (the divisor) is a constant, you can often skip using div/idiv entirely and use an alternative sequence that is faster instead (at which point computing the remainder manually is often strictly better). -- One of the main optimizations an x86/x64 compiler can do is avoiding actually emitting div/idiv however it can because it is that expensive (which is something we do already do, so the light up opportunities for actually using a single div/idiv are rarer in the first place).


Now, the support was added to the JIT to better support instructions that "return" multiple registers. However, it is still quite a complex area due to all the scenarios, edge cases, and considerations listed above. We'll get it eventually, but its not the highest priority consideration since its likely not the bottleneck in most applications and when we do get to it, we have to ensure that codegen is good and not worse than the "fallback" we're generating today.

@colejohnson66
Copy link

@tannergooding
Copy link
Member

DivRem has been implemented to do a division and then compute the remainder using multiplication for many releases now.

On non-xarch platforms, this is the only way to compute the remainder as well, notably. There is no single instruction that computes div+rem simultaneously.

@colejohnson66
Copy link

Yes. So the main complaint that it performs two divisions is no longer valid.

@tannergooding
Copy link
Member

And the overall history of the issue, which needs to be taken into account, covers that such a change went in. This issue then still exists to track the underlying spirit of the bug which is that its doing more work than necessary on some platforms.

@ygc369
Copy link
Author

ygc369 commented May 29, 2024

@tannergooding
Using multiplication is also extra work, although maybe faster than an extra division. Isn't it possible to get div+rem at the same time just by doing division once and no other operation? Maybe this cannot be done for all platforms as some platforms do not support it by hardware. But there must be platforms that support it, so, can it be done just on supported platforms?

@bazzilic
Copy link

multiplication is also extra work, although maybe faster than an extra division

it is indeed significantly faster

can it be done just on supported platforms

some comments above answer this question in great detail, e.g.: #4155 (comment)

@colejohnson66
Copy link

On x86, a division can take 15-30 clock cycles (sometimes over 40) and can only start one ever three or four. Multiplication, on the other hand, takes about eight and can start a new one every cycle.

@tannergooding
Copy link
Member

Isn't it possible to get div+rem at the same time just by doing division once and no other operation

For x86/x64, yes. It isn't possible on other platforms as they don't provide such support. However, even for x86/x64 the benefit there is really using 1 instruction, it isn't actually much faster (than doing div, mul, sub explicitly). In many cases it is actually more desirable to break it apart to avoid a dependency chain or use an alternative sequence (especially if one of the inputs is known constant).

Modern CPUs (almost any in the past 12-14 years) account for these things and often have special div/idiv support that breaks it apart internally to what is functionally div, mul, sub already so that the quotient can be available prior to the remainder being available. They may even skip the internal microcode for computing the remainder in some cases if they can see its unused. The time is also often dependent on the number of significant bits in the dividend, although even newer processors have changed implementations and now it is often based on the number of significant bits in the quotient instead.

On x86, a division can take 15-30 clock cycles (sometimes over 40) and can only start one ever three or four. Multiplication, on the other hand, takes about eight and can start a new one every cycle.

Notably this very much depends on microarchitecture, whether you're doing signed or unsigned, and it has massively improved in even more recent years

  • Haswell (2013)
    • idiv (32-bit) took around 20-28 cycles with 8 reciprocal throughput
    • idiv (64-bit) took around 37-101 cycles with 24 reciprocal throughput.
  • Cannon Lake (2018)
    • idiv (32-bit) took around 10-15 cycles with 6 reciprocal throughput
    • idiv (64-bit) took around 14-18 cycles with 10 reciprocal throughput.

Your multiplication info is also quite a bit off, both 32-bit and 64-bit have been 3 cycles with 1 reciprocal throughput since the early 2000s. If you're doing a big multiplication (64x64=128) then it used to take 7-9 or 3-10 cycles for pre 2011 chips, but that's not typical.

@ygc369 ygc369 closed this as completed May 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jun 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests