-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DivMod instrinct for intel x86/x64 #27292
Comments
CC. @CarolEidt, would appreciate your thoughts here. This falls into the category of "core" instructions (not behind any flag), but which also provide additional functionality that is one or more of:
There are other instructions that also fall into this category, such as |
@Daniel-Svensson, could you update the original post to roughly follow the layout given as an example here: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md#steps Could you also update the original post to include the signed-division overloads (which would call |
@tannergooding I've tried to update the post based on the example which the documentation linked to. I've added two versions for IDIV since I was a bit unsure about which version you would be interested in. I've also tried to add namespaces/classes based on some arm instrinct I found. Regarding Add with Carry and Subtract with Carry are there any issue tracking those? It would be very useful to have them as cross platform instrincts, 64bit operations could just fall back to 2x 32bit operations. I've have some C++ improvements to decimal arithmetics with 150%+ speedups around and having those operations would be required to have any managed implementation with similar performance) |
Both the 64-bit / 32-bit, and the 32-bit / 32-bit cases are interesting.
We don't want to have a x64 specific class. Currently we are just exposing them together in the same class and calling the 64-bit only overloads (on a 32-bit box) will throw a PNSE.
No, there isn't a tracking issue today.
We actually have |
I would like to extend this proposal to also include public static class Base
{
// 32 * 32 => 64 bit signed multiply
public uint BigMul(int a, int b, out int hi);
// 32 * 32 => 64 bit unsigned multiply
public uint BigMul(uint a, uint b, out uint hi);
// 64 * 64 => 128 bit signed multiply
public ulong BigMul(long a, long b, out long hi);
// 64 * 64 => 128 bit unsigned multiply
public ulong BigMul(ulong a, ulong b, out ulong hi);
} |
We should just make the existing |
I think that is (possibly) worth further discussion. Even if we update
|
There are several issues with the existing
And unlike IDIV, a general purpose division method has dividend size equal to quotient size, which results in long division being performed with a 128-bit IDIV on x64 with a 2-4x latency and 4-15x throughput penalty compared to 64-bit IDIV (x86 goes through a slow helper call). Maybe these intrinsics don't need to be public at first, but they would be useful even just to implement the existing |
A couple more useful additions: BSF and BSR public abstract class Base
{
public abstract class X64
{
/// <summary>
/// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
/// </summary>
public static int BitScanForward(long mask);
public static int BitScanForward(ulong mask);
/// <summary>
/// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
/// </summary>
public static int BitScanReverse(long x);
public static int BitScanReverse(ulong x);
}
/// <summary>
/// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
/// </summary>
public static int BitScanForward(int mask);
public static int BitScanForward(uint mask);
/// <summary>
/// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
/// </summary>
public static int BitScanReverse(int x);
public static int BitScanReverse(uint x);
} There are several places in CoreFX that have been updated recently to use the TZCNT and LZCNT HWIntrinsics, but they must fall back to a slow path for processors not supporting the BMI1 or LZCNT ISAs, respectively, because the JIT requires the corresponding CPUID flag be set. However, as described here, TZCNT is encoded as REP BSF, which means it can execute on downlevel processors (with different behavior for a 0 input). Software built with an AOT compiler can take advantage of the instruction compatibility to have a single code path when the TZCNT/BSF behavior is known to be the same, but we can't do the same with the JIT. On the other hand, we could code paths for both instructions if BSF were exposed explicitly. Likewise, in place of LZCNT, we could use Given the number of processors not supporting the BMI1 and LZCNT ISAs, including the modern Intel Goldmont architecture, it would be worthwhile to be able to code both paths. Also related: https://github.com/dotnet/corefx/issues/32269 |
@pentp I think that would be a better fit to add to |
128-bit MUL does not really fit into the existing |
A For C/C++ they expose a I dont think it would be unreasonable, or out of scope, to expose similar boolean properties for a small selection of Math APIs, like |
I've updated the description to match that there is now a X86Base instrinct class. |
I can mark it ready-for-review now but I don't know if this will get on the backlog before we lock everything down. The API review will also need to go over whether this is actually worthwhile to expose. As @jkotas indicated, this may be sufficiently covered (for the most common cases) via The remaining reason for the intrinsic is to support |
In addition to int abs = Math.Abs(value,out int sign); // int for -1/0/1
long l = Math.IntFraction(123.456f,out float f); // 123 and 0.456f
l = Math.IntFraction(123.456,out double d); // 123 and 0.456 Is it possible? |
Those are possible but they should be distinct proposals following the API proposal issue template: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.md The APIs you mention are also different in that they aren't intrinsics supported by hardware. |
namespace System.Runtime.Intrinsics.X86
{
partial class X86Base
{
public (uint Quotient, uint Remainder) DivRem(uint lower, uint upper, uint divisor);
public (int Quotient, int Remainder) DivRem(uint lower, int upper, int divisor);
public (nuint Quotient, nuint Remainder) DivRem(nuint lower, nuint upper, nuint divisor);
public (nint Quotient, nint Remainder) DivRem(nuint lower, nint upper, nint divisor);
[Intrinsic]
public static class X64
{
public (ulong Quotient, ulong Remainder) DivRem(ulong lower, ulong upper, ulong divisor);
public (long Quotient, long Remainder) DivRem(ulong lower, long upper, long divisor);
}
} |
I added the related MUL intrinsics API proposal: #58263 |
I looked into this, but found it's not easy, since the only tuple-returning intrinsic was CPUID, which is not optimal in implementation (calls a function). Ideally, the JIT should be updated to handle multi-reg returning instructions better, and the LSRA can be tuned to handle instructions with fixed register input (namely DIV/IDIV), which can also improve the codegen for BigMul. @tannergooding This one is too hard for community contribution. Maybe some JIT expert should also be called for this. |
@huoyaoyuan Alternatively, we could create another @tannergooding, what do you think? |
The JIT already has support for correctly handling multi-register intrinsics. It was added by Carol Eidt a while back. This support has already been utilized for other intrinsics such as #64864 and the same would need to apply to |
@tannergooding @huoyaoyuan |
If such mechanism has already been used, I should be able to complete this. I have already done managed parts. |
@stephentoub Why was this reopened? I thought it is implemented? |
Because it's currently marked as: [RequiresPreviewFeatures("DivRem is in preview.")] |
The remaining work is being tracked in #82194, so this one could still be closed? |
Please add hw instrinct support for divide on intel platforms.
Rationale and Usage
The intel instruction div is a bit special since 32bit divide takes a 64 dividend and 32bit divisor and returns both quotient and remainder as result.
It would be very useful, not only for speeding up Math.DivRem (https://github.com/dotnet/coreclr/issues/757 ) but especially for several decimal operations which would make it more feasible to move decimal code to C# (and maybe several number functions).
Significant speedup using div in C++ provided significant speedups in https://github.com/dotnet/coreclr/issues/10642)
Example:
The following code perfomes division of a 96bit unsigned number by a 32bit number.
Updating the 96bit number in-pace and returning the remainder
The example is simplified from actual code and contains only the "worst case" execution path.
It performs the same logic as in https://github.com/dotnet/corert/blob/d82d460a8530a57e4915060be37fb42c7a661f48/src/System.Private.CoreLib/shared/System/Decimal.DecCalc.cs#L228
That code has 2 64bit divides (much slower, especially in 32bit mode), and several multiplications (workarounds instead of using '%' which would double the executed div instructions currently).
Proposed API
The API might need some design, but I think it makes sense to keep it similar to
Math.DivRem
as below.Related functionality which would make sense to add.
It could use the instrinct for god performance on x86 platforms.
Details
Full description of instruction can be found in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf
Updates
I am unsure about which versions would be useful so I am adding both that came to my mind.
The text was updated successfully, but these errors were encountered: