Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213

Open
redknightlois opened this issue Mar 1, 2016 · 31 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@redknightlois
Copy link

redknightlois commented Mar 1, 2016

Found this trying to optimize a very low level memory addressing algorithm. Suddenly I noticed that we were doing the idiv twice.

Repro on rc2-16551:

  public class Program
    {
        public long Dividend;

        [MethodImpl(MethodImplOptions.NoInlining)]
        public long Run(long parameter)
        {
            long div = parameter / Dividend;
            long mod = parameter % Dividend;

            return div + mod;
        }

        public static void Main(string[] args)
        {
            Random r = new Random();

            var p = new Program();
            p.Dividend = 10;

            long value = p.Run(203);
            Console.WriteLine($"{value}");
            Console.ReadLine();
        }
    }

will output (in x64 but will also happen in x86):

00007FF8E8EC3C83  mov         rcx,qword ptr [rcx+8]  
00007FF8E8EC3C87  mov         rax,r8          // 2 instructions to prepare for idiv. 
00007FF8E8EC3C8A  cqo                    
00007FF8E8EC3C8C  idiv        rax,rcx         // parameter / Dividend
00007FF8E8EC3C8F  mov         r9,rax          // r9:= divResult 
00007FF8E8EC3C92  mov         rax,r8          // 2 instructions to prepare for idiv. 
00007FF8E8EC3C95  cqo  
00007FF8E8EC3C97  idiv        rax,rcx         //  parameter / Dividend
00007FF8E8EC3C9A  lea         rax,[r9+rdx]    // rax := divResult + modulus

this could be written as:

00007FF8E8EC3C83  mov         rcx,qword ptr [rcx+8]  
00007FF8E8EC3C87  mov         rax,r8          // 2 instructions to prepare for idiv. 
00007FF8E8EC3C8A  cqo                    
00007FF8E8EC3C8C  idiv        rax,rcx         //  parameter / Dividend
00007FF8E8EC3C92  add         rax,rdx         //  rax := divResult + modulus

This is pretty common in addressing modes or accessing matrix locations, etc. Everywhere you have a row and a column, you will have such an structure.

category:cq
theme:div-mod-rem
skill-level:expert
cost:large
impact:medium

@mikedn
Copy link
Contributor

mikedn commented Mar 1, 2016

Related to #4155

@mikedn
Copy link
Contributor

mikedn commented Mar 1, 2016

This is pretty common in addressing modes or accessing matrix locations, etc. Everywhere you have a row and a column, you will have such an structure.

Div and mod have nothing to do with addressing, that's always done with add and mul. You're probably talking about something else.

@jakobbotsch
Copy link
Member

Div and mod have nothing to do with addressing, that's always done with add and mul. You're probably talking about something else.

Div and mod are pretty useful to recover row/column information from indices:

int index = row * width + col;
...
int row = index / width;
int col = index % width;

Mostly when you are treating a singular array as a 2D array (eg. through usage of Bitmap.LockBits).
That said, it's not very common, but not extremely rare either.

@redknightlois
Copy link
Author

@mikedn It is if you are working with stride based binary structures as @JanielS mentioned. I would mention too, that when you are dealing with building data structures using memory maps (therefore unsafe code), it is a pretty common thing to happen.

@redknightlois
Copy link
Author

@mikedn Actually this is a duplicate of #4155. I would have never though about calling it DivRem that's why I havent found it.

@CarolEidt Is the limitation of using multiple registers per single instruction still relevant?

@CarolEidt
Copy link
Contributor

Yes, the multiple register issue is still relevant, but likely something we will soon be addressing.

@mikedn
Copy link
Contributor

mikedn commented Mar 1, 2016

It is if you are working with stride based binary structures as @JanielS mentioned.

What @JanielS shows is not so common and certainly not related to the notion of "addressing modes" as used in compilers. More generally, accessing matrix elements has nothing to do with division and modulo. The offset of a matrix element is simply row * stride + column. Sure, there may be special/unusual/whatever cases that requires division and modulo but it's a long way from such cases to "Everywhere you have a row and a column, you will have such an structure.".

@redknightlois
Copy link
Author

@mikedn They are just two sides of the same coin.

AbsolutePosition := row * stride + column

and

row := AbsolutePosition / TotalColumns
column = AbsolutePosition % TotalColumns

Putting all in the context of the assembly/compiler use is why you are seeing it as special/unusual, where it is not. This pattern is usually hidden in plain sight, it is so common and used, that is very easy to overlook. Only in certain places you will see both together because that means you are trying to retrieve both values at the same time instead of accessing it using a for loop.

@redknightlois
Copy link
Author

redknightlois commented Nov 11, 2016

Any update on if we are going to see this anytime soon?

1

And the code that generates it.

1 1

This is actual code I cannot get away with, I will probably try to tweak it to reduce dependencies, but that's it. This is killing me, if I can get even 5% would be a huge win. :)

EDIT: All the program in context (marked in red address translations - div/mod heavy

1 2
)

cc @AndyAyersMS

@mikedn
Copy link
Contributor

mikedn commented Nov 11, 2016

Does BlocksPerPage change often? I would guess that it doesn't, it's fixed for a given BTree. If that's the case then you could use reciprocal division, it would likely be better than the JIT removing a redundant idiv.

@redknightlois
Copy link
Author

redknightlois commented Nov 11, 2016

@mikedn That's a great idea, I can probably pack 8 to 10 multiplications before paying for an idiv even more with 2 of them, I can do almost whatever I want and still probably win. Do you have any good resource on how to efficiently deal with a "sorta" fixed (but not constant) divisor?

@mikedn
Copy link
Contributor

mikedn commented Nov 12, 2016

Before eliminating division you should eliminate the reminder. n % d is simply n - (n / d) * d. You already have n / d so to compute % you only need a - and a * and these are very cheap compared to %. Come to think of it, the JIT could probably do this automatically if it turns out that removing the redundant idiv is problematic due to the multiple register issue.

Now, getting rid of the division completely may or may not be feasible. For a given denominator you have to store 2 numbers somewhere - a "magic" number and a shift count. Computing these 2 numbers is expensive so if you can't store them somewhere this won't work.

The exacts details can be found here, in the "Exact division by constants" section. To make your life easier you could simply pick up existing code from a compiler. RyuJIT has an implementation and I extracted it here.

It's not as fast as I hoped but it's still ~1.5x faster than division. The code generated by the compiler when the denominator is actually a constant is faster (3x) but that's expected, for example that code doesn't have to deal with positive/negative magic numbers at runtime.

Note that this implementation only works for positive numbers, supporting negative numbers makes MagicDiv more complicated and probably slower.

@redknightlois
Copy link
Author

@mikedn I can do that on tree creation and just store it and be done with it :).

@stephentoub
Copy link
Member

stephentoub commented Nov 14, 2016

Before eliminating division you should eliminate the reminder. n % d is simply n - (n / d) * d. You already have n / d so to compute % you only need a - and a * and these are very cheap compared to %. Come to think of it, the JIT could probably do this automatically if it turns out that removing the redundant idiv is problematic due to the multiple register issue.

Until there's a solution for this in the JIT, should:
https://github.com/dotnet/coreclr/blob/f8b8b6ab80f1c5b30cc04676ca2e084ce200161e/src/mscorlib/src/System/Math.cs#L733
be changed from:

public static long DivRem(long a, long b, out long result)
{
    result = a % b;
    return a / b;
}

to:

public static long DivRem(long a, long b, out long result)
{
    long div = a / b;
    result = a - (div * b);
    return div;
}

? (And the same for the Int32 overload.)

@redknightlois
Copy link
Author

@stephentoub Yes, it should. I can confirm that the performance of using the multiplication in the second case (in my case) has been huge.

@mikedn
Copy link
Contributor

mikedn commented Nov 14, 2016

Changing DivRem like that sounds like a good plan, but we need to remember to revert the change once (if) the JIT starts optimizing this. Unless we also make JIT recognize a - (div * b), I'm toying with the idea.

@stephentoub
Copy link
Member

Changing DivRem like that sounds like a good pla

Ok, I'll submit a PR for that, including a TODO linking to this issue.

@redknightlois
Copy link
Author

redknightlois commented Feb 17, 2017

@mikedn
Copy link
Contributor

mikedn commented Feb 17, 2017

I would guess that Constants.Size.Kilobyte is 1024 so you're dividing by 4096. So your example has nothing to do with this issue...

@redknightlois
Copy link
Author

Not that one.

This one:

const int ConstDivisor = 4 * 1024;
const int DivideAdjustment = ConstDivisor - 1;
static int DivTest1(int num)
{
    return num / ConstDivisor + (num % ConstDivisor == 0 ? 0 : 1);
}

@mikedn
Copy link
Contributor

mikedn commented Feb 17, 2017

How's that different? It still divides by 4096.

@redknightlois
Copy link
Author

redknightlois commented Feb 17, 2017

Read it wrong, It is using ConstDivisor and not DivideAdjustment. Anyway, just change the ConstDivisor to non const division and you get the same issue. For power of 2 and also const is indeed another issue probably worth to report by itself.

EDIT: Lately we decided to eliminate the flexibility and now all of our page sizes are constants just to avoid this issue altogether with a huge improvement as a result.

@redknightlois
Copy link
Author

redknightlois commented Aug 15, 2017

Another case for power of 2 constants.

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static int GetNumberOfOverflowPages(long overflowSize)
        {
            overflowSize += Constants.Tree.PageHeaderSize;
            return (int)(overflowSize / Constants.Storage.PageSize) + (overflowSize % Constants.Storage.PageSize == 0 ? 0 : 1);
        }

This code generates:

00007FFA13645D12 48 63 CB             movsxd      rcx,ebx  
00007FFA13645D15 48 83 C1 40          add         rcx,40h  
00007FFA13645D19 48 8B D1             mov         rdx,rcx  
00007FFA13645D1C 4C 8B C2             mov         r8,rdx  
00007FFA13645D1F 49 C1 F8 3F          sar         r8,3Fh  
00007FFA13645D23 49 81 E0 FF 1F 00 00 and         r8,1FFFh  
00007FFA13645D2A 4C 03 C2             add         r8,rdx  
00007FFA13645D2D 49 C1 F8 0D          sar         r8,0Dh  
00007FFA13645D31 41 8B D0             mov         edx,r8d  
00007FFA13645D34 4C 8B C1             mov         r8,rcx  
00007FFA13645D37 49 C1 F8 3F          sar         r8,3Fh  
00007FFA13645D3B 49 81 E0 FF 1F 00 00 and         r8,1FFFh  
00007FFA13645D42 4C 03 C1             add         r8,rcx  
00007FFA13645D45 49 81 E0 00 E0 FF FF and         r8,0FFFFFFFFFFFFE000h  
00007FFA13645D4C 49 2B C8             sub         rcx,r8  

However you can actually rewrite that (knowing that the constant is power of 2) into:

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static int GetNumberOfOverflowPages(long overflowSize)
        {            
            overflowSize += Constants.Tree.PageHeaderSize;
            return (int)(overflowSize >> Constants.Storage.PageSizeShift) + ((overflowSize & Constants.Storage.PageSizeMask) == 0 ? 0 : 1);
        }

Which will output:

00007FFA13635FF2 48 63 CB             movsxd      rcx,ebx  
00007FFA13635FF5 48 83 C1 40          add         rcx,40h  
00007FFA13635FF9 48 8B D1             mov         rdx,rcx  
00007FFA13635FFC 48 C1 FA 0D          sar         rdx,0Dh  
00007FFA13636000 F7 C1 FF 1F 00 00    test        ecx,1FFFh  
00007FFA13636006 74 07                je          00007FFA1363600F  
00007FFA13636008 B9 01 00 00 00       mov         ecx,1  
00007FFA1363600D EB 02                jmp         00007FFA13636011  
00007FFA1363600F 33 C9                xor         ecx,ecx  
00007FFA13636011 44 8D 34 0A          lea         r14d,[rdx+rcx] 

That is half the bytes.

@mikedn
Copy link
Contributor

mikedn commented Aug 15, 2017

Another case for power of 2 constants.

That's unrelated to the original issue, there's no idiv in this case.

@redknightlois
Copy link
Author

Yes I should have created a new one instead. While it is the same combination (using division and modulus over the same parameters), this one is using constants the other is not. Created https://github.com/dotnet/coreclr/issues/13380

@caihongxu
Copy link

Any update?

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@huoyaoyuan
Copy link
Member

huoyaoyuan commented Sep 24, 2022

This should be solved by #66551. Math.DivRem and X86Base.DivRem would be able to utilize the two registers then.
Though DivRem is effectively a workaround, it should cover the desired use cases.

@pentp
Copy link
Contributor

pentp commented Sep 26, 2022

Just calling X86Base.DivRem from Math.DivRem isn't the entire solution - it would be bad for constant divisors because it would bypass JIT optimizations and always execute the div/idiv instruction. So even if there isn't going to be a general JIT solution for this, then Math.DivRem would at least have to be made a JIT intrinsic.

@svick
Copy link
Contributor

svick commented Sep 26, 2022

@pentp Would RuntimeHelpers.IsKnownConstant help with that?

@pentp
Copy link
Contributor

pentp commented Sep 26, 2022

Probably yes, but for long/ulong needs #67285 (or part of it).

@TIHan TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022
@TIHan TIHan self-assigned this Oct 31, 2022
@TIHan TIHan modified the milestones: Future, 8.0.0 Jan 17, 2023
@TIHan
Copy link
Member

TIHan commented Jan 28, 2023

Related: #4155

@TIHan TIHan removed their assignment Jan 28, 2023
lechu445 pushed a commit to lechu445/Microsoft.IO.RecyclableMemoryStream that referenced this issue Feb 12, 2023
Instead of division then remainder operation, used Math.DivRem which handles it in a more optimized way. It uses only a single division, multiplication, and subtraction instead of two divisions. Moreover, in the future, it can be even more optimized using intrinsics.

See:
dotnet/runtime#5213 (comment)
benmwatson pushed a commit to microsoft/Microsoft.IO.RecyclableMemoryStream that referenced this issue Feb 17, 2023
Instead of division then remainder operation, used Math.DivRem which handles it in a more optimized way. It uses only a single division, multiplication, and subtraction instead of two divisions. Moreover, in the future, it can be even more optimized using intrinsics.

See:
dotnet/runtime#5213 (comment)

Co-authored-by: lechu445 <xxx@example.com>
@TIHan TIHan modified the milestones: 8.0.0, Future Jun 9, 2023
lilinus added a commit to lilinus/runtime that referenced this issue Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests