-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get loop unrolling boundaries via bit hack #46
Comments
Here the |
Here -- i.e. in this project -- this isn't needed because unsafe code is used, so no |
For the loop unrolling boundaries it is essential that the "base" is 0. Look at the following code: //#define NORMALIZE
using System;
using System.Diagnostics;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
Foo(3, 5);
}
private static void Foo(int i, int n)
{
#if NORMALIZE
int offset = i;
n -= i;
i = 0;
#else
int offset = 0;
#endif
int m = n & ~3;
for (; i < m; i += 4)
{
Print(i + 0, n, offset);
Print(i + 1, n, offset);
Print(i + 2, n, offset);
Print(i + 3, n, offset);
}
Console.WriteLine();
m = n & ~1;
if (i < m)
{
Print(i + 0, n, offset);
Print(i + 1, n, offset);
i += 2;
}
Console.WriteLine();
for (; i < n; ++i)
Console.WriteLine(i + offset);
}
[DebuggerStepThrough]
private static void Print(int i, int n, int offset)
{
Debug.Assert(i < n);
Console.Write($"{i + offset} ");
}
}
} It will crash, due index out of range (although in unsafe code this might not show up). So the initial offset has to be taken into account. The easiest way is to normalize the indices, by setting them to |
Terms
Status quo
The upper boundary
m
is calculated as follows:m = n - k
The loop is then written as:
The pattern used for unrolling the loops looks similar to this one:
Let's look to the for-loop for
n = 12
, son - 4 = 8
:Two iterations are done, although a third could be done.
Proposal
The upper boundary for the loop could be calculated as follows:
m = n - n % k
or equivalentm = n / k * k
. With the use of bit hacks this can be efficiently written asm = n & -k
or equivalentm = n & ~(k - 1)
.So the pattern becomes:
Let's look to the for-loop for
n = 12
, son & ~3 = 12
:Three iterations are done, eating up all iterations to do.
Discussion
Iterations
The proposed variant "eats" more iterations from the loop. A simple example yield the following results:
As can be seen from the results variant A (status quo) follows the pattern
1, 2, 3, 4
forc3
, whereas variant B (proposal) follows the pattern0, 1, 2, 3
forc3
. It can be clearly seen, that A misses "one" iteration of the unrolled loop.Code-gen
Status quo
Proposal
Interpretation
The code-gen is similar, except the calculation of
m
.Status quo:
Proposal:
To my knowledge
lea
has a latency of 1 and is done in a separate processor stage, so it has "zero" cost.mov
andand
each have a latency of 1, so in total it is latency of 2.This shouldn't matter, because the work is done in the loop-body and when the unrolled-version is executed an additional time, the benefit should be given.
Additional improvements
Look out for
64->32->64
bit truncations in the registers while address-arithmetics is done.This can be circumvented with
nint
(native int), but unfortunately this isn't available at the moment. Till thenIntPtr
and casts to(int*)
(or any other pointer-type) can be used, to use arithmetics in the word size or define anusing alias
so to havenint
either aslong
orint
.The text was updated successfully, but these errors were encountered: