Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get loop unrolling boundaries via bit hack #46

Closed
gfoidl opened this issue Mar 30, 2018 · 3 comments
Closed

Get loop unrolling boundaries via bit hack #46

gfoidl opened this issue Mar 30, 2018 · 3 comments
Assignees
Milestone

Comments

@gfoidl
Copy link
Owner

gfoidl commented Mar 30, 2018

Terms

Term Meaning
n no of elements to iterate over
m upper boundary for loop
k unroll-factor
i loop variable

Status quo

The upper boundary m is calculated as follows: m = n - k
The loop is then written as:

int i = 0;
for (; i < m; i += k)
{
    // ...
}

The pattern used for unrolling the loops looks similar to this one:

private static int Sum(int n)
{
    int sum = 0;
    int i = 0;

    for (; i < n - 4; i += 4)
        sum += 4;

    if (i < n - 2)
    {
        sum += 2;
        i += 2;
    }

    for (; i < n; ++i)
        sum++;

    return sum;
}

Let's look to the for-loop for n = 12, so n - 4 = 8:

i = 0 -> i < 8 -> true
i = 4 -> i < 8 -> true
i = 8 -> i < 8 -> false

Two iterations are done, although a third could be done.

Proposal

The upper boundary for the loop could be calculated as follows: m = n - n % k or equivalent m = n / k * k. With the use of bit hacks this can be efficiently written as m = n & -k or equivalent m = n & ~(k - 1).

So the pattern becomes:

private static int Sum(int n)
{
    int sum = 0;
    int i = 0;

    for (; i < (n & ~3); i += 4)
        sum += 4;

    if (i < (n & ~1))
    {
        sum += 2;
        i += 2;
    }

    for (; i < n; ++i)
        sum++;

    return sum;
}

Let's look to the for-loop for n = 12, so n & ~3 = 12:

i = 0 -> i < 12 -> true
i = 4 -> i < 12 -> true
i = 8 -> i < 12 -> true
i = 12 -> i < 12 -> false

Three iterations are done, eating up all iterations to do.

Discussion

Iterations

The proposed variant "eats" more iterations from the loop. A simple example yield the following results:

T i:   0 c1: (c1:   0, c2:   0, c3:   0 total:   0) | c2: (c1:   0, c2:   0, c3:   0 total:   0)
T i:   1 c1: (c1:   0, c2:   0, c3:   1 total:   1) | c2: (c1:   0, c2:   0, c3:   1 total:   1)
T i:   2 c1: (c1:   0, c2:   0, c3:   2 total:   2) | c2: (c1:   0, c2:   0, c3:   2 total:   2)
T i:   3 c1: (c1:   0, c2:   0, c3:   3 total:   3) | c2: (c1:   0, c2:   0, c3:   3 total:   3)
F i:   4 c1: (c1:   0, c2:   0, c3:   4 total:   4) | c2: (c1:   0, c2:   4, c3:   0 total:   4)
T i:   5 c1: (c1:   0, c2:   4, c3:   1 total:   5) | c2: (c1:   0, c2:   4, c3:   1 total:   5)
T i:   6 c1: (c1:   0, c2:   4, c3:   2 total:   6) | c2: (c1:   0, c2:   4, c3:   2 total:   6)
T i:   7 c1: (c1:   0, c2:   4, c3:   3 total:   7) | c2: (c1:   0, c2:   4, c3:   3 total:   7)
F i:   8 c1: (c1:   0, c2:   4, c3:   4 total:   8) | c2: (c1:   8, c2:   0, c3:   0 total:   8)
T i:   9 c1: (c1:   8, c2:   0, c3:   1 total:   9) | c2: (c1:   8, c2:   0, c3:   1 total:   9)
T i:  10 c1: (c1:   8, c2:   0, c3:   2 total:  10) | c2: (c1:   8, c2:   0, c3:   2 total:  10)
T i:  11 c1: (c1:   8, c2:   0, c3:   3 total:  11) | c2: (c1:   8, c2:   0, c3:   3 total:  11)
F i:  12 c1: (c1:   8, c2:   0, c3:   4 total:  12) | c2: (c1:   8, c2:   4, c3:   0 total:  12)
T i:  13 c1: (c1:   8, c2:   4, c3:   1 total:  13) | c2: (c1:   8, c2:   4, c3:   1 total:  13)
T i:  14 c1: (c1:   8, c2:   4, c3:   2 total:  14) | c2: (c1:   8, c2:   4, c3:   2 total:  14)
T i:  15 c1: (c1:   8, c2:   4, c3:   3 total:  15) | c2: (c1:   8, c2:   4, c3:   3 total:  15)
F i:  16 c1: (c1:   8, c2:   4, c3:   4 total:  16) | c2: (c1:  16, c2:   0, c3:   0 total:  16)
T i:  17 c1: (c1:  16, c2:   0, c3:   1 total:  17) | c2: (c1:  16, c2:   0, c3:   1 total:  17)
T i:  18 c1: (c1:  16, c2:   0, c3:   2 total:  18) | c2: (c1:  16, c2:   0, c3:   2 total:  18)
T i:  19 c1: (c1:  16, c2:   0, c3:   3 total:  19) | c2: (c1:  16, c2:   0, c3:   3 total:  19)
F i:  20 c1: (c1:  16, c2:   0, c3:   4 total:  20) | c2: (c1:  16, c2:   4, c3:   0 total:  20)
T i:  21 c1: (c1:  16, c2:   4, c3:   1 total:  21) | c2: (c1:  16, c2:   4, c3:   1 total:  21)
T i:  22 c1: (c1:  16, c2:   4, c3:   2 total:  22) | c2: (c1:  16, c2:   4, c3:   2 total:  22)
T i:  23 c1: (c1:  16, c2:   4, c3:   3 total:  23) | c2: (c1:  16, c2:   4, c3:   3 total:  23)
F i:  24 c1: (c1:  16, c2:   4, c3:   4 total:  24) | c2: (c1:  24, c2:   0, c3:   0 total:  24)
T i:  25 c1: (c1:  24, c2:   0, c3:   1 total:  25) | c2: (c1:  24, c2:   0, c3:   1 total:  25)
T i:  26 c1: (c1:  24, c2:   0, c3:   2 total:  26) | c2: (c1:  24, c2:   0, c3:   2 total:  26)
T i:  27 c1: (c1:  24, c2:   0, c3:   3 total:  27) | c2: (c1:  24, c2:   0, c3:   3 total:  27)
F i:  28 c1: (c1:  24, c2:   0, c3:   4 total:  28) | c2: (c1:  24, c2:   4, c3:   0 total:  28)
T i:  29 c1: (c1:  24, c2:   4, c3:   1 total:  29) | c2: (c1:  24, c2:   4, c3:   1 total:  29)
T i:  30 c1: (c1:  24, c2:   4, c3:   2 total:  30) | c2: (c1:  24, c2:   4, c3:   2 total:  30)
T i:  31 c1: (c1:  24, c2:   4, c3:   3 total:  31) | c2: (c1:  24, c2:   4, c3:   3 total:  31)
F i:  32 c1: (c1:  24, c2:   4, c3:   4 total:  32) | c2: (c1:  32, c2:   0, c3:   0 total:  32)
T i:  33 c1: (c1:  32, c2:   0, c3:   1 total:  33) | c2: (c1:  32, c2:   0, c3:   1 total:  33)
T i:  34 c1: (c1:  32, c2:   0, c3:   2 total:  34) | c2: (c1:  32, c2:   0, c3:   2 total:  34)
T i:  35 c1: (c1:  32, c2:   0, c3:   3 total:  35) | c2: (c1:  32, c2:   0, c3:   3 total:  35)
F i:  36 c1: (c1:  32, c2:   0, c3:   4 total:  36) | c2: (c1:  32, c2:   4, c3:   0 total:  36)
T i:  37 c1: (c1:  32, c2:   4, c3:   1 total:  37) | c2: (c1:  32, c2:   4, c3:   1 total:  37)
T i:  38 c1: (c1:  32, c2:   4, c3:   2 total:  38) | c2: (c1:  32, c2:   4, c3:   2 total:  38)
T i:  39 c1: (c1:  32, c2:   4, c3:   3 total:  39) | c2: (c1:  32, c2:   4, c3:   3 total:  39)
F i:  40 c1: (c1:  32, c2:   4, c3:   4 total:  40) | c2: (c1:  40, c2:   0, c3:   0 total:  40)
T i:  41 c1: (c1:  40, c2:   0, c3:   1 total:  41) | c2: (c1:  40, c2:   0, c3:   1 total:  41)
T i:  42 c1: (c1:  40, c2:   0, c3:   2 total:  42) | c2: (c1:  40, c2:   0, c3:   2 total:  42)
T i:  43 c1: (c1:  40, c2:   0, c3:   3 total:  43) | c2: (c1:  40, c2:   0, c3:   3 total:  43)
F i:  44 c1: (c1:  40, c2:   0, c3:   4 total:  44) | c2: (c1:  40, c2:   4, c3:   0 total:  44)
T i:  45 c1: (c1:  40, c2:   4, c3:   1 total:  45) | c2: (c1:  40, c2:   4, c3:   1 total:  45)
T i:  46 c1: (c1:  40, c2:   4, c3:   2 total:  46) | c2: (c1:  40, c2:   4, c3:   2 total:  46)
T i:  47 c1: (c1:  40, c2:   4, c3:   3 total:  47) | c2: (c1:  40, c2:   4, c3:   3 total:  47)
F i:  48 c1: (c1:  40, c2:   4, c3:   4 total:  48) | c2: (c1:  48, c2:   0, c3:   0 total:  48)
T i:  49 c1: (c1:  48, c2:   0, c3:   1 total:  49) | c2: (c1:  48, c2:   0, c3:   1 total:  49)
T i:  50 c1: (c1:  48, c2:   0, c3:   2 total:  50) | c2: (c1:  48, c2:   0, c3:   2 total:  50)
T i:  51 c1: (c1:  48, c2:   0, c3:   3 total:  51) | c2: (c1:  48, c2:   0, c3:   3 total:  51)
F i:  52 c1: (c1:  48, c2:   0, c3:   4 total:  52) | c2: (c1:  48, c2:   4, c3:   0 total:  52)
T i:  53 c1: (c1:  48, c2:   4, c3:   1 total:  53) | c2: (c1:  48, c2:   4, c3:   1 total:  53)
T i:  54 c1: (c1:  48, c2:   4, c3:   2 total:  54) | c2: (c1:  48, c2:   4, c3:   2 total:  54)
T i:  55 c1: (c1:  48, c2:   4, c3:   3 total:  55) | c2: (c1:  48, c2:   4, c3:   3 total:  55)
F i:  56 c1: (c1:  48, c2:   4, c3:   4 total:  56) | c2: (c1:  56, c2:   0, c3:   0 total:  56)
T i:  57 c1: (c1:  56, c2:   0, c3:   1 total:  57) | c2: (c1:  56, c2:   0, c3:   1 total:  57)
T i:  58 c1: (c1:  56, c2:   0, c3:   2 total:  58) | c2: (c1:  56, c2:   0, c3:   2 total:  58)
T i:  59 c1: (c1:  56, c2:   0, c3:   3 total:  59) | c2: (c1:  56, c2:   0, c3:   3 total:  59)
F i:  60 c1: (c1:  56, c2:   0, c3:   4 total:  60) | c2: (c1:  56, c2:   4, c3:   0 total:  60)
T i:  61 c1: (c1:  56, c2:   4, c3:   1 total:  61) | c2: (c1:  56, c2:   4, c3:   1 total:  61)
T i:  62 c1: (c1:  56, c2:   4, c3:   2 total:  62) | c2: (c1:  56, c2:   4, c3:   2 total:  62)
T i:  63 c1: (c1:  56, c2:   4, c3:   3 total:  63) | c2: (c1:  56, c2:   4, c3:   3 total:  63)
F i:  64 c1: (c1:  56, c2:   4, c3:   4 total:  64) | c2: (c1:  64, c2:   0, c3:   0 total:  64)
T i:  65 c1: (c1:  64, c2:   0, c3:   1 total:  65) | c2: (c1:  64, c2:   0, c3:   1 total:  65)
T i:  66 c1: (c1:  64, c2:   0, c3:   2 total:  66) | c2: (c1:  64, c2:   0, c3:   2 total:  66)
T i:  67 c1: (c1:  64, c2:   0, c3:   3 total:  67) | c2: (c1:  64, c2:   0, c3:   3 total:  67)
F i:  68 c1: (c1:  64, c2:   0, c3:   4 total:  68) | c2: (c1:  64, c2:   4, c3:   0 total:  68)
T i:  69 c1: (c1:  64, c2:   4, c3:   1 total:  69) | c2: (c1:  64, c2:   4, c3:   1 total:  69)
T i:  70 c1: (c1:  64, c2:   4, c3:   2 total:  70) | c2: (c1:  64, c2:   4, c3:   2 total:  70)
T i:  71 c1: (c1:  64, c2:   4, c3:   3 total:  71) | c2: (c1:  64, c2:   4, c3:   3 total:  71)
F i:  72 c1: (c1:  64, c2:   4, c3:   4 total:  72) | c2: (c1:  72, c2:   0, c3:   0 total:  72)
T i:  73 c1: (c1:  72, c2:   0, c3:   1 total:  73) | c2: (c1:  72, c2:   0, c3:   1 total:  73)
T i:  74 c1: (c1:  72, c2:   0, c3:   2 total:  74) | c2: (c1:  72, c2:   0, c3:   2 total:  74)
T i:  75 c1: (c1:  72, c2:   0, c3:   3 total:  75) | c2: (c1:  72, c2:   0, c3:   3 total:  75)
F i:  76 c1: (c1:  72, c2:   0, c3:   4 total:  76) | c2: (c1:  72, c2:   4, c3:   0 total:  76)
T i:  77 c1: (c1:  72, c2:   4, c3:   1 total:  77) | c2: (c1:  72, c2:   4, c3:   1 total:  77)
T i:  78 c1: (c1:  72, c2:   4, c3:   2 total:  78) | c2: (c1:  72, c2:   4, c3:   2 total:  78)
T i:  79 c1: (c1:  72, c2:   4, c3:   3 total:  79) | c2: (c1:  72, c2:   4, c3:   3 total:  79)
F i:  80 c1: (c1:  72, c2:   4, c3:   4 total:  80) | c2: (c1:  80, c2:   0, c3:   0 total:  80)
T i:  81 c1: (c1:  80, c2:   0, c3:   1 total:  81) | c2: (c1:  80, c2:   0, c3:   1 total:  81)
T i:  82 c1: (c1:  80, c2:   0, c3:   2 total:  82) | c2: (c1:  80, c2:   0, c3:   2 total:  82)
T i:  83 c1: (c1:  80, c2:   0, c3:   3 total:  83) | c2: (c1:  80, c2:   0, c3:   3 total:  83)
F i:  84 c1: (c1:  80, c2:   0, c3:   4 total:  84) | c2: (c1:  80, c2:   4, c3:   0 total:  84)
T i:  85 c1: (c1:  80, c2:   4, c3:   1 total:  85) | c2: (c1:  80, c2:   4, c3:   1 total:  85)
T i:  86 c1: (c1:  80, c2:   4, c3:   2 total:  86) | c2: (c1:  80, c2:   4, c3:   2 total:  86)
T i:  87 c1: (c1:  80, c2:   4, c3:   3 total:  87) | c2: (c1:  80, c2:   4, c3:   3 total:  87)
F i:  88 c1: (c1:  80, c2:   4, c3:   4 total:  88) | c2: (c1:  88, c2:   0, c3:   0 total:  88)
T i:  89 c1: (c1:  88, c2:   0, c3:   1 total:  89) | c2: (c1:  88, c2:   0, c3:   1 total:  89)
T i:  90 c1: (c1:  88, c2:   0, c3:   2 total:  90) | c2: (c1:  88, c2:   0, c3:   2 total:  90)
T i:  91 c1: (c1:  88, c2:   0, c3:   3 total:  91) | c2: (c1:  88, c2:   0, c3:   3 total:  91)
F i:  92 c1: (c1:  88, c2:   0, c3:   4 total:  92) | c2: (c1:  88, c2:   4, c3:   0 total:  92)
T i:  93 c1: (c1:  88, c2:   4, c3:   1 total:  93) | c2: (c1:  88, c2:   4, c3:   1 total:  93)
T i:  94 c1: (c1:  88, c2:   4, c3:   2 total:  94) | c2: (c1:  88, c2:   4, c3:   2 total:  94)
T i:  95 c1: (c1:  88, c2:   4, c3:   3 total:  95) | c2: (c1:  88, c2:   4, c3:   3 total:  95)
F i:  96 c1: (c1:  88, c2:   4, c3:   4 total:  96) | c2: (c1:  96, c2:   0, c3:   0 total:  96)
T i:  97 c1: (c1:  96, c2:   0, c3:   1 total:  97) | c2: (c1:  96, c2:   0, c3:   1 total:  97)
T i:  98 c1: (c1:  96, c2:   0, c3:   2 total:  98) | c2: (c1:  96, c2:   0, c3:   2 total:  98)
T i:  99 c1: (c1:  96, c2:   0, c3:   3 total:  99) | c2: (c1:  96, c2:   0, c3:   3 total:  99)
F i: 100 c1: (c1:  96, c2:   0, c3:   4 total: 100) | c2: (c1:  96, c2:   4, c3:   0 total: 100)

errors: 25

As can be seen from the results variant A (status quo) follows the pattern 1, 2, 3, 4 for c3, whereas variant B (proposal) follows the pattern 0, 1, 2, 3 for c3. It can be clearly seen, that A misses "one" iteration of the unrolled loop.

Code-gen

Status quo

G_M56183_IG01:
       55                   push     rbp
       488BEC               mov      rbp, rsp

G_M56183_IG02:
       33C0                 xor      eax, eax
       33F6                 xor      esi, esi
       8D57F8               lea      edx, [rdi-8]           ; uppper bound for loop
       85D2                 test     edx, edx
       7E0A                 jle      SHORT G_M56183_IG04

G_M56183_IG03:                                              ; unrolled-loop start
       83C008               add      eax, 8
       83C608               add      esi, 8
       3BF2                 cmp      esi, edx
       7CF6                 jl       SHORT G_M56183_IG03    ; unrolled-loop end

G_M56183_IG04:
       8D57FC               lea      edx, [rdi-4]           ; comparison for if
       3BF2                 cmp      esi, edx
       7D0E                 jge      SHORT G_M56183_IG06
       83C004               add      eax, 4
       83C604               add      esi, 4
       3BF7                 cmp      esi, edi
       7D08                 jge      SHORT G_M56183_IG07

G_M56183_IG05:                                              ; normal loop start
       FFC0                 inc      eax
       FFC6                 inc      esi

G_M56183_IG06:
       3BF7                 cmp      esi, edi
       7CF8                 jl       SHORT G_M56183_IG05    ; normal loop end

G_M56183_IG07:
       5D                   pop      rbp
       C3                   ret

Proposal

G_M26886_IG01:
       55                   push     rbp
       488BEC               mov      rbp, rsp

G_M26886_IG02:
       33C0                 xor      eax, eax
       33F6                 xor      esi, esi
       8BD7                 mov      edx, edi               ; uppper bound for loop - pt1
       83E2F8               and      edx, -8                ; uppper bound for loop - pt2
       85D2                 test     edx, edx
       7E0A                 jle      SHORT G_M26886_IG04

G_M26886_IG03:                                              ; unrolled-loop start
       83C008               add      eax, 8
       83C608               add      esi, 8
       3BF2                 cmp      esi, edx
       7CF6                 jl       SHORT G_M26886_IG03    ; unrolled-loop end

G_M26886_IG04:
       8BD7                 mov      edx, edi
       83E2FC               and      edx, -4
       3BF2                 cmp      esi, edx
       7D0E                 jge      SHORT G_M26886_IG06
       83C004               add      eax, 4
       83C604               add      esi, 4
       3BF7                 cmp      esi, edi
       7D08                 jge      SHORT G_M26886_IG07

G_M26886_IG05:                                              ; normal loop start
       FFC0                 inc      eax
       FFC6                 inc      esi

G_M26886_IG06:
       3BF7                 cmp      esi, edi
       7CF8                 jl       SHORT G_M26886_IG05    ; normal loop end

G_M26886_IG07:
       5D                   pop      rbp
       C3                   ret

Interpretation

The code-gen is similar, except the calculation of m.

Status quo:

lea      edx, [rdi-8]
; ...
lea      edx, [rdi-4]

Proposal:

mov      edx, edi
and      edx, -8
; ...
mov      edx, edi
and      edx, -4

To my knowledge lea has a latency of 1 and is done in a separate processor stage, so it has "zero" cost.
mov and and each have a latency of 1, so in total it is latency of 2.
This shouldn't matter, because the work is done in the loop-body and when the unrolled-version is executed an additional time, the benefit should be given.

Additional improvements

Look out for 64->32->64 bit truncations in the registers while address-arithmetics is done.
This can be circumvented with nint (native int), but unfortunately this isn't available at the moment. Till then IntPtr and casts to (int*) (or any other pointer-type) can be used, to use arithmetics in the word size or define an using alias so to have nint either as long or int.

// ...
#if BIT_64
using nint = System.Int64;
#else
using nint = System.Int32;
#endif
// ...

namespace Foo
{
    // ...
}
@gfoidl gfoidl added this to the v1.1.0 milestone Mar 30, 2018
@gfoidl
Copy link
Owner Author

gfoidl commented Mar 31, 2018

mov and and each have a latency of 1, so in total it is latency of 2.

Here the mov is from register to register and modern cpu will do this by register renaming, so there is also "zero" cost. Thus both behave equally on this concern.

@gfoidl
Copy link
Owner Author

gfoidl commented Apr 19, 2018

casts to (int*) (or any other pointer-type) can be used, to use arithmetics in the word size

int* or any other pointer-type can only be used for comparisons (address-based), but not for arithmetics in the sense of + or -. Here byte* has to be used. See Arithmetic Operations on Pointers (C# Programming Guide).

Here -- i.e. in this project -- this isn't needed because unsafe code is used, so no 64->32->64 truncation will occur.

@gfoidl
Copy link
Owner Author

gfoidl commented Apr 19, 2018

For the loop unrolling boundaries it is essential that the "base" is 0.

Look at the following code:

//#define NORMALIZE

using System;
using System.Diagnostics;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            Foo(3, 5);
        }

        private static void Foo(int i, int n)
        {
#if NORMALIZE
            int offset = i;
            n -= i;
            i = 0;
#else
            int offset = 0;
#endif
            int m = n & ~3;
            for (; i < m; i += 4)
            {
                Print(i + 0, n, offset);
                Print(i + 1, n, offset);
                Print(i + 2, n, offset);
                Print(i + 3, n, offset);
            }
            Console.WriteLine();

            m = n & ~1;
            if (i < m)
            {
                Print(i + 0, n, offset);
                Print(i + 1, n, offset);

                i += 2;
            }
            Console.WriteLine();

            for (; i < n; ++i)
                Console.WriteLine(i + offset);
        }

        [DebuggerStepThrough]
        private static void Print(int i, int n, int offset)
        {
            Debug.Assert(i < n);
            Console.Write($"{i + offset} ");
        }
    }
}

It will crash, due index out of range (although in unsafe code this might not show up).
m will be 4, and i < m is true. When accessing the index by i + 2 it will crash.

So the initial offset has to be taken into account. The easiest way is to normalize the indices, by setting them to 0. In the above example comment out the #define, and it won't crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant