Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unaligned stores/loads #30

Closed
nemequ opened this issue Apr 13, 2015 · 10 comments
Closed

Unaligned stores/loads #30

nemequ opened this issue Apr 13, 2015 · 10 comments
Assignees
Labels

Comments

@nemequ
Copy link
Contributor

nemequ commented Apr 13, 2015

ubsan detects a lot of undefined stores/loads:

/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff52 for type 'density_byte', which requires 4 byte alignment
0x00000206ff52: note: pointer points here
 69 70  0e 3f 39 90 98 7f 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  01 00 00 00 06 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff56 for type 'density_byte', which requires 4 byte alignment
0x00000206ff56: note: pointer points here
 70 72 69 6d 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  01 00 00 00 06 00 00 00  04 00
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff5a for type 'density_byte', which requires 4 byte alignment
0x00000206ff5a: note: pointer points here
 69 73  20 69 00 00 00 00 00 00  00 00 00 00 00 00 00 00  01 00 00 00 06 00 00 00  04 00 00 00 04 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff5e for type 'density_byte', which requires 4 byte alignment
0x00000206ff5e: note: pointer points here
 6e 20 66 61 00 00  00 00 00 00 00 00 00 00  01 00 00 00 06 00 00 00  04 00 00 00 04 00 00 00  00 00
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff62 for type 'density_byte', which requires 4 byte alignment
0x00000206ff62: note: pointer points here
 75 63  69 62 00 00 00 00 00 00  01 00 00 00 06 00 00 00  04 00 00 00 04 00 00 00  00 00 00 00 b7 73
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff66 for type 'density_byte', which requires 4 byte alignment
0x00000206ff66: note: pointer points here
 75 73 20 6f 00 00  01 00 00 00 06 00 00 00  04 00 00 00 04 00 00 00  00 00 00 00 b7 73 e4 b2  00 00
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff6a for type 'density_byte', which requires 4 byte alignment
0x00000206ff6a: note: pointer points here
 72 63  69 20 00 00 06 00 00 00  04 00 00 00 04 00 00 00  00 00 00 00 b7 73 e4 b2  00 00 00 00 00 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff6e for type 'density_byte', which requires 4 byte alignment
0x00000206ff6e: note: pointer points here
 6c 75 63 74 00 00  04 00 00 00 04 00 00 00  00 00 00 00 b7 73 e4 b2  00 00 00 00 00 00 00 00  00 00
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000206ff72 for type 'density_byte', which requires 4 byte alignment
0x00000206ff72: note: pointer points here
 75 73  20 65 00 00 04 00 00 00  00 00 00 00 b7 73 e4 b2  00 00 00 00 00 00 00 00  00 00 00 00 00 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207007a for type 'density_byte', which requires 4 byte alignment
0x00000207007a: note: pointer points here
 76 65  72 72 00 00 00 00 00 00  c8 41 73 90 98 7f 00 00  c8 41 73 90 98 7f 00 00  70 00 07 02 00 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207007e for type 'density_byte', which requires 4 byte alignment
0x00000207007e: note: pointer points here
 61 2e 20 43 00 00  c8 41 73 90 98 7f 00 00  c8 41 73 90 98 7f 00 00  70 00 07 02 00 00 00 00  70 00
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x000002070082 for type 'density_byte', which requires 4 byte alignment
0x000002070082: note: pointer points here
 72 61  73 20 73 90 98 7f 00 00  c8 41 73 90 98 7f 00 00  70 00 07 02 00 00 00 00  70 00 07 02 00 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x000002070086 for type 'density_byte', which requires 4 byte alignment
0x000002070086: note: pointer points here
 69 6e 74 65 00 00  c8 41 73 90 98 7f 00 00  70 00 07 02 00 00 00 00  70 00 07 02 00 00 00 00  79 e9
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207008a for type 'density_byte', which requires 4 byte alignment
0x00000207008a: note: pointer points here
 72 64  75 6d 73 90 98 7f 00 00  70 00 07 02 00 00 00 00  70 00 07 02 00 00 00 00  79 e9 0d a3 75 c0
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207008e for type 'density_byte', which requires 4 byte alignment
0x00000207008e: note: pointer points here
 20 76 65 6c 00 00  70 00 07 02 00 00 00 00  70 00 07 02 00 00 00 00  79 e9 0d a3 75 c0 8d d4  84 f2
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x000002070092 for type 'density_byte', which requires 4 byte alignment
0x000002070092: note: pointer points here
 20 6e  69 73 07 02 00 00 00 00  70 00 07 02 00 00 00 00  79 e9 0d a3 75 c0 8d d4  84 f2 35 bf 79 df
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x000002070096 for type 'density_byte', which requires 4 byte alignment
0x000002070096: note: pointer points here
 6c 20 69 6e 00 00  70 00 07 02 00 00 00 00  79 e9 0d a3 75 c0 8d d4  84 f2 35 bf 79 df a2 f6  3d 5e
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207009a for type 'density_byte', which requires 4 byte alignment
0x00000207009a: note: pointer points here
 20 66  61 63 07 02 00 00 00 00  79 e9 0d a3 75 c0 8d d4  84 f2 35 bf 79 df a2 f6  3d 5e 6a cc 51 da
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207009e for type 'density_byte', which requires 4 byte alignment
0x00000207009e: note: pointer points here
 69 6c 69 73 00 00  79 e9 0d a3 75 c0 8d d4  84 f2 35 bf 79 df a2 f6  3d 5e 6a cc 51 da 51 c3  9f 56
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700a2 for type 'density_byte', which requires 4 byte alignment
0x0000020700a2: note: pointer points here
 69 73  2e 20 0d a3 75 c0 8d d4  84 f2 35 bf 79 df a2 f6  3d 5e 6a cc 51 da 51 c3  9f 56 07 7d 4b 00
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700a6 for type 'density_byte', which requires 4 byte alignment
0x0000020700a6: note: pointer points here
 43 75 72 61 8d d4  84 f2 35 bf 79 df a2 f6  3d 5e 6a cc 51 da 51 c3  9f 56 07 7d 4b 00 2d 06  b6 78
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700aa for type 'density_byte', which requires 4 byte alignment
0x0000020700aa: note: pointer points here
 62 69  74 75 35 bf 79 df a2 f6  3d 5e 6a cc 51 da 51 c3  9f 56 07 7d 4b 00 2d 06  b6 78 64 8b d2 e2
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700ae for type 'density_byte', which requires 4 byte alignment
0x0000020700ae: note: pointer points here
 72 20 73 6f a2 f6  3d 5e 6a cc 51 da 51 c3  9f 56 07 7d 4b 00 2d 06  b6 78 64 8b d2 e2 c2 d9  a4 9b
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700b2 for type 'density_byte', which requires 4 byte alignment
0x0000020700b2: note: pointer points here
 6c 6c  69 63 6a cc 51 da 51 c3  9f 56 07 7d 4b 00 2d 06  b6 78 64 8b d2 e2 c2 d9  a4 9b e8 ee 63 94
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700b6 for type 'density_byte', which requires 4 byte alignment
0x0000020700b6: note: pointer points here
 69 74 75 64 51 c3  9f 56 07 7d 4b 00 2d 06  b6 78 64 8b d2 e2 c2 d9  a4 9b e8 ee 63 94 2a a5  33 a6
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700ba for type 'density_byte', which requires 4 byte alignment
0x0000020700ba: note: pointer points here
 69 6e  20 74 07 7d 4b 00 2d 06  b6 78 64 8b d2 e2 c2 d9  a4 9b e8 ee 63 94 2a a5  33 a6 63 ad d1 2f
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700be for type 'density_byte', which requires 4 byte alignment
0x0000020700be: note: pointer points here
 6f 72 74 6f 2d 06  b6 78 64 8b d2 e2 c2 d9  a4 9b e8 ee 63 94 2a a5  33 a6 63 ad d1 2f 8b 2e  e5 f0
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700c2 for type 'density_byte', which requires 4 byte alignment
0x0000020700c2: note: pointer points here
 72 20  76 65 64 8b d2 e2 c2 d9  a4 9b e8 ee 63 94 2a a5  33 a6 63 ad d1 2f 8b 2e  e5 f0 1c ef d4 ac
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700c6 for type 'density_byte', which requires 4 byte alignment
0x0000020700c6: note: pointer points here
 6c 20 63 6f c2 d9  a4 9b e8 ee 63 94 2a a5  33 a6 63 ad d1 2f 8b 2e  e5 f0 1c ef d4 ac 41 87  54 c2
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:109:33: runtime error: store to misaligned address 0x00000206ffc4 for type 'density_chameleon_signature', which requires 8 byte alignment
0x00000206ffc4: note: pointer points here
  20 73 65 64 00 00 00 00  00 00 00 00 20 74 65 6d  70 6f 72 20 70 75 72 75  73 20 63 75 72 73 75 73
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700d2 for type 'density_byte', which requires 4 byte alignment
0x0000020700d2: note: pointer points here
 2a a5  33 a6 63 ad d1 2f 8b 2e  e5 f0 1c ef d4 ac 41 87  54 c2 46 e2 4f 8c db 44  84 3c 1a 13 f1 4f
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700d6 for type 'density_byte', which requires 4 byte alignment
0x0000020700d6: note: pointer points here
 20 61 75 63 8b 2e  e5 f0 1c ef d4 ac 41 87  54 c2 46 e2 4f 8c db 44  84 3c 1a 13 f1 4f 5f 8e  eb b8
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x0000020700da for type 'density_byte', which requires 4 byte alignment
0x0000020700da: note: pointer points here
 74 6f  72 2e 1c ef d4 ac 41 87  54 c2 46 e2 4f 8c db 44  84 3c 1a 13 f1 4f 5f 8e  eb b8 07 d2 07 b6
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode.c:126:38: runtime error: store to misaligned address 0x00000207089a for type 'density_byte', which requires 4 byte alignment
0x00000207089a: note: pointer points here
 8b 39  f0 21 20 65 75 69 73 6d  6f 64 2c 20 6e 6f 6e 20  76 61 72 69 75 73 20 66  65 6c 69 73 20 64
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_encode_template.h:117:25: runtime error: store to misaligned address 0x00000207082a for type 'density_chameleon_signature', which requires 8 byte alignment
0x00000207082a: note: pointer points here
 75 73  20 66 75 65 74 20 65 73  74 20 e3 f9 20 64 69 63  74 75 6d 2e 65 f6 29 f5  6b 9a 70 75 73 20
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff52 for type 'density_byte', which requires 4 byte alignment
0x00000206ff52: note: pointer points here
 69 70  0e 3f 70 72 69 6d 69 73  20 69 6e 20 66 61 75 63  69 62 75 73 20 6f 72 63  69 20 6c 75 63 74
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff56 for type 'density_byte', which requires 4 byte alignment
0x00000206ff56: note: pointer points here
 70 72 69 6d 69 73  20 69 6e 20 66 61 75 63  69 62 75 73 20 6f 72 63  69 20 6c 75 63 74 75 73  20 65
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff5a for type 'density_byte', which requires 4 byte alignment
0x00000206ff5a: note: pointer points here
 69 73  20 69 6e 20 66 61 75 63  69 62 75 73 20 6f 72 63  69 20 6c 75 63 74 75 73  20 65 74 20 75 6c
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff5e for type 'density_byte', which requires 4 byte alignment
0x00000206ff5e: note: pointer points here
 6e 20 66 61 75 63  69 62 75 73 20 6f 72 63  69 20 6c 75 63 74 75 73  20 65 74 20 75 6c 3a 6d  65 73
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff62 for type 'density_byte', which requires 4 byte alignment
0x00000206ff62: note: pointer points here
 75 63  69 62 75 73 20 6f 72 63  69 20 6c 75 63 74 75 73  20 65 74 20 75 6c 3a 6d  65 73 20 70 6f 73
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff66 for type 'density_byte', which requires 4 byte alignment
0x00000206ff66: note: pointer points here
 75 73 20 6f 72 63  69 20 6c 75 63 74 75 73  20 65 74 20 75 6c 3a 6d  65 73 20 70 6f 73 75 65  72 65
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff6a for type 'density_byte', which requires 4 byte alignment
0x00000206ff6a: note: pointer points here
 72 63  69 20 6c 75 63 74 75 73  20 65 74 20 75 6c 3a 6d  65 73 20 70 6f 73 75 65  72 65 20 63 75 62
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x00000206ff6e for type 'density_byte', which requires 4 byte alignment
0x00000206ff6e: note: pointer points here
 6c 75 63 74 75 73  20 65 74 20 75 6c 3a 6d  65 73 20 70 6f 73 75 65  72 65 20 63 75 62 69 6c  69 61
             ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:85:24: runtime error: load of misaligned address 0x00000206ffc4 for type 'density_byte', which requires 8 byte alignment
0x00000206ffc4: note: pointer points here
  20 73 65 64 00 00 00 00  80 00 00 00 20 74 65 6d  70 6f 72 20 70 75 72 75  73 20 63 75 72 73 75 73
              ^ 
/home/nemequ/local/src/squash/plugins/density/density/src/kernel_chameleon_decode.c:97:14: runtime error: load of misaligned address 0x000002071cda for type 'density_byte', which requires 4 byte alignment
0x000002071cda: note: pointer points here
 00 00  e3 f9 20 64 69 63 74 75  6d 2e 65 f6 29 f5 6b 9a  70 75 73 20 13 23 72 a7  b4 b9 65 20 61 20
              ^

I only tested chameleon there, but it's probably a good bet that cheetah and lion have similar issues.

@g1mv
Copy link
Owner

g1mv commented Apr 15, 2015

I saw that this pointer alignment requirement is actually well-hidden in the C reference...
It's quite concerning because fixing it will require a good amount of memcpy I'm afraid, checking if the address is aligned would make things even slower so I guess memcpy is the best option.
That's going to generate a performance hit for sure. Do you have any idea on your side about a fast workaround ?

@nemequ
Copy link
Contributor Author

nemequ commented Apr 15, 2015

It seems like there really should be a way to get the C compiler to basically do what it is doing now if the hardward supports unaligned loads/stores, but fall back on loading/storing individual bytes on hardware which doesn't support it (MIPS?). I haven't found anything, though. I was thinking about asking on SO… I'll try to do that later today, I'll let you know how it goes.

If there is somewhere you basically want a fast memcpy so you use int64_t or something instead of uint8_t, I was thinking it might be possible to use the new OpenMP SIMD support. Something like

void not_memcpy (uint8_t* dest, uint8_t src, size_t size) {
  #pragma omp simd safelen(???)
  for (size_t i = 0 ; i < size ; i++)
    dest[i] = src[i];
}

I haven't tried it to see if there is any speedup, though, and obviously it would require OpenMP 4.0 for there to be one. It would probably be okay to omit safelen for a memcpy replacement, but for an memmove replacement it would take a bit of thought… if you're replacing a loop on uint64_t it could obviously be at least sizeof(uint64_t), but I expect it would be much better if it were at least 16, though 32 or 64 would be much better.

FWIW, my current understanding is that unaligned store/load is basically free on modern x86/x86_64 CPUs, the main danger is that the CC will auto-vectorize the code and unaligned access will trap on vectors. On ARM the situation is similar, except I believe there is a significant penalty for unaligned access. MIPS doesn't support unaigned access… the CPU will trap them, and by default Linux will currently catch that and emulate the request using safe instructions—cost of the safer instructions aside (basically loading uint8_ts and shifting/oring together a value), it's very expensive because of the whole trap/catch/retry thing.

@nemequ
Copy link
Contributor Author

nemequ commented Apr 15, 2015

http://fastcompression.blogspot.com/2014/11/portability-woes-endianess-and.html has some good ideas, though it doesn't do much about the auto-vectorization concern…

@g1mv
Copy link
Owner

g1mv commented Apr 15, 2015

Thanks for your replies, I checked the link which is very informative.
I like the SIMD idea for memcpy, and apparently it is already implemented on some platforms (osx). It does however require openmp 4 as you say ... Probably too restrictive for now.
After further thinking, I think I might be able to find a very fast workaround for chameleon and cheetah encoding/decoding, but for lion things are different and I don't see a fast enough solution just yet.

@nemequ
Copy link
Contributor Author

nemequ commented Apr 15, 2015

I like the SIMD idea for memcpy, and apparently it is already implemented on some platforms (osx). It does however require openmp 4 as you say ... Probably too restrictive for now.

I'm not sure what you mean here—if you're talking about memcpy using SIMD, all platforms should be doing that. However, the memcpy library function has some overhead as it will take some time to determine what method to use (depending on things like alignment). It's great for larger operations, but for smaller ones it is pretty expensive. That said, most compilers will actually inline many memcpy calls, especially for smaller buffers with sizes that are know at compile-time, so if you can use fixed sizes memcpy would probably be fairly snappy. GCC has a __builtin_memcpy, but AFAIK it's unnecessary unless you compile with -fno-builtins.

If you're talking about OS X (i.e. clang) supporting OpenMP 4.0, it doesn't—hopefully the next version of clang will. GCC does since 4.9. That said, you don't even have to put an ifdef around it… if the compiler doesn't support OpenMP 4 it will still work, it just will not use SIMD (unless the C compiler does it). If you want to take a vastly different approach when OpenMP 4.0 isn't available you can always use #if defined(_OPENMP) && (_OPENMP >= 201307)

@g1mv
Copy link
Owner

g1mv commented Apr 16, 2015

I'm not sure what you mean here—if you're talking about memcpy using SIMD, all platforms should be doing that. However, the memcpy library function has some overhead as it will take some time to determine what method to use (depending on things like alignment). It's great for larger operations, but for smaller ones it is pretty expensive. That said, most compilers will actually inline many memcpy calls, especially for smaller buffers with sizes that are know at compile-time, so if you can use fixed sizes memcpy would probably be fairly snappy. GCC has a __builtin_memcpy, but AFAIK it's unnecessary unless you compile with -fno-builtins.

I'll perform a few tests later on : memcpy vs direct copy by using uint types (unsafe due to alignment issues) vs openmp copies, on OS X.

If you're talking about OS X (i.e. clang) supporting OpenMP 4.0, it doesn't—hopefully the next version of clang will. GCC does since 4.9. That said, you don't even have to put an ifdef around it… if the compiler doesn't support OpenMP 4 it will still work, it just will not use SIMD (unless the C compiler does it). If you want to take a vastly different approach when OpenMP 4.0 isn't available you can always use #if defined(_OPENMP) && (_OPENMP >= 201307)

I was talking about this project :
https://github.com/clang-omp/clang
Very simple to deploy on OS X and it offers the omp simd pragma for clang. I'll check it out.

@nemequ
Copy link
Contributor Author

nemequ commented Apr 16, 2015

I was talking about this project :
https://github.com/clang-omp/clang
Very simple to deploy on OS X and it offers the omp simd pragma for clang. I'll check it out.

AFAIK that is the project they're trying to merge into clang. Unfortunately this has been going on for several years.

@g1mv
Copy link
Owner

g1mv commented Apr 17, 2015

Okay, I created a small code snip to test all of this :

#include <omp.h>
#include <stdio.h>
#include <strings.h>
#include <sys/resource.h>

#define MAX_SIZE    (1 << 24)
#define MICROSECONDS    1000000.0

void method_memcpy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
    for(unsigned int j = 0; j < iterations; j ++) {
        memcpy(output, input, size);
        *(output + j % size) = 123;
    }
}

void method_byte_to_byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
    for(unsigned int j = 0; j < iterations; j ++) {
        for(unsigned int i = 0; i < size; i ++)
            *(output + i) = *(input + i);
        *(output + j % size) = 123;
    }
}

void method_simd_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
    for(unsigned int j = 0; j < iterations; j ++) {
#pragma omp simd
        for(unsigned int i = 0; i < size; i ++)
            *(output + i) = *(input + i);
        *(output + j % size) = 123;
    }
}

void method_unsafe_8byte_to_8byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
    for(unsigned int j = 0; j < iterations; j ++) {
        for(unsigned int i = 0; i < (size / sizeof(uint64_t)); i ++)
            *((uint64_t*)output + i) = *((uint64_t*)input + i);
        output[j % size] = 123;
    }
}

void method_unsafe_4byte_to_4byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
    for(unsigned int j = 0; j < iterations; j ++) {
        for(unsigned int i = 0; i < (size / sizeof(uint32_t)); i ++)
            *((uint32_t*)output + i) = *((uint32_t*)input + i);
        output[j % size] = 123;
    }
}

void method_unsafe_2byte_to_2byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
    for(unsigned int j = 0; j < iterations; j ++) {
        for(unsigned int i = 0; i < (size / sizeof(uint16_t)); i ++)
            *((uint16_t*)output + i) = *((uint16_t*)input + i);
        output[j % size] = 123;
    }
}

void output_result(const char* title, const struct timeval* start, const struct timeval* stop, const unsigned int size, const unsigned int iterations, const unsigned int bogus) {
    double elapsed = ((stop->tv_sec * MICROSECONDS + stop->tv_usec) - (start->tv_sec * MICROSECONDS + start->tv_usec)) / MICROSECONDS;
    printf("%s\tsize = %d, iterations = %d, time = %3lfs, bogus = %i\n", title, size, iterations, elapsed, bogus);
}

int main() {

    unsigned char* input = malloc(MAX_SIZE * sizeof(unsigned char));
    unsigned char* output = malloc(MAX_SIZE * sizeof(unsigned char));
    struct rusage usage;

    for(unsigned int i = 0; i < MAX_SIZE; i ++)
        *(input + i) = (unsigned char)i;

    unsigned int iterations = (1 << 30);
    for(unsigned int size = 2; size <= MAX_SIZE; size = size << 1, iterations = iterations >> 1) {
        // Memcpy
        getrusage(RUSAGE_SELF, &usage);
        struct timeval start = usage.ru_utime;
        method_memcpy((const unsigned char*)input, output, size, iterations);
        getrusage(RUSAGE_SELF, &usage);
        struct timeval stop = usage.ru_utime;
        output_result("MEMCPY\t\t", &start, &stop, size, iterations, (unsigned int)output[123 % size]);

        // Byte to byte
        getrusage(RUSAGE_SELF, &usage);
        start = usage.ru_utime;
        method_byte_to_byte_copy((const unsigned char*)input, output, size, iterations);
        getrusage(RUSAGE_SELF, &usage);
        stop = usage.ru_utime;
        output_result("BYTE TO BYTE\t", &start, &stop, size, iterations, (unsigned int)output[123 % size]);

        // SIMD copy
        getrusage(RUSAGE_SELF, &usage);
        start = usage.ru_utime;
        method_simd_copy((const unsigned char*)input, output, size, iterations);
        getrusage(RUSAGE_SELF, &usage);
        stop = usage.ru_utime;
        output_result("BYTE TO BYTE SIMD", &start, &stop, size, iterations, (unsigned int)output[123 % size]);

        // 8 byte to 8 byte
        getrusage(RUSAGE_SELF, &usage);
        start = usage.ru_utime;
        method_unsafe_8byte_to_8byte_copy((const unsigned char*)input, output, size, iterations);
        getrusage(RUSAGE_SELF, &usage);
        stop = usage.ru_utime;
        output_result("8 BYTES TO 8 BYTES", &start, &stop, size, iterations, (unsigned int)output[123 % size]);

        // 4 byte to 4 byte
        getrusage(RUSAGE_SELF, &usage);
        start = usage.ru_utime;
        method_unsafe_4byte_to_4byte_copy((const unsigned char*)input, output, size, iterations);
        getrusage(RUSAGE_SELF, &usage);
        stop = usage.ru_utime;
        output_result("4 BYTES TO 4 BYTES", &start, &stop, size, iterations, (unsigned int)output[123 % size]);

        // 2 byte to 2 byte
        getrusage(RUSAGE_SELF, &usage);
        start = usage.ru_utime;
        method_unsafe_2byte_to_2byte_copy((const unsigned char*)input, output, size, iterations);
        getrusage(RUSAGE_SELF, &usage);
        stop = usage.ru_utime;
        output_result("2 BYTES TO 2 BYTES", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
    }

    free(input);
    free(output);
}

To launch it, I installed the latest clang-omp :

$ /usr/local/bin/clang-omp --version
clang version 3.5.0
Target: x86_64-apple-darwin14.3.0
Thread model: posix

Compilation was done with the following command :

$ /usr/local/bin/clang-omp -I/usr/local/include/libiomp/ -fopenmp -Ofast -fomit-frame-pointer -flto copy_study.c -o copy

Here is the resulting data as a table for reference (top row indicates sizes in bytes) :
image

And here is the resulting graph :
copy_study

@g1mv
Copy link
Owner

g1mv commented Apr 17, 2015

A few things clearly stand out :

  • no difference between basic byte copy and the simd version : auto vectorization is well done by Clang
  • memcpy seems to be always the slowest (it's faster to copy bytes the basic way, that is if the compiler performs auto vectorizing of course)
  • there is no real difference in timing, for sizes >= 8 bytes, between all methods apart from memcpy

And a few "strange" things are seen and can be discarded :

  • 8 bytes to 8 bytes copy (unsafe !) is extremely fast for sizes < 8 bytes, that's because no actual copy takes place ( as the for loop stops at size / sizeof(uint64_t), = 0 in that case)
  • 4 bytes to 4 bytes copy (unsafe !) is also extremely fast for sizes < 4 bytes for the same reason.

@g1mv
Copy link
Owner

g1mv commented Apr 20, 2015

Fixed in 240088c

@g1mv g1mv closed this as completed Apr 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants