New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unaligned stores/loads #30
Comments
I saw that this pointer alignment requirement is actually well-hidden in the C reference... |
It seems like there really should be a way to get the C compiler to basically do what it is doing now if the hardward supports unaligned loads/stores, but fall back on loading/storing individual bytes on hardware which doesn't support it (MIPS?). I haven't found anything, though. I was thinking about asking on SO… I'll try to do that later today, I'll let you know how it goes. If there is somewhere you basically want a fast memcpy so you use int64_t or something instead of uint8_t, I was thinking it might be possible to use the new OpenMP SIMD support. Something like void not_memcpy (uint8_t* dest, uint8_t src, size_t size) {
#pragma omp simd safelen(???)
for (size_t i = 0 ; i < size ; i++)
dest[i] = src[i];
} I haven't tried it to see if there is any speedup, though, and obviously it would require OpenMP 4.0 for there to be one. It would probably be okay to omit safelen for a FWIW, my current understanding is that unaligned store/load is basically free on modern x86/x86_64 CPUs, the main danger is that the CC will auto-vectorize the code and unaligned access will trap on vectors. On ARM the situation is similar, except I believe there is a significant penalty for unaligned access. MIPS doesn't support unaigned access… the CPU will trap them, and by default Linux will currently catch that and emulate the request using safe instructions—cost of the safer instructions aside (basically loading |
http://fastcompression.blogspot.com/2014/11/portability-woes-endianess-and.html has some good ideas, though it doesn't do much about the auto-vectorization concern… |
Thanks for your replies, I checked the link which is very informative. |
I'm not sure what you mean here—if you're talking about memcpy using SIMD, all platforms should be doing that. However, the memcpy library function has some overhead as it will take some time to determine what method to use (depending on things like alignment). It's great for larger operations, but for smaller ones it is pretty expensive. That said, most compilers will actually inline many memcpy calls, especially for smaller buffers with sizes that are know at compile-time, so if you can use fixed sizes memcpy would probably be fairly snappy. GCC has a __builtin_memcpy, but AFAIK it's unnecessary unless you compile with If you're talking about OS X (i.e. clang) supporting OpenMP 4.0, it doesn't—hopefully the next version of clang will. GCC does since 4.9. That said, you don't even have to put an ifdef around it… if the compiler doesn't support OpenMP 4 it will still work, it just will not use SIMD (unless the C compiler does it). If you want to take a vastly different approach when OpenMP 4.0 isn't available you can always use |
I'll perform a few tests later on : memcpy vs direct copy by using uint types (unsafe due to alignment issues) vs openmp copies, on OS X.
I was talking about this project : |
AFAIK that is the project they're trying to merge into clang. Unfortunately this has been going on for several years. |
A few things clearly stand out :
And a few "strange" things are seen and can be discarded :
|
Fixed in 240088c |
ubsan detects a lot of undefined stores/loads:
I only tested chameleon there, but it's probably a good bet that cheetah and lion have similar issues.
The text was updated successfully, but these errors were encountered: