Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove unaligned loads and stores on x86 #22

Closed
wants to merge 1 commit into from

Conversation

arvidn
Copy link
Contributor

@arvidn arvidn commented Nov 4, 2017

It invokes undefined behavior regardless of the compiler back-end. more details in #21

@pdimov
Copy link
Member

pdimov commented Dec 20, 2017

Instead of removing the #ifdefs altogether, why not use memcpy there?

@arvidn
Copy link
Contributor Author

arvidn commented Dec 20, 2017

You mean as an optimisation of unrolled_byte_loops when no endian conversion is necessary?
I think such optimisation is orthogonal to the current special-case, since it should:

  1. be applied based on the host endianness (not specific architecture)
  2. be applied for big-endian types on big endian hosts

@pdimov
Copy link
Member

pdimov commented Dec 20, 2017

I mean as a replacement for the current casts, as this would minimize the patch both in lines and in spirit.

@pdimov
Copy link
Member

pdimov commented Dec 21, 2017

  1. be applied based on the host endianness (not specific architecture)

It's not clear whether it will be an optimization. It's possible for the unrolled_byte_loops to be more efficient if memcpy is not an intrinsic but compiles to a function call. We'll have to check on godbolt.org.

For x86 I think that all major compilers recognize memcpy on 4/8 bytes.

@pdimov
Copy link
Member

pdimov commented Dec 21, 2017

Here's a testbed: https://godbolt.org/g/MfP3Hc

@pdimov
Copy link
Member

pdimov commented Dec 21, 2017

At least for MSVC, removing the special case notably degrades the code.

@arvidn
Copy link
Contributor Author

arvidn commented Dec 21, 2017

I can put in memcpy there instead. It's tempting to depend on endian-macros instead of architecture macros though, and to use memcpy for the big endian case (on big endian machines).

It seems like a pretty safe assumption that memcpy is more efficient than a hand-written copy loop, in general. It's likely an easier optimisation to make.

@pdimov
Copy link
Member

pdimov commented Dec 21, 2017

As tested on godbolt, all x86 compilers (a) benefit significantly from the special case and (b) generate the same code with the current one and with memcpy.

ARM however generates a function call to memcpy so I'm not sure that the special case should be applied there.

Looks like the safest choice is to keep the current #ifdefs and just replace the cast inside with memcpy.

@arvidn arvidn force-pushed the no-unaligned-stores branch 2 times, most recently from 42026ab to 791ffad Compare December 21, 2017 22:39
@pdimov
Copy link
Member

pdimov commented Dec 23, 2017

Test added in 62802fe. Fixed in e93f6a2.

@pdimov pdimov closed this Dec 23, 2017
@arvidn arvidn deleted the no-unaligned-stores branch December 23, 2017 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants