-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use folly::IsRelocatable in folly::small_vector #1934
Conversation
The `relocateInlineTrivial` operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by `FBVector::make_window` and `FBVector::reserve`. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types. The existing `kShouldCopyInlineTrivial` is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation. FWIW, I don't understand why this optimization can be disabled depending on the value of `hardware_constructive_interference_size`; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial `std::copy`. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.
@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@Orvid one-year ping! :) I kinda forgot about this PR, but it's still applicable as far as I'm concerned. Please let me know if there's anything I can do to move it along. |
I've just made a couple of updates internally (basically just inlining things since these are only used in one place) to address feedback from reviewers. This should be good to go once the diff is approved internally. |
Summary: The `relocateInlineTrivial` operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by `FBVector::make_window` and `FBVector::reserve`. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types. The existing `kShouldCopyInlineTrivial` is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation. FWIW, I don't understand why this optimization can be disabled depending on the value of `hardware_constructive_interference_size`; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial `std::copy`. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it. X-link: facebook/folly#1934 Reviewed By: Gownta Differential Revision: D43401609 Pulled By: Orvid fbshipit-source-id: 7aa1fd922b0e60b9351fba96449e2ad38745bddf
Reducing the scope of the optimization actually had an impact on internal production workloads. The problem is not the memcpy but the fact that we're copying the whole inline storage, regardless of how many elements there are in it. This is a speculative optimization that can backfire when you have a large inline size that crosses multiple cache lines but frequently copy/move vectors that have only a few elements set; in this case you're doing extra cache line loads and stores that you wouldn't be doing with a sized loop. |
@ot: Ah, I either misread the old code or had a brain fart. We're not comparing "non-trivial copy In that case/sense, I think the simpler patch that @Orvid merged ( b6762ac ) is actually worse than the patch I initially submitted. IIUC, the status quo ante was
Then, IIUC, b6762ac changed the behavior to this:
What we actually want (and what I think my #1934 actually did!) is
Does that make sense? What's the next step here? |
If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct? And we need the I think that the best way would have been to just change the definition of I'll submit a patch. |
Yes, but only in this place, which uses relocation. The constant I would first recommend taking my original patch as submitted; or, silver-medal idea, your PR might try splitting out the
the first step should be to get to these three pieces:
|
Move-assignment should benefit from this too, why not use it there as well? |
One could do something there, but it would be "not obvious." Basically you'd have to replace this current code:
with something like this:
(where My point is, yes you could do that, but that's significantly more complicated than the original point of my #1934. If you want to scope-creep it to also do that, sure, but it's kind of a whole separate thing from either "#1934 as originally designed" or "fixing-forward to eliminate b6762ac's large memcpy." |
...Oh, or you could do move-assignment by "destroy and relocate-into", like this:
I see now that this is actually what AMC does. I have nothing against that, just that again it's kind of unrelated to the original intent of this PR. |
Summary: D43401609, derived from [PR #1934](#1934), allows the move constructor to use `memcpy` to relocate the inline storage for relocatable types, but it copies the whole storage disregarding the previous limitation on size for trivial types (which are a subset of relocatable types). Furthermore, the same optimization is not applied to the move-assignment operator. This diff restores the size limitation (using a precise copy when relocating inline buffers that exceed it) and applies the same logic to move-assignment. Also, since now we can assume C++17 and thus `if constexpr`, I removed the unnecessary overload of `copyInlineTrivial()`, and I made the names less ambiguous. We might also benefit from special-casing relocatable types on reallocation in `makeSizeInternal()`, but that can be done separately. Reviewed By: Gownta Differential Revision: D53965673 fbshipit-source-id: 2939c03ada3b19d4fadfa11621a3e2f3afc5ccb7
Summary: D43401609, derived from [PR #1934](facebook/folly#1934), allows the move constructor to use `memcpy` to relocate the inline storage for relocatable types, but it copies the whole storage disregarding the previous limitation on size for trivial types (which are a subset of relocatable types). Furthermore, the same optimization is not applied to the move-assignment operator. This diff restores the size limitation (using a precise copy when relocating inline buffers that exceed it) and applies the same logic to move-assignment. Also, since now we can assume C++17 and thus `if constexpr`, I removed the unnecessary overload of `copyInlineTrivial()`, and I made the names less ambiguous. We might also benefit from special-casing relocatable types on reallocation in `makeSizeInternal()`, but that can be done separately. Reviewed By: Gownta Differential Revision: D53965673 fbshipit-source-id: 2939c03ada3b19d4fadfa11621a3e2f3afc5ccb7
I'm not sure what the original intent was, but I think it is better if move-assigning into an empty container behaves the same way as move-construction. The additional complexity is just a couple of lines, so it seems like a no brainer to me. I landed the change in 2459e17 |
Cool, 2459e17 LGTM! (At least I see nothing wrong with it in a five-minute glance.) |
I don't think there would be much benefit in doing that, the special case is really about when it is beneficial to copy the whole buffer because it turns into a small number of full-word MOVs, instead of a loop of unpredictable length. The general case already lowers to a |
The
relocateInlineTrivial
operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy byFBVector::make_window
andFBVector::reserve
. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types.The existing
kShouldCopyInlineTrivial
is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation.FWIW, I don't understand why this optimization can be disabled depending on the value of
hardware_constructive_interference_size
; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivialstd::copy
. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.