Use folly::IsRelocatable in folly::small_vector #1934

Quuxplusone · 2023-02-14T16:45:30Z

The relocateInlineTrivial operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by FBVector::make_window and FBVector::reserve. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types.

The existing kShouldCopyInlineTrivial is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation.

FWIW, I don't understand why this optimization can be disabled depending on the value of hardware_constructive_interference_size; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial std::copy. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

The `relocateInlineTrivial` operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by `FBVector::make_window` and `FBVector::reserve`. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types. The existing `kShouldCopyInlineTrivial` is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation. FWIW, I don't understand why this optimization can be disabled depending on the value of `hardware_constructive_interference_size`; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial `std::copy`. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

facebook-github-bot · 2023-02-17T19:07:25Z

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Quuxplusone · 2024-02-11T17:34:00Z

@Orvid one-year ping! :) I kinda forgot about this PR, but it's still applicable as far as I'm concerned. Please let me know if there's anything I can do to move it along.

Orvid · 2024-02-12T21:39:07Z

I've just made a couple of updates internally (basically just inlining things since these are only used in one place) to address feedback from reviewers. This should be good to go once the diff is approved internally.

facebook-github-bot · 2024-02-14T21:11:15Z

@Orvid merged this pull request in b6762ac.

Summary: The `relocateInlineTrivial` operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by `FBVector::make_window` and `FBVector::reserve`. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types. The existing `kShouldCopyInlineTrivial` is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation. FWIW, I don't understand why this optimization can be disabled depending on the value of `hardware_constructive_interference_size`; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial `std::copy`. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it. X-link: facebook/folly#1934 Reviewed By: Gownta Differential Revision: D43401609 Pulled By: Orvid fbshipit-source-id: 7aa1fd922b0e60b9351fba96449e2ad38745bddf

ot · 2024-02-20T18:22:53Z

FWIW, I don't understand why this optimization can be disabled depending on the value of hardware_constructive_interference_size; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial std::copy. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

Reducing the scope of the optimization actually had an impact on internal production workloads. The problem is not the memcpy but the fact that we're copying the whole inline storage, regardless of how many elements there are in it. This is a speculative optimization that can backfire when you have a large inline size that crosses multiple cache lines but frequently copy/move vectors that have only a few elements set; in this case you're doing extra cache line loads and stores that you wouldn't be doing with a sized loop.

Quuxplusone · 2024-02-20T19:49:35Z

@ot: Ah, I either misread the old code or had a brain fart. We're not comparing "non-trivial copy size() elements" against "memcpy size() elements"; we're comparing "trivial copy (i.e. memcpy) size() elements" (saves cache lines, can't fully unroll) against "memcpy capacity() elements" (costs cache lines, can fully unroll).

In that case/sense, I think the simpler patch that @Orvid merged ( b6762ac ) is actually worse than the patch I initially submitted. IIUC, the status quo ante was

small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
small_vector<int*, 1000>'s move-constructor (calls uninitialized_copy which) memcpys only 8*size() bytes
small_vector<unique_ptr<int>, 4>'s move-constructor calls uninitialized_copy which does something non-trivial on 8*size() bytes
small_vector<unique_ptr<int>, 1000>'s move-constructor calls uninitialized_copy which does something non-trivial on 8*size() bytes

Then, IIUC, b6762ac changed the behavior to this:

small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
small_vector<int*, 1000>'s move-constructor invariably memcpys 8000 bytes (oops!)
small_vector<unique_ptr<int>, 4>'s move-constructor invariably memcpys 32 bytes
small_vector<unique_ptr<int>, 1000>'s move-constructor invariably memcpys 8000 bytes (oops!)

What we actually want (and what I think my #1934 actually did!) is

small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
small_vector<int*, 1000>'s move-constructor (calls uninitialized_copy which) memcpys only 8*size() bytes
small_vector<unique_ptr<int>, 4>'s move-constructor invariably memcpys 32 bytes
small_vector<unique_ptr<int>, 1000>'s move-constructor (calls the moral equivalent of P1144 uninitialized_relocate which) memcpys only 8*size() bytes

Does that make sense? What's the next step here?

ot · 2024-02-20T20:21:23Z

If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct? And we need the memcpy instead of std::copy because while for trivial objects we can expect the compiler to lower the copy to a memcpy, we cannot reasonably expect that for general relocatable bytes.

I think that the best way would have been to just change the definition of kShouldCopyInlineTrivial (and switch to memcpy accordingly). This has the additional benefit of making the optimization consistent for all the places where it is used, while currently the move constructor is a special case.

I'll submit a patch.

Quuxplusone · 2024-02-20T21:46:31Z

If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct?

Yes, but only in this place, which uses relocation. The constant kShouldCopyInlineTrivial is used in several other places (e.g. in the copy constructor) where it would be very incorrect to start memcpying things just because they happen to be trivially relocatable. Basically, the archetypical "trivially relocatable" types are unique_ptr and shared_ptr: they're safe to relocate with memcpy in this place, but they're totally not safe to copy with memcpy in those other places.

I would first recommend taking my original patch as submitted; or, silver-medal idea, your PR might try splitting out the is_trivially_copyable part from kShouldCopyInlineTrivial. Instead of these three pieces:

  static constexpr bool kShouldCopyInlineTrivial =
      is_trivially_copyable_v<Value> &&
      sizeof(InlineStorageType) <= hardware_constructive_interference_size / 2;

  small_vector(small_vector const& o) {
    if (kShouldCopyInlineTrivial && !o.isExtern()) {
      copyInlineTrivial<Value>(o);
      return;
    }

      if constexpr (IsRelocatable<Value>::value) {
        // Copy the entire inline storage, instead of just size() values, to

the first step should be to get to these three pieces:

  static constexpr bool kWholeBufferIsSmall =
      sizeof(InlineStorageType) <= hardware_constructive_interference_size / 2;

  small_vector(small_vector const& o) {
    if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall && !o.isExtern()) {
      copyInlineTrivial<Value>(o);
      return;
    }

      if constexpr (IsRelocatable<Value>::value && kWholeBufferIsSmall) {
        // Copy the entire inline storage, instead of just size() values, to

ot · 2024-02-20T21:54:12Z

Yes, but only in this place, which uses relocation

Move-assignment should benefit from this too, why not use it there as well?

Quuxplusone · 2024-02-20T22:14:44Z

Yes, but only in this place, which uses relocation

Move-assignment should benefit from this too, why not use it there as well?

One could do something there, but it would be "not obvious." Basically you'd have to replace this current code:

        if (kShouldCopyInlineTrivial) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else {
          const size_t oSize = o.size();
          detail::partiallyUninitializedCopy(
              std::make_move_iterator(o.u.buffer()),
              oSize,
              this->u.buffer(),
              size());
          this->setSize(oSize);
          o.clear();
        }

with something like this:

        if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else if (isRelocatable<Value>::value && this->size() < o.size()) {
          // Move-assigning [X Y Z _ _] = [A B C D E]. First, move-assign the common prefix:
          std::move(o.u.buffer(), o.u.buffer() + this->size(), this->u.buffer());
          // Now we have [A B C _ _] on the left and [a b c D E] on the right. Relocate the suffix:
          std::memcpy(this->u.buffer() + this->size(), o.u.buffer() + this->size(), (oSize - this->size()) * kSizeOfValue);
          // Now we have [A B C D E] on the left and [a b c _ _] on the right. Update the sizes and we're done.
          o.setSize(this->size());
          this->setSize(oSize);
        } else {
          const size_t oSize = o.size();
          detail::partiallyUninitializedCopy(
              std::make_move_iterator(o.u.buffer()),
              oSize,
              this->u.buffer(),
              size());
          this->setSize(oSize);
          o.clear();
        }

(where _ means "no object," uppercase means "object with a value," and lowercase means "moved-from, but not destroyed, object").

My point is, yes you could do that, but that's significantly more complicated than the original point of my #1934. If you want to scope-creep it to also do that, sure, but it's kind of a whole separate thing from either "#1934 as originally designed" or "fixing-forward to eliminate b6762ac's large memcpy."

Quuxplusone · 2024-02-20T22:22:29Z

...Oh, or you could do move-assignment by "destroy and relocate-into", like this:

        if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else if (isRelocatable<Value>::value) {
          // [X Y Z _ _] on the left, [A B C D _] on the right
          std::destroy(this->u.buffer(), this->u.buffer() + this->size());
          // [_ _ _ _ _] on the left, [A B C D _] on the right
          std::memcpy(this->u.buffer(), o.u.buffer(), (kWholeBufferIsSmall ? MaxInline : oSize) * kSizeOfValue);
          // [A B C D _] on the left, [_ _ _ _ _] on the right
          o.setSize(0);  // n.b.: NOT o.clear()!
          this->setSize(oSize);
        } else {
          [...]
        }

I see now that this is actually what AMC does. I have nothing against that, just that again it's kind of unrelated to the original intent of this PR.

Summary: D43401609, derived from [PR #1934](#1934), allows the move constructor to use `memcpy` to relocate the inline storage for relocatable types, but it copies the whole storage disregarding the previous limitation on size for trivial types (which are a subset of relocatable types). Furthermore, the same optimization is not applied to the move-assignment operator. This diff restores the size limitation (using a precise copy when relocating inline buffers that exceed it) and applies the same logic to move-assignment. Also, since now we can assume C++17 and thus `if constexpr`, I removed the unnecessary overload of `copyInlineTrivial()`, and I made the names less ambiguous. We might also benefit from special-casing relocatable types on reallocation in `makeSizeInternal()`, but that can be done separately. Reviewed By: Gownta Differential Revision: D53965673 fbshipit-source-id: 2939c03ada3b19d4fadfa11621a3e2f3afc5ccb7

Summary: D43401609, derived from [PR #1934](facebook/folly#1934), allows the move constructor to use `memcpy` to relocate the inline storage for relocatable types, but it copies the whole storage disregarding the previous limitation on size for trivial types (which are a subset of relocatable types). Furthermore, the same optimization is not applied to the move-assignment operator. This diff restores the size limitation (using a precise copy when relocating inline buffers that exceed it) and applies the same logic to move-assignment. Also, since now we can assume C++17 and thus `if constexpr`, I removed the unnecessary overload of `copyInlineTrivial()`, and I made the names less ambiguous. We might also benefit from special-casing relocatable types on reallocation in `makeSizeInternal()`, but that can be done separately. Reviewed By: Gownta Differential Revision: D53965673 fbshipit-source-id: 2939c03ada3b19d4fadfa11621a3e2f3afc5ccb7

ot · 2024-02-22T00:35:55Z

it's kind of unrelated to the original intent of this PR

I'm not sure what the original intent was, but I think it is better if move-assigning into an empty container behaves the same way as move-construction. The additional complexity is just a couple of lines, so it seems like a no brainer to me.

I landed the change in 2459e17

Quuxplusone · 2024-02-22T01:07:29Z

Cool, 2459e17 LGTM! (At least I see nothing wrong with it in a five-minute glance.)
FWIW, I like your refactor into moveInlineStorageRelocatable that pushes down the branch on "how much do we copy?" to the lower more hidden level. Presumably the next step could be to introduce an analogous copyInlineStorageTrivial to replace/subsume the existing copyWholeInlineStorageTrivial.
(I suppose now I'm questioning the choice of the verb "move" there, instead of relocateInlineStorageTrivial. :)) Anyway — thanks!

ot · 2024-02-23T14:59:14Z

Presumably the next step could be to introduce an analogous copyInlineStorageTrivial to replace/subsume the existing copyWholeInlineStorageTrivial.

I don't think there would be much benefit in doing that, the special case is really about when it is beneficial to copy the whole buffer because it turns into a small number of full-word MOVs, instead of a loop of unpredictable length. The general case already lowers to a memcpy when the value type is trivial, with any reasonable compiler.

facebook-github-bot added the CLA Signed label Feb 14, 2023

facebook-github-bot closed this in b6762ac Feb 14, 2024

facebook-github-bot added the Merged label Feb 14, 2024

Quuxplusone deleted the reloc-in-smallvector branch February 24, 2024 19:44

Quuxplusone mentioned this pull request Mar 21, 2024

Make static_string trivially copyable boostorg/static_string#57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use folly::IsRelocatable in folly::small_vector #1934

Use folly::IsRelocatable in folly::small_vector #1934

Quuxplusone commented Feb 14, 2023

facebook-github-bot commented Feb 17, 2023

Quuxplusone commented Feb 11, 2024

Orvid commented Feb 12, 2024

facebook-github-bot commented Feb 14, 2024

ot commented Feb 20, 2024

Quuxplusone commented Feb 20, 2024 •

edited

Loading

ot commented Feb 20, 2024

Quuxplusone commented Feb 20, 2024

ot commented Feb 20, 2024

Quuxplusone commented Feb 20, 2024 •

edited

Loading

Quuxplusone commented Feb 20, 2024

ot commented Feb 22, 2024

Quuxplusone commented Feb 22, 2024

ot commented Feb 23, 2024

Use folly::IsRelocatable in folly::small_vector #1934

Use folly::IsRelocatable in folly::small_vector #1934

Conversation

Quuxplusone commented Feb 14, 2023

facebook-github-bot commented Feb 17, 2023

Quuxplusone commented Feb 11, 2024

Orvid commented Feb 12, 2024

facebook-github-bot commented Feb 14, 2024

ot commented Feb 20, 2024

Quuxplusone commented Feb 20, 2024 • edited Loading

ot commented Feb 20, 2024

Quuxplusone commented Feb 20, 2024

ot commented Feb 20, 2024

Quuxplusone commented Feb 20, 2024 • edited Loading

Quuxplusone commented Feb 20, 2024

ot commented Feb 22, 2024

Quuxplusone commented Feb 22, 2024

ot commented Feb 23, 2024

Quuxplusone commented Feb 20, 2024 •

edited

Loading

Quuxplusone commented Feb 20, 2024 •

edited

Loading