Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use folly::IsRelocatable in folly::small_vector #1934

Closed
wants to merge 1 commit into from

Conversation

Quuxplusone
Copy link
Contributor

The relocateInlineTrivial operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by FBVector::make_window and FBVector::reserve. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types.

The existing kShouldCopyInlineTrivial is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation.

FWIW, I don't understand why this optimization can be disabled depending on the value of hardware_constructive_interference_size; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial std::copy. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

The `relocateInlineTrivial` operation here is exactly "move-construct plus destroy,"
which is the same operation optimized into memcpy by `FBVector::make_window`
and `FBVector::reserve`. So we can optimize it into memcpy here for
the same set of types, namely, trivially relocatable types.

The existing `kShouldCopyInlineTrivial` is also used in the copy constructor,
where we're doing "copy-construct, do not destroy." I believe the place
I changed in this patch is the only place where we can use true relocation.

FWIW, I don't understand why this optimization can be disabled depending on
the value of `hardware_constructive_interference_size`; it seems to me that
no matter what your cache line size is, memcpy will always be superior to
any non-trivial `std::copy`. Either both approaches will hit cache effects
(and memcpy will be faster), or neither approach will hit cache effects
(and memcpy will be faster). But that's tangential to this patch's purpose,
so I didn't mess with it.
@facebook-github-bot
Copy link
Contributor

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Quuxplusone
Copy link
Contributor Author

@Orvid one-year ping! :) I kinda forgot about this PR, but it's still applicable as far as I'm concerned. Please let me know if there's anything I can do to move it along.

@Orvid
Copy link
Contributor

Orvid commented Feb 12, 2024

I've just made a couple of updates internally (basically just inlining things since these are only used in one place) to address feedback from reviewers. This should be good to go once the diff is approved internally.

@facebook-github-bot
Copy link
Contributor

@Orvid merged this pull request in b6762ac.

facebook-github-bot pushed a commit to facebook/hhvm that referenced this pull request Feb 14, 2024
Summary:
The `relocateInlineTrivial` operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by `FBVector::make_window` and `FBVector::reserve`. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types.

The existing `kShouldCopyInlineTrivial` is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation.

FWIW, I don't understand why this optimization can be disabled depending on the value of `hardware_constructive_interference_size`; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial `std::copy`. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

X-link: facebook/folly#1934

Reviewed By: Gownta

Differential Revision: D43401609

Pulled By: Orvid

fbshipit-source-id: 7aa1fd922b0e60b9351fba96449e2ad38745bddf
@ot
Copy link
Contributor

ot commented Feb 20, 2024

FWIW, I don't understand why this optimization can be disabled depending on the value of hardware_constructive_interference_size; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial std::copy. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

Reducing the scope of the optimization actually had an impact on internal production workloads. The problem is not the memcpy but the fact that we're copying the whole inline storage, regardless of how many elements there are in it. This is a speculative optimization that can backfire when you have a large inline size that crosses multiple cache lines but frequently copy/move vectors that have only a few elements set; in this case you're doing extra cache line loads and stores that you wouldn't be doing with a sized loop.

@Quuxplusone
Copy link
Contributor Author

Quuxplusone commented Feb 20, 2024

@ot: Ah, I either misread the old code or had a brain fart. We're not comparing "non-trivial copy size() elements" against "memcpy size() elements"; we're comparing "trivial copy (i.e. memcpy) size() elements" (saves cache lines, can't fully unroll) against "memcpy capacity() elements" (costs cache lines, can fully unroll).

In that case/sense, I think the simpler patch that @Orvid merged ( b6762ac ) is actually worse than the patch I initially submitted. IIUC, the status quo ante was

  • small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<int*, 1000>'s move-constructor (calls uninitialized_copy which) memcpys only 8*size() bytes
  • small_vector<unique_ptr<int>, 4>'s move-constructor calls uninitialized_copy which does something non-trivial on 8*size() bytes
  • small_vector<unique_ptr<int>, 1000>'s move-constructor calls uninitialized_copy which does something non-trivial on 8*size() bytes

Then, IIUC, b6762ac changed the behavior to this:

  • small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<int*, 1000>'s move-constructor invariably memcpys 8000 bytes (oops!)
  • small_vector<unique_ptr<int>, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<unique_ptr<int>, 1000>'s move-constructor invariably memcpys 8000 bytes (oops!)

What we actually want (and what I think my #1934 actually did!) is

  • small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<int*, 1000>'s move-constructor (calls uninitialized_copy which) memcpys only 8*size() bytes
  • small_vector<unique_ptr<int>, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<unique_ptr<int>, 1000>'s move-constructor (calls the moral equivalent of P1144 uninitialized_relocate which) memcpys only 8*size() bytes

Does that make sense? What's the next step here?

@ot
Copy link
Contributor

ot commented Feb 20, 2024

If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct? And we need the memcpy instead of std::copy because while for trivial objects we can expect the compiler to lower the copy to a memcpy, we cannot reasonably expect that for general relocatable bytes.

I think that the best way would have been to just change the definition of kShouldCopyInlineTrivial (and switch to memcpy accordingly). This has the additional benefit of making the optimization consistent for all the places where it is used, while currently the move constructor is a special case.

I'll submit a patch.

@Quuxplusone
Copy link
Contributor Author

If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct?

Yes, but only in this place, which uses relocation. The constant kShouldCopyInlineTrivial is used in several other places (e.g. in the copy constructor) where it would be very incorrect to start memcpying things just because they happen to be trivially relocatable. Basically, the archetypical "trivially relocatable" types are unique_ptr and shared_ptr: they're safe to relocate with memcpy in this place, but they're totally not safe to copy with memcpy in those other places.

I would first recommend taking my original patch as submitted; or, silver-medal idea, your PR might try splitting out the is_trivially_copyable part from kShouldCopyInlineTrivial. Instead of these three pieces:

  static constexpr bool kShouldCopyInlineTrivial =
      is_trivially_copyable_v<Value> &&
      sizeof(InlineStorageType) <= hardware_constructive_interference_size / 2;

  small_vector(small_vector const& o) {
    if (kShouldCopyInlineTrivial && !o.isExtern()) {
      copyInlineTrivial<Value>(o);
      return;
    }

      if constexpr (IsRelocatable<Value>::value) {
        // Copy the entire inline storage, instead of just size() values, to

the first step should be to get to these three pieces:

  static constexpr bool kWholeBufferIsSmall =
      sizeof(InlineStorageType) <= hardware_constructive_interference_size / 2;

  small_vector(small_vector const& o) {
    if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall && !o.isExtern()) {
      copyInlineTrivial<Value>(o);
      return;
    }

      if constexpr (IsRelocatable<Value>::value && kWholeBufferIsSmall) {
        // Copy the entire inline storage, instead of just size() values, to

@ot
Copy link
Contributor

ot commented Feb 20, 2024

Yes, but only in this place, which uses relocation

Move-assignment should benefit from this too, why not use it there as well?

@Quuxplusone
Copy link
Contributor Author

Quuxplusone commented Feb 20, 2024

Yes, but only in this place, which uses relocation

Move-assignment should benefit from this too, why not use it there as well?

One could do something there, but it would be "not obvious." Basically you'd have to replace this current code:

        if (kShouldCopyInlineTrivial) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else {
          const size_t oSize = o.size();
          detail::partiallyUninitializedCopy(
              std::make_move_iterator(o.u.buffer()),
              oSize,
              this->u.buffer(),
              size());
          this->setSize(oSize);
          o.clear();
        }

with something like this:

        if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else if (isRelocatable<Value>::value && this->size() < o.size()) {
          // Move-assigning [X Y Z _ _] = [A B C D E]. First, move-assign the common prefix:
          std::move(o.u.buffer(), o.u.buffer() + this->size(), this->u.buffer());
          // Now we have [A B C _ _] on the left and [a b c D E] on the right. Relocate the suffix:
          std::memcpy(this->u.buffer() + this->size(), o.u.buffer() + this->size(), (oSize - this->size()) * kSizeOfValue);
          // Now we have [A B C D E] on the left and [a b c _ _] on the right. Update the sizes and we're done.
          o.setSize(this->size());
          this->setSize(oSize);
        } else {
          const size_t oSize = o.size();
          detail::partiallyUninitializedCopy(
              std::make_move_iterator(o.u.buffer()),
              oSize,
              this->u.buffer(),
              size());
          this->setSize(oSize);
          o.clear();
        }

(where _ means "no object," uppercase means "object with a value," and lowercase means "moved-from, but not destroyed, object").

My point is, yes you could do that, but that's significantly more complicated than the original point of my #1934. If you want to scope-creep it to also do that, sure, but it's kind of a whole separate thing from either "#1934 as originally designed" or "fixing-forward to eliminate b6762ac's large memcpy."

@Quuxplusone
Copy link
Contributor Author

...Oh, or you could do move-assignment by "destroy and relocate-into", like this:

        if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else if (isRelocatable<Value>::value) {
          // [X Y Z _ _] on the left, [A B C D _] on the right
          std::destroy(this->u.buffer(), this->u.buffer() + this->size());
          // [_ _ _ _ _] on the left, [A B C D _] on the right
          std::memcpy(this->u.buffer(), o.u.buffer(), (kWholeBufferIsSmall ? MaxInline : oSize) * kSizeOfValue);
          // [A B C D _] on the left, [_ _ _ _ _] on the right
          o.setSize(0);  // n.b.: NOT o.clear()!
          this->setSize(oSize);
        } else {
          [...]
        }

I see now that this is actually what AMC does. I have nothing against that, just that again it's kind of unrelated to the original intent of this PR.

facebook-github-bot pushed a commit that referenced this pull request Feb 22, 2024
Summary:
D43401609, derived from [PR #1934](#1934), allows the move constructor to use `memcpy` to relocate the inline storage for relocatable types, but it copies the whole storage disregarding the previous limitation on size for trivial types (which are a subset of relocatable types). Furthermore, the same optimization is not applied to the move-assignment operator.

This diff restores the size limitation (using a precise copy when relocating inline buffers that exceed it) and applies the same logic to move-assignment.

Also, since now we can assume C++17 and thus `if constexpr`, I removed the unnecessary overload of `copyInlineTrivial()`, and I made the names less ambiguous.

We might also benefit from special-casing relocatable types on reallocation in `makeSizeInternal()`, but that can be done separately.

Reviewed By: Gownta

Differential Revision: D53965673

fbshipit-source-id: 2939c03ada3b19d4fadfa11621a3e2f3afc5ccb7
facebook-github-bot pushed a commit to facebook/hhvm that referenced this pull request Feb 22, 2024
Summary:
D43401609, derived from [PR #1934](facebook/folly#1934), allows the move constructor to use `memcpy` to relocate the inline storage for relocatable types, but it copies the whole storage disregarding the previous limitation on size for trivial types (which are a subset of relocatable types). Furthermore, the same optimization is not applied to the move-assignment operator.

This diff restores the size limitation (using a precise copy when relocating inline buffers that exceed it) and applies the same logic to move-assignment.

Also, since now we can assume C++17 and thus `if constexpr`, I removed the unnecessary overload of `copyInlineTrivial()`, and I made the names less ambiguous.

We might also benefit from special-casing relocatable types on reallocation in `makeSizeInternal()`, but that can be done separately.

Reviewed By: Gownta

Differential Revision: D53965673

fbshipit-source-id: 2939c03ada3b19d4fadfa11621a3e2f3afc5ccb7
@ot
Copy link
Contributor

ot commented Feb 22, 2024

it's kind of unrelated to the original intent of this PR

I'm not sure what the original intent was, but I think it is better if move-assigning into an empty container behaves the same way as move-construction. The additional complexity is just a couple of lines, so it seems like a no brainer to me.

I landed the change in 2459e17

@Quuxplusone
Copy link
Contributor Author

Cool, 2459e17 LGTM! (At least I see nothing wrong with it in a five-minute glance.)
FWIW, I like your refactor into moveInlineStorageRelocatable that pushes down the branch on "how much do we copy?" to the lower more hidden level. Presumably the next step could be to introduce an analogous copyInlineStorageTrivial to replace/subsume the existing copyWholeInlineStorageTrivial.
(I suppose now I'm questioning the choice of the verb "move" there, instead of relocateInlineStorageTrivial. :)) Anyway — thanks!

@ot
Copy link
Contributor

ot commented Feb 23, 2024

Presumably the next step could be to introduce an analogous copyInlineStorageTrivial to replace/subsume the existing copyWholeInlineStorageTrivial.

I don't think there would be much benefit in doing that, the special case is really about when it is beneficial to copy the whole buffer because it turns into a small number of full-word MOVs, instead of a loop of unpredictable length. The general case already lowers to a memcpy when the value type is trivial, with any reasonable compiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants