[RFC] osd: EC Partial Stripe Reads (Retry of #23138) #52746

markhpc · 2023-08-02T05:19:18Z

This is a re-implementation of PR #23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui cuixiaofei@sangfor.com.cn for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

Edit: There was previously a bug in the cycles/op calculation due to a change in how the parsing code works (we previous didn't collect perf data on all osds, now we do). The scale is exactly the same, but the cycles/op numbers were originally over-inflated by a static factor of 6. The new numbers have been verified to be within about 10% of the expected cycles/op numbers when calculated using aggregate average CPU consumption and IOPS instead of aggregate cycles and ops.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

markhpc · 2023-08-02T14:56:10Z

retest this please

src/erasure-code/ErasureCode.cc

ronen-fr · 2023-08-02T15:35:39Z

src/erasure-code/ErasureCode.cc

    }
  }
  return r;
 }
+
+bool ErasureCode::is_systematic() {


should probably be a cost function

So one thing that's a little confusing is that our interface specifically states that it's for systematic codes anyway:

https://github.com/ceph/ceph/blob/main/src/erasure-code/ErasureCodeInterface.h#L23-L26

The original author of the first PR appears to have implemented this after receiving feedback that he was making assumptions about the plugins being systematic (probably due to the interface documentation?).

@ronen-fr I'm not sure what you mean by a "cost function". @markhpc I'm not sure there's actually a reason to select a non-systematic code. Are there any non-systematic codes merged yet? If not, I'm fine with just relying on a requirement that they be systematic and removing this method.

@athanatos - sorry for the confusion. 'const'

ronen-fr · 2023-08-02T15:37:04Z

src/osd/ECBackend.cc

+  }
+
+  for(int i = 0; i < total_chunks; i++) {
+    int j = (first_chunk + i) % data_chunk_count;


please add a comment to explain the logic here (and maybe rename 'j'?)

left the "j" for now. We have a similar loop in get_want_to_read_shards here:
https://github.com/ceph/ceph/blob/main/src/osd/ECBackend.h#L211-L217

If we are going to change these around I think we should perhaps do it at the same time for both functions in another PR. I did add documentation for this function in the header though.

This is among the less readable files in the OSD code base, let's not double down on my prior poor taste. i, j, and k are fine as simple indices for short for loops, but j isn't a simple index. It should probably be chunk_to_read.

src/osd/ECBackend.cc

This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. Signed-off-by: Mark Nelson <mark.nelson@clyso.com>

athanatos

Minor nits first, there's some stylistic stuff to clean up and some type renames that should be in a separate commit.

More substantially, there's some inline offset twiddling I'd like to see tucked into well named stripe_info_t utility methods.

A major concern I have for features like this is how hard it is to actually trigger the behavior, or to inadvertently fail to trigger the behavior in testing. This implementation has a lot of special casing for partial_read and (from what I can tell) an unnecessarily restrictive set of conditions on it. Can we generalize it so that it basically always applies for non-stripe aligned reads? Also, it actually can coexist with a fast read -- either you get the chunks you need or you get enough other chunks to decode the whole stripe.

Also, (just an fyi) I think this is going to conflict with @rzarzynski's work in #52264.

athanatos · 2023-08-03T00:57:34Z

src/osd/ECBackend.cc

+  const vector<int> &chunk_mapping = ec_impl->get_chunk_mapping();
+
+  int total_chunks = (chunk_size - 1 + len) / chunk_size;
+  int first_chunk = (off / chunk_size) % data_chunk_count;


These should be simple, well named ECUtil::stripe_info_t helpers. See ECUtil.h.

athanatos · 2023-08-03T00:57:49Z

src/osd/ECBackend.cc

+    total_chunks = data_chunk_count;
+  }
+
+  for(int i = 0; i < total_chunks; i++) {


Space between for and (

athanatos · 2023-08-03T01:00:14Z

src/osd/ECBackend.cc

+  }
+
+  for(int i = 0; i < total_chunks; i++) {
+    int j = (first_chunk + i) % data_chunk_count;


This is among the less readable files in the OSD code base, let's not double down on my prior poor taste. i, j, and k are fine as simple indices for short for loops, but j isn't a simple index. It should probably be chunk_to_read.

athanatos · 2023-08-03T01:05:10Z

src/osd/ECBackend.cc

+
+  for(int i = 0; i < total_chunks; i++) {
+    int j = (first_chunk + i) % data_chunk_count;
+    int chunk = (int)chunk_mapping.size()  > j ? chunk_mapping[j] : j;


Are there any plugins that actually use chunk_mapping? The only two possibilities here based on the ErasureCodeInterface.h interface comment are (chunk_mapping.empty() || j < chunk_mapping.size()).

athanatos · 2023-08-03T01:06:06Z

src/osd/ECBackend.cc

@@ -1686,6 +1686,29 @@ int ECBackend::get_min_avail_to_read_shards(
  return 0;
 }

+void ECBackend::get_min_want_to_read_shards(


This duplicates some of the logic in the other get_want_to_read_shards overload. I'd like to see the other overload invoke this.

athanatos · 2023-08-03T01:11:57Z

src/osd/ECBackend.cc

-	to_decode,
-	&bl);
+
+      int r = ECUtil::decode(ec->sinfo, ec->ec_impl, to_decode, &bl);


athanatos · 2023-08-03T01:12:04Z

src/osd/ECBackend.cc

      if (r < 0) {
        res.r = r;
        goto out;
      }
+


stray whitespace

athanatos · 2023-08-03T01:12:57Z

src/osd/ECBackend.h

@@ -136,20 +136,29 @@ class ECBackend : public PGBackend {
   * ensures that we won't ever have to restart a client initiated read in
   * check_recovery_sources.
   */
+  typedef boost::tuple<uint64_t, uint64_t, uint32_t> ec_align_t;
+  typedef std::map<hobject_t,std::pair<int, extent_map>>  ec_extents_t;


using, not typedef

Swapping these types for their typedefs makes this PR harder to read. If we're going to bother with this, at least ec_align_t should be a struct with useful member names. Those changes should also be their own commit.

athanatos · 2023-08-03T01:15:04Z

src/osd/ECBackend.h

    read_request_t(
      const std::list<boost::tuple<uint64_t, uint64_t, uint32_t> > &to_read,
      const std::map<pg_shard_t, std::vector<std::pair<int, int>>> &need,
      bool want_attrs,
-      GenContext<std::pair<RecoveryMessages *, read_result_t& > &> *cb)
+      GenContext<std::pair<RecoveryMessages *, read_result_t& > &> *cb,
+      bool partial_read=false)


Why is this a parameter? Wouldn't we always want to do a partial read if possible? Why would we ever read more chunks than are necessary? Seems like it could be a local property of each offset/len pair.

Oops, isn't this a code code? I don't see any caller of read_request_t() setting it.

athanatos · 2023-08-03T01:18:47Z

src/osd/ECBackend.cc

+    if (to_read.size() != 1) {
+      return false;
+    }
+    // Only partial read if the length is inside the stripe boundary


These restrictions seem kind of brittle. I don't see an obvious reason why the head and tail of a multi-stripe read couldn't be sub-stripe reads. Shouldn't it simply be a function of the offsets?

athanatos · 2023-08-03T03:37:29Z

The perf results are impressive, this is worth getting fixed up and merged.

NUABO · 2023-08-28T07:01:37Z

hi @markhpc , may I ask, will this optimization cause problems when the data is silently damaged or io error occurs? How to deal with abnormal situations, thanks

markhpc · 2023-08-31T14:33:40Z

@NUABO That's a good question. AFAIK we don't do any kind of read path validation from the secondaries for primaries in replicated scenarios either. I suppose a side benefit of the current EC scheme is that theoretically errors can be detected on read, though I confess I don't know how that plays out in practice.

TODO: Check how we are using CRC here.

athanatos · 2023-08-31T20:55:03Z

Bluestore stores checksums, so we're protected there at least. The existing implementation doesn't use extra chunks to validate the data -- this isn't any less safe than what we already do. Deep scrub will still validate against stored full object checksums.

github-actions · 2023-09-09T02:02:03Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2023-12-14T00:04:57Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

dvanders · 2023-12-14T16:52:11Z

pls don't close this

This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. --- This commit is a further ressurection, this time of the Mark Nelson's work in ceph#52746. It brings it on top of the recent rework of `ECBackend` and addresses review comments. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

github-actions · 2024-03-12T21:01:22Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

This commit is a further ressurection of the EC partial reads concept; this time of the Mark Nelson's work sent as PR ceph#52746. The modifications in this commit are mostly about settling Mark's work on top of the recent rework of `ECBackend` which had shared the EC codebase with the crimson-osd. At the original description says, Mark's work is based on earlier attempt from Xiaofei Cui. Therefore credits go to: * Mark Nelson (Clyso), * Xiaofei Cui (cuixiaofei@sangfor.com.cn). The original commit description is preserved below: > This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. > > Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. > > The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

github-actions · 2024-04-11T22:01:35Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

xenago · 2024-04-12T14:03:20Z

Well that's disappointing

baergj · 2024-04-12T14:07:59Z

@xenago This is being continued in #55196

xenago · 2024-04-12T14:09:10Z

@xenago This is being continued in #55196

Thank you for linking that!

NUABO · 2024-04-12T14:12:23Z

@xenago This is being continued in #55196

good news

markhpc · 2024-04-12T19:26:20Z

Yep! Radek took it over and we're just waiting for QA to do some updated performance tests!

markhpc added core performance labels Aug 2, 2023

markhpc requested review from gregsfortytwo, jdurgin and dvanders August 2, 2023 05:19

markhpc requested a review from a team as a code owner August 2, 2023 05:19

github-actions bot added the tests label Aug 2, 2023

ronen-fr reviewed Aug 2, 2023

View reviewed changes

src/erasure-code/ErasureCode.cc Show resolved Hide resolved

ronen-fr reviewed Aug 2, 2023

View reviewed changes

src/erasure-code/ErasureCode.cc Show resolved Hide resolved

ronen-fr reviewed Aug 2, 2023

View reviewed changes

src/erasure-code/ErasureCode.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Aug 2, 2023

View reviewed changes

src/osd/ECBackend.cc Outdated Show resolved Hide resolved

neha-ojha requested a review from athanatos August 2, 2023 16:11

markhpc force-pushed the wip-osd-ec-partial-read branch from 5e4fb61 to 8f53dac Compare August 2, 2023 23:38

athanatos requested changes Aug 3, 2023

View reviewed changes

github-actions bot added the needs-rebase label Sep 9, 2023

github-actions bot added the stale label Dec 14, 2023

github-actions bot removed the stale label Dec 14, 2023

rzarzynski mentioned this pull request Jan 16, 2024

osd: EC Partial Stripe Reads (Retry of #23138 and #52746) #55196

Open

14 tasks

github-actions bot added the stale label Mar 12, 2024

github-actions bot closed this Apr 11, 2024

[RFC] osd: EC Partial Stripe Reads (Retry of #23138) #52746

[RFC] osd: EC Partial Stripe Reads (Retry of #23138) #52746

Conversation

markhpc commented Aug 2, 2023 • edited

Contribution Guidelines

Checklist

markhpc commented Aug 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athanatos left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athanatos Aug 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athanatos Aug 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athanatos commented Aug 3, 2023

NUABO commented Aug 28, 2023

markhpc commented Aug 31, 2023 • edited

athanatos commented Aug 31, 2023

github-actions bot commented Sep 9, 2023

github-actions bot commented Dec 14, 2023

dvanders commented Dec 14, 2023

github-actions bot commented Mar 12, 2024

github-actions bot commented Apr 11, 2024

xenago commented Apr 12, 2024

baergj commented Apr 12, 2024

xenago commented Apr 12, 2024

NUABO commented Apr 12, 2024

markhpc commented Apr 12, 2024

markhpc commented Aug 2, 2023 •

edited

athanatos left a comment •

edited

athanatos Aug 3, 2023 •

edited

athanatos Aug 3, 2023 •

edited

markhpc commented Aug 31, 2023 •

edited