Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] osd: EC Partial Stripe Reads (Retry of #23138) #52746

Closed
wants to merge 1 commit into from

Conversation

markhpc
Copy link
Member

@markhpc markhpc commented Aug 2, 2023

This is a re-implementation of PR #23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui cuixiaofei@sangfor.com.cn for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

FIO 4KB Random Read IOPS (16 Client)
FIO 4KB Random Read Cycles_OP (16 Client)

Edit: There was previously a bug in the cycles/op calculation due to a change in how the parsing code works (we previous didn't collect perf data on all osds, now we do). The scale is exactly the same, but the cycles/op numbers were originally over-inflated by a static factor of 6. The new numbers have been verified to be within about 10% of the expected cycles/op numbers when calculated using aggregate average CPU consumption and IOPS instead of aggregate cycles and ops.

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@markhpc
Copy link
Member Author

markhpc commented Aug 2, 2023

retest this please

}
}
return r;
}

bool ErasureCode::is_systematic() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be a cost function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one thing that's a little confusing is that our interface specifically states that it's for systematic codes anyway:

https://github.com/ceph/ceph/blob/main/src/erasure-code/ErasureCodeInterface.h#L23-L26

The original author of the first PR appears to have implemented this after receiving feedback that he was making assumptions about the plugins being systematic (probably due to the interface documentation?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronen-fr I'm not sure what you mean by a "cost function". @markhpc I'm not sure there's actually a reason to select a non-systematic code. Are there any non-systematic codes merged yet? If not, I'm fine with just relying on a requirement that they be systematic and removing this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos - sorry for the confusion. 'const'

}

for(int i = 0; i < total_chunks; i++) {
int j = (first_chunk + i) % data_chunk_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment to explain the logic here (and maybe rename 'j'?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left the "j" for now. We have a similar loop in get_want_to_read_shards here:
https://github.com/ceph/ceph/blob/main/src/osd/ECBackend.h#L211-L217

If we are going to change these around I think we should perhaps do it at the same time for both functions in another PR. I did add documentation for this function in the header though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is among the less readable files in the OSD code base, let's not double down on my prior poor taste. i, j, and k are fine as simple indices for short for loops, but j isn't a simple index. It should probably be chunk_to_read.

src/osd/ECBackend.cc Outdated Show resolved Hide resolved
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits first, there's some stylistic stuff to clean up and some type renames that should be in a separate commit.

More substantially, there's some inline offset twiddling I'd like to see tucked into well named stripe_info_t utility methods.

A major concern I have for features like this is how hard it is to actually trigger the behavior, or to inadvertently fail to trigger the behavior in testing. This implementation has a lot of special casing for partial_read and (from what I can tell) an unnecessarily restrictive set of conditions on it. Can we generalize it so that it basically always applies for non-stripe aligned reads? Also, it actually can coexist with a fast read -- either you get the chunks you need or you get enough other chunks to decode the whole stripe.

Also, (just an fyi) I think this is going to conflict with @rzarzynski's work in #52264.

const vector<int> &chunk_mapping = ec_impl->get_chunk_mapping();

int total_chunks = (chunk_size - 1 + len) / chunk_size;
int first_chunk = (off / chunk_size) % data_chunk_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be simple, well named ECUtil::stripe_info_t helpers. See ECUtil.h.

total_chunks = data_chunk_count;
}

for(int i = 0; i < total_chunks; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between for and (

}

for(int i = 0; i < total_chunks; i++) {
int j = (first_chunk + i) % data_chunk_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is among the less readable files in the OSD code base, let's not double down on my prior poor taste. i, j, and k are fine as simple indices for short for loops, but j isn't a simple index. It should probably be chunk_to_read.


for(int i = 0; i < total_chunks; i++) {
int j = (first_chunk + i) % data_chunk_count;
int chunk = (int)chunk_mapping.size() > j ? chunk_mapping[j] : j;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any plugins that actually use chunk_mapping? The only two possibilities here based on the ErasureCodeInterface.h interface comment are (chunk_mapping.empty() || j < chunk_mapping.size()).

@@ -1686,6 +1686,29 @@ int ECBackend::get_min_avail_to_read_shards(
return 0;
}

void ECBackend::get_min_want_to_read_shards(
Copy link
Contributor

@athanatos athanatos Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates some of the logic in the other get_want_to_read_shards overload. I'd like to see the other overload invoke this.

to_decode,
&bl);

int r = ECUtil::decode(ec->sinfo, ec->ec_impl, to_decode, &bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

if (r < 0) {
res.r = r;
goto out;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray whitespace

@@ -136,20 +136,29 @@ class ECBackend : public PGBackend {
* ensures that we won't ever have to restart a client initiated read in
* check_recovery_sources.
*/
typedef boost::tuple<uint64_t, uint64_t, uint32_t> ec_align_t;
typedef std::map<hobject_t,std::pair<int, extent_map>> ec_extents_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using, not typedef

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapping these types for their typedefs makes this PR harder to read. If we're going to bother with this, at least ec_align_t should be a struct with useful member names. Those changes should also be their own commit.

read_request_t(
const std::list<boost::tuple<uint64_t, uint64_t, uint32_t> > &to_read,
const std::map<pg_shard_t, std::vector<std::pair<int, int>>> &need,
bool want_attrs,
GenContext<std::pair<RecoveryMessages *, read_result_t& > &> *cb)
GenContext<std::pair<RecoveryMessages *, read_result_t& > &> *cb,
bool partial_read=false)
Copy link
Contributor

@athanatos athanatos Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a parameter? Wouldn't we always want to do a partial read if possible? Why would we ever read more chunks than are necessary? Seems like it could be a local property of each offset/len pair.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, isn't this a code code? I don't see any caller of read_request_t() setting it.

if (to_read.size() != 1) {
return false;
}
// Only partial read if the length is inside the stripe boundary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These restrictions seem kind of brittle. I don't see an obvious reason why the head and tail of a multi-stripe read couldn't be sub-stripe reads. Shouldn't it simply be a function of the offsets?

@athanatos
Copy link
Contributor

The perf results are impressive, this is worth getting fixed up and merged.

@NUABO
Copy link
Contributor

NUABO commented Aug 28, 2023

hi @markhpc , may I ask, will this optimization cause problems when the data is silently damaged or io error occurs? How to deal with abnormal situations, thanks

@markhpc
Copy link
Member Author

markhpc commented Aug 31, 2023

@NUABO That's a good question. AFAIK we don't do any kind of read path validation from the secondaries for primaries in replicated scenarios either. I suppose a side benefit of the current EC scheme is that theoretically errors can be detected on read, though I confess I don't know how that plays out in practice.

TODO: Check how we are using CRC here.

@athanatos
Copy link
Contributor

Bluestore stores checksums, so we're protected there at least. The existing implementation doesn't use extra chunks to validate the data -- this isn't any less safe than what we already do. Deep scrub will still validate against stored full object checksums.

@github-actions
Copy link

github-actions bot commented Sep 9, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 14, 2023
@dvanders
Copy link
Contributor

pls don't close this

@github-actions github-actions bot removed the stale label Dec 14, 2023
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jan 12, 2024
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

---

This commit is a further ressurection, this time of the Mark Nelson's
work in ceph#52746. It brings it on top
of the recent rework of `ECBackend` and addresses review comments.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jan 16, 2024
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

---

This commit is a further ressurection, this time of the Mark Nelson's
work in ceph#52746. It brings it on top
of the recent rework of `ECBackend` and addresses review comments.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Mar 8, 2024
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

---

This commit is a further ressurection, this time of the Mark Nelson's
work in ceph#52746. It brings it on top
of the recent rework of `ECBackend` and addresses review comments.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Mar 12, 2024
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Mar 13, 2024
This commit is a further ressurection of the EC partial reads
concept; this time of the Mark Nelson's work sent as PR ceph#52746.
The modifications in this commit are mostly about settling
Mark's work on top of the recent rework of `ECBackend` which
had shared the EC codebase with the crimson-osd.

At the original description says, Mark's work is based on earlier
attempt from Xiaofei Cui.

Therefore credits go to:
  * Mark Nelson (Clyso),
  * Xiaofei Cui (cuixiaofei@sangfor.com.cn).

The original commit description is preserved below:

> This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.
>
> Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.
>
> The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Apr 11, 2024
@xenago
Copy link
Contributor

xenago commented Apr 12, 2024

Well that's disappointing

@baergj
Copy link

baergj commented Apr 12, 2024

@xenago This is being continued in #55196

@xenago
Copy link
Contributor

xenago commented Apr 12, 2024

@xenago This is being continued in #55196

Thank you for linking that!

@NUABO
Copy link
Contributor

NUABO commented Apr 12, 2024

@xenago This is being continued in #55196

good news

@markhpc
Copy link
Member Author

markhpc commented Apr 12, 2024

Yep! Radek took it over and we're just waiting for QA to do some updated performance tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants