Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] osd: ec partial stripe reads #23138

Closed
wants to merge 2 commits into from
Closed

Conversation

thinkercui
Copy link

If a read is limited at one stripe and all the original data
chunks are available. We will only read the chunks that contain
range <off, len> to reduce the read latency and bandwidth.

If the read is fast read or not all the original data chunks are
available we will do a normal read.

This idea comes from http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/23872.

Signed-off-by: Xiaofei Cui cuixiaofei@sangfor.com.cn

@thinkercui thinkercui changed the title osd: only read needed chunks for EC read [RFC] osd: ec partial stripe reads Jul 20, 2018
@thinkercui
Copy link
Author

retest this please

@gregsfortytwo
Copy link
Member

This doesn't seem right:

  • it's open-coding the striping strategy without any reference to the erasure code plugin in use?
  • don't we already read only the necessary shards by asking the EC plugin what the minimum-to-decode is?

Maybe we still do a full-shard read and I've made up that we're efficient about that, but it seems like extending the interfaces around minimum-to-decode is the way to go here.

If a read is limited at one stripe and all the original data
chunks are available. We will only read the chunks that contain
range <off, len> to reduce the read latency and bandwidth.

If the read is fast read or not all the original data chunks are
available we will do a normal read.

Signed-off-by: Xiaofei Cui <cuixiaofei@sangfor.com.cn>
@thinkercui
Copy link
Author

Thanks for your review. @gregsfortytwo

  • This is for all the erasure code plugin but not for some one.
  • We need to tell what shards we want before we ask the EC plugin what the minimum-to-decode is.
    The less shards we want the less shards we need to read. The main point of this commit is to reduce
    want to read shards.

Do I get the point what you mean? Thanks.

@gregsfortytwo
Copy link
Member

Maybe you'd better write some docs explaining the strategy then. When I skim this, it looks like you're trying to map from logical data offsets to which chunks to read, in the OSD code, without reference to the EC plugin in use. So I think you're just assuming that the EC plugin's data placement is systematic and is a simple linear stream across the objects? That is frequently not the case for more exotic EC codes (some of which we have in-tree).

So it's not that we need to tell the EC plugin what shards we want to read; we have to tell the EC plugin what data offsets we want to read, and let it tell us the minimum number of reads to get that data.

@@ -2295,11 +2323,51 @@ void ECBackend::objects_read_and_reconstruct(
}

map<hobject_t, set<int>> obj_want_to_read;
set<int> want_to_read;
get_want_to_read_shards(&want_to_read);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a much larger change than necessary... here's all it took to make recovery read the minimum data required: 468ad4b

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification only relates the read from client. The read of recovery is left as it is. So, don't worry about 468ad4b. And the rop.want_to_read comes from here. We can't do like 468ad4b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting that the client read path can be similarly simple as the recovery read path. There should not be redundancies with the EC plugin or the recovery code.

EC plugin tells whether it a systematic or not. Partial read is
available only when the ec_impl is systematic.

Signed-off-by: Xiaofei Cui <cuixiaofei@sangfor.com.cn>
@thinkercui
Copy link
Author

You are right! I assume that the EC plugin's data placement is systematic and is a simple linear stream across the objects. First, all the EC plugins of ceph are systematic right now. Second, I think the chunk_mapping can make sure the data is a simple linear stream across the objects. In addition to make the code more general, I have made another commit to restrict the partial read to systematic EC plugins.

We really can tell the EC plugin what data offsets we want to read, and let it tell us the minimum number of reads to get that data. But, I think a EC plugin should better only deal with the stripe, but not the whole object, left the work to osd maybe better.

Is there any other suggestion? Thanks. @gregsfortytwo

@stale
Copy link

stale bot commented Oct 17, 2018

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added stale and removed stale labels Oct 17, 2018
@stale
Copy link

stale bot commented Dec 19, 2018

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Dec 19, 2018
@jdurgin
Copy link
Member

jdurgin commented Feb 8, 2019

Closing due to unaddressed comments, please reopen if you're still working on this

@jdurgin jdurgin closed this Feb 8, 2019
markhpc pushed a commit to markhpc/ceph that referenced this pull request Aug 2, 2023
Signed-off-by: Mark Nelson <mnelson@redhat.com>
markhpc pushed a commit to markhpc/ceph that referenced this pull request Aug 2, 2023
Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
markhpc pushed a commit to markhpc/ceph that referenced this pull request Aug 2, 2023
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a littler clearer (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
markhpc pushed a commit to markhpc/ceph that referenced this pull request Aug 2, 2023
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
markhpc pushed a commit to markhpc/ceph that referenced this pull request Aug 2, 2023
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jan 12, 2024
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

---

This commit is a further ressurection, this time of the Mark Nelson's
work in ceph#52746. It brings it on top
of the recent rework of `ECBackend` and addresses review comments.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jan 16, 2024
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

---

This commit is a further ressurection, this time of the Mark Nelson's
work in ceph#52746. It brings it on top
of the recent rework of `ECBackend` and addresses review comments.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Mar 8, 2024
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.

Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.

The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

---

This commit is a further ressurection, this time of the Mark Nelson's
work in ceph#52746. It brings it on top
of the recent rework of `ECBackend` and addresses review comments.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Mar 13, 2024
This commit is a further ressurection of the EC partial reads
concept; this time of the Mark Nelson's work sent as PR ceph#52746.
The modifications in this commit are mostly about settling
Mark's work on top of the recent rework of `ECBackend` which
had shared the EC codebase with the crimson-osd.

At the original description says, Mark's work is based on earlier
attempt from Xiaofei Cui.

Therefore credits go to:
  * Mark Nelson (Clyso),
  * Xiaofei Cui (cuixiaofei@sangfor.com.cn).

The original commit description is preserved below:

> This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least).  Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation.
>
> Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b.  We don't have the ReadOp yet.  I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here.
>
> The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants