-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] osd: ec partial stripe reads #23138
Conversation
retest this please |
This doesn't seem right:
Maybe we still do a full-shard read and I've made up that we're efficient about that, but it seems like extending the interfaces around minimum-to-decode is the way to go here. |
If a read is limited at one stripe and all the original data chunks are available. We will only read the chunks that contain range <off, len> to reduce the read latency and bandwidth. If the read is fast read or not all the original data chunks are available we will do a normal read. Signed-off-by: Xiaofei Cui <cuixiaofei@sangfor.com.cn>
Thanks for your review. @gregsfortytwo
Do I get the point what you mean? Thanks. |
Maybe you'd better write some docs explaining the strategy then. When I skim this, it looks like you're trying to map from logical data offsets to which chunks to read, in the OSD code, without reference to the EC plugin in use. So I think you're just assuming that the EC plugin's data placement is systematic and is a simple linear stream across the objects? That is frequently not the case for more exotic EC codes (some of which we have in-tree). So it's not that we need to tell the EC plugin what shards we want to read; we have to tell the EC plugin what data offsets we want to read, and let it tell us the minimum number of reads to get that data. |
@@ -2295,11 +2323,51 @@ void ECBackend::objects_read_and_reconstruct( | |||
} | |||
|
|||
map<hobject_t, set<int>> obj_want_to_read; | |||
set<int> want_to_read; | |||
get_want_to_read_shards(&want_to_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a much larger change than necessary... here's all it took to make recovery read the minimum data required: 468ad4b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting that the client read path can be similarly simple as the recovery read path. There should not be redundancies with the EC plugin or the recovery code.
EC plugin tells whether it a systematic or not. Partial read is available only when the ec_impl is systematic. Signed-off-by: Xiaofei Cui <cuixiaofei@sangfor.com.cn>
You are right! I assume that the EC plugin's data placement is systematic and is a simple linear stream across the objects. First, all the EC plugins of ceph are systematic right now. Second, I think the chunk_mapping can make sure the data is a simple linear stream across the objects. In addition to make the code more general, I have made another commit to restrict the partial read to systematic EC plugins. We really can tell the EC plugin what data offsets we want to read, and let it tell us the minimum number of reads to get that data. But, I think a EC plugin should better only deal with the stripe, but not the whole object, left the work to osd maybe better. Is there any other suggestion? Thanks. @gregsfortytwo |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Closing due to unaddressed comments, please reopen if you're still working on this |
Signed-off-by: Mark Nelson <mnelson@redhat.com>
Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a littler clearer (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in ceph@468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. --- This commit is a further ressurection, this time of the Mark Nelson's work in ceph#52746. It brings it on top of the recent rework of `ECBackend` and addresses review comments. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. --- This commit is a further ressurection, this time of the Mark Nelson's work in ceph#52746. It brings it on top of the recent rework of `ECBackend` and addresses review comments. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. --- This commit is a further ressurection, this time of the Mark Nelson's work in ceph#52746. It brings it on top of the recent rework of `ECBackend` and addresses review comments. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
This commit is a further ressurection of the EC partial reads concept; this time of the Mark Nelson's work sent as PR ceph#52746. The modifications in this commit are mostly about settling Mark's work on top of the recent rework of `ECBackend` which had shared the EC codebase with the crimson-osd. At the original description says, Mark's work is based on earlier attempt from Xiaofei Cui. Therefore credits go to: * Mark Nelson (Clyso), * Xiaofei Cui (cuixiaofei@sangfor.com.cn). The original commit description is preserved below: > This is a re-implementation of PR ceph#23138 rebased on main with a couple of nitpicky changes to make the code a little more clear (to me at least). Credit goes to Xiaofei Cui [cuixiaofei@sangfor.com.cn](mailto:cuixiaofei@sangfor.com.cn) for the original implementation. > > Looking at the original PR's review, it does not appear that we can use the same technique as in 468ad4b. We don't have the ReadOp yet. I'm not sure if @gregsforytwo's idea to query the plugin works, but it's clear we are not doing the efficient thing from the get-go here. > > The performance and efficiency benefits for small random reads appears to be quite substantial, especially for large stripe widths. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
If a read is limited at one stripe and all the original data
chunks are available. We will only read the chunks that contain
range <off, len> to reduce the read latency and bandwidth.
If the read is fast read or not all the original data chunks are
available we will do a normal read.
This idea comes from http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/23872.
Signed-off-by: Xiaofei Cui cuixiaofei@sangfor.com.cn