New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978
Conversation
a special sequence can cause this new situation. IO sequence: 1.put index prepare 2.list, get stale index 3.check_disk_state, find the head obj not exist 4.write head obj 5.index complete 6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE step 6 will delete the index Fixes: http://tracker.ceph.com/issues/24744 Signed-off-by: Tianshan Qu <tianshan@xsky.com>
1.recover index from put crash after complete 2.list raced with put, index_suggest should not delete index 3.list raced with delete, index_suggest should not recover index Signed-off-by: Tianshan Qu <tianshan@xsky.com>
a6f7b29
to
5b5a9bc
Compare
When these tests were initially added, the APIs were different. This updates them to the current versions, so it will compile. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
5b5a9bc
to
070b11b
Compare
I ran the python script in the tracker on a "vstart cluster" and did not hit the assert: "2021-06-22 16:23:24.579769 0 verified directory 164" I will now back up to the preceding master to see if I hit the assert there. |
I ran the python script on master (i.e., without the fix) on the vstart cluster running on a desktop and was not able to hit the assert. I got this far before killing the job: 2021-06-22 17:00:15.930446 1 verified directory 182 So I'm proposing to @mattbenjamin and @mkogan1 that @mkogan1 try it on one of the more substantial clusters he maintains. |
With this PR, vstart desktop cluster, python script run with -O: "2021-06-23 14:08:22.704091 0 verified directory 188". w/o this pr, vstart desktop cluster, python script run with -O: "2021-06-23 14:30:28.888813 2 verified directory 96". So even though I ran Python with "-O", it seems that nothing gets triggered on a vstart + desktop cluster. I've run it with both 1 and 2 OSDs running. |
My colleage, @mkogan1 did some manipulations to get some errors generated:
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
@@ -2110,6 +2110,13 @@ int rgw_dir_suggest_changes(cls_method_context_t hctx, | |||
return -EINVAL; | |||
} | |||
|
|||
if (cur_disk.pending_map.size() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that there's already a if (cur_disk.pending_map.empty()) {
condition below that prevents dir suggestions from being applied against pending entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i understand this PR better after the discussions around #45345
this early return means that we'll only apply suggestions for entries that had pending operations but they expired by tag timeout. the change in #45345 only addresses a race between dir_suggest and the complete, but this change also prevents clients from making suggestions against completed entries
in bucket listing, we have several conditions for calling check_disk_state()
- !pending_map.empty()
is not the only one. is it possible that we're relying on dir_suggest to work even though pending_map is empty?
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
@ivancich is this still needed? it has been outstanding for a long time |
Like the issue http://tracker.ceph.com/issues/22555 , a special sequence can cause this new situation.
IO sequence:
1.put index prepare
2.list, get stale index
3.check_disk_state, find the head obj not exist
4.write head obj
5.index complete
6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE
step 6 will delete the index
NOTE: This is a copy of #28654 (which I closed) with one additional commit, so the tests compile and pass.
Primary credit goes to Tianshan Qu tianshan@xsky.com.
Fixes: http://tracker.ceph.com/issues/24744