Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978

Closed
wants to merge 3 commits into from

Conversation

ivancich
Copy link
Member

@ivancich ivancich commented Jun 22, 2021

Like the issue http://tracker.ceph.com/issues/22555 , a special sequence can cause this new situation.

IO sequence:
1.put index prepare
2.list, get stale index
3.check_disk_state, find the head obj not exist
4.write head obj
5.index complete
6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE

step 6 will delete the index

NOTE: This is a copy of #28654 (which I closed) with one additional commit, so the tests compile and pass.

Primary credit goes to Tianshan Qu tianshan@xsky.com.

Fixes: http://tracker.ceph.com/issues/24744

a special sequence can cause this new situation.
IO sequence:
1.put index prepare
2.list, get stale index
3.check_disk_state, find the head obj not exist
4.write head obj
5.index complete
6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE
step 6 will delete the index

Fixes: http://tracker.ceph.com/issues/24744

Signed-off-by: Tianshan Qu <tianshan@xsky.com>
1.recover index from put crash after complete
2.list raced with put, index_suggest should not delete index
3.list raced with delete, index_suggest should not recover index

Signed-off-by: Tianshan Qu <tianshan@xsky.com>
When these tests were initially added, the APIs were different. This
updates them to the current versions, so it will compile.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
@ivancich
Copy link
Member Author

I ran the python script in the tracker on a "vstart cluster" and did not hit the assert: "2021-06-22 16:23:24.579769 0 verified directory 164"

I will now back up to the preceding master to see if I hit the assert there.

@ivancich
Copy link
Member Author

I ran the python script on master (i.e., without the fix) on the vstart cluster running on a desktop and was not able to hit the assert. I got this far before killing the job:

2021-06-22 17:00:15.930446 1 verified directory 182

So I'm proposing to @mattbenjamin and @mkogan1 that @mkogan1 try it on one of the more substantial clusters he maintains.

@ivancich
Copy link
Member Author

With this PR, vstart desktop cluster, python script run with -O: "2021-06-23 14:08:22.704091 0 verified directory 188".

w/o this pr, vstart desktop cluster, python script run with -O: "2021-06-23 14:30:28.888813 2 verified directory 96".

So even though I ran Python with "-O", it seems that nothing gets triggered on a vstart + desktop cluster. I've run it with both 1 and 2 OSDs running.

@ivancich
Copy link
Member Author

My colleage, @mkogan1 did some manipulations to get some errors generated:

Able to reproduce the check prints on the o07.vlan104.sepia.ceph.com machine.

The prints in the ceph_repro.py script @:
...
elif expect_all_found:
print ("%s %d only saw %d files in directory %d" % (timestamp(), thread_id, len(seen), dir_id))
for fid in range(FILES_PER_DIR):
if fid not in seen:
missing_file = dir + "f_" + str(fid)
print ("%s %d Is this file '%s' missing? Check if missing?" % (timestamp(), thread_id, missing_file))
try:
get_resp = ceph_conn.get_object(Key=missing_file, Bucket=BUCKET)
print ("%s %d Seems like its here! %s" % (timestamp(), thread_id, str(get_resp)))
except Exception as e:
print ("%s %d Seems like an error: %s" % (timestamp(), thread_id, str(e)))
...

occur both on todays master and on todays master + cherry-pick of the 3 commits from https://github.com/ceph/ceph/pull/41978/commits

but only after disabling the asserts in python by running it with -O:
python3 -O ceph_repro.py

otherwise only the assert that is checking for Truncated :
...
resp = ceph_conn.list_objects(Bucket=BUCKET, Prefix=dir)
assert not resp['IsTruncated'], resp
if 'Contents' in resp:
...
is triggering which outputs a lengthy trace that seems to affect the timing.
since the reproduction is dependent on the the output of an assert message, possibly it's a temporary inconsistency?

Sample of output: python3 -O ceph_repro.py | egrep --line-buffered "file|like"

2021-06-23 09:31:55.588514 2 only saw 983 files in directory 29

2021-06-23 09:31:55.588546 2 Is this file 'cdae001c-d405-11eb-a4a8-3cfdfed269ec/d_29/f_99' missing? Check if missing?

2021-06-23 09:31:55.591324 2 Seems like its here! {'ResponseMetadata': {'RequestId': 'tx00000000000000002a1b3-0060d2ff8b-1045-default', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-length': '0', 'accept-ranges': 'bytes', 'last-modified': 'Wed, 23 Jun 2021 09:31:50 GMT', 'x-rgw-object-type': 'Normal', 'etag': '"d41d8cd98f00b204e9800998ecf8427e"', 'x-amz-request-id': 'tx00000000000000002a1b3-0060d2ff8b-1045-default', 'content-type': 'binary/octet-stream', 'date': 'Wed, 23 Jun 2021 09:31:55 GMT', 'connection': 'Keep-Alive'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2021, 6, 23, 9, 31, 50, tzinfo=tzutc()), 'ContentLength': 0, 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"', 'ContentType': 'binary/octet-stream', 'Metadata': {}, 'Body': <botocore.response.StreamingBody object at 0x7fffef107898>}

2021-06-23 09:31:55.591409 2 Is this file 'cdae001c-d405-11eb-a4a8-3cfdfed269ec/d_29/f_984' missing? Check if missing?

2021-06-23 09:31:55.594077 2 Seems like its here! {'ResponseMetadata': {'RequestId': 'tx00000000000000002a1b7-0060d2ff8b-1045-default', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-length': '0', 'accept-ranges': 'bytes', 'last-modified': 'Wed, 23 Jun 2021 09:31:51 GMT', 'x-rgw-object-type': 'Normal', 'etag': '"d41d8cd98f00b204e9800998ecf8427e"', 'x-amz-request-id': 'tx00000000000000002a1b7-0060d2ff8b-1045-default', 'content-type': 'binary/octet-stream', 'date': 'Wed, 23 Jun 2021 09:31:55 GMT', 'connection': 'Keep-Alive'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2021, 6, 23, 9, 31, 51, tzinfo=tzutc()), 'ContentLength': 0, 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"', 'ContentType': 'binary/octet-stream', 'Metadata': {}, 'Body': <botocore.response.StreamingBody object at 0x7fffef107e80>}

@stale
Copy link

stale bot commented Jan 9, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jan 9, 2022
@@ -2110,6 +2110,13 @@ int rgw_dir_suggest_changes(cls_method_context_t hctx,
return -EINVAL;
}

if (cur_disk.pending_map.size() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that there's already a if (cur_disk.pending_map.empty()) { condition below that prevents dir suggestions from being applied against pending entries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i understand this PR better after the discussions around #45345

this early return means that we'll only apply suggestions for entries that had pending operations but they expired by tag timeout. the change in #45345 only addresses a race between dir_suggest and the complete, but this change also prevents clients from making suggestions against completed entries

in bucket listing, we have several conditions for calling check_disk_state() - !pending_map.empty() is not the only one. is it possible that we're relying on dir_suggest to work even though pending_map is empty?

@stale stale bot removed the stale label Jan 10, 2022
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@djgalloway djgalloway changed the base branch from master to main July 3, 2022 00:00
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 1, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Nov 28, 2022
@ivancich ivancich reopened this Nov 28, 2022
@github-actions github-actions bot removed the stale label Nov 28, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 27, 2023
@ivancich ivancich removed the stale label Jan 30, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Mar 31, 2023
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Apr 30, 2023
@mattbenjamin
Copy link
Contributor

@ivancich is this still needed? it has been outstanding for a long time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants