rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978

ivancich · 2021-06-22T16:59:17Z

Like the issue http://tracker.ceph.com/issues/22555 , a special sequence can cause this new situation.

IO sequence:
1.put index prepare
2.list, get stale index
3.check_disk_state, find the head obj not exist
4.write head obj
5.index complete
6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE

step 6 will delete the index

NOTE: This is a copy of #28654 (which I closed) with one additional commit, so the tests compile and pass.

Primary credit goes to Tianshan Qu tianshan@xsky.com.

Fixes: http://tracker.ceph.com/issues/24744

a special sequence can cause this new situation. IO sequence: 1.put index prepare 2.list, get stale index 3.check_disk_state, find the head obj not exist 4.write head obj 5.index complete 6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE step 6 will delete the index Fixes: http://tracker.ceph.com/issues/24744 Signed-off-by: Tianshan Qu <tianshan@xsky.com>

1.recover index from put crash after complete 2.list raced with put, index_suggest should not delete index 3.list raced with delete, index_suggest should not recover index Signed-off-by: Tianshan Qu <tianshan@xsky.com>

When these tests were initially added, the APIs were different. This updates them to the current versions, so it will compile. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>

ivancich · 2021-06-22T20:25:31Z

I ran the python script in the tracker on a "vstart cluster" and did not hit the assert: "2021-06-22 16:23:24.579769 0 verified directory 164"

I will now back up to the preceding master to see if I hit the assert there.

ivancich · 2021-06-22T21:04:14Z

I ran the python script on master (i.e., without the fix) on the vstart cluster running on a desktop and was not able to hit the assert. I got this far before killing the job:

2021-06-22 17:00:15.930446 1 verified directory 182

So I'm proposing to @mattbenjamin and @mkogan1 that @mkogan1 try it on one of the more substantial clusters he maintains.

ivancich · 2021-06-23T20:21:47Z

With this PR, vstart desktop cluster, python script run with -O: "2021-06-23 14:08:22.704091 0 verified directory 188".

w/o this pr, vstart desktop cluster, python script run with -O: "2021-06-23 14:30:28.888813 2 verified directory 96".

So even though I ran Python with "-O", it seems that nothing gets triggered on a vstart + desktop cluster. I've run it with both 1 and 2 OSDs running.

ivancich · 2021-06-24T17:30:20Z

My colleage, @mkogan1 did some manipulations to get some errors generated:

Able to reproduce the check prints on the o07.vlan104.sepia.ceph.com machine.

The prints in the ceph_repro.py script @:
...
elif expect_all_found:
print ("%s %d only saw %d files in directory %d" % (timestamp(), thread_id, len(seen), dir_id))
for fid in range(FILES_PER_DIR):
if fid not in seen:
missing_file = dir + "f_" + str(fid)
print ("%s %d Is this file '%s' missing? Check if missing?" % (timestamp(), thread_id, missing_file))
try:
get_resp = ceph_conn.get_object(Key=missing_file, Bucket=BUCKET)
print ("%s %d Seems like its here! %s" % (timestamp(), thread_id, str(get_resp)))
except Exception as e:
print ("%s %d Seems like an error: %s" % (timestamp(), thread_id, str(e)))
...

occur both on todays master and on todays master + cherry-pick of the 3 commits from https://github.com/ceph/ceph/pull/41978/commits

but only after disabling the asserts in python by running it with -O:
python3 -O ceph_repro.py

otherwise only the assert that is checking for Truncated :
...
resp = ceph_conn.list_objects(Bucket=BUCKET, Prefix=dir)
assert not resp['IsTruncated'], resp
if 'Contents' in resp:
...
is triggering which outputs a lengthy trace that seems to affect the timing.
since the reproduction is dependent on the the output of an assert message, possibly it's a temporary inconsistency?

Sample of output: python3 -O ceph_repro.py | egrep --line-buffered "file|like"

2021-06-23 09:31:55.588514 2 only saw 983 files in directory 29

2021-06-23 09:31:55.588546 2 Is this file 'cdae001c-d405-11eb-a4a8-3cfdfed269ec/d_29/f_99' missing? Check if missing?

2021-06-23 09:31:55.591324 2 Seems like its here! {'ResponseMetadata': {'RequestId': 'tx00000000000000002a1b3-0060d2ff8b-1045-default', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-length': '0', 'accept-ranges': 'bytes', 'last-modified': 'Wed, 23 Jun 2021 09:31:50 GMT', 'x-rgw-object-type': 'Normal', 'etag': '"d41d8cd98f00b204e9800998ecf8427e"', 'x-amz-request-id': 'tx00000000000000002a1b3-0060d2ff8b-1045-default', 'content-type': 'binary/octet-stream', 'date': 'Wed, 23 Jun 2021 09:31:55 GMT', 'connection': 'Keep-Alive'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2021, 6, 23, 9, 31, 50, tzinfo=tzutc()), 'ContentLength': 0, 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"', 'ContentType': 'binary/octet-stream', 'Metadata': {}, 'Body': <botocore.response.StreamingBody object at 0x7fffef107898>}

2021-06-23 09:31:55.591409 2 Is this file 'cdae001c-d405-11eb-a4a8-3cfdfed269ec/d_29/f_984' missing? Check if missing?

2021-06-23 09:31:55.594077 2 Seems like its here! {'ResponseMetadata': {'RequestId': 'tx00000000000000002a1b7-0060d2ff8b-1045-default', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-length': '0', 'accept-ranges': 'bytes', 'last-modified': 'Wed, 23 Jun 2021 09:31:51 GMT', 'x-rgw-object-type': 'Normal', 'etag': '"d41d8cd98f00b204e9800998ecf8427e"', 'x-amz-request-id': 'tx00000000000000002a1b7-0060d2ff8b-1045-default', 'content-type': 'binary/octet-stream', 'date': 'Wed, 23 Jun 2021 09:31:55 GMT', 'connection': 'Keep-Alive'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2021, 6, 23, 9, 31, 51, tzinfo=tzutc()), 'ContentLength': 0, 'ETag': '"d41d8cd98f00b204e9800998ecf8427e"', 'ContentType': 'binary/octet-stream', 'Metadata': {}, 'Body': <botocore.response.StreamingBody object at 0x7fffef107e80>}

stale · 2022-01-09T15:27:10Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

cbodley · 2022-01-10T18:51:56Z

src/cls/rgw/cls_rgw.cc

@@ -2110,6 +2110,13 @@ int rgw_dir_suggest_changes(cls_method_context_t hctx,
        return -EINVAL;
      }

+      if (cur_disk.pending_map.size() == 0) {


note that there's already a if (cur_disk.pending_map.empty()) { condition below that prevents dir suggestions from being applied against pending entries

ok, i understand this PR better after the discussions around #45345

this early return means that we'll only apply suggestions for entries that had pending operations but they expired by tag timeout. the change in #45345 only addresses a race between dir_suggest and the complete, but this change also prevents clients from making suggestions against completed entries

in bucket listing, we have several conditions for calling check_disk_state() - !pending_map.empty() is not the only one. is it possible that we're relying on dir_suggest to work even though pending_map is empty?

github-actions · 2022-03-18T17:22:23Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2022-09-01T00:05:02Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2022-11-28T05:01:41Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

github-actions · 2023-01-27T21:01:25Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2023-03-31T21:01:23Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2023-04-30T22:01:24Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

mattbenjamin · 2023-05-01T12:23:38Z

@ivancich is this still needed? it has been outstanding for a long time

github-actions bot added rgw tests labels Jun 22, 2021

ivancich mentioned this pull request Jun 22, 2021

rgw: fix list op raced with put op maybe cause index delete #28654

Closed

3 tasks

ivancich added bug-fix needs-review labels Jun 22, 2021

ivancich requested review from cbodley and mattbenjamin June 22, 2021 17:00

tianshan added 2 commits June 22, 2021 13:03

test/rgw: add cases for index dir_suggest_changes

815fa3a

1.recover index from put crash after complete 2.list raced with put, index_suggest should not delete index 3.list raced with delete, index_suggest should not recover index Signed-off-by: Tianshan Qu <tianshan@xsky.com>

ivancich force-pushed the wip-dir-suggest-race branch from a6f7b29 to 5b5a9bc Compare June 22, 2021 17:03

test/rgw: conform newer cls_rgw testing code to current APIs

070b11b

When these tests were initially added, the APIs were different. This updates them to the current versions, so it will compile. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>

ivancich force-pushed the wip-dir-suggest-race branch from 5b5a9bc to 070b11b Compare June 22, 2021 17:04

stale bot added the stale label Jan 9, 2022

cbodley reviewed Jan 10, 2022

View reviewed changes

stale bot removed the stale label Jan 10, 2022

github-actions bot added the needs-rebase label Mar 18, 2022

djgalloway changed the base branch from master to main July 3, 2022 00:00

github-actions bot added the stale label Sep 1, 2022

github-actions bot closed this Nov 28, 2022

ivancich reopened this Nov 28, 2022

github-actions bot removed the stale label Nov 28, 2022

github-actions bot added the stale label Jan 27, 2023

ivancich removed the stale label Jan 30, 2023

github-actions bot added the stale label Mar 31, 2023

github-actions bot closed this Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978

rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978

ivancich commented Jun 22, 2021 •

edited

ivancich commented Jun 22, 2021

ivancich commented Jun 22, 2021

ivancich commented Jun 23, 2021

ivancich commented Jun 24, 2021

stale bot commented Jan 9, 2022

cbodley Jan 10, 2022

cbodley Mar 18, 2022

github-actions bot commented Mar 18, 2022

github-actions bot commented Sep 1, 2022

github-actions bot commented Nov 28, 2022

github-actions bot commented Jan 27, 2023

github-actions bot commented Mar 31, 2023

github-actions bot commented Apr 30, 2023

mattbenjamin commented May 1, 2023

rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978

rgw: fix list op raced with put op maybe cause index delete (copy + fix) #41978

Conversation

ivancich commented Jun 22, 2021 • edited

ivancich commented Jun 22, 2021

ivancich commented Jun 22, 2021

ivancich commented Jun 23, 2021

ivancich commented Jun 24, 2021

stale bot commented Jan 9, 2022

cbodley Jan 10, 2022

Choose a reason for hiding this comment

cbodley Mar 18, 2022

Choose a reason for hiding this comment

github-actions bot commented Mar 18, 2022

github-actions bot commented Sep 1, 2022

github-actions bot commented Nov 28, 2022

github-actions bot commented Jan 27, 2023

github-actions bot commented Mar 31, 2023

github-actions bot commented Apr 30, 2023

mattbenjamin commented May 1, 2023

ivancich commented Jun 22, 2021 •

edited