New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: read incremental metalog from master cluster based on truncate v… #36202
Conversation
…ariable when the log entry in the meta.log object of the secondary cluster is empty, the value of max_marker is also empty,which can't meet the requirement that mdlog_marker <= max_marker,resulting in that the secondary cluster can't fetch new log entry from the master cluster and infinite loop,finally, the secondary cluster's metadata can't catch up the master cluster. when the truncate is false, it means that the secondary cluster's meta.log is empyt,we can read more from master cluster. Fixes: https://tracker.ceph.com/issues/46563 Signed-off-by: gengjichao <gengjichao@jd.com>
|
@cbodley Can you help me review the code? |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@cbodley I jused commented in https://tracker.ceph.com/issues/46563 as I apparently hit this very issue with a current Ceph Octopus. Any chance for this proposed fix to get merged? Any other workaround? |
|
@moningchao could you maybe take a look at the failing checks and rebase your PR onto master? Edit: I triggered a retest to see what is actually still failing. |
|
jenkins retest this please |
rebase done |
|
@cbodley @dang could you maybe kindly take a look at this PR? I don't know if I could have assigned this or raise awareness somehow. But I am able to reproduce this issue over and over with a rolling restart of |
|
@mattbenjamin @cbodley I just observed the condition of some metadata shards being stuck after restarts of RADOSGW. |
|
thanks, @frittentheke ; looking for feedback from @cbodley |
|
or perhaps @adamemerson |
|
i still haven't been able to reproduce this one on master. if you're only hitting this on octopus, it may be because https://tracker.ceph.com/issues/51784 hasn't been backported |
Thanks for your reply @cbodley . Is there any additional info or debugging I should do to determine what is actually the issue with the stuck metadata replication? So "which" of the two issues we are actually hitting. As for replicating the issue on your end - We are running the replication with a list of 3 RADOSGW hosts on each end. So this is NOT the usual RADOSGW->LB->RADOSGW setup. I know there is locking of shards happening and all, but maybe RADOSGW behaves differently when working with a list of endpoints for the other zone and not just a single (LB-powered) endpoint? |
|
|
@cbodley sorry for being a PITA, but I just observed an "interesting" crash at https://tracker.ceph.com/issues/46563#note-9 which then left a our multisite with the issue of a single stuck metadata shard no being synced. Maybe this helps narrowing down the whole issue more? |
I encountered metadata never catching up with message 'metadata sync syncing' on v16.2.6. Initially, sync was failing due to scale-down profile noted in v16.2.7's release notes. This was causing default pool After using a variety of commands to set the default profile to 'scale-up' on both ceph clusters, the otp pool was eventually created. Executing I'm not sure if this PR would correct this set of circumstances (prolonged problem with metadata syncing that persisted for a couple days), but some form of it seems to be present as recent as 16.2.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean this up to get rid of the merge commit, please.
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
thanks @moningchao, it looks like this merged with #46148 |
…ariable
when the log entry in the meta.log object of the secondary cluster is empty,
the value of max_marker is also empty,which can't meet the requirement that
mdlog_marker <= max_marker,resulting in that the secondary cluster can't fetch
new log entry from the master cluster and infinite loop,finally, the secondary
cluster's metadata can't catch up the master cluster. when the truncate is false,
it means that the secondary cluster's meta.log is empyt,we can read more from
master cluster.
Fixes: https://tracker.ceph.com/issues/46563
Signed-off-by: gengjichao gengjichao@jd.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox