New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: fix issuing redundant reintegrate/migrate_stray requests #53280
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update commit message to start with Just in case
Sure @mchangir I will fix them both. Thanks! |
once the new reintegration flag is set, it stays set throughout the lifetime of that request right? in case if the reintegration fails in the first try, can there be any serious problems since now it cannot perform reintegration? |
@batrick I just suspect this is the same issue with mds: infinite rename recursion on itself tracker. From the logs I can see the same |
No, it shouldn't. I just set the flag in the |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Since #52199 had been merged and rebased this to the upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM!
16d7ea6
to
0983390
Compare
This will be used to avoid possible multiple reintegration issue later. Fixes: https://tracker.ceph.com/issues/62702 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Just in case a CInode's nlink is 1, and then a unlink request comes and then early replies and submits to the MDLogs, but just before the MDlogs are flushed a link request comes, and the link request also succeeds and early replies to client. Later when the unlink/link requests' MDLog events are flushed and the callbacks are called, which will fire a stray denty reintegration. But it will pick the new dentry, which is from the link's request and is a remote dentry, to do the reintegration. While in the 'rename' code when traversing the path it will trigger to call the 'dn->link_remote()', which later will fire a new stray dentry reintegration. The problem is if the first 'rename' request is retried several times, and in each time it will fire a new reintegration, which makes no sense and maybe blocked for a very long time dues to some reasons and then will be reported as slow request warning. Fixes: https://tracker.ceph.com/issues/62702 Signed-off-by: Xiubo Li <xiubli@redhat.com>
@batrick Fixed them all and also run the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\o/
@lxbsz I see a couple of fsstress failures with the branch that has this change: https://pulpito.ceph.com/vshankar-2023-09-28_07:23:59-fs-wip-vshankar-testing-20230926.081818-testing-default-smithi/7405355 ... and not in the baseline run for main branch: https://pulpito.ceph.com/yuriw-2023-09-29_19:46:13-fs-main-distro-default-smithi/ I suspect this change is contributing to the failure, so, not merging this till the failures is debugged. |
* refs/pull/53280/head: mds: fix issuing redundant reintegrate/migrate_stray requests mds: record the internal client request and receive client reply Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
I will have a look today or tomorrow. |
@vshankar This is a bug from kernel's From
And from the
Thanks |
Thx @lxbsz - will have a look. |
* refs/pull/53280/head: mds: fix issuing redundant reintegrate/migrate_stray requests mds: record the internal client request and receive client reply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://tracker.ceph.com/projects/cephfs/wiki/Main#16-Oct-2023 Also tested ^. This certainly seems to have resolved the issue thoroughly. Thanks Xiubo. |
Just in cause a CInode's nlink is 1, and then a unlink request comes and then early replies and submits to the MDLogs, but just before the MDlogs are flushed a link request comes, and the link request also succeeds and early replies to client.
Later when the unlink/link requests' MDLog events are flushed and the callbacks are called, which will fire a stray denty reintegration. But it will pick the new dentry, which is from the link's request and is a remote dentry, to do the reintegration. While in the 'rename' code when traversing the path it will trigger to call the 'dn->link_remote()', which later will fire a new stray dentry reintegration.
The problem is if the first 'rename' request is retried several times, and in each time it will fire a new reintegration, which makes no sense and maybe blocked for a very long time dues to some reasons and then will be reported as slow request warning.
Fixes: https://tracker.ceph.com/issues/62702
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows