New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: catch damage to dentry's first field #49773
Conversation
{ | ||
auto&& snapclient = dir->mdcache->mds->snapclient; | ||
auto next_snap = snapclient->get_last_created()+1; | ||
if (first > last || (snapclient->is_server_ready() && first > next_snap)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this also be possible that the first
equals to last
, which are all head
just like in https://tracker.ceph.com/issues/38452#note-10 ?
If so we also need to check this and abort the MDS ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also check CInode's corruption in this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If first > next_snap
, it would trigger when first == SNAP_HEAD
.
Could we also check CInode's corruption in this PR ?
It's not a bad idea but writing dedicated tests for that kind of corruption would take time. I believe the corruption is introduced via the dentries so it's best to get a fix in for this ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense +1
The PR looks good to me. But I have not followed much on how postgress can corrupt the dentry which this PR is catching. A bit of information about that in commit msg would help. |
b74b8c3
to
229b872
Compare
Thanks @kotreshhr , I've left a small explanation in one of the commits. |
229b872
to
ed581bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
{ | ||
auto&& snapclient = dir->mdcache->mds->snapclient; | ||
auto next_snap = snapclient->get_last_created()+1; | ||
if (first > last || (snapclient->is_server_ready() && first > next_snap)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense +1
cc17782
to
25b3ad9
Compare
jenkins test make check |
jenkins test api |
1 similar comment
jenkins test api |
jenkins test api |
1 similar comment
jenkins test api |
@batrick I haven't gone through the test run, but this test failure caught my eye which might be related: https://pulpito.ceph.com/vshankar-2023-03-03_04:39:14-fs-wip-vshankar-testing-20230303.023823-testing-default-smithi/7192234 |
25b3ad9
to
210becf
Compare
This was a surprising failure I didn't see before in teuthology. We expect the mds to die but for some reason only recently see these errors AFAIK. Anyway, I've fixed that. I'm looking into another logical problem with this PR showing up in other test failures. |
210becf
to
bf90995
Compare
bf90995
to
0c69de2
Compare
* refs/pull/49773/head: qa: add missing scan_links step for data scan recovery qa/tasks/cephfs: test damage to dentry's first is caught qa/tasks/cephfs: use rank_asok and allow specifying rank qa/tasks: allow specifying timeout command prefix to ceph mds: provide test configs for creating first corruption mds: catch damage to dentry's first field mds: add debugging for pre_cow_old_inode mds: cleanup code
e126f34
to
686b4b3
Compare
jenkins test make check arm64 |
1 similar comment
jenkins test make check arm64 |
Changes look good. I'll have a final rundown on the test run before merging this. Thank you, @batrick. |
#50692 should be merged with this PR or there will be QA failures in main. |
@vshankar config added. Going to QA now. |
80fc323
to
78d0aca
Compare
jenkins test api |
dout(1) << *dn << " " << dn->corrupt_first_loaded << dendl; | ||
if (!dn->corrupt_first_loaded) { | ||
dn->check_corruption(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for debug only ? The debug level is 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... yes!
Good catch!
without latest SQUASH commit. This is just @vshankar do you want more tests run? |
Nope. Good to merge once tests pass. |
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
When possible. Abort the MDS before it can be written to the journal/directory. This is part of a series to address corruption first observed in [1]. How the corruption is introduced is yet unknown. [1] https://tracker.ceph.com/issues/38452#note-10 Fixes: http://tracker.ceph.com/issues/58482 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This will use the more efficient: ceph tell mds.<fsname>:<rank> ... Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Without, the first field remains corrupt (HEAD). Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
So admin can restore access to files if necessary. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
6375d9a
to
7ffa065
Compare
Tests look good: https://pulpito.ceph.com/pdonnell-2023-03-29_15:15:36-fs-wip-pdonnell-testing-20230329.131031-distro-default-smithi/ The warning should be ignorelisted. I added it to Will merge when jenkins test pass. |
Only for the tests we are deliberately inducing it in, right? If this turns up in the lab we definitely want to detect it! |
Yes, it's induced. |
jenkins test make check arm64 |
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows