Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: catch damage to dentry's first field #49773

Merged
merged 9 commits into from Mar 29, 2023
Merged

Conversation

batrick
Copy link
Member

@batrick batrick commented Jan 18, 2023

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

{
auto&& snapclient = dir->mdcache->mds->snapclient;
auto next_snap = snapclient->get_last_created()+1;
if (first > last || (snapclient->is_server_ready() && first > next_snap)) {
Copy link
Member

@lxbsz lxbsz Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also be possible that the first equals to last, which are all head just like in https://tracker.ceph.com/issues/38452#note-10 ?

If so we also need to check this and abort the MDS ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also check CInode's corruption in this PR ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If first > next_snap, it would trigger when first == SNAP_HEAD.

Could we also check CInode's corruption in this PR ?

It's not a bad idea but writing dedicated tests for that kind of corruption would take time. I believe the corruption is introduced via the dentries so it's best to get a fix in for this ASAP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense +1

src/mds/CDentry.h Outdated Show resolved Hide resolved
@kotreshhr
Copy link
Contributor

The PR looks good to me. But I have not followed much on how postgress can corrupt the dentry which this PR is catching. A bit of information about that in commit msg would help.

@batrick
Copy link
Member Author

batrick commented Jan 19, 2023

Thanks @kotreshhr , I've left a small explanation in one of the commits.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/mds/Server.cc Show resolved Hide resolved
{
auto&& snapclient = dir->mdcache->mds->snapclient;
auto next_snap = snapclient->get_last_created()+1;
if (first > last || (snapclient->is_server_ready() && first > next_snap)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense +1

@batrick batrick force-pushed the postgres-encode branch 2 times, most recently from cc17782 to 25b3ad9 Compare January 26, 2023 19:33
@rishabh-d-dave
Copy link
Contributor

jenkins test make check

@rishabh-d-dave
Copy link
Contributor

jenkins test api

1 similar comment
@batrick
Copy link
Member Author

batrick commented Feb 8, 2023

jenkins test api

@vshankar vshankar added wip-vshankar-backlog Backlog of CephFS PRs to pick for testing wip-vshankar-testing2 and removed wip-vshankar-backlog Backlog of CephFS PRs to pick for testing labels Feb 16, 2023
@vshankar
Copy link
Contributor

jenkins test api

1 similar comment
@batrick
Copy link
Member Author

batrick commented Feb 28, 2023

jenkins test api

@vshankar
Copy link
Contributor

vshankar commented Mar 6, 2023

@batrick I haven't gone through the test run, but this test failure caught my eye which might be related: https://pulpito.ceph.com/vshankar-2023-03-03_04:39:14-fs-wip-vshankar-testing-20230303.023823-testing-default-smithi/7192234

@vshankar vshankar added wip-vshankar-backlog Backlog of CephFS PRs to pick for testing and removed wip-vshankar-testing2 labels Mar 7, 2023
@batrick
Copy link
Member Author

batrick commented Mar 10, 2023

@batrick I haven't gone through the test run, but this test failure caught my eye which might be related: https://pulpito.ceph.com/vshankar-2023-03-03_04:39:14-fs-wip-vshankar-testing-20230303.023823-testing-default-smithi/7192234

This was a surprising failure I didn't see before in teuthology. We expect the mds to die but for some reason only recently see these errors AFAIK.

Anyway, I've fixed that.

I'm looking into another logical problem with this PR showing up in other test failures.

batrick added a commit to batrick/ceph that referenced this pull request Mar 28, 2023
* refs/pull/49773/head:
	qa: add missing scan_links step for data scan recovery
	qa/tasks/cephfs: test damage to dentry's first is caught
	qa/tasks/cephfs: use rank_asok and allow specifying rank
	qa/tasks: allow specifying timeout command prefix to ceph
	mds: provide test configs for creating first corruption
	mds: catch damage to dentry's first field
	mds: add debugging for pre_cow_old_inode
	mds: cleanup code
@batrick
Copy link
Member Author

batrick commented Mar 28, 2023

jenkins test make check arm64

1 similar comment
@batrick
Copy link
Member Author

batrick commented Mar 28, 2023

jenkins test make check arm64

@vshankar
Copy link
Contributor

Changes look good. I'll have a final rundown on the test run before merging this. Thank you, @batrick.

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

#50692 should be merged with this PR or there will be QA failures in main.

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

@vshankar config added. Going to QA now.

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

cc @gregsfortytwo

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

jenkins test api

dout(1) << *dn << " " << dn->corrupt_first_loaded << dendl;
if (!dn->corrupt_first_loaded) {
dn->check_corruption(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for debug only ? The debug level is 1 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... yes!

Good catch!

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

https://pulpito.ceph.com/pdonnell-2023-03-29_15:15:36-fs-wip-pdonnell-testing-20230329.131031-distro-default-smithi/

without latest SQUASH commit. This is just --suite fs --filter damage.

@vshankar do you want more tests run?

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/pdonnell-2023-03-29_15:15:36-fs-wip-pdonnell-testing-20230329.131031-distro-default-smithi/

without latest SQUASH commit. This is just --suite fs --filter damage.

@vshankar do you want more tests run?

Nope. Good to merge once tests pass.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
When possible. Abort the MDS before it can be written to the
journal/directory.

This is part of a series to address corruption first observed in [1].
How the corruption is introduced is yet unknown.

[1] https://tracker.ceph.com/issues/38452#note-10

Fixes: http://tracker.ceph.com/issues/58482
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This will use the more efficient:

    ceph tell mds.<fsname>:<rank> ...

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Without, the first field remains corrupt (HEAD).

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
So admin can restore access to files if necessary.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

Tests look good: https://pulpito.ceph.com/pdonnell-2023-03-29_15:15:36-fs-wip-pdonnell-testing-20230329.131031-distro-default-smithi/

The warning should be ignorelisted. I added it to qa/tasks/cephfs: test damage to dentry's first is caught .

Will merge when jenkins test pass.

@gregsfortytwo
Copy link
Member

The warning should be ignorelisted. I added it to qa/tasks/cephfs: test damage to dentry's first is caught .

Only for the tests we are deliberately inducing it in, right? If this turns up in the lab we definitely want to detect it!

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

The warning should be ignorelisted. I added it to qa/tasks/cephfs: test damage to dentry's first is caught .

Only for the tests we are deliberately inducing it in, right? If this turns up in the lab we definitely want to detect it!

Yes, it's induced.

@batrick
Copy link
Member Author

batrick commented Mar 29, 2023

jenkins test make check arm64

@batrick batrick merged commit 0c4d835 into ceph:main Mar 29, 2023
3 checks passed
@batrick batrick deleted the postgres-encode branch March 29, 2023 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants