Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: fix crash when exporting unlinked dir #44335

Merged
merged 2 commits into from Jun 7, 2022

Conversation

huww98
Copy link
Contributor

@huww98 huww98 commented Dec 16, 2021

When fetch() an unlinked dir, we set fnode->version = 1, but leave projected_version = 0. This will trigger the assert dir->get_projected_version() == dir->get_version() in Migrator::encode_export_dir().

projected_version should equal to fnode->version unless this dir is projected.

Fix this by introducing a new helper CDir::set_fresh_fnode(), which will ensure versions are correctly set.

Fixes: https://tracker.ceph.com/issues/53597
Signed-off-by: 胡玮文 huww98@outlook.com

I'm the reporter of the above tracker ticket. I just read the code and logs, but not tested this patch. Please help me test it if possible.

And it is a bit strange if I'm not missing something. This bug seems pretty easy to reproduce. We have above 100 of such directories with just less than 30 clients. So why didn't anybody report it before?

example crash logs when reducing max_mds from 2 to 1:

2021-12-16T06:25:13.749+0000 7f04aa738700 10 MDSContext::complete: 25C_IO_Dir_OMAP_FetchedMore
2021-12-16T06:25:13.749+0000 7f04b0744700  7 mds.1.cache handle_discover mds.0 wants basedir+ has [inode 0x20007e2462d [...2,head] ~mds1/stray8/20007e2462d/ auth{0=1} v82292168 ap=3 FROZEN_AUTHPIN f(v0 m2021-12-12T14:07:02.907234+0000) n(v0 rc2021-12-12T14:07:02.907234+0000 1=0+1) (ilink xlock x=1 by 0x5622a6ddcc00) (isnap xlock x=1 by 0x5622a6ddcc00) (inest lock) (iversion lock x=1 by 0x5622a6ddcc00) caps={8134535=p/-@1},l=8134535 | request=1 lock=3 caps=1 frozen=0 replicated=1 waiter=0 authpin=1 0x56207f099180]
2021-12-16T06:25:13.749+0000 7f04b0744700 15 mds.1.cache.ino(0x20007e2462d) maybe_export_pin update=0 [inode 0x20007e2462d [...2,head] ~mds1/stray8/20007e2462d/ auth{0=1} v82292168 ap=3 FROZEN_AUTHPIN f(v0 m2021-12-12T14:07:02.907234+0000) n(v0 rc2021-12-12T14:07:02.907234+0000 1=0+1) (ilink xlock x=1 by 0x5622a6ddcc00) (isnap xlock x=1 by 0x5622a6ddcc00) (inest lock) (iversion lock x=1 by 0x5622a6ddcc00) caps={8134535=p/-@1},l=8134535 | request=1 lock=3 caps=1 frozen=0 replicated=1 waiter=0 authpin=1 0x56207f099180]
2021-12-16T06:25:13.749+0000 7f04b0744700  7 mds.1.cache incomplete dir contents for [dir 0x20007e2462d ~mds1/stray8/20007e2462d/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5623e3221b00], fetching
2021-12-16T06:25:13.749+0000 7f04b0744700 10 mds.1.cache.dir(0x20007e2462d) fetch on [dir 0x20007e2462d ~mds1/stray8/20007e2462d/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5623e3221b00]
2021-12-16T06:25:13.749+0000 7f04b0744700  7 mds.1.cache.dir(0x20007e2462d) fetch dirfrag for unlinked directory, mark complete
2021-12-16T06:25:13.749+0000 7f04b0744700  7 mds.1.79799 mds has 1 queued contexts
2021-12-16T06:25:13.749+0000 7f04b0744700 10 mds.1.79799  finish 0x562258e68840
2021-12-16T06:25:13.749+0000 7f04b0744700 10 MDSContext::complete: 18C_MDS_RetryMessage
--
2021-12-16T06:25:17.873+0000 7f04aa738700 10 MDSContext::complete: 12C_M_ExportGo
2021-12-16T06:25:17.873+0000 7f04aa738700  7 mds.1.mig export_go_synced [dir 0x20007e2462d ~mds0/stray8/20007e2462d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x5623e3221b00] to 0
2021-12-16T06:25:17.873+0000 7f04aa738700 10 mds.1.cache number of subtrees = 96; not printing subtrees
2021-12-16T06:25:17.873+0000 7f04aa738700  7 mds.1.cache adjust_subtree_auth 1,1 -> 1,0 on [dir 0x20007e2462d ~mds0/stray8/20007e2462d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x5623e3221b00]
2021-12-16T06:25:17.873+0000 7f04aa738700 10 mds.1.cache number of subtrees = 96; not printing subtrees
2021-12-16T06:25:17.873+0000 7f04aa738700  7 mds.1.cache  current root is [dir 0x20007e2462d ~mds0/stray8/20007e2462d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x5623e3221b00]
2021-12-16T06:25:17.873+0000 7f04aa738700 10 mds.1.cache.dir(0x20007e2462d) setting dir_auth=1,0 from 1,1 on [dir 0x20007e2462d ~mds0/stray8/20007e2462d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x5623e3221b00]
2021-12-16T06:25:17.873+0000 7f04aa738700 10 mds.1.cache number of subtrees = 96; not printing subtrees
2021-12-16T06:25:17.873+0000 7f04aa738700  7 mds.1.mig encode_export_dir [dir 0x20007e2462d ~mds0/stray8/20007e2462d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,0 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x5623e3221b00] 0 head items
2021-12-16T06:25:17.877+0000 7f04aa738700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: In function 'void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)' thread 7f04aa738700 time 2021-12-16T06:25:17.877684+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: 1753: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the cephfs Ceph File System label Dec 16, 2021
@huww98
Copy link
Contributor Author

huww98 commented Dec 16, 2021

@lxbsz

@lxbsz lxbsz requested a review from a team December 17, 2021 00:48
Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your analysis is correct.

After one directory was removed, it would unlink the dentry from its parent and then created one new stray dentry in the ~/mdsX/stray/. But for the stray dentry, since there had one extra caps=1 pin ref, so it was skipped being purged by the MDS. This is why the MDS will always could fetch it from the disk and this stray dentry didn't disappear.

src/mds/CDir.cc Outdated Show resolved Hide resolved
Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me!

@huww98
Copy link
Contributor Author

huww98 commented Jan 6, 2022

For anyone who would like to reproduce this on master:

Console 1

MON=1 OSD=1 MDS=2 MGR=1 ../src/vstart.sh --new --debug
./bin/ceph fs set a max_mds 2
mkdir mnt
sudo mount -t ceph <mon addr from ceph.conf>:/ mnt -o name=admin,secret=<from keyring>

Console 2

cd mnt
mkdir pin
setfattr -n ceph.dir.pin -v 1 pin
mkdir pin/to-be-deleted
cd pin/to-be-deleted
rmdir ../to-be-deleted

Keep console 2 in this state to hold the cap on the deleted dir. Back to console 1:

./bin/ceph daemon mds.a flush journal  # mds.a is rank 1 in my case
./bin/ceph mds fail a:1  # Just to make sure cache is cleared
./bin/ceph fs set a max_mds 1

On the current master (c1c7d3b), this sequence would trigger another crash on rank 0. So, to reproduce the bug fixed by this PR, I need to manually restart rank 0 to let rank 1 proceed.

More about the crash of rank 0: (gdb session)

Thread 19 "ms_dispatch" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffb7fff700 (LWP 1213331)]
MDCache::expire_recursive (this=this@entry=0x7fffc8065920, in=in@entry=0x7fffc8082610, expiremap=std::map with 1 element = {...}) at ../src/mds/MDCache.cc:3747
3747          if (dnl->is_primary()) {
(gdb) bt
#0  MDCache::expire_recursive (this=this@entry=0x7fffc8065920, in=in@entry=0x7fffc8082610, expiremap=std::map with 1 element = {...}) at ../src/mds/MDCache.cc:3747
#1  0x00005555558ee433 in MDCache::trim (this=this@entry=0x7fffc8065920, count=count@entry=0) at ../src/mds/MDCache.cc:6811
#2  0x00005555558f3c24 in MDCache::upkeep_main (this=0x7fffc8065920) at ../src/mds/MDCache.cc:13285
#3  0x00005555559190bd in std::__invoke_impl<void, void (MDCache::*)(), MDCache*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#4  std::__invoke<void (MDCache::*)(), MDCache*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#5  std::thread::_Invoker<std::tuple<void (MDCache::*)(), MDCache*> >::_M_invoke<0ul, 1ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#6  std::thread::_Invoker<std::tuple<void (MDCache::*)(), MDCache*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#7  std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (MDCache::*)(), MDCache*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#8  0x00007ffff71cdde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff72e2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007ffff6ebb293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p dnl
$1 = (CDentry::linkage_t *) 0x1a8

Maybe we need another PR to fix this one.

@vshankar
Copy link
Contributor

vshankar commented Jan 6, 2022

And it is a bit strange if I'm not missing something. This bug seems pretty easy to reproduce. We have above 100 of such directories with just less than 30 clients. So why didn't anybody report it before?

Perhaps its a combination of client holding a cap ref on a deleted inode and reducing max_mds (to 1) that's probably uncommon.

Anyhow - thank you for the detailed explanation in the tracker. Do you think a test case can be written to verify this?

@huww98
Copy link
Contributor Author

huww98 commented Jan 6, 2022

Perhaps its a combination of client holding a cap ref on a deleted inode and reducing max_mds (to 1) that's probably uncommon.

Now cephadm would reduce max_mds on every upgrade. I would expect this to be a lot more common.

Anyhow - thank you for the detailed explanation in the tracker. Do you think a test case can be written to verify this?

I'm not familiar with the testing infrastructure. But I think yes, given that the steps in my above comment could reproduce the crash stably (I have tried it about 5 times). But we may need to fix the rank 0 seg fault first before we can automatically verify this fix.

@vshankar
Copy link
Contributor

vshankar commented Jan 6, 2022

I'm not familiar with the testing infrastructure. But I think yes, given that the steps in my above comment could reproduce the crash stably (I have tried it about 5 times). But we may need to fix the rank 0 seg fault first before we can automatically verify this fix.

You could start taking a look at tests under qa/tasks/cephfs/ - test_* python sources. Since there already a reliable reproducer, it shouldn't be hard to convert that into a test case.

+1 for fixing the other crash, which is a necessity to verify this fix.

@huww98 huww98 mentioned this pull request Jan 6, 2022
14 tasks
@huww98
Copy link
Contributor Author

huww98 commented Jan 6, 2022

I made a PR #44477 to fix the crash of rank 0. But the relevant code has not changed for years. Could you reproduce the same seg fault as mine? Or it is just because I use gcc from Ubuntu 20.04

@vshankar
Copy link
Contributor

I made a PR #44477 to fix the crash of rank 0. But the relevant code has not changed for years. Could you reproduce the same seg fault as mine? Or it is just because I use gcc from Ubuntu 20.04

@huww98 could you update this pr with a test case? I'll plan to include this and #44477 in my test branch.

@huww98
Copy link
Contributor Author

huww98 commented Jan 18, 2022

@huww98 could you update this pr with a test case?

I will try it today. But I don't have the resources to run the tests to ensure it can reproduce the crash.

@github-actions github-actions bot added the tests label Jan 18, 2022
@vshankar
Copy link
Contributor

I will try it today. But I don't have the resources to run the tests to ensure it can reproduce the crash.

Sure. I'll include this PR (and the supporting PR #44477) in my test branch.


self.mount_a.run_shell(["mkdir", "pin"])

self._force_migrate("pin")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test fails with:

2022-02-05T21:48:59.852 INFO:tasks.cephfs_test_runner:test_migrate_unlinked_dir (tasks.cephfs.test_strays.TestStrays) ... ERROR
2022-02-05T21:48:59.853 INFO:tasks.cephfs_test_runner:
2022-02-05T21:48:59.853 INFO:tasks.cephfs_test_runner:======================================================================
2022-02-05T21:48:59.854 INFO:tasks.cephfs_test_runner:ERROR: test_migrate_unlinked_dir (tasks.cephfs.test_strays.TestStrays)
2022-02-05T21:48:59.854 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2022-02-05T21:48:59.855 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2022-02-05T21:48:59.855 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_09fc2e9e470435f1e6dbf462a51853391f4a4cc0/qa/tasks/cephfs/cephfs_test_case.py", line 365, in _wait_subtrees
2022-02-05T21:48:59.855 INFO:tasks.cephfs_test_runner:    while proceed():
2022-02-05T21:48:59.856 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_3094160cc590b786a43d55faaf7c99d5de71ce56/teuthology/contextutil.py", line 133, in __call__
2022-02-05T21:48:59.856 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2022-02-05T21:48:59.856 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: reached maximum tries (15) after waiting for 30 seconds
2022-02-05T21:48:59.856 INFO:tasks.cephfs_test_runner:
2022-02-05T21:48:59.857 INFO:tasks.cephfs_test_runner:The above exception was the direct cause of the following exception:
2022-02-05T21:48:59.857 INFO:tasks.cephfs_test_runner:
2022-02-05T21:48:59.857 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2022-02-05T21:48:59.858 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_09fc2e9e470435f1e6dbf462a51853391f4a4cc0/qa/tasks/cephfs/test_strays.py", line 710, in test_migrate_unlinked_dir
2022-02-05T21:48:59.858 INFO:tasks.cephfs_test_runner:    self._force_migrate("pin")
2022-02-05T21:48:59.858 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_09fc2e9e470435f1e6dbf462a51853391f4a4cc0/qa/tasks/cephfs/test_strays.py", line 609, in _force_migrate
2022-02-05T21:48:59.859 INFO:tasks.cephfs_test_runner:    self._wait_subtrees([(rpath, rank)], rank=rank, path=rpath)
2022-02-05T21:48:59.859 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_09fc2e9e470435f1e6dbf462a51853391f4a4cc0/qa/tasks/cephfs/cephfs_test_case.py", line 378, in _wait_subtrees
2022-02-05T21:48:59.859 INFO:tasks.cephfs_test_runner:    raise RuntimeError("rank {0} failed to reach desired subtree state".format(rank)) from e
2022-02-05T21:48:59.859 INFO:tasks.cephfs_test_runner:RuntimeError: rank 1 failed to reach desired subtree state

A directory gets migrated to the pinned mds rank if its not empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just create an empty file in the pin directory. Hopes it works.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at the failed test. It a straightforward fix.

@vshankar
Copy link
Contributor

@huww98 ping?

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

@vshankar
Copy link
Contributor

@huww98
Copy link
Contributor Author

huww98 commented Mar 21, 2022

The log said RuntimeError: Timed out waiting for MDS daemons to become healthy

Does this mean I managed to reproduce the bug but failed to fix it? I may not have time to dig into the logs recently. Please help me if possible.

@vshankar
Copy link
Contributor

The log said RuntimeError: Timed out waiting for MDS daemons to become healthy

Does this mean I managed to reproduce the bug but failed to fix it? I may not have time to dig into the logs recently. Please help me if possible.

OK, no problem - I'll take a look.

@huww98
Copy link
Contributor Author

huww98 commented Apr 19, 2022

We are close to the next release. I think we need a warning in the release note that the administrators should clean up any unlinked stray dirs before reducing max_mds (or initiating an automatic upgrade with cephadm).

@vshankar
Copy link
Contributor

We are close to the next release. I think we need a warning in the release note that the administrators should clean up any unlinked stray dirs before reducing max_mds (or initiating an automatic upgrade with cephadm).

From dev@ceph.io - https://lists.ceph.io/hyperkitty/list/dev@ceph.io/message/CKYDFTLXUJ5ZBY7PXSLPJMZALYDWNWXE/

@vshankar vshankar self-assigned this May 11, 2022
@vshankar
Copy link
Contributor

@huww98 BTW, I found another issue when migrating strays. At times, when (say) rank1 mds migrates a stray dentry to rank0 (by sending a rename operation), rank1 might, at a later point, purge the stray dentry itself. rank0 would send discover messages to rank1 to load dir/dentry/inodes into its cache (as replicas), however, when it tries to discover the stray entry itself (the dir that was deleted), rank1 does not send back the inode of the deleted dir and rank0 ends sending discover messages to rank1 endlessly.

@vshankar
Copy link
Contributor

vshankar commented May 25, 2022

@huww98 BTW, I found another issue when migrating strays. At times, when (say) rank1 mds migrates a stray dentry to rank0 (by sending a rename operation), rank1 might, at a later point, purge the stray dentry itself. rank0 would send discover messages to rank1 to load dir/dentry/inodes into its cache (as replicas), however, when it tries to discover the stray entry itself (the dir that was deleted), rank1 does not send back the inode of the deleted dir and rank0 ends sending discover messages to rank1 endlessly.

@huww98 This PR is blocked for testing/merging due to the issue mentioned above. To unblock this, I think its fine to remove the check here - https://github.com/ceph/ceph/pull/44335/files#diff-550a84c6df2db675f1eee5834738001a189b4f0939b7013868c9eed284a79417R737

We could add it back once the stray migration issue is fully fixed. Does that sound reasonable? This way tests would validate that the MDS does not crash. Could you update this PR, so that I can run in through tests?

@huww98 huww98 changed the base branch from master to main May 25, 2022 06:57
When fetch() an unlinked dir, we set fnode->version = 1, but leave
projected_version = 0. This will trigger the assert
`dir->get_projected_version() == dir->get_version()`
in Migrator::encode_export_dir().

projected_version should equal to `fnode->version` unless this dir is
projected.

Fix this by introducing a new helper CDir::set_fresh_fnode(), which will ensure
versions are correctly set.

Fixes: https://tracker.ceph.com/issues/53597
Signed-off-by: 胡玮文 <huww98@outlook.com>
Signed-off-by: 胡玮文 <huww98@outlook.com>
@huww98
Copy link
Contributor Author

huww98 commented May 25, 2022

To unblock this, I think its fine to remove the check here

@vshankar Done

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants