-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: skip sr moves when target is an unlinked dir #55768
Conversation
|
actually it's not a typo https://en.wikipedia.org/wiki/Memoization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a fine enhancement, I got really excited thinking that we would start maintaining hierarchical inodes_with_caps to ease splitting a snaprealm 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a rename op affect the memoization in any way ?
i.e moving a dentry outside a newly split SnapRealm
No, the behavior of |
arrgh! sorry about that |
Sorry I'm not seeing a difference in the question? |
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This can print a ludicrous number of lines for large cache sizes. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
It can dominate logs when large splits occur. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This change uses an unordered_map to memoize results of CInode::is_ancestor_of so that subsequent invocations can skip directory inodes which are already known to not be a descendent of the target directory. In the worst case, this unordered_map can grow to the number of inodes in memory when all inodes are directories and at least one client has a cap for each inode. However, in general this will not be the case. The size of each entry in the map will be a 64-bit pointer and bool. The total size will vary across platforms but we can say that with a conservative estimate of 192 bits / entry overhead (including the entry linked list pointer in the bucket), the map will grow to ~24MB / 1M inodes. The result of this change is not eye-popping but it does have a significant performance advantage. For an unpatched MDS with 1M inodes with caps in the global snaprealm (with debugging commits preceding this one): 2024-02-27T01:08:53.247+0000 7f4be40ec700 2 mds.0.cache Memory usage: total 6037860, rss 5710800, heap 215392, baseline 199008, 1000251 / 1000323 inodes have caps, 1000251 caps, 0.999928 caps per inode ... 2024-02-27T01:08:54.000+0000 7f4be18e7700 10 mds.0.cache.snaprealm(0x1 seq 3 0x55feaf85ad80) split_at: snaprealm(0x1000000043b seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55feb986b200) on [inode 0x1000000043b [...4,head] ~mds0/stray8/1000000043b/ auth v152 pv153 ap=3 snaprealm=0x55feb986b200 f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (ifile excl) (iversion lock w=1 last_client=4361) caps={4361=pAsXsFs/-@6},l=4361 | request=1 lock=3 caps=1 authpin=1 0x56000423d180] 2024-02-27T01:08:54.649+0000 7f4be18e7700 10 mds.0.cache.ino(0x1000000043b) move_to_realm joining realm snaprealm(0x1000000043b seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55feb986b200), leaving realm snaprealm(0x1 seq 3 lc 3 cr 3 cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:06:29.440802+0000),3=snap(3 0x1 'two' 2024-02-27T01:06:43.209349+0000)} last_modified 2024-02-27T01:06:43.209349+0000 change_attr 2 0x55feaf85ad80) 2024-02-27T01:08:54.750+0000 7f4be18e7700 10 mds.0.cache.snaprealm(0x1 seq 3 0x55feaf85ad80) split_at: split 1 inodes so around 750ms to check all inodes_with_caps (1M) in the global snaprealm. This result was fairly consistent for multiple tries. For a 100k split: 2024-02-27T04:12:27.548+0000 7f2da9dbe700 10 mds.0.cache.ino(0x1000000000f) open_snaprealm snaprealm(0x1000000000f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x563553c92900) parent is snaprealm(0x1 seq 2 lc 2 cr 2 cps 1 snaps={2=snap(2 0x1 '1' 2024-02-27T04:12:13.803030+0000)} last_modified 2024-02-27T04:12:13.803030+0000 change_attr 1 0x563553abed80) 2024-02-27T04:12:27.548+0000 7f2da9dbe700 10 mds.0.cache.snaprealm(0x1 seq 2 0x563553abed80) split_at: snaprealm(0x1000000000f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x563553c92900) on [inode 0x1000000000f [...3,head] /tmp.K9bdjohIVa/ auth v10972 ap=2 snaprealm=0x563553c92900 f(v0 m2024-02-27T04:03:37.953918+0000 1=0+1) n(v106 rc2024-02-27T04:12:27.544141+0000 rs1 99755=0+99755) old_inodes=1 (isnap xlock x=1 by 0x5636a6372900) (inest lock dirty) (ifile excl) (iversion lock w=1 last_client=20707) caps={20707=pAsLsXsFsx/AsLsXsFsx@8},l=20707 | dirtyscattered=1 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x563553cfd180] 2024-02-27T04:12:28.886+0000 7f2da9dbe700 10 mds.0.cache.snaprealm(0x1 seq 2 0x563553abed80) split_at: split 100031 inodes or about 1,338ms. This caused a split of 100k inodes. This takes more time because directories are actually moved to the snaprealm with a lot of list twiddling for caps. With this patch, we bring that down, for 1 split: 2024-02-27T02:09:48.549+0000 7ff854ad4700 2 mds.0.cache Memory usage: total 5859852, rss 4290012, heap 231776, baseline 190816, 1000312 / 1000327 inodes have caps, 1000312 caps, 0.999985 caps per inode ... 2024-02-27T02:09:48.550+0000 7ff8522cf700 10 mds.0.cache.ino(0x100000f456f) open_snaprealm snaprealm(0x100000f456f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e2b4fab40) parent is snaprealm(0x1 seq 9 lc 9 cr 9 cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:34:36.001053+0000),3=snap(3 0x1 'two' 2024-02-27T01:34:48.623349+0000),6=snap(6 0x1 'six' 2024-02-27T02:03:51.619896+0000),7=snap(7 0x1 'seven' 2024-02-27T02:04:28.375336+0000),8=snap(8 0x1 '1' 2024-02-27T02:06:14.170884+0000),9=snap(9 0x1 '2' 2024-02-27T02:09:47.158624+0000)} last_modified 2024-02-27T02:09:47.158624+0000 change_attr 6 0x559dfd4ad8c0) 2024-02-27T02:09:48.550+0000 7ff8522cf700 10 mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) split_at: snaprealm(0x100000f456f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e2b4fab40) on [inode 0x100000f456f [...a,head] ~mds0/stray2/100000f456f/ auth v1164 pv1165 ap=3 snaprealm=0x559e2b4fab40 DIRTYPARENT f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/AsLsXsFsx@6},l=4365 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x559e8a8bd600] 2024-02-27T02:09:48.550+0000 7ff8522cf700 10 mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) open_children are 0x559dfd4add40,0x559e1cca1d40 2024-02-27T02:09:48.919+0000 7ff8522cf700 10 mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) split_at: split 1 inodes or about 370ms. This was also fairly consistent across multiple tries. For a 100k split: 2024-02-27T01:52:24.500+0000 7ff8522cf700 10 mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) split_at: snaprealm(0x10000000013 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e1cca1d40) on [inode 0x10000000013 [...5,head] /tmp.RIUAaU5wuE/ auth v10499 ap=2 snaprealm=0x559e1cca1d40 f(v0 m2024-02-27T01:16:04.611198+0000 1=0+1) n(v122 rc2024-02-27T01:52:24.495465+0000 rs1 100031=0+100031) old_inodes=1 (isnap xlock x=1 by 0x559ef038a880) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/-@11},l=4365 | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirty=1 waiter=0 authpin=1 0x559e0238c580] 2024-02-27T01:52:24.500+0000 7ff8522cf700 10 mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) open_children are 0x559dfd4add40 2024-02-27T01:52:25.338+0000 7ff8522cf700 10 mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) split_at: split 100031 inodes or about 840ms. This can be easily done by making a directory in one of the trees created (see reproducer below). Reproducing can be done with: for ((i =0; i < 10; i++)); do (pushd $(mktemp -d -p . ); for ((j = 0; j < 30; ++j)); do mkdir "$j"; pushd "$j"; done; for ((j = 0; j < 10; ++j)); do for ((k = 0; k < 10000; ++k)); do mkdir $j.$k; done & done) & done to make 1M directories. We put the majority of directories in a 30-deep nesting to exercise CInode::is_ancestor_of with some worst-case type scenario. Make sure all debugging configs are disabled for the MDS/clients. Make sure the client has a cache size to accomodate 1M caps. Make at least one snapshot: mkdir .snap/one Then reproduction can be done with: $ mkdir tmp.qQNsTpxpvh/dir; mkdir .snap/$((++i)); rmdir tmp.qQNsTpxpvh/dir It is not necessary to delete any snapshots to reproduce this behavior. It's only necessary to have a lot of inodes_with_caps in a snaprealm and effect a split. Fixes: https://tracker.ceph.com/issues/53192 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
A directory in the stray directory cannot have any HEAD inodes with caps so there is no need to move anything to the snaprealm opened for the unlinked directory. Following the parent commit's reproducer, the behavior now looks expectedly like: 2024-02-27T02:26:59.049+0000 7f5b095f3700 10 mds.0.cache.ino(0x100000f4575) open_snaprealm snaprealm(0x100000f4575 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x5632a57f9680) parent is snaprealm(0x1 seq e lc e cr e cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:34:36.001053+0000),3=snap(3 0x1 'two' 2024-02-27T01:34:48.623349+0000),6=snap(6 0x1 'six' 2024-02-27T02:03:51.619896+0000),7=snap(7 0x1 'seven' 2024-02-27T02:04:28.375336+0000),8=snap(8 0x1 '1' 2024-02-27T02:06:14.170884+0000),9=snap(9 0x1 '2' 2024-02-27T02:09:47.158624+0000),a=snap(a 0x1 '3' 2024-02-27T02:18:24.666934+0000),b=snap(b 0x1 '4' 2024-02-27T02:18:38.268874+0000),c=snap(c 0x1 '5' 2024-02-27T02:23:13.183995+0000),d=snap(d 0x1 '6' 2024-02-27T02:25:25.593014+0000),e=snap(e 0x1 '7' 2024-02-27T02:26:55.184945+0000)} last_modified 2024-02-27T02:26:55.184945+0000 change_attr 11 0x5632861c5680) 2024-02-27T02:26:59.049+0000 7f5b095f3700 10 mds.0.cache.snaprealm(0x1 seq 14 0x5632861c5680) split_at: snaprealm(0x100000f4575 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x5632a57f9680) on [inode 0x100000f4575 [...f,head] ~mds0/stray0/100000f4575/ auth v1199 pv1200 ap=3 snaprealm=0x5632a57f9680 DIRTYPARENT f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/AsLsXsFsx@6},l=4365 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563385e94000] 2024-02-27T02:26:59.049+0000 7f5b095f3700 10 mds.0.cache.snaprealm(0x1 seq 14 0x5632861c5680) moving unlinked directory inode Discussions with Dan van der Ster led to the creation of this patch. Fixes: https://tracker.ceph.com/issues/53192 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Dan van der Ster <dan.vanderster@clyso.com>
(Edited some commit messages as part of a trivial rebase.) |
|
jenkins test api |
I was expecting the memo variable ( |
Will run this through fs suite and report. |
https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240307.022905 Will review the run next week. |
Merge blocked on a failure due to a bug in testing kernel. |
I'm seeing MDS crashes in the reruns which I'm not sure if it's related to this change or other PRs in the test batch. I've dropped a change that seems likely to be causing it and rerunning tests. |
I'll run this with a group of PRs I'm doing for fun. |
* refs/pull/55768/head: mds: skip sr moves when target is an unlinked dir mds: memoize descendent results during realm splits mds: reduce move_to_realm verbosity mds: indicate when split_at is complete for analysis mds: increase debug lvl for inode listing at sr split mds: refactor debug print of func name mds: skip print of empty_children if empty
* refs/pull/55768/head: mds: skip sr moves when target is an unlinked dir mds: memoize descendent results during realm splits mds: reduce move_to_realm verbosity mds: indicate when split_at is complete for analysis mds: increase debug lvl for inode listing at sr split mds: refactor debug print of func name mds: skip print of empty_children if empty
* refs/pull/55768/head: mds: skip sr moves when target is an unlinked dir mds: memoize descendent results during realm splits mds: reduce move_to_realm verbosity mds: indicate when split_at is complete for analysis mds: increase debug lvl for inode listing at sr split mds: refactor debug print of func name mds: skip print of empty_children if empty
* refs/pull/55768/head: mds: skip sr moves when target is an unlinked dir mds: memoize descendent results during realm splits mds: reduce move_to_realm verbosity mds: indicate when split_at is complete for analysis mds: increase debug lvl for inode listing at sr split mds: refactor debug print of func name mds: skip print of empty_children if empty
@vshankar needs approval + merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added this to my test branch too. LGTM.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
x
between the brackets:[x]
. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e