Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: skip sr moves when target is an unlinked dir #55768

Merged
merged 7 commits into from
Apr 3, 2024
Merged

Conversation

batrick
Copy link
Member

@batrick batrick commented Feb 27, 2024

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@lxbsz
Copy link
Member

lxbsz commented Feb 27, 2024

s/memoize/memorize/ ?

@dvanders
Copy link
Contributor

s/memoize/memorize/ ?

actually it's not a typo https://en.wikipedia.org/wiki/Memoization

@vshankar vshankar requested a review from a team February 27, 2024 06:21
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a fine enhancement, I got really excited thinking that we would start maintaining hierarchical inodes_with_caps to ease splitting a snaprealm 😃

src/mds/SnapRealm.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a rename op affect the memoization in any way ?
i.e moving a dentry outside a newly split SnapRealm

@batrick
Copy link
Member Author

batrick commented Feb 27, 2024

Would a rename op affect the memoization in any way ? i.e moving a dentry outside a newly split SnapRealm

No, the behavior of CInode::is_ancestor_of is not changing due to the memoization.

@mchangir
Copy link
Contributor

Would a rename op affect the memoization in any way ? i.e moving a dentry outside a newly split SnapRealm

No, the behavior of CInode::is_ancestor_of is not changing due to the memoization.

arrgh! sorry about that
I meant to ask, "Would a rename op affect the use of memoization in any way?"

@batrick
Copy link
Member Author

batrick commented Feb 27, 2024

Would a rename op affect the memoization in any way ? i.e moving a dentry outside a newly split SnapRealm

No, the behavior of CInode::is_ancestor_of is not changing due to the memoization.

arrgh! sorry about that I meant to ask, "Would a rename op affect the use of memoization in any way?"

Sorry I'm not seeing a difference in the question?

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This can print a ludicrous number of lines for large cache sizes.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
It can dominate logs when large splits occur.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This change uses an unordered_map to memoize results of CInode::is_ancestor_of
so that subsequent invocations can skip directory inodes which are already
known to not be a descendent of the target directory.

In the worst case, this unordered_map can grow to the number of inodes in
memory when all inodes are directories and at least one client has a cap for
each inode. However, in general this will not be the case. The size of each
entry in the map will be a 64-bit pointer and bool. The total size will vary
across platforms but we can say that with a conservative estimate of 192 bits /
entry overhead (including the entry linked list pointer in the bucket), the map
will grow to ~24MB / 1M inodes.

The result of this change is not eye-popping but it does have a significant performance advantage.

For an unpatched MDS with 1M inodes with caps in the global snaprealm (with debugging commits preceding this one):

    2024-02-27T01:08:53.247+0000 7f4be40ec700  2 mds.0.cache Memory usage:  total 6037860, rss 5710800, heap 215392, baseline 199008, 1000251 / 1000323 inodes have caps, 1000251 caps, 0.999928 caps per inode
    ...
    2024-02-27T01:08:54.000+0000 7f4be18e7700 10  mds.0.cache.snaprealm(0x1 seq 3 0x55feaf85ad80) split_at: snaprealm(0x1000000043b seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55feb986b200) on [inode 0x1000000043b [...4,head] ~mds0/stray8/1000000043b/ auth v152 pv153 ap=3 snaprealm=0x55feb986b200 f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (ifile excl) (iversion lock w=1 last_client=4361) caps={4361=pAsXsFs/-@6},l=4361 | request=1 lock=3 caps=1 authpin=1 0x56000423d180]
    2024-02-27T01:08:54.649+0000 7f4be18e7700 10 mds.0.cache.ino(0x1000000043b) move_to_realm joining realm snaprealm(0x1000000043b seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55feb986b200), leaving realm snaprealm(0x1 seq 3 lc 3 cr 3 cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:06:29.440802+0000),3=snap(3 0x1 'two' 2024-02-27T01:06:43.209349+0000)} last_modified 2024-02-27T01:06:43.209349+0000 change_attr 2 0x55feaf85ad80)
    2024-02-27T01:08:54.750+0000 7f4be18e7700 10  mds.0.cache.snaprealm(0x1 seq 3 0x55feaf85ad80) split_at: split 1 inodes

so around 750ms to check all inodes_with_caps (1M) in the global snaprealm. This result was fairly consistent for multiple tries.

For a 100k split:

    2024-02-27T04:12:27.548+0000 7f2da9dbe700 10 mds.0.cache.ino(0x1000000000f) open_snaprealm snaprealm(0x1000000000f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x563553c92900) parent is snaprealm(0x1 seq 2 lc 2 cr 2 cps 1 snaps={2=snap(2 0x1 '1' 2024-02-27T04:12:13.803030+0000)} last_modified 2024-02-27T04:12:13.803030+0000 change_attr 1 0x563553abed80)
    2024-02-27T04:12:27.548+0000 7f2da9dbe700 10  mds.0.cache.snaprealm(0x1 seq 2 0x563553abed80) split_at: snaprealm(0x1000000000f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x563553c92900) on [inode 0x1000000000f [...3,head] /tmp.K9bdjohIVa/ auth v10972 ap=2 snaprealm=0x563553c92900 f(v0 m2024-02-27T04:03:37.953918+0000 1=0+1) n(v106 rc2024-02-27T04:12:27.544141+0000 rs1 99755=0+99755) old_inodes=1 (isnap xlock x=1 by 0x5636a6372900) (inest lock dirty) (ifile excl) (iversion lock w=1 last_client=20707) caps={20707=pAsLsXsFsx/AsLsXsFsx@8},l=20707 | dirtyscattered=1 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x563553cfd180]
    2024-02-27T04:12:28.886+0000 7f2da9dbe700 10  mds.0.cache.snaprealm(0x1 seq 2 0x563553abed80) split_at: split 100031 inodes

or about 1,338ms. This caused a split of 100k inodes. This takes more time
because directories are actually moved to the snaprealm with a lot of list
twiddling for caps.

With this patch, we bring that down, for 1 split:

    2024-02-27T02:09:48.549+0000 7ff854ad4700  2 mds.0.cache Memory usage:  total 5859852, rss 4290012, heap 231776, baseline 190816, 1000312 / 1000327 inodes have caps, 1000312 caps, 0.999985 caps per inode
    ...
    2024-02-27T02:09:48.550+0000 7ff8522cf700 10 mds.0.cache.ino(0x100000f456f) open_snaprealm snaprealm(0x100000f456f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e2b4fab40) parent is snaprealm(0x1 seq 9 lc 9 cr 9 cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:34:36.001053+0000),3=snap(3 0x1 'two' 2024-02-27T01:34:48.623349+0000),6=snap(6 0x1 'six' 2024-02-27T02:03:51.619896+0000),7=snap(7 0x1 'seven' 2024-02-27T02:04:28.375336+0000),8=snap(8 0x1 '1' 2024-02-27T02:06:14.170884+0000),9=snap(9 0x1 '2' 2024-02-27T02:09:47.158624+0000)} last_modified 2024-02-27T02:09:47.158624+0000 change_attr 6 0x559dfd4ad8c0)
    2024-02-27T02:09:48.550+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) split_at: snaprealm(0x100000f456f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e2b4fab40) on [inode 0x100000f456f [...a,head] ~mds0/stray2/100000f456f/ auth v1164 pv1165 ap=3 snaprealm=0x559e2b4fab40 DIRTYPARENT f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/AsLsXsFsx@6},l=4365 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x559e8a8bd600]
    2024-02-27T02:09:48.550+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0)  open_children are 0x559dfd4add40,0x559e1cca1d40
    2024-02-27T02:09:48.919+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) split_at: split 1 inodes

or about 370ms. This was also fairly consistent across multiple tries.

For a 100k split:

    2024-02-27T01:52:24.500+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) split_at: snaprealm(0x10000000013 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e1cca1d40) on [inode 0x10000000013 [...5,head] /tmp.RIUAaU5wuE/ auth v10499 ap=2 snaprealm=0x559e1cca1d40 f(v0 m2024-02-27T01:16:04.611198+0000 1=0+1) n(v122 rc2024-02-27T01:52:24.495465+0000 rs1 100031=0+100031) old_inodes=1 (isnap xlock x=1 by 0x559ef038a880) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/-@11},l=4365 | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirty=1 waiter=0 authpin=1 0x559e0238c580]
    2024-02-27T01:52:24.500+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0)  open_children are 0x559dfd4add40
    2024-02-27T01:52:25.338+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) split_at: split 100031 inodes

or about 840ms. This can be easily done by making a directory in one of the
trees created (see reproducer below).

Reproducing can be done with:

    for ((i =0; i < 10; i++)); do (pushd $(mktemp -d -p . ); for ((j = 0; j < 30; ++j)); do mkdir "$j"; pushd "$j"; done; for ((j = 0; j < 10; ++j)); do for ((k = 0; k < 10000; ++k)); do mkdir $j.$k; done & done) & done

to make 1M directories. We put the majority of directories in a 30-deep nesting
to exercise CInode::is_ancestor_of with some worst-case type scenario.

Make sure all debugging configs are disabled for the MDS/clients. Make sure the
client has a cache size to accomodate 1M caps. Make at least one snapshot:

    mkdir .snap/one

Then reproduction can be done with:

    $ mkdir tmp.qQNsTpxpvh/dir; mkdir .snap/$((++i)); rmdir tmp.qQNsTpxpvh/dir

It is not necessary to delete any snapshots to reproduce this behavior. It's
only necessary to have a lot of inodes_with_caps in a snaprealm and effect a
split.

Fixes: https://tracker.ceph.com/issues/53192
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
A directory in the stray directory cannot have any HEAD inodes with caps so
there is no need to move anything to the snaprealm opened for the unlinked
directory.

Following the parent commit's reproducer, the behavior now looks expectedly like:

    2024-02-27T02:26:59.049+0000 7f5b095f3700 10 mds.0.cache.ino(0x100000f4575) open_snaprealm snaprealm(0x100000f4575 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x5632a57f9680) parent is snaprealm(0x1 seq e lc e cr e cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:34:36.001053+0000),3=snap(3 0x1 'two' 2024-02-27T01:34:48.623349+0000),6=snap(6 0x1 'six' 2024-02-27T02:03:51.619896+0000),7=snap(7 0x1 'seven' 2024-02-27T02:04:28.375336+0000),8=snap(8 0x1 '1' 2024-02-27T02:06:14.170884+0000),9=snap(9 0x1 '2' 2024-02-27T02:09:47.158624+0000),a=snap(a 0x1 '3' 2024-02-27T02:18:24.666934+0000),b=snap(b 0x1 '4' 2024-02-27T02:18:38.268874+0000),c=snap(c 0x1 '5' 2024-02-27T02:23:13.183995+0000),d=snap(d 0x1 '6' 2024-02-27T02:25:25.593014+0000),e=snap(e 0x1 '7' 2024-02-27T02:26:55.184945+0000)} last_modified 2024-02-27T02:26:55.184945+0000 change_attr 11 0x5632861c5680)
    2024-02-27T02:26:59.049+0000 7f5b095f3700 10  mds.0.cache.snaprealm(0x1 seq 14 0x5632861c5680) split_at: snaprealm(0x100000f4575 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x5632a57f9680) on [inode 0x100000f4575 [...f,head] ~mds0/stray0/100000f4575/ auth v1199 pv1200 ap=3 snaprealm=0x5632a57f9680 DIRTYPARENT f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/AsLsXsFsx@6},l=4365 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563385e94000]
    2024-02-27T02:26:59.049+0000 7f5b095f3700 10  mds.0.cache.snaprealm(0x1 seq 14 0x5632861c5680)  moving unlinked directory inode

Discussions with Dan van der Ster led to the creation of this patch.

Fixes: https://tracker.ceph.com/issues/53192
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Dan van der Ster <dan.vanderster@clyso.com>
@batrick
Copy link
Member Author

batrick commented Feb 27, 2024

(Edited some commit messages as part of a trivial rebase.)

@batrick
Copy link
Member Author

batrick commented Feb 29, 2024

2024-02-27 19:22:58,521.521 INFO:__main__:adjust-ulimits path is /tmp/tmp.XsyZQg66PS/venv/bin/adjust-ulimits
2024-02-27 19:22:58,521.521 INFO:__main__:daemon-helper path is /tmp/tmp.XsyZQg66PS/venv/bin/daemon-helper
2024-02-27 19:22:58,521.521 INFO:__main__:stdin-killer path is /tmp/tmp.XsyZQg66PS/venv/bin/stdin-killer
2024-02-27 19:22:58,521.521 INFO:__main__:> ip netns list
2024-02-27 19:22:58,525.525 INFO:__main__:> sudo ip link delete ceph-brx
Cannot find device "ceph-brx"
2024-02-27 19:22:58,541.541 INFO:__main__:> ps -u1111
2024-02-27 19:22:58,558.558 INFO:__main__:Creating cluster with 1 MDS daemons
2024-02-27 19:22:58,558.558 INFO:__main__:
tearing down the cluster...
2024-02-27 19:22:58,558.558 INFO:__main__:> ../src/stop.sh
unable to get monitor info from DNS SRV with service name: ceph-mon
2024-02-27T19:23:01.650+0000 7effc9ea1640 -1 failed for service _ceph-mon._tcp
2024-02-27T19:23:01.650+0000 7effc9ea1640 -1 monclient: get_monmap_and_config cannot identify monitors to contact

@batrick
Copy link
Member Author

batrick commented Feb 29, 2024

jenkins test api

@mchangir
Copy link
Contributor

mchangir commented Mar 4, 2024

Would a rename op affect the memoization in any way ? i.e moving a dentry outside a newly split SnapRealm

No, the behavior of CInode::is_ancestor_of is not changing due to the memoization.

arrgh! sorry about that I meant to ask, "Would a rename op affect the use of memoization in any way?"

Sorry I'm not seeing a difference in the question?

I was expecting the memo variable (visited) to be part of the SnapRealm class before I looked at the commits. But that would need too many changes to keep the visited data-structure up-to-date.

@vshankar
Copy link
Contributor

vshankar commented Mar 6, 2024

Will run this through fs suite and report.

@vshankar
Copy link
Contributor

vshankar commented Mar 8, 2024

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240307.022905

Will review the run next week.

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240307.022905

Will review the run next week.

Merge blocked on a failure due to a bug in testing kernel.

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240307.022905
Will review the run next week.

Merge blocked on a failure due to a bug in testing kernel.

I'm seeing MDS crashes in the reruns which I'm not sure if it's related to this change or other PRs in the test batch. I've dropped a change that seems likely to be causing it and rerunning tests.

@batrick
Copy link
Member Author

batrick commented Mar 28, 2024

I'll run this with a group of PRs I'm doing for fun.

batrick added a commit to batrick/ceph that referenced this pull request Mar 28, 2024
* refs/pull/55768/head:
	mds: skip sr moves when target is an unlinked dir
	mds: memoize descendent results during realm splits
	mds: reduce move_to_realm verbosity
	mds: indicate when split_at is complete for analysis
	mds: increase debug lvl for inode listing at sr split
	mds: refactor debug print of func name
	mds: skip print of empty_children if empty
batrick added a commit to batrick/ceph that referenced this pull request Mar 29, 2024
* refs/pull/55768/head:
	mds: skip sr moves when target is an unlinked dir
	mds: memoize descendent results during realm splits
	mds: reduce move_to_realm verbosity
	mds: indicate when split_at is complete for analysis
	mds: increase debug lvl for inode listing at sr split
	mds: refactor debug print of func name
	mds: skip print of empty_children if empty
batrick added a commit to batrick/ceph that referenced this pull request Mar 31, 2024
* refs/pull/55768/head:
	mds: skip sr moves when target is an unlinked dir
	mds: memoize descendent results during realm splits
	mds: reduce move_to_realm verbosity
	mds: indicate when split_at is complete for analysis
	mds: increase debug lvl for inode listing at sr split
	mds: refactor debug print of func name
	mds: skip print of empty_children if empty
batrick added a commit to batrick/ceph that referenced this pull request Apr 2, 2024
* refs/pull/55768/head:
	mds: skip sr moves when target is an unlinked dir
	mds: memoize descendent results during realm splits
	mds: reduce move_to_realm verbosity
	mds: indicate when split_at is complete for analysis
	mds: increase debug lvl for inode listing at sr split
	mds: refactor debug print of func name
	mds: skip print of empty_children if empty
@batrick
Copy link
Member Author

batrick commented Apr 2, 2024

@batrick
Copy link
Member Author

batrick commented Apr 2, 2024

@vshankar needs approval + merge

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added this to my test branch too. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants