client: expose Client::ll_register_callback via libcephfs #34596

jtlayton · 2020-04-16T19:08:59Z

We need to wire up the invalidation callbacks so that ganesha can do a better job of keeping its caches under control. This is the first step in that process, as the current callback interface isn't suitable for C.

Fixes: https://tracker.ceph.com/issues/45114
Signed-off-by: Jeff Layton jlayton@redhat.com

jtlayton · 2020-04-16T19:09:43Z

DNM for now, as I need to write the corresponding code for ganesha, and I don't want to merge this until it's working end-to-end.

jtlayton · 2020-04-17T14:52:41Z

Also added in a new ino_release callback. The intent will be to advise the application that we would like it to release references to this inode if possible. Ganesha can then conditionally look them up in its cache and drop them if they are idle.

The fact that we have to queue these up makes the math a bit fuzzy, IMO, but I don't see a reasonable alternative.

gregsfortytwo · 2020-04-17T17:35:31Z

Seems fine in concept. I am a bit worried about only being able to request release on one inode at a time, though — no idea how expensive all the overhead turns into but an individual callback repeated 10k+ times sounds expensive. We probably also want either to just pass through a generic "please release inodes"/"please clean up whatever you can of the inodes", or else at least be able to specify more than one inode at a time?

Or perhaps the overhead isn't really high enough to worry about that and I'm blowing smoke; we aren't actually crossing the kernel boundary here...

jtlayton · 2020-04-17T18:30:28Z

I think this is probably the best we can do.

I intend to make the release callback in ganesha synchronous, so basically we'll queue this to the finisher, and it'll call into ganesha and put the reference. Ideally we'd just do all of this without having to queue up work, but the fact that the client_lock is held while calling this makes it a bit hairy to do so.

gregsfortytwo · 2020-04-17T20:06:44Z

I intend to make the release callback in ganesha synchronous, so basically we'll queue this to the finisher, and it'll call into ganesha and put the reference. Ideally we'd just do all of this without having to queue up work, but the fact that the client_lock is held while calling this makes it a bit hairy to do so.

Well if that's going to be an issue and we end up having to do queues I'm quite sure we'll want to batch up inode releases. We could presumably pass around an array (perhaps of fixed size) of requested inodes to free about as easily as a single ino number?

jtlayton · 2020-04-20T13:57:34Z

That's a possibility, though I'm not sure batching them really provides any benefit here.

Alternately we could just batch them up and do all of the the individual "upcalls" outside the client_lock. That wouldn't cost that much and it may allow the results to be more immediate. For now, I'll stick with it as-is, just to get a proof of concept patch going. Once I have something working we can argue over the details.

jtlayton · 2020-04-22T21:28:32Z

Was able to successfully test this today and verify that it was sending release request callbacks to ganesha. The ganesha piece still needs some work, but this piece is more or less working.

For now, I'm still not batching these up. I don't quite see the point here. Is there extra overhead for multiple queued ino_release_cb jobs that would make that worthwhile? The gain isn't clear, and that means having to queue them up manually on a list or something in trim_caps and then hand that to the finisher job.

ukernel · 2020-04-23T01:58:56Z

LGTM. I agree with Greg. we'd better to batching these up. The new release ino upcall may be used for ceph-fuse in the future. the overhead of one upcall per inode is too high

jtlayton · 2020-04-23T12:02:14Z

Where does the overhead come from though? Is it from the fact that we're queueing a this to the finisher, or is it the (hypothetical) context switch from calling into the kernel to release the inode?

If it's the former, then batching may help. If it's the latter, then probably not -- I doubt libfuse has a batched interface for something like this. We'd just end up batching them up in one thread and then dispatching them individually in the finisher.

ukernel · 2020-04-23T13:41:43Z

Where does the overhead come from though? Is it from the fact that we're queueing a this to the finisher, or is it the (hypothetical) context switch from calling into the kernel to release the inode?

If it's the former, then batching may help. If it's the latter, then probably not -- I doubt libfuse has a batched interface for something like this. We'd just end up batching them up in one thread and then dispatching them individually in the finisher.

I mean the later. You are right, fuse does have interface like this.

mattbenjamin · 2020-04-23T13:50:40Z

I think the idea of doing the upcalls without the client lock was a sensible one. Has that made it in? As I remarked in the email thread, I think batching makes for good future improvement, even if the benefit isn't immediate--esp. if the upper-layer isn't required to release synchronously, or perhaps optionally is not.

jtlayton · 2020-04-23T17:21:31Z

This does the callbacks without the client_lock. It does that by queueing it to a new "finisher" and it dequeues and runs the jobs without taking the client_lock. This is pretty consistent with how all of the libcephfs callbacks work.

I have the corresponding patches for ganesha as well, but I don't want to merge those until the ceph bits go in:

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/490847
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/490848

mattbenjamin · 2020-04-23T17:33:45Z

@jtlayton cool--thanks for explaining

jtlayton · 2020-04-30T11:30:08Z

Testcase seems to work. The only thing I'm not sure about is whether it would be better to piggyback this test onto the existing cluster that gets set up for client_trim_caps test. The cluster configuration for that is identical, so we might be able to run both tests on the same one.

http://pulpito.ceph.com/jlayton-2020-04-30_11:04:23-fs-wip-jlayton-45114-distro-basic-smithi/

jtlayton · 2020-04-30T18:38:39Z

Updated the testcase to piggyback this test onto the cluster setup for ceph_test_trim_caps. I think it's "officially" ready now.

Client::init sets this, but if we later call ll_register_callbacks again with a new set of function pointers that has umask_cb set to nullptr, it'll override the value in the cmount. Only reset umask_cb if the one in args is not nullptr. Signed-off-by: Jeff Layton <jlayton@redhat.com>

C doesn't have the string type, and doesn't understand references. Change client_dentry_callback_t to take separate pointer and length arguments. Signed-off-by: Jeff Layton <jlayton@redhat.com>

...so we can use it to include other definitions and types that need to be shared with in-tree code that doesn't want to include libcephfs.h. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Put them in a common interface header file. This also allows us to eliminate the duplicate definition of ceph_deleg_cb_t in Delegation.h. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Allow applications based on libcephfs to register callbacks, like we do for ceph-fuse. Signed-off-by: Jeff Layton <jlayton@redhat.com>

trim_caps() walks the list of caps on the session, and releases non-auth caps, and attempts to trim dentries until the cache size is under the max_caps value requested by MDS. This is fine for FUSE, but doesn't really match the use-case of nfs-ganesha. Ganesha typically looks up inodes by inode number, not by dentry. It's quite possible that after a restart, we may have a ton of outstanding inodes with no dentries associated with them. Ganesha holds a reference to each inode, so libcephfs can't release them, and we don't have a way to request that ganesha do so. Add a new ino_release_callback and finisher. The intent is to allow libcephfs to "upcall" to the application and request that it release references to a specific inode. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Create a bunch of files and get their inode numbers. Remount, look them all up by inode number and hold references. Stop looking up inodes as soon as we get a callback from libcephfs. If we got the callback, return success. Fail otherwise. Since this has the same cluster setup as the other client_trim_caps testcase, we can piggyback onto that task. Signed-off-by: Jeff Layton <jlayton@redhat.com>

jtlayton · 2020-05-05T13:42:41Z

@batrick : can you pull this into your PR runs? I'd like to get this merged soon as the ganesha patches have already made it into ganesha's "next" branch, and we won't really know how much this helps until we can get it tested in environments where it's a problem.

gregsfortytwo

Test looks good!

Reviewed-by: Greg Farnum gfarnum@redhat.com

* refs/pull/34596/head: test: add a new program for testing ino_release_cb client: add a new inode release request callback client: expose ceph_ll_register_callbacks via libcephfs client: move callback typedefs and arg struct into ceph_ll_client.h client: rename ceph_statx.h to ceph_ll_client.h client: make client_dentry_callback_t more friendly for C client: only override umask_cb with non-NULL values Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: Zheng Yan <zyan@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>

sepia-liu · 2020-05-19T02:03:52Z

src/client/Client.cc

@@ -4345,6 +4385,7 @@ void Client::trim_caps(MetaSession *s, uint64_t max)
      if (all && in->ino != MDS_INO_ROOT) {
        ldout(cct, 20) << __func__ << " counting as trimmed: " << *in << dendl;
 	trimmed++;
+	_schedule_ino_release_callback(in.get());


This is fine for release file inode, but it is not valid for releasing the directory inode. When ganesha hold a ref (ll_ref =1) of directory inode, the dentry state is pinned and all=false, although it can be released for ganesha, but here can not callbak. I have merged changes both of libcephfs and ganesha, the test found that the directory inode could not be released.
Does it fit here? thanks @jtlayton

jtlayton added cephfs Ceph File System DNM labels Apr 16, 2020

jtlayton requested review from batrick, gregsfortytwo and ukernel April 16, 2020 19:08

jtlayton force-pushed the wip-45114 branch 2 times, most recently from 7b3780e to 42efe72 Compare April 17, 2020 14:49

jtlayton force-pushed the wip-45114 branch 2 times, most recently from da03047 to d8f2dbd Compare April 22, 2020 21:22

jtlayton requested a review from lxbsz April 22, 2020 21:22

jtlayton removed the DNM label Apr 22, 2020

jtlayton requested review from a team and removed request for batrick, gregsfortytwo, ukernel and lxbsz April 22, 2020 21:29

jtlayton requested a review from gregsfortytwo April 23, 2020 21:28

jtlayton force-pushed the wip-45114 branch 5 times, most recently from a940f5b to 0d7295e Compare April 29, 2020 17:14

jtlayton marked this pull request as ready for review April 30, 2020 11:26

jtlayton force-pushed the wip-45114 branch from 0d7295e to 7982efa Compare April 30, 2020 18:37

jtlayton force-pushed the wip-45114 branch from 7982efa to bcd7102 Compare April 30, 2020 20:15

jtlayton added 7 commits April 30, 2020 16:20

client: make client_dentry_callback_t more friendly for C

79b5882

C doesn't have the string type, and doesn't understand references. Change client_dentry_callback_t to take separate pointer and length arguments. Signed-off-by: Jeff Layton <jlayton@redhat.com>

client: rename ceph_statx.h to ceph_ll_client.h

4436f27

...so we can use it to include other definitions and types that need to be shared with in-tree code that doesn't want to include libcephfs.h. Signed-off-by: Jeff Layton <jlayton@redhat.com>

client: move callback typedefs and arg struct into ceph_ll_client.h

8370f70

Put them in a common interface header file. This also allows us to eliminate the duplicate definition of ceph_deleg_cb_t in Delegation.h. Signed-off-by: Jeff Layton <jlayton@redhat.com>

client: expose ceph_ll_register_callbacks via libcephfs

adcf12d

Allow applications based on libcephfs to register callbacks, like we do for ceph-fuse. Signed-off-by: Jeff Layton <jlayton@redhat.com>

jtlayton force-pushed the wip-45114 branch from bcd7102 to c0db8a0 Compare April 30, 2020 21:43

jtlayton requested a review from a team May 4, 2020 13:31

gregsfortytwo approved these changes May 5, 2020

View reviewed changes

batrick added the wip-pdonnell-testing label May 5, 2020

batrick approved these changes May 18, 2020

View reviewed changes

batrick merged commit 356c64a into ceph:master May 18, 2020

sepia-liu reviewed May 19, 2020

View reviewed changes

Vicente-Cheng mentioned this pull request May 20, 2020

octopus: client: expose Client::ll_register_callback via libcephfs #35150

Merged

smithfarm mentioned this pull request May 25, 2020

octopus: client: expose Client::ll_register_callback via libcephfs #35232

Closed

smithfarm mentioned this pull request Jun 4, 2020

nautilus: cephfs: client: expose Client::ll_register_callback via libcephfs #35393

Merged

phlogistonjohn mentioned this pull request Oct 21, 2021

cephfs: add support for cache management callbacks ceph/go-ceph#605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: expose Client::ll_register_callback via libcephfs #34596

client: expose Client::ll_register_callback via libcephfs #34596

jtlayton commented Apr 16, 2020

jtlayton commented Apr 16, 2020

jtlayton commented Apr 17, 2020

gregsfortytwo commented Apr 17, 2020

jtlayton commented Apr 17, 2020

gregsfortytwo commented Apr 17, 2020

jtlayton commented Apr 20, 2020

jtlayton commented Apr 22, 2020

ukernel commented Apr 23, 2020

jtlayton commented Apr 23, 2020

ukernel commented Apr 23, 2020

mattbenjamin commented Apr 23, 2020 •

edited

Loading

jtlayton commented Apr 23, 2020

mattbenjamin commented Apr 23, 2020

jtlayton commented Apr 30, 2020 •

edited

Loading

jtlayton commented Apr 30, 2020

jtlayton commented May 5, 2020

gregsfortytwo left a comment

sepia-liu May 19, 2020 •

edited

Loading

client: expose Client::ll_register_callback via libcephfs #34596

client: expose Client::ll_register_callback via libcephfs #34596

Conversation

jtlayton commented Apr 16, 2020

jtlayton commented Apr 16, 2020

jtlayton commented Apr 17, 2020

gregsfortytwo commented Apr 17, 2020

jtlayton commented Apr 17, 2020

gregsfortytwo commented Apr 17, 2020

jtlayton commented Apr 20, 2020

jtlayton commented Apr 22, 2020

ukernel commented Apr 23, 2020

jtlayton commented Apr 23, 2020

ukernel commented Apr 23, 2020

mattbenjamin commented Apr 23, 2020 • edited Loading

jtlayton commented Apr 23, 2020

mattbenjamin commented Apr 23, 2020

jtlayton commented Apr 30, 2020 • edited Loading

jtlayton commented Apr 30, 2020

jtlayton commented May 5, 2020

gregsfortytwo left a comment

Choose a reason for hiding this comment

sepia-liu May 19, 2020 • edited Loading

Choose a reason for hiding this comment

mattbenjamin commented Apr 23, 2020 •

edited

Loading

jtlayton commented Apr 30, 2020 •

edited

Loading

sepia-liu May 19, 2020 •

edited

Loading