Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: expose Client::ll_register_callback via libcephfs #34596

Merged
merged 7 commits into from
May 18, 2020

Conversation

jtlayton
Copy link
Contributor

We need to wire up the invalidation callbacks so that ganesha can do a better job of keeping its caches under control. This is the first step in that process, as the current callback interface isn't suitable for C.

Fixes: https://tracker.ceph.com/issues/45114
Signed-off-by: Jeff Layton jlayton@redhat.com

@jtlayton jtlayton added cephfs Ceph File System DNM labels Apr 16, 2020
@jtlayton
Copy link
Contributor Author

DNM for now, as I need to write the corresponding code for ganesha, and I don't want to merge this until it's working end-to-end.

@jtlayton jtlayton force-pushed the wip-45114 branch 2 times, most recently from 7b3780e to 42efe72 Compare April 17, 2020 14:49
@jtlayton
Copy link
Contributor Author

Also added in a new ino_release callback. The intent will be to advise the application that we would like it to release references to this inode if possible. Ganesha can then conditionally look them up in its cache and drop them if they are idle.

The fact that we have to queue these up makes the math a bit fuzzy, IMO, but I don't see a reasonable alternative.

@gregsfortytwo
Copy link
Member

Seems fine in concept. I am a bit worried about only being able to request release on one inode at a time, though — no idea how expensive all the overhead turns into but an individual callback repeated 10k+ times sounds expensive. We probably also want either to just pass through a generic "please release inodes"/"please clean up whatever you can of the inodes", or else at least be able to specify more than one inode at a time?

Or perhaps the overhead isn't really high enough to worry about that and I'm blowing smoke; we aren't actually crossing the kernel boundary here...

@jtlayton
Copy link
Contributor Author

I think this is probably the best we can do.

I intend to make the release callback in ganesha synchronous, so basically we'll queue this to the finisher, and it'll call into ganesha and put the reference. Ideally we'd just do all of this without having to queue up work, but the fact that the client_lock is held while calling this makes it a bit hairy to do so.

@gregsfortytwo
Copy link
Member

I intend to make the release callback in ganesha synchronous, so basically we'll queue this to the finisher, and it'll call into ganesha and put the reference. Ideally we'd just do all of this without having to queue up work, but the fact that the client_lock is held while calling this makes it a bit hairy to do so.

Well if that's going to be an issue and we end up having to do queues I'm quite sure we'll want to batch up inode releases. We could presumably pass around an array (perhaps of fixed size) of requested inodes to free about as easily as a single ino number?

@jtlayton
Copy link
Contributor Author

That's a possibility, though I'm not sure batching them really provides any benefit here.

Alternately we could just batch them up and do all of the the individual "upcalls" outside the client_lock. That wouldn't cost that much and it may allow the results to be more immediate. For now, I'll stick with it as-is, just to get a proof of concept patch going. Once I have something working we can argue over the details.

@jtlayton jtlayton force-pushed the wip-45114 branch 2 times, most recently from da03047 to d8f2dbd Compare April 22, 2020 21:22
@jtlayton jtlayton requested a review from lxbsz April 22, 2020 21:22
@jtlayton
Copy link
Contributor Author

Was able to successfully test this today and verify that it was sending release request callbacks to ganesha. The ganesha piece still needs some work, but this piece is more or less working.

For now, I'm still not batching these up. I don't quite see the point here. Is there extra overhead for multiple queued ino_release_cb jobs that would make that worthwhile? The gain isn't clear, and that means having to queue them up manually on a list or something in trim_caps and then hand that to the finisher job.

@jtlayton jtlayton removed the DNM label Apr 22, 2020
@jtlayton jtlayton requested review from a team and removed request for batrick, gregsfortytwo, ukernel and lxbsz April 22, 2020 21:29
@ukernel
Copy link
Contributor

ukernel commented Apr 23, 2020

LGTM. I agree with Greg. we'd better to batching these up. The new release ino upcall may be used for ceph-fuse in the future. the overhead of one upcall per inode is too high

@jtlayton
Copy link
Contributor Author

Where does the overhead come from though? Is it from the fact that we're queueing a this to the finisher, or is it the (hypothetical) context switch from calling into the kernel to release the inode?

If it's the former, then batching may help. If it's the latter, then probably not -- I doubt libfuse has a batched interface for something like this. We'd just end up batching them up in one thread and then dispatching them individually in the finisher.

@ukernel
Copy link
Contributor

ukernel commented Apr 23, 2020

Where does the overhead come from though? Is it from the fact that we're queueing a this to the finisher, or is it the (hypothetical) context switch from calling into the kernel to release the inode?

If it's the former, then batching may help. If it's the latter, then probably not -- I doubt libfuse has a batched interface for something like this. We'd just end up batching them up in one thread and then dispatching them individually in the finisher.

I mean the later. You are right, fuse does have interface like this.

@mattbenjamin
Copy link
Contributor

mattbenjamin commented Apr 23, 2020

I think the idea of doing the upcalls without the client lock was a sensible one. Has that made it in? As I remarked in the email thread, I think batching makes for good future improvement, even if the benefit isn't immediate--esp. if the upper-layer isn't required to release synchronously, or perhaps optionally is not.

@jtlayton
Copy link
Contributor Author

This does the callbacks without the client_lock. It does that by queueing it to a new "finisher" and it dequeues and runs the jobs without taking the client_lock. This is pretty consistent with how all of the libcephfs callbacks work.

I have the corresponding patches for ganesha as well, but I don't want to merge those until the ceph bits go in:

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/490847
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/490848

@mattbenjamin
Copy link
Contributor

@jtlayton cool--thanks for explaining

@jtlayton jtlayton force-pushed the wip-45114 branch 5 times, most recently from a940f5b to 0d7295e Compare April 29, 2020 17:14
@jtlayton jtlayton marked this pull request as ready for review April 30, 2020 11:26
@jtlayton
Copy link
Contributor Author

jtlayton commented Apr 30, 2020

Testcase seems to work. The only thing I'm not sure about is whether it would be better to piggyback this test onto the existing cluster that gets set up for client_trim_caps test. The cluster configuration for that is identical, so we might be able to run both tests on the same one.

http://pulpito.ceph.com/jlayton-2020-04-30_11:04:23-fs-wip-jlayton-45114-distro-basic-smithi/

@jtlayton
Copy link
Contributor Author

Updated the testcase to piggyback this test onto the cluster setup for ceph_test_trim_caps. I think it's "officially" ready now.

Client::init sets this, but if we later call ll_register_callbacks again
with a new set of function pointers that has umask_cb set to nullptr,
it'll override the value in the cmount.

Only reset umask_cb if the one in args is not nullptr.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
C doesn't have the string type, and doesn't understand references.
Change client_dentry_callback_t to take separate pointer and length
arguments.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
...so we can use it to include other definitions and types that need to
be shared with in-tree code that doesn't want to include libcephfs.h.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Put them in a common interface header file. This also allows us to
eliminate the duplicate definition of ceph_deleg_cb_t in Delegation.h.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Allow applications based on libcephfs to register callbacks, like we
do for ceph-fuse.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
trim_caps() walks the list of caps on the session, and releases
non-auth caps, and attempts to trim dentries until the cache
size is under the max_caps value requested by MDS.

This is fine for FUSE, but doesn't really match the use-case of
nfs-ganesha. Ganesha typically looks up inodes by inode number, not
by dentry. It's quite possible that after a restart, we may have a
ton of outstanding inodes with no dentries associated with them.

Ganesha holds a reference to each inode, so libcephfs can't release
them, and we don't have a way to request that ganesha do so.

Add a new ino_release_callback and finisher. The intent is to allow
libcephfs to "upcall" to the application and request that it release
references to a specific inode.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Create a bunch of files and get their inode numbers. Remount, look them
all up by inode number and hold references. Stop looking up inodes as
soon as we get a callback from libcephfs. If we got the callback, return
success. Fail otherwise.

Since this has the same cluster setup as the other client_trim_caps
testcase, we can piggyback onto that task.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton
Copy link
Contributor Author

jtlayton commented May 5, 2020

@batrick : can you pull this into your PR runs? I'd like to get this merged soon as the ganesha patches have already made it into ganesha's "next" branch, and we won't really know how much this helps until we can get it tested in environments where it's a problem.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good!

Reviewed-by: Greg Farnum gfarnum@redhat.com

batrick added a commit to batrick/ceph that referenced this pull request May 15, 2020
* refs/pull/34596/head:
	test: add a new program for testing ino_release_cb
	client: add a new inode release request callback
	client: expose ceph_ll_register_callbacks via libcephfs
	client: move callback typedefs and arg struct into ceph_ll_client.h
	client: rename ceph_statx.h to ceph_ll_client.h
	client: make client_dentry_callback_t more friendly for C
	client: only override umask_cb with non-NULL values

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: Zheng Yan <zyan@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
@batrick batrick merged commit 356c64a into ceph:master May 18, 2020
@@ -4345,6 +4385,7 @@ void Client::trim_caps(MetaSession *s, uint64_t max)
if (all && in->ino != MDS_INO_ROOT) {
ldout(cct, 20) << __func__ << " counting as trimmed: " << *in << dendl;
trimmed++;
_schedule_ino_release_callback(in.get());
Copy link
Contributor

@sepia-liu sepia-liu May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for release file inode, but it is not valid for releasing the directory inode. When ganesha hold a ref (ll_ref =1) of directory inode, the dentry state is pinned and all=false, although it can be released for ganesha, but here can not callbak. I have merged changes both of libcephfs and ganesha, the test found that the directory inode could not be released.
Does it fit here? thanks @jtlayton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants