-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: expose Client::ll_register_callback via libcephfs #34596
Conversation
DNM for now, as I need to write the corresponding code for ganesha, and I don't want to merge this until it's working end-to-end. |
7b3780e
to
42efe72
Compare
Also added in a new ino_release callback. The intent will be to advise the application that we would like it to release references to this inode if possible. Ganesha can then conditionally look them up in its cache and drop them if they are idle. The fact that we have to queue these up makes the math a bit fuzzy, IMO, but I don't see a reasonable alternative. |
Seems fine in concept. I am a bit worried about only being able to request release on one inode at a time, though — no idea how expensive all the overhead turns into but an individual callback repeated 10k+ times sounds expensive. We probably also want either to just pass through a generic "please release inodes"/"please clean up whatever you can of the inodes", or else at least be able to specify more than one inode at a time? Or perhaps the overhead isn't really high enough to worry about that and I'm blowing smoke; we aren't actually crossing the kernel boundary here... |
I think this is probably the best we can do. I intend to make the release callback in ganesha synchronous, so basically we'll queue this to the finisher, and it'll call into ganesha and put the reference. Ideally we'd just do all of this without having to queue up work, but the fact that the client_lock is held while calling this makes it a bit hairy to do so. |
Well if that's going to be an issue and we end up having to do queues I'm quite sure we'll want to batch up inode releases. We could presumably pass around an array (perhaps of fixed size) of requested inodes to free about as easily as a single ino number? |
That's a possibility, though I'm not sure batching them really provides any benefit here. Alternately we could just batch them up and do all of the the individual "upcalls" outside the client_lock. That wouldn't cost that much and it may allow the results to be more immediate. For now, I'll stick with it as-is, just to get a proof of concept patch going. Once I have something working we can argue over the details. |
da03047
to
d8f2dbd
Compare
Was able to successfully test this today and verify that it was sending release request callbacks to ganesha. The ganesha piece still needs some work, but this piece is more or less working. For now, I'm still not batching these up. I don't quite see the point here. Is there extra overhead for multiple queued ino_release_cb jobs that would make that worthwhile? The gain isn't clear, and that means having to queue them up manually on a list or something in trim_caps and then hand that to the finisher job. |
LGTM. I agree with Greg. we'd better to batching these up. The new release ino upcall may be used for ceph-fuse in the future. the overhead of one upcall per inode is too high |
Where does the overhead come from though? Is it from the fact that we're queueing a this to the finisher, or is it the (hypothetical) context switch from calling into the kernel to release the inode? If it's the former, then batching may help. If it's the latter, then probably not -- I doubt libfuse has a batched interface for something like this. We'd just end up batching them up in one thread and then dispatching them individually in the finisher. |
I mean the later. You are right, fuse does have interface like this. |
I think the idea of doing the upcalls without the client lock was a sensible one. Has that made it in? As I remarked in the email thread, I think batching makes for good future improvement, even if the benefit isn't immediate--esp. if the upper-layer isn't required to release synchronously, or perhaps optionally is not. |
This does the callbacks without the client_lock. It does that by queueing it to a new "finisher" and it dequeues and runs the jobs without taking the client_lock. This is pretty consistent with how all of the libcephfs callbacks work. I have the corresponding patches for ganesha as well, but I don't want to merge those until the ceph bits go in: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/490847 |
@jtlayton cool--thanks for explaining |
a940f5b
to
0d7295e
Compare
Testcase seems to work. The only thing I'm not sure about is whether it would be better to piggyback this test onto the existing cluster that gets set up for client_trim_caps test. The cluster configuration for that is identical, so we might be able to run both tests on the same one. http://pulpito.ceph.com/jlayton-2020-04-30_11:04:23-fs-wip-jlayton-45114-distro-basic-smithi/ |
Updated the testcase to piggyback this test onto the cluster setup for ceph_test_trim_caps. I think it's "officially" ready now. |
Client::init sets this, but if we later call ll_register_callbacks again with a new set of function pointers that has umask_cb set to nullptr, it'll override the value in the cmount. Only reset umask_cb if the one in args is not nullptr. Signed-off-by: Jeff Layton <jlayton@redhat.com>
C doesn't have the string type, and doesn't understand references. Change client_dentry_callback_t to take separate pointer and length arguments. Signed-off-by: Jeff Layton <jlayton@redhat.com>
...so we can use it to include other definitions and types that need to be shared with in-tree code that doesn't want to include libcephfs.h. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Put them in a common interface header file. This also allows us to eliminate the duplicate definition of ceph_deleg_cb_t in Delegation.h. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Allow applications based on libcephfs to register callbacks, like we do for ceph-fuse. Signed-off-by: Jeff Layton <jlayton@redhat.com>
trim_caps() walks the list of caps on the session, and releases non-auth caps, and attempts to trim dentries until the cache size is under the max_caps value requested by MDS. This is fine for FUSE, but doesn't really match the use-case of nfs-ganesha. Ganesha typically looks up inodes by inode number, not by dentry. It's quite possible that after a restart, we may have a ton of outstanding inodes with no dentries associated with them. Ganesha holds a reference to each inode, so libcephfs can't release them, and we don't have a way to request that ganesha do so. Add a new ino_release_callback and finisher. The intent is to allow libcephfs to "upcall" to the application and request that it release references to a specific inode. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Create a bunch of files and get their inode numbers. Remount, look them all up by inode number and hold references. Stop looking up inodes as soon as we get a callback from libcephfs. If we got the callback, return success. Fail otherwise. Since this has the same cluster setup as the other client_trim_caps testcase, we can piggyback onto that task. Signed-off-by: Jeff Layton <jlayton@redhat.com>
@batrick : can you pull this into your PR runs? I'd like to get this merged soon as the ganesha patches have already made it into ganesha's "next" branch, and we won't really know how much this helps until we can get it tested in environments where it's a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good!
Reviewed-by: Greg Farnum gfarnum@redhat.com
* refs/pull/34596/head: test: add a new program for testing ino_release_cb client: add a new inode release request callback client: expose ceph_ll_register_callbacks via libcephfs client: move callback typedefs and arg struct into ceph_ll_client.h client: rename ceph_statx.h to ceph_ll_client.h client: make client_dentry_callback_t more friendly for C client: only override umask_cb with non-NULL values Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: Zheng Yan <zyan@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
@@ -4345,6 +4385,7 @@ void Client::trim_caps(MetaSession *s, uint64_t max) | |||
if (all && in->ino != MDS_INO_ROOT) { | |||
ldout(cct, 20) << __func__ << " counting as trimmed: " << *in << dendl; | |||
trimmed++; | |||
_schedule_ino_release_callback(in.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for release file inode, but it is not valid for releasing the directory inode. When ganesha hold a ref (ll_ref =1) of directory inode, the dentry state is pinned and all=false, although it can be released for ganesha, but here can not callbak. I have merged changes both of libcephfs and ganesha, the test found that the directory inode could not be released.
Does it fit here? thanks @jtlayton
We need to wire up the invalidation callbacks so that ganesha can do a better job of keeping its caches under control. This is the first step in that process, as the current callback interface isn't suitable for C.
Fixes: https://tracker.ceph.com/issues/45114
Signed-off-by: Jeff Layton jlayton@redhat.com