Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pile of statx patches #10922

Merged
merged 37 commits into from Sep 6, 2016
Merged

First pile of statx patches #10922

merged 37 commits into from Sep 6, 2016

Conversation

jtlayton
Copy link
Contributor

@jtlayton jtlayton commented Aug 30, 2016

github wouldn't allow me to reopen my old PR, so opening a new one here. Most of the changes are in response to Greg's review of the PR here:

#10842

This patchset is the first pile of patches to add ceph_statx support to the userland client. The basic idea is to add a new interface patterned after the proposed statx() interface for Linux. The idea there is to extend the traditional stat() interface in several ways:

  1. allow access to more inode attribute fields, in particular the birthtime (btime) and a change attribute (a'la NFSv4), in a way that is extensible in the future.

  2. allow applications to request just specific fields in the ceph_statx struct be filled out. Note that you can (and will) often get more than you request, but you should always get what is requested or an error.

  3. allow applications to do a "lazy" statx, where we fill out the attribute fields from whatever we have in cache, without querying the server. This is handy for performance reasons or in cases where we don't really need up-to-date information.

In addition, a ceph_setattrx has also been added, which allows the setting of the btime as well as the other fields that ceph_setattr allows. I also have a companion patchset for samba that converts it to use these new interfaces. That seems to work properly, though I still need to roll a testcase to verify that setting the btime via smb actually works.

Pay close attention to the first patch in the file as well, which I think may help performance in some cases by allowing us to request the caps needed to do an operation during the lookup phase of path-based operations.

I've also started some work to do a patchset for ganesha, but that requires adding a number of new ceph_ll_* calls. I think it's probably best to get this merged first and then do that part afterward.

jtlayton and others added 30 commits August 29, 2016 07:16
ll_walk expects to get a set of attributes out of a path_walk. Pass a
caps mask parameter into path_walk, and then apply it when we reach the
last component of the path.

This may prevent us from having to further iteract with the server after
the pathwalk, in some cases. If we know that we're going to need certain
caps to do the actual operation we can request them during the lookup
and may have all that we need by the time we go to do the real request.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
...in preparation for adding a ll_getattrx.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
We'll need them in Client.cc now in addition to FUSE specific code.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
New interfaces for fetching extended (and selective) stat information.
Additionally, applications can specify AT_NO_ATTR_SYNC in the flags to
indicate that they want to do a "lazy" statx that just hands out the
inode info from the cache, or AT_SYMLINK_NOFOLLOW to avoid following
symlinks when walking the path.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Create 2 clients. Create a file in client1, and do a lookup of it in
client2, and then ll_getattrx it from client2. chmod the file from
client1, ll_getattrx it client2 (this time with AT_NO_ATTR_SYNC) and
ensure that the ctime change is not seen.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Currently we don't have a mechanism to set the btime, but we will need
that eventually. If we want to allow the client to cache that change, we
need to be able to pass it back and forth between client and server.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
...that is mapped to BTIME feature.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
...and pass it around appropriately.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
…t hold CAP_FILE_EXCL

Suppose we have two clients. client1 holds FILE_EXCL cap and client2
holds AUTH_EXCL.  Both have the change_attr at the same value (call it
1).  client1 does 2 writes and its change_attr goes to 3. The client1
then queries for the change_attr and gets back 3 from the cache. The MDS
then recalls FILE_EXCL from client1 and now the MDS and client1 have the
same change_attr (3).

client2 then does a chmod on the file, and its change_attr goes to 2.
client1 then does a statx with STX_VERSION|STX_MODE. The MDS recalls the
AUTH_EXCL cap from client2, the change_attr in the MClientCaps is less
than the one in the MDS inode, so it gets discarded. client1 then sees
a new mode but the change_attr value has not changed, which violates the
rules.

Fix this with an extra increment of the MDS copy of the change_attr when
the caps being returned are dirty, and they don't contain exclusive write
caps.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
The semantics for a change_attr are that it should be incremented
whenever there is a change to the ctime in the inode. Add those
increments for the simple case of regular files. Directories however can
be fragmented so we'll need to do something more elaborate there.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
…mtime

...rather than messing around with references. While we're at it, we
can also make the argument optional, which allows us to drop an unused
stack variable from CDir::split.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
To give us change attribute support for directories. Whenever we gather
the dirfrags, we just select the largest change_attr out of the set.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
We're going to need to introduce new versions of these structures in
order to expand the setattr union member. Rename the existing ones so
that it's clear that they are for legacy clients and servers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Unfortunately, the only option here is to rev the MClientRequest
version as the ceph_mds_request_head is not currently versioned. Add a
new ceph_mds_request_head, which contains a new ceph_mds_request_args
structure.

The new ceph_mds_request_head is now versioned via a __le16
at the beginning of it, and then the args structure is expanded to hold
the btime. When we get a legacy ceph_mds_request_head, we just set the
new fields to zero. When encoding a reply to a legacy client, we simply
don't encode the version in the head, or the btime in the setattr union
member.

Reluctantly-Signed-off-by: Jeff Layton <jlayton@redhat.com>
This adds a new set of libcephfs calls: ceph_ll_setattrx and
ceph_setattrx. This allows clients to set the btime in addition to other
values that are typically settable via ceph_setattr calls.

Currently, the setattrx mask uses the same CEPH_SETATTR values that the
ceph_setattr interface uses. I'm not sure this is what we will want
though. Would it be better to rephrase that via STATX_* constants?

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton
Copy link
Contributor Author

jtlayton commented Aug 30, 2016

Note that while this is largely a feature request, we may want to consider backporting this patch to earlier releases:

 client: pass a mask parameter to path_walk

I think that may be a performance win in many cases, though I don't have any numbers to back that up, currently.

@@ -7571,7 +7576,7 @@ int Client::open(const char *relpath, int flags, mode_t mode, int stripe_unit,
string dname = dirpath.last_dentry();
dirpath.pop_dentry();
InodeRef dir;
r = path_walk(dirpath, &dir, true, uid, gid);
r = path_walk(dirpath, &dir, true, 0, uid, gid);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably ought to request CEPH_CAP_AUTH_SHARED here if client_permissions is true. I'll plan to fold that into this patch once others have had a chance to comment.

::decode(head, p);
} else {
struct ceph_mds_request_head_legacy *old_mds_head =
(struct ceph_mds_request_head_legacy *)&(head.oldest_client_tid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just create a new (genuine) ceph_mds_request_head_legacy and then invoke a copy function to move it into our real one?
I realize just using the pointer is more efficient but hopefully this is an uncommon case, and doing it with pointer magic makes it really hard to find. Imagine in future when we do an incompat change and totally reorder all the member fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could. I'll hold my nose and do that. ;)

@gregsfortytwo
Copy link
Member

I think I've been over the new bits here; everything else looks good.

(In future, it's a lot easier to do follow-on reviews if you do add-on FIXUP/SQUASH patches that we can look at, then squash down before final merge.)

@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 1, 2016

FWIW, I've done a blog post expliaining why I think that this method of handling the change attribute is OK:

https://jtlayton.wordpress.com/2016/09/01/cephfs-and-the-nfsv4-change-attribute/

@ffilz
Copy link
Contributor

ffilz commented Sep 1, 2016

Confused by my getting assigned the pr...

…e'll need to check perms locally

Signed-off-by: Jeff Layton <jlayton@redhat.com>
…s on clients that hold CAP_FILE_EXCL"

This reverts commit 26ab2fa.

I don't think this is necessary. We don't necessarily need to increment
the change attribute on every change, as long as we ensure that it gets
bumped iff there were changes since the last time you queried for it.

IOW: it's ok to have a single change_attr change encompass a large set
of changes, as long as you ensure that that it is larger by at least 1
after all of those changes.

In order to look at (and potentially cache) other attributes under that
change_attribute (e.g. owner or mode), you need to recall any outstanding
exclusive caps for those attrs. That causes their change_attrs to be
synched to the largest, which is enough to ensure that it changed in
some way.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
…t's newer

I think this is more correct in the client. If we see a newer
change_attr, then we always want to take note of that fact, even if
we have the right caps.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
…y encode/decode

As requested by Greg...

Declare a legacy object on the stack, and do an extra copy to/from it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@gregsfortytwo
Copy link
Member

@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 1, 2016

@ffilz: only assigned to you to make sure you saw the set

@gregsfortytwo: I replied on the blog. There should be no reliance on any particular client or server behavior here. That said, if you can come up with a scenario where this scheme falls down, then I'd be very interested to hear it.

OTOH, if you're talking about the ceph change attr giving guarantees over and above what the NFSv4 spec mandates, then that's another matter entirely.

Are you suggesting we ought to allow clients to inspect the change_attr and somehow determine whether their cached attrs are still valid, even when they don't have the caps that they'd normally need to see them? If so, I'm skeptical but am interested to hear more of what you're thinking.


::decode(old_mds_head, p);
memcpy(&head.oldest_client_tid, &old_mds_head, sizeof(old_mds_head));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is getting lost in translation. Enjoy your ability to use C++ and write "invisible" functions; instead of memcpy do something like
head.copy_from_legacy(old_mds_head);
or perhaps copy_legacy_head(head, old_mds_head) since these are declared as shared C structs. That way if we update stuff, it requires changing a single function located next to the ceph_mds_request_head struct instead of searching out everybody who copies them.

Granted, if we only have the one user maybe it's not such a big deal and is fine as it stands. That's just the sort of protective programming we can engage in even though it makes the kernel devs angry for being inefficient. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does trigger my gag reflex, but ok...

@gregsfortytwo
Copy link
Member

Discussed this on irc as well and I think the protocol is all good.

Reviewed-by: Greg Farnum gfarnum@redhat.com

…hared caps

Otherwise, someone could potentially query for just CEPH_STATX_VERSION,
and see it as unchanged, even when there are changes buffered up on
other clients.

By doing this, I don't think we'll incur any perf hit in the common use
case which is ganesha querying for all attributes. We are adding Xs
here, but unless there is a lot of xattr activity I don't think that
will generally cause a lot of exclusive cap recalls.

Also, we don't actually need AUTH caps to fetch CEPH_STATX_RDEV, remove
that from statx_to_mask. PIN is sufficient there.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 2, 2016

Now that I thought about it some more, I think both @gregsfortytwo and @ffilz are right, and we need to ensure that any outstanding exclusive caps are revoked when you want to fetch the change attr. The patch that I pushed to the branch late last night should do the right thing there.

…d_legacy

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@gregsfortytwo
Copy link
Member

Updates look good.

Just to be clear, there's nothing in here or the expected wrapper code that will automatically trigger a change_attr lookup unless it's asked for explicitly via statx (or otherwise genuinely needed), right? (I don't see anything, but that would be a really obnoxious performance bug to track down.) I'm thinking specifically of default masks and whatnot.

@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 2, 2016

Yes, the only to get the change_attr is to specifically request it via CEPH_STATX_VERSION flag, in the "want" mask.

@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 2, 2016

Note too that I don't think the existing code is quite correct wrt to the ctime. It seems like client1 could cache a mode change (for instance) and if client2 doesn't request As caps, then you might not see the ctime change for that. We may need to consider altering how that works here as well...

@gregsfortytwo
Copy link
Member

That wouldn't surprise me and might be one of the deliberate cheats Sage snuck in...stuff like times basically forces synchronicity and that isn't much fun. :/
But if there are issues we should open tracker bugs for them and can discuss the cost of the fix, if we want to make it configurable, etc.

@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 2, 2016

Actually, I think we're ok with the ctime. In order to actually query for the ctime, you have to do a ceph_stat() anyway (or the FUSE equivalent), which means that you're going to ask for all the shared caps anyway.

Where it's not yet correct is in statx_to_mask() function. I'll fix...

Much like the change_attr, ctime changes can potentially be cached on
clients. Request all shared caps if the want mask specifies the ctime.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton jtlayton merged commit 0358fc0 into master Sep 6, 2016
@jtlayton jtlayton deleted the wip-jlayton-statx branch September 6, 2016 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System feature
Projects
None yet
6 participants