New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cephfs: fix future rctimes #37938
cephfs: fix future rctimes #37938
Conversation
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
@vshankar @batrick at CERN we are using some iterations of Owncloud file sharing with a custom storage backend, and we try to create a module for Owncloud to use a ceph backend with cephfs. In our clusters we have some files with future mtimes and that has affected the recursive ctimes that we use to detect which files have changed to be synchronised. And because of the future rctimes we have to check the whole tree for changes (huge performance impact). Can you have a look at this please? |
@thmour -- @mchangir making some really nice progress for fixing recursive stats in cephfs (especially for snapshots) -- requesting his review. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
I would also like to put in my vote for this to be merged. Currently I cannot use rctime as some of the directories have rctimes ~70 years in the future. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
@mchangir can you check this out? we need it for cephfs mirroring and a project I do with having Ceph as a backend for ownCloud. In some of our clusters we have future rctimes and we can't use these file synchronisation tools. |
I'm mostly ok with this change as it provides a way to fix rctime inconsistencies for the user (note that fixing rctime esp. for snapshots is going to be a non-trivial effort). I'll let @mchangir and @batrick review the details of the change though. |
@thmour Side note - CephFS supports asynchronous mirroring of snapshots via cephfs-mirror tool (available in Ceph pacific release onward). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editing rctimes seems like a reasonable solution to me, but two issues:
- The wrong lock I commented on.
- Do we need to worry about the security model? I guess maybe it's the same as accepting client ctimes on files? Which is interesting because client timestamps used to be the default everywhere but we've been gradually moving away from it... :/
@@ -5823,6 +5823,34 @@ void Server::handle_set_vxattr(MDRequestRef& mdr, CInode *cur) | |||
auto pi = cur->project_inode(mdr); | |||
cur->setxattr_ephemeral_dist(val); | |||
pip = pi.inode.get(); | |||
} else if (name == "ceph.dir.rctime") { | |||
if (!cur->is_dir()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you update an inode's rctime? You could fix all of the rctimes but none of the inodes?
OK - nice. I was thinking about introducing more changes to this PR for that, which is unneeded - so +1 |
src/mds/Server.cc
Outdated
} | ||
utime_t val(sec, nsec); | ||
|
||
if (!xlock_nestlock(mdr, cur)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails with the following crash here: https://pulpito.ceph.com/vshankar-2022-02-05_17:27:49-fs-wip-vshankar-testing-20220201-113815-testing-default-smithi/6663067/
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10531-g09fc2e9e/rpm/el8/BUILD/ceph-17.0.0-10531-g09fc2e9e/src/mds/Locker.cc: 4782: ceph_abort_msg("abort() called")
ceph version 17.0.0-10531-g09fc2e9e (09fc2e9e470435f1e6dbf462a51853391f4a4cc0) quincy (dev)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f83dd2ec91f]
2: (Locker::simple_lock(SimpleLock*, bool*)+0x588) [0x560d6e1a4108]
3: (Locker::xlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&)+0x380) [0x560d6e1a5dd0]
4: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x24a0) [0x560d6e1bb8c0]
5: (Server::xlock_nestlock(boost::intrusive_ptr<MDRequestImpl>&, CInode*)+0xd2) [0x560d6e00fab2]
6: (Server::handle_set_vxattr(boost::intrusive_ptr<MDRequestImpl>&, CInode*)+0x14d4) [0x560d6e036364]
7: (Server::handle_client_setxattr(boost::intrusive_ptr<MDRequestImpl>&)+0xcd) [0x560d6e03979d]
8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xc7b) [0x560d6e0600ab]
9: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x566) [0x560d6e06bda6]
10: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x133) [0x560d6e074fe3]
11: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xd3c) [0x560d6dfcb74c]
12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7cb) [0x560d6dfce39b]
13: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x560d6dfce9bc]
14: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x560d6dfbd8b8]
15: (DispatchQueue::entry()+0x14fa) [0x7f83dd57240a]
16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f83dd628891]
17: /lib64/libpthread.so.0(+0x814a) [0x7f83dc2c614a]
18: clone()
I think this needs to be a wrlock rather than an xlock.
Also, BTW, as per comment here(*), seems that we may want to acquire some additional locks? @batrick @gregsfortytwo What is dirlock? Seems to me like it is something existed and is got removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, BTW, as per comment here(*), seems that we may want to acquire some additional locks? @batrick @gregsfortytwo What is dirlock? Seems to me like it is something existed and is got removed?
I ran git log -G dirlock -- src/mds/
and got
commit fca6c27
Author: Sage Weil sweil@redhat.com
Date: Thu Dec 18 10:36:47 2008 -0800mds: kill the dirlock
LOL
sigh
So it looks like Locker::xlock_start() checks the current lock state and calls either simple_xlock() or simple_lock() based on that.
if (lock->get_state() == LOCK_LOCK || lock->get_state() == LOCK_XLOCKDONE) {
mut->start_locking(lock);
simple_xlock(lock);
} else {
simple_lock(lock);
}
And we ended up in the simple_lock branch and it aborted because it doesn't recognize the current state when we get there. I really don't know why that might have happened — it might be that we just need the wrlock, but it also might be that we need more state guards in place elsewhere whose semantics I don't have in my head any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, that state isn't valid for scatterlock.
For the lock state machine in general, we would need extensive design write-up covering each state with its nuances to really make sense of what all states to add, which isn't going to be straightforward. In the interest of time switching to wrlock might just be OK.
It is observed that clients have files with invalid ctimes on cephfs. Even if those files can be fixed, the parent directories up to the root have their rctime set to an invalid value. Make rctime modifiable and let sysadmins fix those invalid rctimes recusively (bottom-up). Signed-off-by: Theofilos Mouratidis <t.mour@cern.ch>
Will run it through test this week. thx @thmour |
Setting this xattr from kclient fails since its considered as a read-only attribute - https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/xattr.c#L365 Test failure here - https://pulpito.ceph.com/vshankar-2022-02-28_04:32:15-fs-wip-vshankar-testing-20220226-211550-testing-default-smithi/6710176/ This would require kclient allowing ceph.dir.rctime to be modified (This PR allows user-space driver to set this xattr). @jtlayton ^^ |
@jtlayton If my |
No, I don't think so. This doesn't rely on GETVXATTR. The problem is that the rstats xattrs are marked READONLY so they aren't allowed to be changed currently. You'd need to add support for setting them to the kclient. |
I'll send a fix for that. |
} | ||
} | ||
|
||
if (changed) { | ||
pi->ctime = std::max(pi->ctime, ceph_clock_now()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the ctime
change here seems buggy. The client could cache the ctime
changing such as if the Fx
or Ax
is issued. What if the cached ctime
is flushed to MDS 1 minute later ? The ctime
here will always be increased by 1 miniute ?
if (new_mtime > pi.inode->rstat.rctime) | ||
pi.inode->rstat.rctime = new_mtime; | ||
} | ||
pi.inode->ctime = ceph_clock_now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be breaking pjd test - https://pulpito.ceph.com/vshankar-2022-03-14_08:09:40-fs-wip-vshankar-testing1-20220314-103400-testing-default-smithi/6735687/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and only the recovery will set the new_size
and also the mtime
will be readed from the data pool objects, I don't see any case will user modified mtime
could affect the ctime
here.
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
In our cephfs cluster we get users who import files with invalid ctimes.
This affects the rctime up to the root and the rctime doesn't get updated for every new write until the future date is met.
Applications which depend on rctime to detect file changes optimally (rsync) can't run now.
Make rctime modifiable so sysadmins can fix them after fixing the invalid file's ctime.
Also make mtime not change the rctime, because it is a user modifiable field and it can be set to the future.
rctime should only change based on the value of ctime instead.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox