Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephfs: fix future rctimes #37938

Closed
wants to merge 2 commits into from
Closed

cephfs: fix future rctimes #37938

wants to merge 2 commits into from

Conversation

thmour
Copy link

@thmour thmour commented Nov 3, 2020

In our cephfs cluster we get users who import files with invalid ctimes.
This affects the rctime up to the root and the rctime doesn't get updated for every new write until the future date is met.
Applications which depend on rctime to detect file changes optimally (rsync) can't run now.
Make rctime modifiable so sysadmins can fix them after fixing the invalid file's ctime.
Also make mtime not change the rctime, because it is a user modifiable field and it can be set to the future.
rctime should only change based on the value of ctime instead.

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@batrick batrick added cephfs Ceph File System needs-review labels Nov 3, 2020
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@stale
Copy link

stale bot commented Jun 11, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jun 11, 2021
@thmour
Copy link
Author

thmour commented Jun 23, 2021

@vshankar @batrick at CERN we are using some iterations of Owncloud file sharing with a custom storage backend, and we try to create a module for Owncloud to use a ceph backend with cephfs. In our clusters we have some files with future mtimes and that has affected the recursive ctimes that we use to detect which files have changed to be synchronised. And because of the future rctimes we have to check the whole tree for changes (huge performance impact). Can you have a look at this please?

@stale stale bot removed the stale label Jun 23, 2021
@vshankar vshankar requested a review from mchangir June 23, 2021 09:47
@vshankar
Copy link
Contributor

@vshankar @batrick at CERN we are using some iterations of Owncloud file sharing with a custom storage backend, and we try to create a module for Owncloud to use a ceph backend with cephfs. In our clusters we have some files with future mtimes and that has affected the recursive ctimes that we use to detect which files have changed to be synchronised. And because of the future rctimes we have to check the whole tree for changes (huge performance impact). Can you have a look at this please?

@thmour -- @mchangir making some really nice progress for fixing recursive stats in cephfs (especially for snapshots) -- requesting his review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@dajrivera
Copy link

I would also like to put in my vote for this to be merged. Currently I cannot use rctime as some of the directories have rctimes ~70 years in the future.

@stale
Copy link

stale bot commented Jan 9, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jan 9, 2022
@thmour
Copy link
Author

thmour commented Jan 10, 2022

@mchangir can you check this out? we need it for cephfs mirroring and a project I do with having Ceph as a backend for ownCloud. In some of our clusters we have future rctimes and we can't use these file synchronisation tools.

@stale stale bot removed the stale label Jan 10, 2022
@vshankar
Copy link
Contributor

@mchangir can you check this out? we need it for cephfs mirroring and a project I do with having Ceph as a backend for ownCloud. In some of our clusters we have future rctimes and we can't use these file synchronisation tools.

I'm mostly ok with this change as it provides a way to fix rctime inconsistencies for the user (note that fixing rctime esp. for snapshots is going to be a non-trivial effort).

I'll let @mchangir and @batrick review the details of the change though.

@vshankar
Copy link
Contributor

@mchangir can you check this out? we need it for cephfs mirroring and a project I do with having Ceph as a backend for ownCloud. In some of our clusters we have future rctimes and we can't use these file synchronisation tools.

@thmour Side note - CephFS supports asynchronous mirroring of snapshots via cephfs-mirror tool (available in Ceph pacific release onward).

Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@vshankar fyi

Just a small question to @thmour
If rctime is updated manually, that rctime has to bubble up to the root dir or snap-root dir appropriately.
I don't see that happening in handle_set_vxattr()

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editing rctimes seems like a reasonable solution to me, but two issues:

  1. The wrong lock I commented on.
  2. Do we need to worry about the security model? I guess maybe it's the same as accepting client ctimes on files? Which is interesting because client timestamps used to be the default everywhere but we've been gradually moving away from it... :/

src/mds/Server.cc Outdated Show resolved Hide resolved
src/mds/Locker.cc Show resolved Hide resolved
src/mds/Locker.cc Show resolved Hide resolved
src/mds/MDCache.cc Show resolved Hide resolved
@@ -5823,6 +5823,34 @@ void Server::handle_set_vxattr(MDRequestRef& mdr, CInode *cur)
auto pi = cur->project_inode(mdr);
cur->setxattr_ephemeral_dist(val);
pip = pi.inode.get();
} else if (name == "ceph.dir.rctime") {
if (!cur->is_dir()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you update an inode's rctime? You could fix all of the rctimes but none of the inodes?

@vshankar
Copy link
Contributor

Recursive stats need to be maintained inside the MDS, or I'd expect all kinds of things to break when they later do get pushed around by changes. We shouldn't be twiddling them in user-accessible calls in ways which prevent that reconciliation from happening; @mchangir's next comment suggests this PR is behaving right so I think we're okay, though.

OK - nice. I was thinking about introducing more changes to this PR for that, which is unneeded - so +1

}
utime_t val(sec, nsec);

if (!xlock_nestlock(mdr, cur))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails with the following crash here: https://pulpito.ceph.com/vshankar-2022-02-05_17:27:49-fs-wip-vshankar-testing-20220201-113815-testing-default-smithi/6663067/

/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10531-g09fc2e9e/rpm/el8/BUILD/ceph-17.0.0-10531-g09fc2e9e/src/mds/Locker.cc: 4782: ceph_abort_msg("abort() called")

 ceph version 17.0.0-10531-g09fc2e9e (09fc2e9e470435f1e6dbf462a51853391f4a4cc0) quincy (dev)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f83dd2ec91f]
 2: (Locker::simple_lock(SimpleLock*, bool*)+0x588) [0x560d6e1a4108]
 3: (Locker::xlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&)+0x380) [0x560d6e1a5dd0]
 4: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x24a0) [0x560d6e1bb8c0]
 5: (Server::xlock_nestlock(boost::intrusive_ptr<MDRequestImpl>&, CInode*)+0xd2) [0x560d6e00fab2]
 6: (Server::handle_set_vxattr(boost::intrusive_ptr<MDRequestImpl>&, CInode*)+0x14d4) [0x560d6e036364]
 7: (Server::handle_client_setxattr(boost::intrusive_ptr<MDRequestImpl>&)+0xcd) [0x560d6e03979d]
 8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xc7b) [0x560d6e0600ab]
 9: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x566) [0x560d6e06bda6]
 10: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x133) [0x560d6e074fe3]
 11: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xd3c) [0x560d6dfcb74c]
 12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7cb) [0x560d6dfce39b]
 13: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x560d6dfce9bc]
 14: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x560d6dfbd8b8]
 15: (DispatchQueue::entry()+0x14fa) [0x7f83dd57240a]
 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f83dd628891]
 17: /lib64/libpthread.so.0(+0x814a) [0x7f83dc2c614a]
 18: clone()

I think this needs to be a wrlock rather than an xlock.

Also, BTW, as per comment here(*), seems that we may want to acquire some additional locks? @batrick @gregsfortytwo What is dirlock? Seems to me like it is something existed and is got removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, BTW, as per comment here(*), seems that we may want to acquire some additional locks? @batrick @gregsfortytwo What is dirlock? Seems to me like it is something existed and is got removed?

I ran git log -G dirlock -- src/mds/

and got

commit fca6c27
Author: Sage Weil sweil@redhat.com
Date: Thu Dec 18 10:36:47 2008 -0800

mds: kill the dirlock

LOL
sigh

So it looks like Locker::xlock_start() checks the current lock state and calls either simple_xlock() or simple_lock() based on that.

if (lock->get_state() == LOCK_LOCK || lock->get_state() == LOCK_XLOCKDONE) {
mut->start_locking(lock);
simple_xlock(lock);
} else {
simple_lock(lock);
}

And we ended up in the simple_lock branch and it aborted because it doesn't recognize the current state when we get there. I really don't know why that might have happened — it might be that we just need the wrlock, but it also might be that we need more state guards in place elsewhere whose semantics I don't have in my head any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, that state isn't valid for scatterlock.

For the lock state machine in general, we would need extensive design write-up covering each state with its nuances to really make sense of what all states to add, which isn't going to be straightforward. In the interest of time switching to wrlock might just be OK.

It is observed that clients have files with invalid ctimes on cephfs.
Even if those files can be fixed, the parent directories up to the root have their rctime set to an invalid value.
Make rctime modifiable and let sysadmins fix those invalid rctimes recusively (bottom-up).

Signed-off-by: Theofilos Mouratidis <t.mour@cern.ch>
@vshankar
Copy link
Contributor

Will run it through test this week. thx @thmour

@vshankar
Copy link
Contributor

vshankar commented Mar 2, 2022

Setting this xattr from kclient fails since its considered as a read-only attribute - https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/xattr.c#L365

Test failure here - https://pulpito.ceph.com/vshankar-2022-02-28_04:32:15-fs-wip-vshankar-testing-20220226-211550-testing-default-smithi/6710176/

This would require kclient allowing ceph.dir.rctime to be modified (This PR allows user-space driver to set this xattr).

@jtlayton ^^

@mchangir
Copy link
Contributor

mchangir commented Mar 2, 2022

@jtlayton If my getvxattr kclient PR is merged in the testing kernel, then this should should've just worked ?
but my userland PR is not merged yet

@jtlayton
Copy link
Contributor

jtlayton commented Mar 2, 2022

@jtlayton If my getvxattr kclient PR is merged in the testing kernel, then this should should've just worked ? but my userland PR is not merged yet

No, I don't think so. This doesn't rely on GETVXATTR. The problem is that the rstats xattrs are marked READONLY so they aren't allowed to be changed currently. You'd need to add support for setting them to the kclient.

@vshankar
Copy link
Contributor

vshankar commented Mar 2, 2022

@jtlayton If my getvxattr kclient PR is merged in the testing kernel, then this should should've just worked ? but my userland PR is not merged yet

No, I don't think so. This doesn't rely on GETVXATTR. The problem is that the rstats xattrs are marked READONLY so they aren't allowed to be changed currently. You'd need to add support for setting them to the kclient.

I'll send a fix for that.

@vshankar
Copy link
Contributor

}
}

if (changed) {
pi->ctime = std::max(pi->ctime, ceph_clock_now());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ctime change here seems buggy. The client could cache the ctime changing such as if the Fx or Ax is issued. What if the cached ctime is flushed to MDS 1 minute later ? The ctime here will always be increased by 1 miniute ?

if (new_mtime > pi.inode->rstat.rctime)
pi.inode->rstat.rctime = new_mtime;
}
pi.inode->ctime = ceph_clock_now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@lxbsz lxbsz Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and only the recovery will set the new_size and also the mtime will be readed from the data pool objects, I don't see any case will user modified mtime could affect the ctime here.

@djgalloway djgalloway changed the base branch from master to main July 7, 2022 00:00
@github-actions
Copy link

github-actions bot commented Sep 5, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot removed the stale label Oct 28, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 27, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants