Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: check relevant caps for fs include root_squash #57192

Merged
merged 13 commits into from
May 7, 2024
Merged

Conversation

batrick
Copy link
Member

@batrick batrick commented May 1, 2024

I'm not satisfied with this approach completely:

  • There should be a way to override the MDS and allow the broken root_squash caps to continue working. Evicting all affected clients is not acceptable.
  • Setting the client feature bit on the MDSMap has no effect.

Will think about this more tomorrow and start some discussions with the team.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
  • Component impact
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@batrick
Copy link
Member Author

batrick commented May 3, 2024

This PR is under test in https://tracker.ceph.com/issues/65771.

Copy link

github-actions bot commented May 3, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A speedy turnaround with this one, great job, Patrick!
As usual, I'm cautious about adding a managed cache to the internal state where const calculation is cheap. We can afford to scan a list of a few tens of sessions once in 5 seconds and thus avoid inflation of the managed state and potential bugs this can bring.

src/mds/SessionMap.h Outdated Show resolved Hide resolved
n += bits_per_block;
n /= bits_per_block;
_vec.resize(n, 0);
n += bits_per_block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: something's wrong with the indentation in this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/mds/MDSAuthCaps.h Outdated Show resolved Hide resolved
src/mds/MDSAuthCaps.h Outdated Show resolved Hide resolved
src/mds/MDSAuthCaps.h Outdated Show resolved Hide resolved
qa/tasks/cephfs/test_admin.py Outdated Show resolved Hide resolved
and a client upgrade is required.

This is a HEALTH_ERR warning because of the danger of inconsistency and lost
data. It is recommended to either upgrade your clients, discontinue using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lost data

What sort of scenario would this happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-d-dave
Copy link
Contributor

@batrick I know it's going to be squashed so it doesn't matter, but sign-off in last commit is missing. Signed-off-by check in CI failed due to it

Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@batrick batrick requested a review from chrisphoffman May 6, 2024 15:22
Comment on lines 583 to +584
self.update_attrs(**kwargs)

retval = self.mount(mntopts=mntopts, check_status=check_status)
retval = self.mount(mntopts=mntopts, check_status=check_status, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing kwargs is redundant because each k-v pair in kwargs are set object attributes by update_attrs()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit message in qa: pass kwargs to mount from remount.

# should succeed
with self.assert_cluster_log("with broken root_squash implementation"):
keyring_path = self.mount_a.client_remote.mktemp(data=keyring)
self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.


CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK = 21
# all but CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK
features = ",".join([str(i) for i in range(CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.


captester = CapTester(self.mount_a, '/')
captester.conduct_pos_test_for_read_caps()
captester.conduct_pos_test_for_open_caps()

def test_rootsquash_nofeature(self):
"""
That having root_squash on an fs without the feature bit raises an HEALTH_ERR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this line would currently look in teuthology.log -

2024-05-01T05:49:57.520 INFO:tasks.cephfs_test_runner:Starting test: test_add_already_in_use_data_pool (tasks.cephfs.test_admin.TestAddDataPool)
That command try to add data pool which is already in use with another fs.

So I suggest slight modification -

Suggested change
That having root_squash on an fs without the feature bit raises an HEALTH_ERR.
Test that having root_squash on an fs without the feature bit raises an HEALTH_ERR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Test ..." is implicit in this construction. I'm using the style begun by John Spray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but it's not printed in teuthology.log. I've copied an example from recent run.

Anyways, I don't really mind. Since this PR is urgent, I've approved it regardless of these nits.

@@ -1507,13 +1507,72 @@ def test_multifs_rootsquash_nofeature(self):
mntargs = [f"--client_debug_inject_features={features}"]

# should succeed
keyring_path = self.mount_a.client_remote.mktemp(data=keyring)
self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs1.name)
with self.assert_cluster_log("report clients with broken root_squash", present=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.

self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs1.name)
with self.assert_cluster_log("report clients with broken root_squash", present=False):
keyring_path = self.mount_a.client_remote.mktemp(data=keyring)
self.mount_a.remount(client_id=self.client_id, client_keyring_path=keyring_path, mntargs=mntargs, cephfs_name=self.fs1.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, will be better if wrap such lines before 81 chars.

@batrick
Copy link
Member Author

batrick commented May 6, 2024

Folks, I'm not considering further nits on this PR in the interest of time as this PR is blocking v18.2.3.

@batrick
Copy link
Member Author

batrick commented May 6, 2024

I will be rebasing and squashing fixes now.

For testing purposes.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
batrick added a commit to batrick/ceph that referenced this pull request May 7, 2024
* refs/pull/57192/head:
	PendingReleaseNotes: add note on the client incompatibility health warning and feature bit
	doc/cephfs: add client_mds_auth_caps client feature bit
	doc/cephfs: add missing client feature bits
	doc/cephfs: document MDS_CLIENTS_BROKEN_ROOTSQUASH health error
	qa: add tests for MDS_CLIENTS_BROKEN_ROOTSQUASH
	mds: raise health warning if client lacks feature for root_squash
	mon/MDSMonitor: add note about missing metadata inclusion
	mds: check relevant caps for fs include root_squash
	mds: refactor out fs_name match in MDSAuthCaps
	qa: test for root_squash with multiple caps
	qa: pass kwargs to mount from remount
	qa: simplify update_attrs and only update relevant keys
	client: allow overriding client features
@vshankar
Copy link
Contributor

vshankar commented May 7, 2024

Folks, I'm not considering further nits on this PR in the interest of time as this PR is blocking v18.2.3.

Agreed and +1 for this.

batrick added 12 commits May 7, 2024 08:19
So we can just pass the caller's kwargs to update_attrs.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
So we can pass mntargs.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Where the client has root_squash for one cap but not for another. The fs
without root_squash should not necessarily reject the client.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
When denying client reconnects because the MDS caps include root_squash and the
client features do not include CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK, ensure those
caps are only for the file system the MDS is joined to.

Fixes: https://tracker.ceph.com/issues/65733
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
There is a "client_count" metadata on the health warning that apparently was
intended to be used for aggregating warnings but never was. Add a TODO item for
that.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Rather than evict all clients lacking this feature bit, raise a health error
that pushes the administrator to address it. This avoids the surprise of having
all affected clients suddenly evicted in the cluster.

Fixes: https://tracker.ceph.com/issues/65733
Fixes: 954ed30
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
…rning and feature bit

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented May 7, 2024

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented May 7, 2024

@batrick
Copy link
Member Author

batrick commented May 7, 2024

jenkins test make check

@batrick
Copy link
Member Author

batrick commented May 7, 2024

jenkins test make check arm64

@batrick batrick merged commit b54c9e8 into ceph:main May 7, 2024
11 checks passed
@batrick batrick deleted the i65733 branch May 7, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants