mgr/nfs: use object_format decorators to simplify response handling #46209

phlogistonjohn · 2022-05-09T18:35:49Z

Depends on: #45467

The changes above add a new python module to the mgr - a general way to handle mgr command responses and formatting. As yet it is unused. This series of changes converts the nfs mgr module to use the object_format decorators. This serves as both a demonstration of how the decorators can be used and how they benefit a mgr module - the code within the module becomes simpler and more pythonic. Only at the "outermost" layer do we concern ourselves with creating mgr response tuples. Because the nfs module is one of sufficient complexity and history not all existing APIs can use the simplest object_format.Responder decorator. A few edge cases (discussed below) were found and this required the addition of two simpler decorators "EmptyResponder" and "ErrorResponseHandler".

There are approximately three kinds of conversion that were done:

A function already returned JSON and could be mapped directly to Responder
A function makes changes to system state and (usually) returned no information were mapped to EmptyResponder
A function passed raw data stored in a rados object to the client. The one case of this we used the ErrorResponseHandler to ensure exceptions of the right base class are converted to response tuples. Otherwise the only other change was to isolate the mgr response logic to the CLICommand function.

I am filing this PR in draft state until the predecessor pr is merged. As it'll be in draft, discussion of things like the general approach and naming are quite welcome.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

phlogistonjohn · 2022-05-09T18:42:19Z

CC: @adk3798 @rkachach @epuertat
@ajarr may also be interested since you were making changes related to this topic recently

FWIW I still need to test parts of this. I'll leave it in draft until that's been done.

Note: I picked the nfs module because I recently worked in this area and was somewhat familiar with the module. If people have suggestions for other modules that would benefit from being updated to use object_format, please let me know and we can discuss that. I find that only when the "rubber hits the road" do we find out stuff that could be missing from the general approach of #45467.

adk3798

Seems pretty great. Even the conversion process from returning the old tuple to using the decorators looks fairly straightforward. Just a few minor things and questions from me.

src/pybind/mgr/object_format.py

src/pybind/mgr/nfs/export.py

adk3798 · 2022-06-14T20:02:00Z

src/pybind/mgr/nfs/export.py

+                self._delete_export(cluster_id=cluster_id, pseudo_path=None,
+                                    export_obj=export)
+            except Exception as e:
+                raise NFSException(f"Failed to delete export {export.export_id}: {e}")


is the fact that we're not raising an ErrorResponse wrapping the other exception here like in a lot of other cases related to the fact that we're called by a function using the EmptyResponder decorator rather than the regular Responder?

I think you should find most (all?) of the ErrorResponse exceptions are being added where the code used to return error-condition tuples. In this case the code was already raising an NFSException. AFAICT, the only place that calls delete_all_exports is delete_nfs_cluster in cluster.py, which will end up catching all exceptions and using ErrorResponse.wrap on the caught exception.

Until I looked just now I assumed that some other code might call delete_all_exports and catch NFSException. I still think its worth changing as little as possible, though.

adk3798 · 2022-06-14T20:05:40Z

src/pybind/mgr/nfs/cluster.py

-                return 0, "", ""
-            return 0, "", "Cluster does not exist"
+                return
+            raise ErrorResponse("Cluster does not exist",


I see that this previously returned a a return code 0 even while including an error message. Is the change to now return -errno.ENOENT instead for a return code on purpose?

Hmm. That was probably an oversight, but I'm not sure what's best, retaining the somewhat strange old behavior of sending the success error code when this is clearly an error or trying to avoid any behavior change here. Would be happy to hear your opinion.

I guess it depends if we think it's more important for the command to be idempotent or to make sure users are aware that the cluster they deleted isn't there anymore. We actually had a command like this in cephadm that would just print success if you tried to delete a nonexistent service, but eventually somebody requested we have it print an error when you do that. So I guess maybe raising here is the right thing.

Yeah, my subconscious was probably encouraging the idea that errors should be errors. Otherwise it is not worth reporting at all. :-)

@ajarr flagged this in our standup today. See my review comment. :)

src/pybind/mgr/nfs/module.py

src/pybind/mgr/nfs/cluster.py

src/pybind/mgr/nfs/export.py

src/pybind/mgr/nfs/tests/test_nfs.py

github-actions · 2022-06-15T17:56:33Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

phlogistonjohn · 2022-06-15T21:14:09Z

I've applied a number of small, what I figured would be noncontroversial, changes. There are a few open sub-threads that I have not taken before I hear back from the reviewers. Threads I've acted upon should be marked as such.

Signed-off-by: John Mulligan <jmulligan@redhat.com>

…ator Signed-off-by: John Mulligan <jmulligan@redhat.com>

Signed-off-by: John Mulligan <jmulligan@redhat.com>

The "export apply" functionality is unusual in that it allows either one or multiple nested requests to change or create an export. The previous implementation would concatenate the results of multiple change operations into a single string. It also would continue making changes if one case failed, adding the error to the string and setting a non-zero error code. The updated version keeps the general behavior but returns structured JSON (or other formatted data) with one entry per change request. In order to accomplish this and match the old behavior as closely as possible we add an intermediate type (AppliedExportResults) that can return both the response data (the `to_simplified` method) and track if there was a failure mixed in with the various updates (the `mgr_return_value` method). Signed-off-by: John Mulligan <jmulligan@redhat.com>

Signed-off-by: John Mulligan <jmulligan@redhat.com>

…rator Signed-off-by: John Mulligan <jmulligan@redhat.com>

…decorator Signed-off-by: John Mulligan <jmulligan@redhat.com>

This decorator is no longer needed as equivalent functionality is handled internally by the class. Signed-off-by: John Mulligan <jmulligan@redhat.com>

This function is now unused as we no longer need to coerce exceptions into response tuples at the layer in the code. Signed-off-by: John Mulligan <jmulligan@redhat.com>

These formatting changes are made by autopep8 when running tox. Signed-off-by: John Mulligan <jmulligan@redhat.com>

The patches that add object formatting / decorators to the nfs module also made error handling more generic when accessing an nfs cluster and now returns a nonzero exit code. A test was after the PR adding the object format support that only checked an error message. Update the test to match the new nfs module behavior as well as fixing a typo. Signed-off-by: John Mulligan <jmulligan@redhat.com>

phlogistonjohn · 2023-01-18T20:58:57Z

@adk3798 when do you think we can next retry merging this?

adk3798 · 2023-01-18T22:26:34Z

@adk3798 when do you think we can next retry merging this?

as soon as centos 8 builds are possible again, which I think is when pushing to and pulling from the quay.ceph.io container registry works again. I tried testing it last weekend but that's when the builds broke. Since this seemed to cause a test failure in the last run we can't really merge it without checking the patch you added for the test works.

adk3798 · 2023-01-22T16:16:02Z

https://pulpito.ceph.com/adking-2023-01-21_05:38:21-orch:cephadm-wip-adk-testing-2023-01-20-1359-distro-default-smithi/

Lots of failures (13) but all accounted for

3 instances of 'list' object has no attribute 'get'. Failure was on new test for setting mon crush location introduced by a PR in the run. That PR can't be merged, but this shouldn't block merging anything else.
2 instances of test_cephadm.sh test failing, caused by the same mon crush location PR. Failed with local variable 'config_json' referenced before assignment
1 instance of https://tracker.ceph.com/issues/49287, known issue
2 instances of https://tracker.ceph.com/issues/58526, known issue
2 instances of https://tracker.ceph.com/issues/58535, known issue
1 instance of https://tracker.ceph.com/issues/57771, known issue
1 instance of https://tracker.ceph.com/issues/57755, known issue
1 failed basic smoke test that timed out waiting for all 8 OSDs to come up. Did 5 reruns of this test with the same build (https://pulpito.ceph.com/adking-2023-01-22_15:28:20-orch:cephadm-wip-adk-testing-2023-01-20-1359-distro-default-smithi/) all of which passed. My guess is an I/O error creating the last OSD, which I've seen happen (infrequently) when we make a bunch of OSDs on the split up nvme drive we use for having multiple OSDs on these machines. Considering it a 1 off for now, and not blocking merging over it.

Overall, PRs in the run should be okay to merge other than the mon crush location one causing failures. Will start to try and clean up the test suite now that we're able to make builds and run tests again.

adk3798 · 2023-01-22T16:36:45Z

@phlogistonjohn were we planning to backport this t oquincy? I know we did backport the original PR that added the decorator.

phlogistonjohn · 2023-01-23T14:50:53Z

No, IIRC we discussed this in one of the Weekly meetings and opted not to backport it. If we find out that not backporting it means we have difficulties backporting other things we can "easily" change course and backport this series later.

when you create an nfs export from dashboard it leaves this traceback and error ``` Feb 09 14:15:54 ceph-node-00 ceph-mgr[3235]: [dashboard ERROR taskexec] Error while calling Task(ns=nfs/create, md={'path': 'e2e.nfs.bucket', 'fsal': 'RGW', 'cluster_id': 'testnfs'}) Traceback (most recent call last): File "/usr/share/ceph/mgr/dashboard/tools.py", line 550, in _run val = self.task.fn(*self.task.fn_args, **self.task.fn_kwargs) # type: ignore File "/usr/share/ceph/mgr/dashboard/controllers/nfs.py", line 148, in create ret, _, err = export_mgr.apply_export(cluster_id, json.dumps(raw_ex)) TypeError: 'AppliedExportResults' object is not iterable Feb 09 14:15:54 ceph-node-00 ceph-mgr[3235]: [dashboard INFO taskmgr] finished Task(ns=nfs/create, md={'path': 'e2e.nfs.bucket', 'fsal': 'RGW', 'cluster_id': 'testnfs'}) Feb 09 14:15:54 ceph-node-00 ceph-mgr[3235]: [dashboard INFO request] [::ffff:192.168.100.1:43896] [POST] [500] [0.767s] [admin] [172.0B] /api/nfs-ganesha/export ``` This started after ceph#46209, so dashboard code needs to be adapted Fixes: https://tracker.ceph.com/issues/58681 Signed-off-by: Nizamudeen A <nia@redhat.com>

when you create/edit an nfs export from dashboard it leaves this traceback and error ``` Feb 09 14:15:54 ceph-node-00 ceph-mgr[3235]: [dashboard ERROR taskexec] Error while calling Task(ns=nfs/create, md={'path': 'e2e.nfs.bucket', 'fsal': 'RGW', 'cluster_id': 'testnfs'}) Traceback (most recent call last): File "/usr/share/ceph/mgr/dashboard/tools.py", line 550, in _run val = self.task.fn(*self.task.fn_args, **self.task.fn_kwargs) # type: ignore File "/usr/share/ceph/mgr/dashboard/controllers/nfs.py", line 148, in create ret, _, err = export_mgr.apply_export(cluster_id, json.dumps(raw_ex)) TypeError: 'AppliedExportResults' object is not iterable Feb 09 14:15:54 ceph-node-00 ceph-mgr[3235]: [dashboard INFO taskmgr] finished Task(ns=nfs/create, md={'path': 'e2e.nfs.bucket', 'fsal': 'RGW', 'cluster_id': 'testnfs'}) Feb 09 14:15:54 ceph-node-00 ceph-mgr[3235]: [dashboard INFO request] [::ffff:192.168.100.1:43896] [POST] [500] [0.767s] [admin] [172.0B] /api/nfs-ganesha/export ``` This started after ceph#46209, so dashboard code needs to be adapted Fixes: https://tracker.ceph.com/issues/58681 Signed-off-by: Nizamudeen A <nia@redhat.com>

github-actions bot added mgr nfs orchestrator pybind labels May 9, 2022

phlogistonjohn added the cephadm label May 9, 2022

phlogistonjohn force-pushed the jjm-format-nfs-mod branch from 8c8adca to 9fee8c2 Compare May 13, 2022 14:49

github-actions bot added build/ops CI Continuous Integration labels May 13, 2022

phlogistonjohn force-pushed the jjm-format-nfs-mod branch 2 times, most recently from f9d56bd to 695762b Compare May 24, 2022 15:35

phlogistonjohn marked this pull request as ready for review May 24, 2022 17:46

djgalloway changed the base branch from master to main May 25, 2022 19:59

phlogistonjohn force-pushed the jjm-format-nfs-mod branch from 695762b to e3159fc Compare June 9, 2022 18:19

adk3798 reviewed Jun 14, 2022

View reviewed changes

adk3798 requested review from rkachach, epuertat and ajarr June 14, 2022 21:04