feat: enhance microceph orchestrator — apply/remove/restart methods, tests, CI#709
feat: enhance microceph orchestrator — apply/remove/restart methods, tests, CI#709UtkarshBhatthere wants to merge 4 commits into
Conversation
|
This PR SHOULD be merged post tentacle switch. |
There was a problem hiding this comment.
Pull request overview
Extends the MicroCeph orchestrator plugin from read-only queries to full lifecycle management by adding service apply/remove/restart operations, hardening existing read paths, and adding unit + CI/integration test coverage.
Changes:
- Add apply/remove/restart orchestrator methods and shared helpers in
microceph.module. - Extend the Python MicroCeph client with enable/delete/restart service endpoints and safer list methods.
- Add a Python unit test suite plus a shell-based integration test script, wired into GitHub Actions.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
microceph-orch/src/microceph/module.py |
Adds lifecycle methods (apply/remove/restart) and improves host/service/daemon/inventory listing behavior. |
microceph-orch/src/microceph/client/cluster.py |
Adds service enable/delete/restart APIs and normalizes list methods to return [] instead of None. |
microceph-orch/tests/test_module.py |
Adds orchestrator unit tests covering new lifecycle methods and updated read operations. |
microceph-orch/tests/test_client.py |
Adds client-layer unit tests for the new service endpoints and list-method behavior. |
microceph-orch/tests/stubs.py |
Provides minimal Ceph/orchestrator type stubs to run tests outside the snap environment. |
microceph-orch/tests/conftest.py |
Injects mocked Ceph/snap-only modules and provides fixtures for orchestrator/client testing. |
tests/scripts/test-orch.sh |
Adds CLI-level integration tests for ceph orch operations (status/ls/ps/device/apply/rm/restart). |
.github/workflows/tests.yml |
Runs Python unit tests in CI and executes the new integration script in single- and multi-node jobs. |
.gitignore |
Ignores Python bytecode/cache artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Implement apply methods: rgw, nfs, rbd-mirror, mon, mgr, mds, cephfs-mirror - Implement remove_service, remove_host, service_action (restart) - Add client methods: enable_service, delete_service, delete_nfs_service, restart_services - Harden existing methods: null safety, filter support, error handling - Improve get_inventory: use OSD disk list + cluster members for complete host coverage - Add 77 unit tests with mocked Ceph environment - Consolidate service name parsing (_parse_service_name) Known limitations: - Per-host targeting not supported (unix socket, local node only) - RGW SSL not configurable (Ceph spec lacks private key field) Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Co-authored-by: Claude <noreply@anthropic.com>
af2d1cc to
d0d376b
Compare
- Add Python unit tests (77 tests) to unit-tests job - Add single-node orchestrator integration test after RGW tests - Add multi-node orchestrator integration test to multi-node job - Add test-orch.sh to tests/scripts/ alongside existing test infra - NFS tests are resilient to environments where NFS service fails to start Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Co-authored-by: Claude <noreply@anthropic.com>
d0d376b to
662d0e9
Compare
- Add stub smb mgr module (patches/0003) to satisfy dashboard imports. The ceph-mgr-smb package is not available in the Ubuntu distribution but the tentacle dashboard imports from it unconditionally. - Add python3-openssl to snap stage-packages for dashboard SSL support. Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com> Co-authored-by: Claude <noreply@anthropic.com>
88e03b9 to
82aa9ec
Compare
- Match existing service hosts by (service, group_id) so a new NFS cluster on a host that already runs a different NFS cluster isn't incorrectly skipped as "already active". - Drop unused host_count variable in test-orch.sh; the existing assert_contains already checks for the expected string. Copilot unit-tests-install concern (#3056285558) is a non-issue: the workflow already runs pip install pytest before pytest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
4b84c72 to
0266ffc
Compare
- Match existing service hosts by (service, group_id) so a new NFS cluster on a host that already runs a different NFS cluster isn't incorrectly skipped as "already active". - Drop unused host_count variable in test-orch.sh; the existing assert_contains already checks for the expected string. Copilot unit-tests-install concern (#3056285558) is a non-issue: the workflow already runs pip install pytest before pytest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
bc411e8 to
fbdd459
Compare
johnramsden
left a comment
There was a problem hiding this comment.
This looks very thorough, great test suite.
It would be good to add some documentation regarding the functionality of the module.
For the UseTarget functionality, could we simply pass the Target and host name along with the request (/1.0/services?target={hostname}) and it will be proxied from the local node? In a quick local test this seemed to work (at least I did not receive an error when passing along a valid target, and it did error when I passed an invalid target).
sabaini
left a comment
There was a problem hiding this comment.
Hey @UtkarshBhatthere some drive-by comments -- shaping up well!
| NOTE: Per-host targeting is not yet supported. The Python client | ||
| connects via unix socket to the local node only. MicroCeph's Go | ||
| client uses UseTarget() to proxy to specific hosts, but this is | ||
| not exposed through the socket. See todo #6. |
There was a problem hiding this comment.
Ack -- this should be a blocker for this PR as otherwise we'd get commands applied on the wrong host
There was a problem hiding this comment.
FTr. should also add a test that checks for this
| name=service_name, | ||
| payload=payload, | ||
| wait=True, | ||
| ) |
There was a problem hiding this comment.
Hm, does this respect spec.service_id?
| for item in resources: | ||
| if isinstance(item, dict) and 'disks' in item: | ||
| for disk in item.get('disks', []): | ||
| local_disk_ids.add(disk.get('id', '')) |
There was a problem hiding this comment.
Do we ever use local_disk_ids?
| return self._apply_service("cephfs-mirror", spec, "{}") | ||
|
|
||
| @handle_orch_error | ||
| def remove_service(self, service_name: str, force: bool = False) -> OrchResult[str]: |
There was a problem hiding this comment.
suggest to use _force as it's not being used except as a placeholder
fbdd459 to
88e03b9
Compare
Summary
Extends the MicroCeph orchestrator module from read-only to full lifecycle management. Adds write operations (apply, remove, restart), hardens existing read operations, and introduces comprehensive unit and integration tests.
Changes
Orchestrator module (
microceph-orch/src/microceph/module.py)New methods:
apply_rgw,apply_nfs,apply_rbd_mirror,apply_mon,apply_mgr,apply_mds,apply_cephfs_mirror— enable services via the MicroCeph APIremove_service— remove services (NFS-aware, passescluster_id)remove_host— remove a host from the clusterservice_action— restart services (only action currently supported by MicroCeph)Improved existing methods:
get_hosts— safe access with.get(), handles missing:in addressdescribe_service— now respectsservice_namefilter paramlist_daemons— added missing filters (service_name,daemon_id,host), safe NFS info parsingget_inventory— cross-references/1.0/resources(all devices) with/1.0/disks(OSD devices) to report both available and used disksavailable— catches generic exceptions beyondRemoteExceptionHelpers:
_parse_service_name— replaces_elaborate_service, usespartition()to correctly handle dotted names likenfs.my.cluster_apply_service— common apply logic with existing-service check_get_existing_service_hosts— queries running services to skip re-enablementClient layer (
microceph-orch/src/microceph/client/cluster.py)enable_service(name, payload, wait)—PUT /1.0/services/<name>delete_service(name)—DELETE /1.0/services/<name>delete_nfs_service(cluster_id)—DELETE /1.0/services/nfswith bodyrestart_services(services)—POST /1.0/services/restartlist_*methods now returnor []instead of potentiallyNoneTests
microceph-orch/tests/) — mocked Ceph environment, covers all orchestrator methods including error pathstests/scripts/test-orch.sh) — 26ceph orchCLI tests, validated on single-node and 3-node clustersunit-testsjob, integration tests in bothsingle-system-testsandmulti-node-testsjobsKnown limitations
UseTarget()to proxy to specific hosts, but this is not exposed through the socket. Placement specs are validated but services are placed on the local node.RGWSpechas no private key field. MicroCeph requires both cert and key for SSL. Until a key injection mechanism is added, RGW deploys in non-SSL mode.Testing
cd microceph-orch && PYTHONPATH=src:tests python3 -m pytest tests/ -vtests/scripts/test-orch.sh