Skip to content

feat: enhance microceph orchestrator — apply/remove/restart methods, tests, CI#709

Closed
UtkarshBhatthere wants to merge 4 commits into
release/tentaclePlusV3from
feat/tentacle/orchPlus
Closed

feat: enhance microceph orchestrator — apply/remove/restart methods, tests, CI#709
UtkarshBhatthere wants to merge 4 commits into
release/tentaclePlusV3from
feat/tentacle/orchPlus

Conversation

@UtkarshBhatthere
Copy link
Copy Markdown
Contributor

Summary

Extends the MicroCeph orchestrator module from read-only to full lifecycle management. Adds write operations (apply, remove, restart), hardens existing read operations, and introduces comprehensive unit and integration tests.

Changes

Orchestrator module (microceph-orch/src/microceph/module.py)

New methods:

  • apply_rgw, apply_nfs, apply_rbd_mirror, apply_mon, apply_mgr, apply_mds, apply_cephfs_mirror — enable services via the MicroCeph API
  • remove_service — remove services (NFS-aware, passes cluster_id)
  • remove_host — remove a host from the cluster
  • service_action — restart services (only action currently supported by MicroCeph)

Improved existing methods:

  • get_hosts — safe access with .get(), handles missing : in address
  • describe_service — now respects service_name filter param
  • list_daemons — added missing filters (service_name, daemon_id, host), safe NFS info parsing
  • get_inventory — cross-references /1.0/resources (all devices) with /1.0/disks (OSD devices) to report both available and used disks
  • available — catches generic exceptions beyond RemoteException

Helpers:

  • _parse_service_name — replaces _elaborate_service, uses partition() to correctly handle dotted names like nfs.my.cluster
  • _apply_service — common apply logic with existing-service check
  • _get_existing_service_hosts — queries running services to skip re-enablement

Client layer (microceph-orch/src/microceph/client/cluster.py)

  • enable_service(name, payload, wait)PUT /1.0/services/<name>
  • delete_service(name)DELETE /1.0/services/<name>
  • delete_nfs_service(cluster_id)DELETE /1.0/services/nfs with body
  • restart_services(services)POST /1.0/services/restart
  • All list_* methods now return or [] instead of potentially None

Tests

  • 77 unit tests (microceph-orch/tests/) — mocked Ceph environment, covers all orchestrator methods including error paths
  • Integration test script (tests/scripts/test-orch.sh) — 26 ceph orch CLI tests, validated on single-node and 3-node clusters
  • CI integration — Python unit tests in unit-tests job, integration tests in both single-system-tests and multi-node-tests jobs

Known limitations

  • Per-host service targeting: The Python client connects via unix socket to the local node only. The Go client uses UseTarget() to proxy to specific hosts, but this is not exposed through the socket. Placement specs are validated but services are placed on the local node.
  • RGW SSL: The Ceph RGWSpec has no private key field. MicroCeph requires both cert and key for SSL. Until a key injection mechanism is added, RGW deploys in non-SSL mode.

Testing

  • Unit tests: cd microceph-orch && PYTHONPATH=src:tests python3 -m pytest tests/ -v
  • Integration (single-node): install snap, bootstrap, enable orchestrator, run tests/scripts/test-orch.sh
  • Integration (multi-node): 3-node cluster, all 26 tests pass

@UtkarshBhatthere
Copy link
Copy Markdown
Contributor Author

This PR SHOULD be merged post tentacle switch.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the MicroCeph orchestrator plugin from read-only queries to full lifecycle management by adding service apply/remove/restart operations, hardening existing read paths, and adding unit + CI/integration test coverage.

Changes:

  • Add apply/remove/restart orchestrator methods and shared helpers in microceph.module.
  • Extend the Python MicroCeph client with enable/delete/restart service endpoints and safer list methods.
  • Add a Python unit test suite plus a shell-based integration test script, wired into GitHub Actions.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
microceph-orch/src/microceph/module.py Adds lifecycle methods (apply/remove/restart) and improves host/service/daemon/inventory listing behavior.
microceph-orch/src/microceph/client/cluster.py Adds service enable/delete/restart APIs and normalizes list methods to return [] instead of None.
microceph-orch/tests/test_module.py Adds orchestrator unit tests covering new lifecycle methods and updated read operations.
microceph-orch/tests/test_client.py Adds client-layer unit tests for the new service endpoints and list-method behavior.
microceph-orch/tests/stubs.py Provides minimal Ceph/orchestrator type stubs to run tests outside the snap environment.
microceph-orch/tests/conftest.py Injects mocked Ceph/snap-only modules and provides fixtures for orchestrator/client testing.
tests/scripts/test-orch.sh Adds CLI-level integration tests for ceph orch operations (status/ls/ps/device/apply/rm/restart).
.github/workflows/tests.yml Runs Python unit tests in CI and executes the new integration script in single- and multi-node jobs.
.gitignore Ignores Python bytecode/cache artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread microceph-orch/src/microceph/module.py
Comment thread .github/workflows/tests.yml
Comment thread tests/scripts/test-orch.sh
- Implement apply methods: rgw, nfs, rbd-mirror, mon, mgr, mds, cephfs-mirror
- Implement remove_service, remove_host, service_action (restart)
- Add client methods: enable_service, delete_service, delete_nfs_service, restart_services
- Harden existing methods: null safety, filter support, error handling
- Improve get_inventory: use OSD disk list + cluster members for complete host coverage
- Add 77 unit tests with mocked Ceph environment
- Consolidate service name parsing (_parse_service_name)

Known limitations:
- Per-host targeting not supported (unix socket, local node only)
- RGW SSL not configurable (Ceph spec lacks private key field)

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Co-authored-by: Claude <noreply@anthropic.com>
@UtkarshBhatthere UtkarshBhatthere force-pushed the feat/tentacle/orchPlus branch 3 times, most recently from af2d1cc to d0d376b Compare April 9, 2026 09:45
UtkarshBhatthere and others added 2 commits April 9, 2026 18:39
- Add Python unit tests (77 tests) to unit-tests job
- Add single-node orchestrator integration test after RGW tests
- Add multi-node orchestrator integration test to multi-node job
- Add test-orch.sh to tests/scripts/ alongside existing test infra
- NFS tests are resilient to environments where NFS service fails to start

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Co-authored-by: Claude <noreply@anthropic.com>
@UtkarshBhatthere UtkarshBhatthere force-pushed the feat/tentacle/orchPlus branch from d0d376b to 662d0e9 Compare April 9, 2026 13:10
- Add stub smb mgr module (patches/0003) to satisfy dashboard imports.
  The ceph-mgr-smb package is not available in the Ubuntu distribution
  but the tentacle dashboard imports from it unconditionally.
- Add python3-openssl to snap stage-packages for dashboard SSL support.

Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Co-authored-by: Claude <noreply@anthropic.com>
@UtkarshBhatthere UtkarshBhatthere changed the base branch from main to release/tentaclePlusV3 April 13, 2026 16:52
UtkarshBhatthere added a commit that referenced this pull request Apr 13, 2026
- Match existing service hosts by (service, group_id) so a new
  NFS cluster on a host that already runs a different NFS
  cluster isn't incorrectly skipped as "already active".
- Drop unused host_count variable in test-orch.sh; the existing
  assert_contains already checks for the expected string.

Copilot unit-tests-install concern (#3056285558) is a non-issue:
the workflow already runs pip install pytest before pytest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
UtkarshBhatthere added a commit that referenced this pull request Apr 14, 2026
- Match existing service hosts by (service, group_id) so a new
  NFS cluster on a host that already runs a different NFS
  cluster isn't incorrectly skipped as "already active".
- Drop unused host_count variable in test-orch.sh; the existing
  assert_contains already checks for the expected string.

Copilot unit-tests-install concern (#3056285558) is a non-issue:
the workflow already runs pip install pytest before pytest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Copy link
Copy Markdown
Member

@johnramsden johnramsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very thorough, great test suite.

It would be good to add some documentation regarding the functionality of the module.

For the UseTarget functionality, could we simply pass the Target and host name along with the request (/1.0/services?target={hostname}) and it will be proxied from the local node? In a quick local test this seemed to work (at least I did not receive an error when passing along a valid target, and it did error when I passed an invalid target).

Copy link
Copy Markdown
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @UtkarshBhatthere some drive-by comments -- shaping up well!

NOTE: Per-host targeting is not yet supported. The Python client
connects via unix socket to the local node only. MicroCeph's Go
client uses UseTarget() to proxy to specific hosts, but this is
not exposed through the socket. See todo #6.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack -- this should be a blocker for this PR as otherwise we'd get commands applied on the wrong host

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTr. should also add a test that checks for this

name=service_name,
payload=payload,
wait=True,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does this respect spec.service_id?

for item in resources:
if isinstance(item, dict) and 'disks' in item:
for disk in item.get('disks', []):
local_disk_ids.add(disk.get('id', ''))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever use local_disk_ids?

return self._apply_service("cephfs-mirror", spec, "{}")

@handle_orch_error
def remove_service(self, service_name: str, force: bool = False) -> OrchResult[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use _force as it's not being used except as a placeholder

@UtkarshBhatthere UtkarshBhatthere deleted the feat/tentacle/orchPlus branch April 28, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants