fix: Resolve referencing stale paths#708
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes MicroCeph’s use of revision-specific snap paths (e.g., /var/snap/microceph/<rev>/run) by switching to the stable current symlink and adding migrations/tests to repair existing stale config entries.
Changes:
- Update path constants and config generation to use
/var/snap/microceph/current/...instead of revision-specific$SNAP_DATA/.... - Add a startup migration (
migrateStaleRunDir) to rewrite stalerun dir/CCacheDirentries in existing configs. - Add unit/integration coverage for the migration and hook it into CI.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
microceph/constants/constants.go |
Derives ConfPath/RunPath via the stable current symlink. |
microceph/ceph/start.go |
Adds migration that rewrites stale run-dir paths and runs it during daemon startup. |
microceph/ceph/start_test.go |
Adds unit tests for stale run-dir migration behavior. |
microceph/tests/testutils.go |
Mirrors snapd’s current -> SNAP_DATA symlink in tests. |
microceph/ceph/config.go |
Uses constants.GetPathConst() for conf/run paths in config updates. |
microceph/ceph/join.go |
Uses constants.GetPathConst().ConfPath for ceph.conf path discovery. |
microceph/ceph/rgw_test.go |
Asserts RGW config uses current/run. |
tests/scripts/actionutils.sh |
Adds integration test functions to validate migration for RGW and NFS. |
.github/workflows/tests.yml |
Runs the new integration tests in CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -316,6 +406,7 @@ func Start(ctx context.Context, s interfaces.StateInterface) error { | |||
| case <-time.After(10 * time.Second): | |||
| } | |||
| } | |||
There was a problem hiding this comment.
Re-enablement is guarded by + "" + serviceStartMu + "" + , but the newly-added config migration runs outside that lock. Since the migration edits config files that other concurrent operations/hooks can also rewrite (e.g., config regeneration during + "" + snapctl + "" + actions), it can race and cause lost updates or inconsistent state. Consider taking the same mutex around + "" + migrateStaleRunDir() + "" + and + "" + reEnableServices() + "" + (mandatory if other code paths can concurrently write these configs).
| } | |
| } | |
| serviceStartMu.Lock() | |
| defer serviceStartMu.Unlock() |
There was a problem hiding this comment.
The restart actions do not occur in parallel with enabling services, the following happens sequentially:
migrateStaleRunDir()
reEnableServices(ctx, s) I don't think this is the right lock for the job. If we didn't want to consider locking on file updates what we'd probably want to look at would be having a separate lock that covers all situations where configuration might be modified.
For example enabling RGW. I don't believe this is an actual concern though because the writing of the configuration file is atomic so you cannot have the situation where a file is half written and therefore the migration would be attempting to happen in the middle of a rewrite.
|
Addressed AI reviews & rewrote commits |
sabaini
left a comment
There was a problem hiding this comment.
@johnramsden great catch 🙏 one tiny style nit otherwise lgtm
UtkarshBhatthere
left a comment
There was a problem hiding this comment.
The changes overall lgtm, one nit from me and a pending comment from Peter.
SNAP_DATA resolves to a revision-specific directory at runtime (e.g.
/var/snap/microceph/1630). Paths derived from it and baked into config
files become dangling references after a snap refresh garbage-collects
old revision directories.
Fix the root cause in GetPathConst() by deriving ConfPath and RunPath
from the stable 'current' symlink that snapd always maintains pointing
to the active revision. Remove per-callsite os.Getenv("SNAP_DATA") path
construction in config.go, join.go, and start.go so all consumers
inherit the correct paths automatically.
Add migrateStaleRunDir(), called on every daemon startup before
reEnableServices, to rewrite any revision-specific run dir found in
existing radosgw.conf or ganesha.conf files. This repairs installations
that were broken by a snap refresh before this fix was applied.
Use errors.Is(err, os.ErrNotExist) in fixConfigLine rather than the
deprecated os.IsNotExist().
Closes: canonical#705
Closes: canonical#159
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Add assertion to TestEnableRGW that the generated radosgw.conf uses the stable 'current' symlink for run dir rather than a revision-specific path. Add migrateStaleRunDir test cases covering: no config files present, already-correct paths left unchanged, and stale revision-specific paths rewritten correctly for both radosgw.conf and ganesha.conf. Fix ReadCephConfig to T.Fatal on missing files rather than silently returning an empty string. Replace uses of the duplicate readConf helper in start_test.go with ReadCephConfig. Fix two rgw_test.go cases that relied on silent error behaviour to assert a file was absent; use os.Stat + os.IsNotExist instead. Add integration test functions test_rgw_stale_run_dir_migration and test_nfs_stale_run_dir_migration to actionutils.sh, and wire them into the single-system CI job. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Split one-line assign/test constructs into separate assign and if statements per CONTRIBUTING.md convention, and replace the literal 0644 permission with constants.PermissionUserRwWorldRAccess. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
# Description Microceph references old snap paths which can be recycled after multiple snap refreshes. Full reproduction of the issue can be found here #159 (comment) The solution is to use the symbolic link which is always present pointing to current. This is set up by the snap environment and will always be the correct path. Something similar was already fixed in #387 Fixes #645 Fixes #159 Fixes #705 ## Type of change - Bug fix (non-breaking change which fixes an issue) ## How has this been tested? * Unit tests exercise migration * Integration tests exercise migration ## Contributor checklist Please check that you have: - [x] self-reviewed the code in this PR - [x] added code comments, particularly in less straightforward areas - [x] checked and added or updated relevant documentation - [x] checked and added or updated relevant release notes - [x] added tests to verify effectiveness of this change --------- Signed-off-by: John Ramsden <john.ramsden@canonical.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Description
Microceph references old snap paths which can be recycled after multiple snap refreshes.
Full reproduction of the issue can be found here #159 (comment)
The solution is to use the symbolic link which is always present pointing to current. This is set up by the snap environment and will always be the correct path.
Something similar was already fixed in #387
Fixes #645
Fixes #159
Fixes #705
Type of change
How has this been tested?
Contributor checklist
Please check that you have: