Skip to content

fix: Resolve referencing stale paths#708

Merged
johnramsden merged 3 commits intocanonical:mainfrom
johnramsden:159-path
Apr 9, 2026
Merged

fix: Resolve referencing stale paths#708
johnramsden merged 3 commits intocanonical:mainfrom
johnramsden:159-path

Conversation

@johnramsden
Copy link
Copy Markdown
Member

@johnramsden johnramsden commented Apr 7, 2026

Description

Microceph references old snap paths which can be recycled after multiple snap refreshes.

Full reproduction of the issue can be found here #159 (comment)

The solution is to use the symbolic link which is always present pointing to current. This is set up by the snap environment and will always be the correct path.

Something similar was already fixed in #387

Fixes #645
Fixes #159
Fixes #705

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

  • Unit tests exercise migration
  • Integration tests exercise migration

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes MicroCeph’s use of revision-specific snap paths (e.g., /var/snap/microceph/<rev>/run) by switching to the stable current symlink and adding migrations/tests to repair existing stale config entries.

Changes:

  • Update path constants and config generation to use /var/snap/microceph/current/... instead of revision-specific $SNAP_DATA/....
  • Add a startup migration (migrateStaleRunDir) to rewrite stale run dir / CCacheDir entries in existing configs.
  • Add unit/integration coverage for the migration and hook it into CI.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
microceph/constants/constants.go Derives ConfPath/RunPath via the stable current symlink.
microceph/ceph/start.go Adds migration that rewrites stale run-dir paths and runs it during daemon startup.
microceph/ceph/start_test.go Adds unit tests for stale run-dir migration behavior.
microceph/tests/testutils.go Mirrors snapd’s current -> SNAP_DATA symlink in tests.
microceph/ceph/config.go Uses constants.GetPathConst() for conf/run paths in config updates.
microceph/ceph/join.go Uses constants.GetPathConst().ConfPath for ceph.conf path discovery.
microceph/ceph/rgw_test.go Asserts RGW config uses current/run.
tests/scripts/actionutils.sh Adds integration test functions to validate migration for RGW and NFS.
.github/workflows/tests.yml Runs the new integration tests in CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread microceph/ceph/start.go Outdated
Comment thread microceph/ceph/start.go
Comment thread microceph/ceph/start.go
Comment thread microceph/ceph/start.go
@@ -316,6 +406,7 @@ func Start(ctx context.Context, s interfaces.StateInterface) error {
case <-time.After(10 * time.Second):
}
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-enablement is guarded by + "" + serviceStartMu + "" + , but the newly-added config migration runs outside that lock. Since the migration edits config files that other concurrent operations/hooks can also rewrite (e.g., config regeneration during + "" + snapctl + "" + actions), it can race and cause lost updates or inconsistent state. Consider taking the same mutex around + "" + migrateStaleRunDir() + "" + and + "" + reEnableServices() + "" + (mandatory if other code paths can concurrently write these configs).

Suggested change
}
}
serviceStartMu.Lock()
defer serviceStartMu.Unlock()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart actions do not occur in parallel with enabling services, the following happens sequentially:

		migrateStaleRunDir()
		reEnableServices(ctx, s) 

I don't think this is the right lock for the job. If we didn't want to consider locking on file updates what we'd probably want to look at would be having a separate lock that covers all situations where configuration might be modified.

For example enabling RGW. I don't believe this is an actual concern though because the writing of the configuration file is atomic so you cannot have the situation where a file is half written and therefore the migration would be attempting to happen in the middle of a rewrite.

Comment thread tests/scripts/actionutils.sh
Comment thread tests/scripts/actionutils.sh
@johnramsden
Copy link
Copy Markdown
Member Author

Addressed AI reviews & rewrote commits

@johnramsden johnramsden requested a review from sabaini April 7, 2026 23:41
Copy link
Copy Markdown
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnramsden great catch 🙏 one tiny style nit otherwise lgtm

Comment thread microceph/ceph/start.go Outdated
Comment thread microceph/ceph/start.go Outdated
Copy link
Copy Markdown
Contributor

@UtkarshBhatthere UtkarshBhatthere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes overall lgtm, one nit from me and a pending comment from Peter.

johnramsden and others added 3 commits April 8, 2026 17:34
SNAP_DATA resolves to a revision-specific directory at runtime (e.g.
/var/snap/microceph/1630). Paths derived from it and baked into config
files become dangling references after a snap refresh garbage-collects
old revision directories.

Fix the root cause in GetPathConst() by deriving ConfPath and RunPath
from the stable 'current' symlink that snapd always maintains pointing
to the active revision. Remove per-callsite os.Getenv("SNAP_DATA") path
construction in config.go, join.go, and start.go so all consumers
inherit the correct paths automatically.

Add migrateStaleRunDir(), called on every daemon startup before
reEnableServices, to rewrite any revision-specific run dir found in
existing radosgw.conf or ganesha.conf files. This repairs installations
that were broken by a snap refresh before this fix was applied.

Use errors.Is(err, os.ErrNotExist) in fixConfigLine rather than the
deprecated os.IsNotExist().

Closes: canonical#705
Closes: canonical#159

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Add assertion to TestEnableRGW that the generated radosgw.conf uses the
stable 'current' symlink for run dir rather than a revision-specific
path.

Add migrateStaleRunDir test cases covering: no config files present,
already-correct paths left unchanged, and stale revision-specific paths
rewritten correctly for both radosgw.conf and ganesha.conf.

Fix ReadCephConfig to T.Fatal on missing files rather than silently
returning an empty string. Replace uses of the duplicate readConf helper
in start_test.go with ReadCephConfig. Fix two rgw_test.go cases that
relied on silent error behaviour to assert a file was absent; use
os.Stat + os.IsNotExist instead.

Add integration test functions test_rgw_stale_run_dir_migration and
test_nfs_stale_run_dir_migration to actionutils.sh, and wire them into
the single-system CI job.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Split one-line assign/test constructs into separate assign and if
statements per CONTRIBUTING.md convention, and replace the literal
0644 permission with constants.PermissionUserRwWorldRAccess.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden merged commit 12ff308 into canonical:main Apr 9, 2026
45 of 46 checks passed
skoech pushed a commit that referenced this pull request Apr 9, 2026
# Description

Microceph references old snap paths which can be recycled after multiple
snap refreshes.

Full reproduction of the issue can be found here
#159 (comment)

The solution is to use the symbolic link which is always present
pointing to current. This is set up by the snap environment and will
always be the correct path.

Something similar was already fixed in
#387

Fixes #645 
Fixes #159 
Fixes #705 

## Type of change

- Bug fix (non-breaking change which fixes an issue)

## How has this been tested?

* Unit tests exercise migration
* Integration tests exercise migration

## Contributor checklist

Please check that you have:

- [x] self-reviewed the code in this PR
- [x] added code comments, particularly in less straightforward areas
- [x] checked and added or updated relevant documentation
- [x] checked and added or updated relevant release notes
- [x] added tests to verify effectiveness of this change

---------

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@johnramsden johnramsden deleted the 159-path branch April 14, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants