Skip to content

fix: Resolve missing unlock encrypted wal/db at osd start #680

Merged
sabaini merged 3 commits intocanonical:mainfrom
johnramsden:john/CEPH-1598
Feb 27, 2026
Merged

fix: Resolve missing unlock encrypted wal/db at osd start #680
sabaini merged 3 commits intocanonical:mainfrom
johnramsden:john/CEPH-1598

Conversation

@johnramsden
Copy link
Copy Markdown
Member

Description

On node reboot, or if devices are closed encrypted wal/dbs will not be unlocked correctly and will lead to issues on startup.

The relevant issue #655 describes the problem very well:

There is a discrepancy between the OSD Creation logic (Go) and the OSD Startup logic (Shell).

  1. Creation (Correct):
    In microceph/ceph/osd.go, the function setupEncryptedOSD correctly handles Block, WAL, and DB using the corresponding "suffix". It formats and opens (unlocks) all of them explicitly.

  2. Startup (Bug):
    In snapcraft/commands/osd.start, the spawn() function iterates through the OSD directories.

Because the shell script ignores the unencrypted.wal symlink, the mapper is never created, and BlueStore cannot access its write-ahead log.

Proposed Fix

Update snapcraft/commands/osd.start to check for and unlock separate WAL/DB devices if they exist, mirroring the logic for the main block device.

Pseudocode logic missing in osd.start:

if [ -b "${i}/unencrypted.db" ] ; then
maybe_unlock "${i}/unencrypted.db" "${nr}" "$( get_key "${nr}" )"
fi
if [ -b "${i}/unencrypted.wal" ] ; then
maybe_unlock "${i}/unencrypted.wal" "${nr}" "$( get_key "${nr}" )"
fi

Add tests and fix the unlock.

Fixes #655

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

This has been tested locally by executing:

INITIAL_OSD_COUNT=$(microceph.ceph -s -f json | jq -r '.osdmap.num_in_osds // 0')
EXPECTED=$((INITIAL_OSD_COUNT + 1))
tests/scripts/actionutils.sh test_encrypted_wal_db_startup "${EXPECTED}" 

The test which is also been added to CI creates three loop back devices, and then creates an OSD with an encrypted WAL & DB. This happens successfully.

The test then shuts down the OSD, closes the relevant encrypted devices and then starts the OSD back up. This should lead to the OSD being opened successfully. Instead it fails.

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change reply with my

johnramsden and others added 2 commits February 25, 2026 16:04
On reboot encrypted wal/dbs will not be unlocked correctly and will lead to issues on startup.

Add a relevant test that will demonstrate failure to unlock.

The test manually closes the encrypted devices simulating a reboot. It then tries to start back up microceph and verifies the devices should be available (currently fails)

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* osd start was not unlocking encrypted wal/db
* Add functionality to unlock by looking at wal/db suffix

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johnramsden thanks lgtm in general, only a minor testing nit!

Comment thread tests/scripts/actionutils.sh
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden
Copy link
Copy Markdown
Member Author

I believe the current errors are not related to this PR

@johnramsden johnramsden requested a review from sabaini February 26, 2026 23:27
Copy link
Copy Markdown
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, thx!

@sabaini sabaini merged commit 1e74227 into canonical:main Feb 27, 2026
35 of 40 checks passed
@johnramsden johnramsden deleted the john/CEPH-1598 branch April 14, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OSD fails to start after reboot if using encrypted WAL or DB devices

2 participants