ext4: group-aware allocator, metadata-aware dedup alignment, e2fsck-gated harness#72
Merged
Conversation
Two latent, shipping bugs in the ext4 image writer, both surfaced by gating validation on the real `e2fsck`/kernel instead of the lenient in-crate reader: 1. s_journal_uuid was stamped with the filesystem UUID. That field names an *external* journal device, so the kernel and e2fsck searched for a nonexistent external journal and aborted before checking anything — meaning no image this writer produced could ever be fsck-validated. Zeroed it (internal journal carries its UUID in the jbd2 superblock). 2. Multi-block-group images (>128 MiB) placed file data on the Group 1 backup superblock + group descriptors (block 32768), producing multiply-claimed blocks the kernel rejects — real corruption (e.g. /home, /media read back as "Structure needs cleaning"). New group-aware allocator: file data skips the reserved backup-SB/GDT blocks at sparse_super group boundaries and fragments around them (write_file_data, physical_runs, rewritten extent emission keyed on data_start_block). Unfragmented files produce byte-identical output to before. Validated on synthetic and real (python:3.12-slim) images: e2fsck clean, kernel mount reads all files, byte-exact content across fragmentation, determinism preserved, full ext4 suite green. Adds: - ext4/tests/fsck_validity.rs: e2fsck-oracle integration harness. - glidefs/src/bin/dedup_probe.rs: empirical dedup probe (replaces the unsound oci_dedup_measure model). - WriterOption::AlignData: opt-in, gated off, marked KNOWN-LIMITATION (block-alignment dedup work in progress; not yet metadata-aware). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rness Completes the block-alignment dedup work on top of the group-aware allocator. Alignment (WriterOption::AlignData) is now correct and e2fsck-clean: - Composes with reserved-block skipping: when an aligned file start lands on a backup-superblock/GDT block, write_file_data skips past it before recording data_start_block. - Padding gaps are unreferenced free space, not data. record_free_hole tracks them (excluding reserved metadata blocks), and close() clears them from the otherwise-dense block bitmap so the free counts are correct. Previously the dense bitmap marked padding as used, which e2fsck rejected. Validated end to end on real images (python:3.12/3.13-slim): aligned builds are e2fsck-clean, kernel-mount and read byte-exact, deterministic, and realize the dedup win — 26%->52% cross-image (3-image set), matching the file-level ground truth (52%) and capturing 98% of the FastCDC ceiling, with zero stored cost (padding is dropped zeros). Adds fuzz_multigroup_validity_and_content: random multi-group filesets (mixed small/medium/large) must be e2fsck-clean AND read back byte-exact, both unaligned and aligned. Deterministic seeds reproduce failures; EXT4_FUZZ_SEEDS scales coverage (ran clean at 64 seeds = 128 images). This is the generalized gate over the size/position space where the data-on-metadata and alignment-bitmap bugs lived. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the validated dedup alignment into both bless paths:
- run_bless_oci (merged image) and run_bless_oci_layered via layer_store
(per-layer) now pass AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE }
(128 KiB = the volume's block size). Threshold = one full dedup block:
measured as the sweet spot — captures 99% of the FastCDC ceiling (34%
cross-image reduction on the 3-image probe set) with less logical
inflation than a 16 KiB threshold.
- device_size estimates get headroom for alignment padding (bless x3->x4,
layer_store x2->x3). Padding is holes/zeros the block store drops, so it
costs address space, not stored bytes.
Automate the strongest oracle: kernel_mount_content loop-mounts the image
with the real Linux ext4 driver and verifies every file byte-exact
against known input, for both aligned and unaligned builds. Opt-in via
EXT4_MOUNT_TEST=1 (needs root/passwordless sudo), skips by default so CI
stays green; runs in a privileged/nightly job. Verified passing locally.
dedup_probe: alignment threshold is now configurable via
DEDUP_ALIGN_THRESHOLD for measurement sweeps.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
router.rs run_bless_oci_task (the in-server background bless triggered via the API) was missing the AlignData wiring and device-size headroom that the two CLI bless paths got. Wire it consistently so every bless path produces grid-aligned, cross-image-deduppable base images. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The group-aware allocator skipped reserved block-group backup-superblock/GDT blocks for FILE DATA, but the journal, inode table, and bitmaps are written contiguously at close() via raw writes and could still land on (or straddle) them. Since bless enables a 4 MiB journal, this was a real corruption path: e.g. a ~124 MiB image puts the journal inode across block 32768 (Group 1's backup superblock) -> multiply-claimed block the kernel rejects. The inode table straddles the same way for workloads with many files. Caught by running the fsck harness with the production journal config (it previously ran journal-less). Fix: reserve_contiguous(n) places a contiguous run clear of reserved regions (skipping past any it would straddle, recording the gap as a free hole); used for the journal, inode table, and bitmaps. Tests now run with Journal(1024) (matching bless) and add: - fsck_journal_straddles_group_boundary: sweeps 120-136 MiB - fsck_inode_table_straddles_boundary: many-file workloads at the boundary Both verified to fail before the fix (journal: inode 8 multiply-claim; inode table: "Group 1's inode table at 32768 conflicts"). Full suite + 64-seed fuzz + kernel mount green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the doc in line with actual behavior: - Extent building: file data fragments around reserved blocks (write_file_data / physical_runs / write_extents), not always contiguous. - New Block-Group Metadata Reservation section: sparse_super backup superblocks at block 32768 etc., data + close()-time structures skip them (reserve_contiguous), free-hole bitmap accounting; the multi-group corruption bug and why the in-crate reader hid it. - On-disk layout: correct flex_bg layout (metadata clustered at the end), reserved holes, journal placement. - WriterOption table: add Uuid, Journal (s_journal_uuid must be zero), AlignData. - Fix stale "why no journal" (now optional, bless enables it) and the zero-UUID determinism/checksum claims. - Testing: document the fsck_validity e2fsck/mount/fuzz harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improves host/S3 deduplication by aligning large file payloads in blessed base images to the dedup block grid — and, in the process, fixes a pre-existing shipping data-corruption bug in the ext4 writer that was masked by a lenient in-crate reader.
The work is gated end-to-end on kernel-grade oracles (
e2fsck, loop-mount), not the crate's own reader.The dedup win
Fixed 128 KiB blocks over ext4 miss byte-identical content placed at different offsets (positional windows). On real images this loses ~half the achievable dedup. Aligning large file payloads to the grid makes the existing content-addressing fire:
The bug it surfaced
Validating with the real
e2fsck(after fixing a strays_journal_uuidthat made e2fsck abort) revealed that multi-block-group images (>128 MiB) placed file data on the Group 1 backup superblock (block 32768) → multiply-claimed blocks the kernel rejects →/home,/mediaetc. read back as "Structure needs cleaning". Pre-existing, independent of alignment, hidden by the lenient reader.Fix: a group-aware allocator skips reserved backup-superblock/GDT blocks and fragments files around them. Alignment then composes on top (metadata-aware placement + padding holes cleared from the bitmap).
Changes
s_journal_uuidfix; group-aware allocator (write_file_data,physical_runs, rewritten extent emission); metadata-awareWriterOption::AlignData(opt-in) with free-hole bitmap accounting.AlignDataat the 128 KiB block grid, with device-size headroom for padding.ext4/tests/fsck_validity.rs): e2fsck-oracle integration tests — single/multi-group clean, content-survives-fragmentation, a property fuzzer over random multi-group layouts (both align modes;EXT4_FUZZ_SEEDSto scale; ran clean at 64 seeds = 128 images), and an opt-inkernel_mount_contentcheck (real loop-mount, byte-exact;EXT4_MOUNT_TEST=1).oci_dedup_measuremodel).Validation
e2fsckclean on synthetic + real (debian, python:3.12) imagesblessof debian (single-group) and python:3.12 (multi-group) completes end-to-end with alignment onNotes / follow-ups
bless_writer_options()helper so the three bless sites can't drift (this PR already had one missing call site); would also unify a random-vs-deterministic UUID inconsistency in the server-side path.🤖 Generated with Claude Code