Skip to content

rpm: reclaim orphaned files by SHA-256 digest matching#84

Merged
jlebon merged 11 commits intomainfrom
issue-35-selinux-policy
Mar 11, 2026
Merged

rpm: reclaim orphaned files by SHA-256 digest matching#84
jlebon merged 11 commits intomainfrom
issue-35-selinux-policy

Conversation

@jlebon
Copy link
Copy Markdown
Member

@jlebon jlebon commented Mar 10, 2026

In rpm-ostree-composed bootable containers (e.g. Fedora CoreOS), some
files are moved from their RPM-declared paths to different locations
(e.g. SELinux policy from /var/lib/selinux to /etc/selinux). Since
the RPM database still records the original paths, these files end up
unclaimed (~1,700 files / ~18 MB on FCOS).

For the SELinux policy specifically, I think it would make sense to
add a /var/lib/selinux symlink; this is both helpful to humans and
to chunkah which already knows to canonicalize RPM paths. But there are
some other paths which wouldn't be caught by this either, e.g. the EFI
bootupd blob migrations to /usr/lib/efi/.

Anyway, one way we can detect these moves for now is to hash them. After
loading the rpmdb, we build an index mapping SHA-256 digests of orphaned
entries (paths in rpmdb but not on rootfs at their expected paths) to
their ComponentIds. And then during the weak claiming phase, we hash
the files and match against the index to reclaim them into their correct
SRPM component.

Assisted-by: Claude Opus 4.6

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust mechanism to reclaim orphaned files in RPM-based container images by matching their SHA-256 digests. This is a significant improvement, particularly for environments like Fedora CoreOS where files are moved post-installation. The architectural shift to a two-pass system with 'strong' and 'weak' claims is well-conceived and enhances extensibility. The implementation is solid, featuring smart optimizations like size-based filtering to avoid unnecessary hashing and careful handling of ambiguous digests. The addition of a component manifest output and comprehensive end-to-end tests further strengthens the quality of this change.

Comment thread src/components/rpm.rs
@jlebon jlebon force-pushed the issue-35-selinux-policy branch from 251aadf to f71c156 Compare March 10, 2026 14:18
jlebon added 5 commits March 10, 2026 10:29
We want to know about all failures in one go since they might be
independent.
This gives trait implementors access to the full file metadata (size,
mode, inode, etc.) rather than just the type.

Prep for adding more claiming logic which will use this information.

Assisted-by: Claude Opus 4.6
So far, we've only had strong path claims. I.e. these are claims derived
from authoritative databases living in the rootfs. But there's also room
for a second pass of weaker path claims which are more "best-effort".

We want to give a chance for every ComponentsRepo to say "this path
definitely belongs in one of my components" (strong) _as well as_
something like "if no one else has a definite answer for this path, let
me examine it more closely to see if it belongs to one of my components"
(weak). The first is cheap, the second may require work, which is why we
don't want to do it right off the bat in the first pass on all the files
in the rootfs.

The specific use case for this will become clearer in a later commit in
this patch series.

Assisted-by: Claude Opus 4.6
The bigfiles backend fits the use case for "weak" claims. It's a
heuristic that's not really derived from any authoritative source to try
to achieve better packing.

Assisted-by: Claude Opus 4.6
I want to write tests which check that components were split in a
certain way. Checking the layers themselves is not foolproof because it
depends on how components were packed together.

Add (hidden/unstable for now) support for writing a manifest which
details all the components that were written and their paths.

Assisted-by: OpenCode (Claude Opus 4.6)
@jlebon jlebon force-pushed the issue-35-selinux-policy branch from f71c156 to 7e706fa Compare March 10, 2026 14:29
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Mar 10, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust mechanism to reclaim orphaned files in RPM-based systems by matching their SHA-256 digests. This is a significant improvement for handling composed file systems where files are moved from their original RPM-declared paths. The core of the change is a new two-pass claiming system that distinguishes between "strong" path-based claims and "weak" heuristic-based claims, like digest matching. The implementation is well-structured, includes performance optimizations such as pre-filtering files by size before hashing, and is accompanied by thorough unit and end-to-end tests that validate the new functionality. My review includes a couple of suggestions to further enhance maintainability and performance.

Note: Security Review did not run due to the size of the PR.

Comment thread src/components/mod.rs
Comment thread src/components/rpm.rs
jlebon added 6 commits March 10, 2026 19:38
In rpm-ostree-composed bootable containers (e.g. Fedora CoreOS), some
files are moved from their RPM-declared paths to different locations
(e.g. SELinux policy from `/var/lib/selinux` to `/etc/selinux`). Since
the RPM database still records the original paths, these files end up
unclaimed (~1,700 files / ~18 MB on FCOS).

For the SELinux policy specifically, I think it would make sense to
add a `/var/lib/selinux` symlink; this is both helpful to humans and
to chunkah which already knows to canonicalize RPM paths. But there are
some other paths which wouldn't be caught by this either, e.g. the EFI
bootupd blob migrations to `/usr/lib/efi/`.

Anyway, one way we can detect these moves for now is to hash them. After
loading the rpmdb, we build an index mapping SHA-256 digests of orphaned
entries (paths in rpmdb but not on rootfs at their expected paths) to
their ComponentIds. And then during the weak claiming phase, we hash
the files and match against the index to reclaim them into their correct
SRPM component.

Assisted-by: Claude Opus 4.6
Part of debugging a weird issue happening in GHA where the total size of
components are different than what I see locally.
Add a hidden `--trace-logfile` FILE flag that writes trace-level logs to
a file while leaving stderr output at the user-selected verbosity. This
is useful for debugging failed builds where the full trace output is
needed but would be too noisy on stderr.

Part of debugging a weird GHA issue where sizes reported there are
different than locally.

Assisted-by: OpenCode (Claude Opus 4.6)
Add an --output-dir option (defaulting to tests/e2e/results/) that
creates per-test subdirectories and exports OUTPUT_DIR for test scripts
to write artifacts into.

This gives tests a stable location for output files like manifests and
trace logs that can later be uploaded as CI artifacts.

Assisted-by: OpenCode (Claude Opus 4.6)
Switch test-fcos.sh from creating its own tmpdir for output to using the
OUTPUT_DIR exported by run.sh. This means test artifacts (manifest,
peak-mem, trace log) persist after the test run for inspection and CI
upload. Also add --trace-logfile to capture full trace output alongside
the other artifacts.

Assisted-by: OpenCode (Claude Opus 4.6)
Upload the test output directory as a CI artifact on every run (pass or
fail) so that manifests, trace logs, and peak memory stats are available
for debugging failed CI runs without needing to reproduce locally.

Assisted-by: OpenCode (Claude Opus 4.6)
@jlebon jlebon force-pushed the issue-35-selinux-policy branch from e44ed26 to 60c51a9 Compare March 10, 2026 23:39
@jlebon jlebon merged commit 1b2faef into main Mar 11, 2026
10 checks passed
@jlebon jlebon deleted the issue-35-selinux-policy branch March 11, 2026 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant