rpm: reclaim orphaned files by SHA-256 digest matching#84
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a robust mechanism to reclaim orphaned files in RPM-based container images by matching their SHA-256 digests. This is a significant improvement, particularly for environments like Fedora CoreOS where files are moved post-installation. The architectural shift to a two-pass system with 'strong' and 'weak' claims is well-conceived and enhances extensibility. The implementation is solid, featuring smart optimizations like size-based filtering to avoid unnecessary hashing and careful handling of ambiguous digests. The addition of a component manifest output and comprehensive end-to-end tests further strengthens the quality of this change.
251aadf to
f71c156
Compare
We want to know about all failures in one go since they might be independent.
This gives trait implementors access to the full file metadata (size, mode, inode, etc.) rather than just the type. Prep for adding more claiming logic which will use this information. Assisted-by: Claude Opus 4.6
So far, we've only had strong path claims. I.e. these are claims derived from authoritative databases living in the rootfs. But there's also room for a second pass of weaker path claims which are more "best-effort". We want to give a chance for every ComponentsRepo to say "this path definitely belongs in one of my components" (strong) _as well as_ something like "if no one else has a definite answer for this path, let me examine it more closely to see if it belongs to one of my components" (weak). The first is cheap, the second may require work, which is why we don't want to do it right off the bat in the first pass on all the files in the rootfs. The specific use case for this will become clearer in a later commit in this patch series. Assisted-by: Claude Opus 4.6
The bigfiles backend fits the use case for "weak" claims. It's a heuristic that's not really derived from any authoritative source to try to achieve better packing. Assisted-by: Claude Opus 4.6
I want to write tests which check that components were split in a certain way. Checking the layers themselves is not foolproof because it depends on how components were packed together. Add (hidden/unstable for now) support for writing a manifest which details all the components that were written and their paths. Assisted-by: OpenCode (Claude Opus 4.6)
f71c156 to
7e706fa
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust mechanism to reclaim orphaned files in RPM-based systems by matching their SHA-256 digests. This is a significant improvement for handling composed file systems where files are moved from their original RPM-declared paths. The core of the change is a new two-pass claiming system that distinguishes between "strong" path-based claims and "weak" heuristic-based claims, like digest matching. The implementation is well-structured, includes performance optimizations such as pre-filtering files by size before hashing, and is accompanied by thorough unit and end-to-end tests that validate the new functionality. My review includes a couple of suggestions to further enhance maintainability and performance.
Note: Security Review did not run due to the size of the PR.
In rpm-ostree-composed bootable containers (e.g. Fedora CoreOS), some files are moved from their RPM-declared paths to different locations (e.g. SELinux policy from `/var/lib/selinux` to `/etc/selinux`). Since the RPM database still records the original paths, these files end up unclaimed (~1,700 files / ~18 MB on FCOS). For the SELinux policy specifically, I think it would make sense to add a `/var/lib/selinux` symlink; this is both helpful to humans and to chunkah which already knows to canonicalize RPM paths. But there are some other paths which wouldn't be caught by this either, e.g. the EFI bootupd blob migrations to `/usr/lib/efi/`. Anyway, one way we can detect these moves for now is to hash them. After loading the rpmdb, we build an index mapping SHA-256 digests of orphaned entries (paths in rpmdb but not on rootfs at their expected paths) to their ComponentIds. And then during the weak claiming phase, we hash the files and match against the index to reclaim them into their correct SRPM component. Assisted-by: Claude Opus 4.6
Part of debugging a weird issue happening in GHA where the total size of components are different than what I see locally.
Add a hidden `--trace-logfile` FILE flag that writes trace-level logs to a file while leaving stderr output at the user-selected verbosity. This is useful for debugging failed builds where the full trace output is needed but would be too noisy on stderr. Part of debugging a weird GHA issue where sizes reported there are different than locally. Assisted-by: OpenCode (Claude Opus 4.6)
Add an --output-dir option (defaulting to tests/e2e/results/) that creates per-test subdirectories and exports OUTPUT_DIR for test scripts to write artifacts into. This gives tests a stable location for output files like manifests and trace logs that can later be uploaded as CI artifacts. Assisted-by: OpenCode (Claude Opus 4.6)
Switch test-fcos.sh from creating its own tmpdir for output to using the OUTPUT_DIR exported by run.sh. This means test artifacts (manifest, peak-mem, trace log) persist after the test run for inspection and CI upload. Also add --trace-logfile to capture full trace output alongside the other artifacts. Assisted-by: OpenCode (Claude Opus 4.6)
Upload the test output directory as a CI artifact on every run (pass or fail) so that manifests, trace logs, and peak memory stats are available for debugging failed CI runs without needing to reproduce locally. Assisted-by: OpenCode (Claude Opus 4.6)
e44ed26 to
60c51a9
Compare
In rpm-ostree-composed bootable containers (e.g. Fedora CoreOS), some
files are moved from their RPM-declared paths to different locations
(e.g. SELinux policy from
/var/lib/selinuxto/etc/selinux). Sincethe RPM database still records the original paths, these files end up
unclaimed (~1,700 files / ~18 MB on FCOS).
For the SELinux policy specifically, I think it would make sense to
add a
/var/lib/selinuxsymlink; this is both helpful to humans andto chunkah which already knows to canonicalize RPM paths. But there are
some other paths which wouldn't be caught by this either, e.g. the EFI
bootupd blob migrations to
/usr/lib/efi/.Anyway, one way we can detect these moves for now is to hash them. After
loading the rpmdb, we build an index mapping SHA-256 digests of orphaned
entries (paths in rpmdb but not on rootfs at their expected paths) to
their ComponentIds. And then during the weak claiming phase, we hash
the files and match against the index to reclaim them into their correct
SRPM component.
Assisted-by: Claude Opus 4.6