The packing algorithm over-merges some components

Even with very large layer counts, the packing algorithm sometimes results in merging components that don't make sense while keeping other components completely isolated. I don't mean that the packing isn't working as intended, but that I think the way packing is designed needs a bit more nuanced of an approach. I mean this not in the academic case of suboptimal packing with the greedy algorithm, but in the real-world user experience use case of handling chunking of builds in a way that provides close-to-optimal chunking strategies and prevents larger layers from changing due to trivial changes.

The following [dive](https://github.com/wagoodman/dive) screenshots are from a 256-layer chunking of an image designed for running Steam Big Picture mode as a graphical session in gamescope:

<img width="1710" height="1390" alt="Image" src="https://github.com/user-attachments/assets/78796fc7-7971-4da2-9f34-5a19ab7241a8" />

This single layer is almost half of the image size, containing large files (which were not installed via RPM, but were given xattrs and therefore marked as their own component and not picked up by the big file repo) alongside rpm-ostree objects, a few random fonts, some sideloaded ungrouped files like scripts and a wallpaper PNG, and some seemingly random, unrelated RPM content like libraries from mesa-freeworld RPMs installed from rpmfusion.

<img width="1710" height="1390" alt="Image" src="https://github.com/user-attachments/assets/8d7f8d3d-7ce5-416d-8f1b-9a3a2f1775c8" />

This layer contains the kernel in the build, installed from an RPM in a COPR repo, a few related RPM content like `kernel-modules-extra` and also the `perf` RPM. This looks like a good chunking to me, but there maybe were a few kernel-related RPMs that could have made it in here as well as the initrd.

<img width="1710" height="1390" alt="Image" src="https://github.com/user-attachments/assets/35177ffa-be8f-44f7-9e2f-ac8a024b8676" />

This layer has just botocore, which considering its size makes some amount of sense to me.

<img width="1710" height="1390" alt="Image" src="https://github.com/user-attachments/assets/c2dbf73c-e5f3-4ac3-b8dc-0645353dbd56" />

This layer just has subscription-manager (I built this from my base GUI-less image which includes a bunch of developer tooling for my day job, including the ability to do entitled builds), which considering its relative stability and small size does not make sense to me. Having this package in its own layer does mean that the [maybe-biweekly releases](https://github.com/candlepin/subscription-manager/tags) will have a minimal diff on update, with a single layer changed, but because the layer is so small the impact on the real world update is not as useful as having the kernel and kernel-related packages in a single layer is.

I think that, right now, the algorithm used tends to reward mixing very-large components with very-small components, maximizing their "expected value." Multiple very-large components packed together also produce an outsized impact on those very-small components. This means that trivial changes are now bundled alongside very large files, resulting in what tends to be one very-large layer, with the rest of the components spread relatively well throughout other layers, with a tendency towards pretty-small layers for 99% of an image alongside that one massive layer.

I'm not sure what the right solution here is. The scoring algorithms taking into account relative size differences of components and trying to spread them out? More intelligence in the RpmRepo component to try to identify and group related packages into a single component, even if not built from a single srpm, which is work that would need to go into every Repo component for other package managers? A stronger tendency to have unclaimed components not merge with any other components? Maybe all of those?

For now, I'm opening the issue and happy to discuss and experiment with some other strategies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The packing algorithm over-merges some components #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The packing algorithm over-merges some components #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions