Skip to content

The packing algorithm over-merges some components #97

@solacelost

Description

@solacelost

Even with very large layer counts, the packing algorithm sometimes results in merging components that don't make sense while keeping other components completely isolated. I don't mean that the packing isn't working as intended, but that I think the way packing is designed needs a bit more nuanced of an approach. I mean this not in the academic case of suboptimal packing with the greedy algorithm, but in the real-world user experience use case of handling chunking of builds in a way that provides close-to-optimal chunking strategies and prevents larger layers from changing due to trivial changes.

The following dive screenshots are from a 256-layer chunking of an image designed for running Steam Big Picture mode as a graphical session in gamescope:

Image

This single layer is almost half of the image size, containing large files (which were not installed via RPM, but were given xattrs and therefore marked as their own component and not picked up by the big file repo) alongside rpm-ostree objects, a few random fonts, some sideloaded ungrouped files like scripts and a wallpaper PNG, and some seemingly random, unrelated RPM content like libraries from mesa-freeworld RPMs installed from rpmfusion.

Image

This layer contains the kernel in the build, installed from an RPM in a COPR repo, a few related RPM content like kernel-modules-extra and also the perf RPM. This looks like a good chunking to me, but there maybe were a few kernel-related RPMs that could have made it in here as well as the initrd.

Image

This layer has just botocore, which considering its size makes some amount of sense to me.

Image

This layer just has subscription-manager (I built this from my base GUI-less image which includes a bunch of developer tooling for my day job, including the ability to do entitled builds), which considering its relative stability and small size does not make sense to me. Having this package in its own layer does mean that the maybe-biweekly releases will have a minimal diff on update, with a single layer changed, but because the layer is so small the impact on the real world update is not as useful as having the kernel and kernel-related packages in a single layer is.

I think that, right now, the algorithm used tends to reward mixing very-large components with very-small components, maximizing their "expected value." Multiple very-large components packed together also produce an outsized impact on those very-small components. This means that trivial changes are now bundled alongside very large files, resulting in what tends to be one very-large layer, with the rest of the components spread relatively well throughout other layers, with a tendency towards pretty-small layers for 99% of an image alongside that one massive layer.

I'm not sure what the right solution here is. The scoring algorithms taking into account relative size differences of components and trying to spread them out? More intelligence in the RpmRepo component to try to identify and group related packages into a single component, even if not built from a single srpm, which is work that would need to go into every Repo component for other package managers? A stronger tendency to have unclaimed components not merge with any other components? Maybe all of those?

For now, I'm opening the issue and happy to discuss and experiment with some other strategies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions