Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable zstd:chunked support in containers/image #775

Merged
merged 4 commits into from May 14, 2021

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Nov 19, 2020

a bunch of new APIs to enable zstd:chunked support in containers/image.

add a new custom variant of the zstd compression that permits to retrieve each file separately.

The idea is based on CRFS and its stargz format for having seekable and indexable tarballs.

zstd_chunked

One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers.

The zstd compression supports embedding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format.

Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further.

The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then.

Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant.

Since access to host files is required for the deduplication, the untar doesn't happen in a separate process running in a chroot, but it happens in original mount namespace. To mitigate possible breakages, openat2(2) is used were available.

More details here: containers/image#1084

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

@rhatdan
Copy link
Member

rhatdan commented Nov 19, 2020

Lint is not happy.

@giuseppe giuseppe force-pushed the zstd-chunked branch 5 times, most recently from eb34310 to 69ca881 Compare November 23, 2020 15:28
@giuseppe giuseppe force-pushed the zstd-chunked branch 15 times, most recently from fe91364 to 66a8934 Compare December 1, 2020 12:18
@giuseppe
Copy link
Member Author

giuseppe commented Dec 1, 2020

could I get some feedbacks on the containers/storage side (especially about the new interfaces)?

The current implementation is feature completed, with some parts that can be added later, e.g. cache for host files.

One thing to address is how to store the layer metadata. I think it will be better to use a separate file. [Addressed in the last version]

@giuseppe giuseppe force-pushed the zstd-chunked branch 6 times, most recently from b8cfe6b to 264b1a2 Compare December 4, 2020 21:20
@cgwalters
Copy link
Contributor

cgwalters commented May 6, 2021

I have a much higher level comment: As we all know, the original Docker format was basically "the simplest thing that could possibly work": layers of tar and gzip. What ended up being standardized is just tweaks on that.

This work addresses some of those issues (zstd is way better than gzip just as a baseline, seekability, etc.). But one thing that IMO is so key about both ostree and git is that the format is reproducible - there is exactly one canonical representation of a git commit and an ostree commit (unlike tar with its many variants, extensions, ability to have files in different orders, overwriting earlier files with later, starting with ./ or just the empty string, etc.).

If we're going to create effectively a new container format I think it's worth considering addressing that at the same time. Which would here probably end up as something like a "canonical tar format" - and e.g. require that files in the input stream are sorted by name in UTF-8 order, etc.

@ktock
Copy link
Contributor

ktock commented May 6, 2021

Which would here probably end up as something like a "canonical tar format"

It's already discussed in "OCIv2" image format (discussion notes, session youtube by @cyphar ) which doesn't use tar at all.
IMO we should keep the initial scope of zstd:chunked (i.e. lazy pulling).

@rhatdan
Copy link
Member

rhatdan commented May 6, 2021

I am hesitant to push this too far, since if we end up breaking installed base of container registries and container engines we will hit a huge head wind, and prevent us from improving what we have. I don't mind a grand stategy for image format 2.0, but if we can make small improvements in 1.0, I would like to do this first.

@cgwalters
Copy link
Contributor

Honestly I think cyphar just went off the rails a bit here - it's likely all the goals could be achieved with a canonical tar representation and hence be compatible with all existing tar parsers.

@mtrmac
Copy link
Collaborator

mtrmac commented May 6, 2021

But one thing that IMO is so key about both ostree and git is that the format is reproducible - there is exactly one canonical representation of a git commit and an ostree commit

That seems hard to do with compressed artifacts (we want to allow for bug fixes or improvements to the compression implementation, and for image creators to use different programming languages = different implementations — also the implementations might not produce reproducible data just because it wasn’t a goal), if we don’t keep the compressed variant around (like I guess Git does?). We don’t currently keep it around, and the disk space impact could be pretty significant in absolute terms, easily tens of gigabytes (I have no idea what it is relative vs. available disk space).


If we're going to create effectively a new container format I think it's worth considering addressing that at the same time. Which would here probably end up as something like a "canonical tar format" - and e.g. require that files in the input stream are sorted by name in UTF-8 order, etc.

Yeah, this pair of PRs is big enough and invasive enough that it does warrant some thinking whether we could be getting more features for similar amount of complexity.

Intuitively I don’t see reproducibility (from pulled and expanded image stores) as that much of an end goal; users shouldn’t be copying images by pull&push in any case, it’s going to be less efficient than a straight registry-to-registry copy no matter what we do. (Reproducibility is a useful property for a build system, which is a more constrained problem, e.g. it can choose to use a reproducible compression implementation or freeze an implementation version in a way the whole ecosystem can’t.)

It seems interesting to me to think about the size/speed aspects of the incremental pull, to see if we can improve those. (If it turns out that we need a canonical uncompressed tar format for that, sure, let’s; it just doesn’t look like a natural place to start.) At least my life would be easier if we had only a small number of delta/incremental-update mechanisms, instead of giving the user a menu of 5 and having to worry about the 2^5 possible code paths in the worst case; so a format that allows best-of-all-worlds is a nice dream :)

OTOH that’s all entirely theoretical until there’s an idea for how to do that, and interest+time to implement it; and this PR shouldn’t be blocked on an unachievable hypothetical.


[In a sense, the uncompressed tar format is reproducible already, due to tar-split. (That‘s the other kind of reproducible, from a pulled image, not from a filesystem during a build process.) I don’t know whether that helps because I’m not sure what we need reproducibility/canonical form for. Just for more layer reuse across separately-built images?]

@cgwalters
Copy link
Contributor

cgwalters commented May 6, 2021

That seems hard to do with compressed artifacts

Yes; we can really only care about the uncompressed version. Clients would need to verify the uncompressed checksum and not the compressed one.

That said, after some reflection I think "canonical tar" could be done after this in theory - they're mostly orthogonal problems.

[In a sense, the uncompressed tar format is reproducible already, due to tar-split.

(looked this up, seems to be 9b04055 )

I don’t know whether that helps because I’m not sure what we need reproducibility/canonical form for. Just for more layer reuse across separately-built images?]

Basically the same rationale as https://reproducible-builds.org/ - the downside of tar-split is basically that the original tar stream becomes canonical and has to be carried around.

@giuseppe
Copy link
Member Author

giuseppe commented May 7, 2021

I've marked all the APIs in containers/storage and containers/image as experimental, to have the freedom of introducing breaking changes until we are happy with it. I am planning to do more incremental improvements later, as well as possibly support stargz/estargz as well for compatibility with clients and registries that don't support zstd yet.

The reason why I am pushing to get this merged is that we can finally get users to play with it and get some feedback, and not expect users to deal with re-vendoring containers/storage and containers/image into the container tools.

We can still improve the new format and we add tar reproducibility later. I think the tar-split blob can be embedded in the zstd stream, in the same way the TOC metadata is added right now.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
add a new custom variant of the zstd compression that permits to
retrieve each file separately.

The idea is based on CRFS and its stargz format for having seekable
and indexable tarballs.

One disadvantage of the stargz format is that a custom file is added
to the tarball to store the layer metadata, and the metadata file is
part of the image itself.  Clients that are not aware of the stargz
format will propagate the metadata file inside of the containers.

The zstd compression supports embeddeding additional data as part of
the stream that the zstd decompressor will ignore (skippable frame),
so the issue above with CRFS can be solved directly within the zstd
compression format.

Beside this minor advantage, zstd is much faster and compresses better
than gzip, so take this opportunity to push the zstd format further.

The zstd compression is supported by the OCI image specs since August
2019: opencontainers/image-spec#788 and has
been supported by containers/image since then.

Clients that are not aware of the zstd:chunked format, won't notice
any difference when handling a blob that uses the variant.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

giuseppe commented May 7, 2021

fedora-32 vfs fails to start, but I guess we can probably drop tests on fedora 32

@Romain-Geissler-1A
Copy link

Romain-Geissler-1A commented May 7, 2021

Hi,

FYI, I am speaking from a docker/podman user point of view.

If I understand correctly this proposal, this is something more or less similar to the eStargz proposal, but using zstd instead, to be able to inject metadata unknown to old client impactless, where currently eStargz would result in having some unexpected files injected inside the running container. Also, if I understand correctly, this allows to share on the client side parts of files which may be found in different and unrelated images, if these files share the same hash, and that is really cool ! While this proposal seems to be a very good improvement from the "pull" side, is it also an improvement for "push" scenario, and the storage on remote registry ? I mean, let's say a file content exists already in some image of a remote registry, if I build a local image uzing this zstd format, which contains a file with the same content, when I will run "podman push", will it push the full ztd tar file, or only the parts that the registry doesn't know yet ? And on registry side, will it be able to serve the "big" zstd file as a collection of smaller concatenatd zstd tar files and thus only store split zstd tar files rather than the "big" ones ?
If this pull requests also eventually allows to improve the "push" scenario, that is super cool !

Let me explain why we in Amadeus might be interested by such a proposal (as users). We have many applications which are C++ based. We do use (for now) mainly dynamic linking, and contrary to other companies our teams are very inter-dependant: we have several layer of C++ middleware teams developing diverse library, that are eventually reused by application developers. Our applications are quite big and tend to need many dependencies from various teams, compiled .so shared libraries are often 10MB files, up to sometimes 100+MB, even after being stripped. We have thousands of these libraries in total, big applications might require a lot of them. In production images, we install part of these libraries using "RUN dnf install some-amadeus-package", but the majority of these libraries are installed a simpler way, just untaring some tar.gz file in a new OCI layer.

To develop internally, we also use Docker images, so these come with GNU C/C++ based toolchains, which also are very heavy (a single toolchain containing both gcc + LLVM tools can easily weight 1GB in total). While we do offer one "standard SDK" base image where most tools are present, some team make the choice to develop their own image, writing their own Dockerfile. Most of our tools are available as RPM packages, so they naturally write "RUN dnf install some-amadeus-packages && dnf clean all".

The fact that we have many teams over dependant on each other and that each set of dependencies in every team is different results that in the end, we have many different SDK/production images in our registries, and layer's storage diff are very rarely shared between common images, despite we have MANY big files in common between all these various images. Anything based on "dnf" is doomed to fail to share layers as "dnf install" will result in having unique timestamsp, but not only (ie we can't just fix it by using "podman build --timestamp"), as "dnf" internally relies on a database, which will contain different package list for every team, thus look different from a filesystem point of view, thus leading to different tar.gz sha256).

So on Amadeus side, proposals that would allow to share not only "big" layer's filesystem diff tar.gz, but really individual file content part (so without taking into account metadata) of these tar.gz (or tar.zstd) would definitely help us save a lot of disk space and network bandwidth to download/push just the "new" files. We are concerned by the registry side as well, not only the client side, as we do host our own internal JFrog Artifactory registry, and we regularly receive mails to clean all the docker images we don't use anymore as it fills up pretty quickly.

I am curious about this one, and about the mentioned OCIv2 as well ;)

Cheers,
Romain

@ktock
Copy link
Contributor

ktock commented May 7, 2021

@Romain-Geissler-1A

This is exactly one of the requirements of OCIv2! It might require the API changes on the registry side.

But if the registry can understand seekable compression formats (e.g. zstd:chunked, eStargz, stargz) and the TOC format, I think we don't even need to change image spec, by peforming PATCH upload to these layers, in chunk-level granularity.

Anyway, as mentioned #775 (comment), I think we can care about it in following-up PRs.

@giuseppe
Copy link
Member Author

@mtrmac @cgwalters are you fine with the current status of this PR? Or anything against moving this forward?

@cgwalters
Copy link
Contributor

In rpm-ostree we have rpm-ostree ex <subcommand> which is a place to ship commands that are marked experimental. Does podman have anything like this? Basically it creates a middle ground between the otherwise vast gulf separating "code in a PR" and "code shipped to vast numbers of people via distribution stable releases".

In particular as soon as this is vendored in podman for example, then it becomes effectively frozen in that we need to support the format ~forever right? Whereas if it requires something like echo enable-experimental-zstd-chunked: true >> ~/.containers/storage.conf that's a different thing.

I only skimmed the design, and barely any of the code.

@mtrmac
Copy link
Collaborator

mtrmac commented May 11, 2021

@mtrmac @cgwalters are you fine with the current status of this PR? Or anything against moving this forward?

  • I am still seriously worried about host file reuse. Why is it known to be secure?
  • Shouldn’t @nalind review this either way? (I’m not all that familiar with the codebase, and I have only quickly skimmed it once.)

@rhatdan
Copy link
Member

rhatdan commented May 11, 2021

As far as scanning for matching files in the Base OS @giuseppe has agreed to make this optional and just for /usr. We can turn it off by default, and then for certain workloads users could turn it on.

@giuseppe
Copy link
Member Author

As far as scanning for matching files in the Base OS @giuseppe has agreed to make this optional and just for /usr. We can turn it off by default, and then for certain workloads users could turn it on.

it is already disabled by default in the last version of these patches and it must be enabled in the storage.conf configuration file.

In particular as soon as this is vendored in podman for example, then it becomes effectively frozen in that we need to support the format ~forever right? Whereas if it requires something like echo enable-experimental-zstd-chunked: true >> ~/.containers/storage.conf that's a different thing.

it won't be enabled by default yet, it must be explicitly configured (haven't decided yet if through the storage.conf config file or through some env variable)

@giuseppe
Copy link
Member Author

@mtrmac @nalind @rhatdan so fine to move forward with this PR (and considering the changes needed in c/image)?

@rhatdan
Copy link
Member

rhatdan commented May 14, 2021

Since this is experiemental, I am going to merge, so we can move forward.

@rhatdan rhatdan merged commit 4f69205 into containers:master May 14, 2021
@cgwalters
Copy link
Contributor

I randomly came across https://github.com/dragonflyoss/image-service - not sure if you all have seen it. May be worth a "Related projects" section somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants