New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker tarsum #11

Closed
giuseppe opened this Issue Mar 16, 2016 · 20 comments

Comments

Projects
None yet
5 participants
@giuseppe
Member

giuseppe commented Mar 16, 2016

I would like to get the tarsum of docker save'd images so that I can use the same sha256 used in the v2 registry. Is Skopeo the right place to add this functionality? I would like something like `skopeo tarsum /path/to/foo.tar' and that internally uses https://github.com/docker/docker/tree/master/pkg/tarsum

What do you think?

@runcom

This comment has been minimized.

Member

runcom commented Mar 16, 2016

make sense but right now, we already have https://github.com/vbatts/docker-utils#dockertarsum, could have a look at it?

@rhatdan

This comment has been minimized.

Member

rhatdan commented Mar 16, 2016

@vbatts PTAL

@giuseppe

This comment has been minimized.

Member

giuseppe commented Mar 16, 2016

how is the sh256 checksum computed for each layer? I have tried both:

docker save busybox | dockertarsum and docker save busybox | tar xO d51a083a3b01fe8c58086903595b91fc975de59a9e9ececec755df384a181026/layer.tar | dockertarsum

but I still don't get the same hash I see in the v2 registry image manifest (under fsLayers)

@giuseppe

This comment has been minimized.

Member

giuseppe commented Mar 17, 2016

Got a bit confused of all the different ids that are around, so probably this feature in Skopeo is not needed.

My idea was that we will skip downloading layers that we have imported into the OSTree repository from a docker saved tarball, e.g.

docker save -o busybox.tar busybox
atomic pull --tar=busybox.tar
atomic pull --docker busybox.tar # Does a pull from the registry

The second pull would not do download anything. To do that though, I would have to map each layer imported with the same sha256 used by fsLayers.

It seems the sha256 listed under fsLayers is just the sha256sum of the binary blob you download from the v2 registry. Given it is compressed, I don't know if it is possible to make it reproducible starting from a docker save'd image.

Is there a way to do that or any plan to have a v2 similar layout for docker save'd images?

@rhatdan

This comment has been minimized.

Member

rhatdan commented Mar 17, 2016

@vbatts ^^

@cgwalters

This comment has been minimized.

cgwalters commented Mar 17, 2016

We'd need to look at how the Docker Engine handles this now...the core problem that @vbatts was trying to solve with tarsum was similar to pristine-tar - be able to regenerate a tarball with the same checksum form disk content.

But the Docker people just went with straight sha256...so Engine must do something here. One thing I'd suggested was that if you were re-uploading a layer, the Engine would keep track of the registry it downloaded it from, and rather than re-synthesize it on the client, just re-fetch from the server. Was something like that implemented?

@cgwalters

This comment has been minimized.

cgwalters commented Mar 17, 2016

That said, do we actually need the ability to do docker save and have it avoid redownloads for anything?

Oh actually, this gets into the whole "docker save doesn't do v2" problem which the OSBS people hit...I don't have a link handy.

@runcom

This comment has been minimized.

Member

runcom commented Mar 17, 2016

@giuseppe the sha256 stored in the manifest is just the sha256sum of the downloaded layer's tar file, e.g. I downloaded layer a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4.tar from docker.io/busybox

sha256sum a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4.tar
a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4  a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4.tar

If I add a random file to that tar I get a different sha sum:

sha256sum a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4.tar
4589eafb71a90c4acbd360e68de3517b5d7844dbaaf5fb3ca9fe841f3ae1e754  a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4.tar
@runcom

This comment has been minimized.

Member

runcom commented Mar 17, 2016

Oh actually, this gets into the whole "docker save doesn't do v2" problem which the OSBS people hit...I don't have a link handy.

this is one of the issue though

@giuseppe

This comment has been minimized.

Member

giuseppe commented Mar 17, 2016

@runcom, yes I realized that afterwards. Probably the reason of calculating the sha256 before the compression was to reduce the number of operations done on unverified data; but now there is the issue of how to get the same sha256 given the same tarball.

@cgwalters yes, I agree that it is not a blocking issue, the only additional cost is to redownload the image if the two methods are used together (import a docker save'd image + fetch it from a registry).

@cgwalters

This comment has been minimized.

cgwalters commented Mar 17, 2016

BTW the discussion on compression, checksums, and reproducibilty mirrors that of the rationale for https://github.com/cgwalters/git-evtag (And we should probably have an OSTree equivalent)

@vbatts

This comment has been minimized.

Contributor

vbatts commented Mar 22, 2016

tarsum is not used.

@vbatts

This comment has been minimized.

Contributor

vbatts commented Mar 22, 2016

i'm trying to figure out what is the issue we're solving here? is something still needing attention?

@giuseppe

This comment has been minimized.

Member

giuseppe commented Mar 22, 2016

@vbatts The issue is that the checksums for layers imported from a docker save'd tarball are different than those showed in the manifest file (under fsLayers). If we first import to the OSTree repository from a tarball, and later fetch the same image from the registry, we will need to download each layer again, even though it is the same image and same version.

Is there a way that I can get the same sha256 checksums that are under fsLayers from the image.tar file obtained as docker pull IMAGE && docker save -o image.tar IMAGE?

@runcom

This comment has been minimized.

Member

runcom commented Mar 22, 2016

We could try to convert the manifest to the docker save format and generate the hash (I can take a look at it probably), Vincent is it feasible this way?

@vbatts

This comment has been minimized.

Contributor

vbatts commented Mar 22, 2016

The checksum of the blob in the fsLayers is of the layer.tar itself. I.E.

vbatts@f ~ (master) $ docker pull fedora:latest
latest: Pulling from library/fedora

a3ed95caeb02: Already exists
236608c7b546: Pull complete 
Digest: sha256:1fa98be10c550ffabde65246ed2df16be28dc896d6e370dab56b98460bd27823
Status: Downloaded newer image for fedora:latest
vbatts@f ~ (master) $ docker save fedora:latest | tar -tv
drwxr-xr-x 0/0               0 2016-03-04 18:40 768d4f50f65f00831244703e57f64134771289e3de919a576441c9140e037ea2/  
-rw-r--r-- 0/0               3 2016-03-04 18:40 768d4f50f65f00831244703e57f64134771289e3de919a576441c9140e037ea2/VERSION  
-rw-r--r-- 0/0             388 2016-03-04 18:40 768d4f50f65f00831244703e57f64134771289e3de919a576441c9140e037ea2/json  
-rw-r--r-- 0/0            1024 2016-03-04 18:40 768d4f50f65f00831244703e57f64134771289e3de919a576441c9140e037ea2/layer.tar  
drwxr-xr-x 0/0               0 2016-03-04 18:40 9a233237d70560774705931fc55fe1a3a4619cccf2d0a76671256080c2af6fdb/  
-rw-r--r-- 0/0               3 2016-03-04 18:40 9a233237d70560774705931fc55fe1a3a4619cccf2d0a76671256080c2af6fdb/VERSION  
-rw-r--r-- 0/0            1195 2016-03-04 18:40 9a233237d70560774705931fc55fe1a3a4619cccf2d0a76671256080c2af6fdb/json  
-rw-r--r-- 0/0       212476928 2016-03-04 18:40 9a233237d70560774705931fc55fe1a3a4619cccf2d0a76671256080c2af6fdb/layer.tar  
-rw-r--r-- 0/0            1667 2016-03-04 18:40 ddd5c9c1d0f2a08c5d53958a2590495d4f8a6166e2c1331380178af425ac9f3c.json
-rw-r--r-- 0/0             279 1970-01-01 00:00 manifest.json  
-rw-r--r-- 0/0              89 1970-01-01 00:00 repositories
vbatts@f ~ (master) $ docker save fedora:latest | tar xO 768d4f50f65f00831244703e57f64134771289e3de919a576441c9140e037ea2/layer.tar | sha256sum
5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef  -
vbatts@f ~ (master) $ docker save fedora:latest | tar xO 9a233237d70560774705931fc55fe1a3a4619cccf2d0a76671256080c2af6fdb/layer.tar | sha256sum
4f9e31a2233f97f1fe18c26d44effd10a5ea3d9839299cf88003c85aea75391c  -

Though now that I'm thinking about it, they do compress it before pushing. While gzip (deflate) is deterministic, it still varies by implementation. To get the same stream, you'd have to use golang's 'compress/gzip' library and matching compression level that the docker engine uses. This would only be a couple of lines to make an executable for.

@giuseppe

This comment has been minimized.

Member

giuseppe commented Mar 23, 2016

Thanks for confirming it. I am going to add an helper process to atomic that can be used to compute the sha256 for the tarballs.

I am going to close this issue as anyway it is not related to Skopeo.

@cgwalters

This comment has been minimized.

cgwalters commented Apr 7, 2016

Ugh, so everything (including Docker engine) that interacts with Docker images now needs to use a specific implementation of compress/flate forever? And if anyone improves the implementation in golang, then every project will have to carry a forked copy of the current version.

@vbatts

This comment has been minimized.

Contributor

vbatts commented Apr 18, 2016

@cgwalters compression is the bane of a lot of this. The compress/gzip in golang is consistent to itself (for each compression level), just as GNU gzip is (including the --no-name flag). But there are numerous implementations of the deflate algorithm, and non-trivial ways to unpack and repack content with the same implementation.
So yes, many of the assumptions of content-addressibility that folks are depending on, depends on specific set ups of golang compress/gzip

@vbatts

This comment has been minimized.

Contributor

vbatts commented Apr 18, 2016

@cgwalters further, this is my big argument for addressing based on the digest of the uncompressed tar. Which invalidates many APIs dealing with digests of opaque blobs, and the desire to not transport huge uncompressed archives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment