Skip to content
This repository has been archived by the owner on Sep 12, 2018. It is now read-only.

NG: access image via immutable identifier #804

Closed
ncdc opened this issue Dec 1, 2014 · 35 comments
Closed

NG: access image via immutable identifier #804

ncdc opened this issue Dec 1, 2014 · 35 comments
Milestone

Comments

@ncdc
Copy link
Contributor

ncdc commented Dec 1, 2014

In OpenShift, we would like to be able to access an image (i.e. docker pull/create/run) via an immutable identifier that uniquely identifies an image for all points in time. Here are some use cases:

  1. During a deployment of a group of containers, the same image must be used when creating each container. If the image we're using is foo/bar:latest (let's call this Rev1) and someone pushes an updated foo/bar:latest manifest in the middle of the deployment (Rev2), some containers might be created using Rev1 and others with Rev2. The correct behavior is for all containers to be running based on Rev1.
  2. A group of containers has been deployed (foo/bar:latest - Rev1). As development continues, foo/bar:latest is updated several times, but none of these newer image manifests has been deployed yet. The user decides to scale up the existing deployment, so OpenShift needs to create and start new containers using the same image that is currently deployed (Rev1). foo/bar:latest can't be used because that's no longer Rev1 - it's now something else (e.g. Rev17).

With the v1 registry, we created a custom extension that responds to the tag_created signal, creates a new tag whose name is the image's id (since a v1 image has an id), and then http posts a payload to OpenShift with

  1. namespace
  2. repository
  3. tag
  4. image id
  5. image metadata

OpenShift users can create deployment triggers that watch for changes to foo/bar:latest and then perform new deployments. When deploying, OpenShift inspects the image to find its id and then translates foo/bar:latest to foo/bar:$image_id. This, combined with our custom v1 registry extension, allows us to pull an image by its id. It also lets us deploy a specific image by id, as our deployments don't refer to :latest but instead to the id.

@dmp42 suggested that OpenShift could pull foo/bar:latest, generate a new tag (based on commit id, date/time, etc) that is unique and consistent for all time, push the new tag, and then use that when deploying. This creates a couple of problems:

  1. The tag isn't immutable, which could be problematic if someone accidentally pushed an updated image manifest for a tag that is meant to never change
  2. The new manifest would be signed by OpenShift, not by the user creating the image; if a user wants to run an image that he/she signed, we lose this ability.

It would really be nice to have immutable identifiers for image manifests that are consistent all the time.

@dmp42 @stevvooe @smarterclayton @wking thoughts?

@dmp42
Copy link
Contributor

dmp42 commented Dec 1, 2014

Thanks for the writeup @ncdc

What you ask for (immutable "references" to tags) equals "complete history of every tag version" in the new lingo. Indeed, this is a casualty of the new design - by making clear the difference between an image (a list of layers, name and signature) and a layer (a binary blob), tags are no longer aliases of immutable layers ids.

I'm not sure I have a good solution for you yet, so, let's keep this open and give it some thinking.

@stevvooe
Copy link
Contributor

stevvooe commented Dec 2, 2014

@ncdc With V2, it's very likely we'll have changes to the way things must work. While we should be avoiding serious changes, I hope your team can be somewhat open to this in the hopes that we'll get some benefits.

That said, it seems like we need some sort of content-addressable digest of V2 image manifests to make this work. There are two levels that might be interesting to applications:

  1. A signature-independent content-addressable digest taking into account the following fields:
    • "schemaVersion"
    • "architecture"
    • "fsLayers.blobSum"
    • "history.v2Compatibility" (maybe omitting internal fields, like "created"
  2. A signature-dependent content-addressable digest that includes the fields from above and the content in "signatures". This might actually just be the sha256 hash of the content, given that the signature is dependent on name and tag.

Number 1 would provide addressability of "identical" images, whereas number 2 would provide addressability over specific builds. The main issue with this is that it makes registry garbage collection nearly impossible, because all layers will technically be referenced by all manifests.

Please note that this is just a thought exercise.

@wking
Copy link
Contributor

wking commented Dec 2, 2014

On Tue, Dec 02, 2014 at 10:56:43AM -0800, Stephen Day wrote:

  1. A signature-independent content-addressable digest taking into
    account the following fields:
    • "schemaVersion"
    • "architecture"
    • "fsLayers.blobSum"
    • "history.v2Compatibility" (maybe omitting internal fields, like "created"

I think you mean v1Compatibility, but other than that this is just the
set of fields that get signed (moby/moby#8093) excepting ‘name’
and ‘tag’. That makes sense to me.

  1. A signature-dependent content-addressable digest that includes
    the fields from above and the content in "signatures". This might
    actually just be the sha256 hash of the content, given that the
    signature is dependent on name and tag.

You don't want to add ‘name’ and ‘tag’ back here? And can you explain
why you want to embed the signature-independent content here instead
of referencing it with a content-addressable hash of the
signature-independent content (giving us back thin tags). Or is this
an either/or proposal and not a both/and? In that case, why not take
a both/and approach?

The main issue with this is that it makes registry garbage
collection nearly impossible, because all layers will technically be
referenced by all manifests.

Can you elaborate on this? Manifests should only reference the layers
in their history, and I see no problem if many tags share references
to common layers. In fact, I think shared references like that are a
good thing for quota allocation [1,2].

@ncdc
Copy link
Contributor Author

ncdc commented Dec 2, 2014

The main issue with this is that it makes registry garbage collection nearly impossible, because all layers will technically be referenced by all manifests.

If pruning is pluggable, you could have a default implementation that prunes layers and manifests as soon as they become stale (i.e. no longer actively referenced by a tag). This would require thin tags I would imagine. Other pluggable implementations could defer to an external system (e.g. OpenShift) for quota enforcement and pruning decisions, or support simple caps such as keeping at most n revisions of a manifest for namespace/repo:tag.

@wking
Copy link
Contributor

wking commented Dec 2, 2014

On Tue, Dec 02, 2014 at 11:52:05AM -0800, Andy Goldstein wrote:

… keeping at most n revisions of a manifest for
namespace/repo:tag.

How would you access these older manifests? Are you imagining new API
endpoints to get manifests by their content-addressable hash? I don't
see why you can't just prune immediately, and have folks interested in
maintaining references create their own thin tags to the material they
care about.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 2, 2014

I was thinking about accessing via the content-addressable hash, yes. Although if we can have thin tags back, I think that would likely be sufficient for our needs. We'd either migrate our v1 extension to v2, where the registry extension automatically adds a thin tag on every push, or we'd have our deployment code create the thin tag on the fly and then use that as the image name when deploying.

@dmp42 dmp42 added this to the GO-RC milestone Dec 2, 2014
@stevvooe
Copy link
Contributor

stevvooe commented Dec 3, 2014

@ncdc It sounds like the real problem is creating a consistent identifier to a
manifest, such that it doesn't change when a deployment is kicked off. I'm
going to go over some stuff about garbage collection and then we'll address
that root issue.

The issue with permanently exporting references becomes apparent when one
looks at the garbage collection implementation. Let's say we have the
following list of pushes for the image tagged with a (in the same repo), each
image version identifed by "revision":

tag revision  layers
--- --------- ------
a   a1        layer(0), layer(2)
a   a0        layer(0), layer(1)

The above table would represent two pushes. One relying on layer(0) and
layer(1) and one replacing layer(1) with layer(2). The accessible roots
(or "image ids") are a, a0, and a1. The target of a is temporally
dependent on whether a1 has been pushed or not.

We can use some diagrams to understand how these references develop with each
push. Before the push of a1, all known layers are referenced:

         [a]
          |
         [a0]
         / \ 
        /   \
[layer(1)] [layer(0)]

After the push, we can see that all of the layers are still referenced, and
therefore not safe for delete, but a0 is technically orphaned:

                   [a]
                    |
         [a0]      [a1]
         / \       / \
        /   \     /   \
[layer(1)] [layer(0)] [layer(2)] 

In the approach where no references are ever removed, such as this content-
addressable proposal, all layers must remain forever, unless you allow
deletion by id (a0, a1, etc.).

Under the current V2 registry scheme, a0 and a1 don't exist, so layer(1)
becomes unreferenced after the push of a1 and can be "safely" deleted. We
can see that layer(1) is trivially identified for deletion in this approach:

                    [a]
                    / \
                   /   \
[layer(1)] [layer(0)] [layer(2)]

The proposed content-addressable scheme can be mimicked by pushing manifests
that link different versions of a (where / represents the same manifest
pushed with different tags):

       [a0]        [a/a1]
       /  \         / \
      /    \       /   \
[layer(1)] [layer(0)] [layer(2)]

I'm not saying that we should never implemented content-addressable manifest
storage, but its a lot of bookkeeping for a system should be simplified. The
current approach allows the users to control the deletion pattern while still
allowing one to keep multiple versions of the same image around without a lot
of work on either side.

For @ncdc's use case, referring to most recent diagram, a0 would be version
0 and a1 would be version 1 of the pushed manifest but the latest would be
referenced via a. The real problem is programmatically associating a with
a1 (or "latest" version) when a notification activates a deployment. In corollary, given an image tag a, what are its equivalent images and how can a permanent reference be retained?
.
This use case is what we really need to focus on.

@wking
Copy link
Contributor

wking commented Dec 3, 2014

On Tue, Dec 02, 2014 at 06:44:54PM -0800, Stephen Day wrote:

After the push, we can see that all of the layers are still
referenced, and therefore not safe for delete, but a0 is
technically orphaned:

Ah, I'd delete a0 and layer(1) once a1 had been pushed, since I don't
see a point to referencing untagged manifests 1. Then you don't
have this garbage-collection issue. Pushing and a deleting thin tags
allows anyone to mark out existing manifests that they want to
preserve access to (at the cost of sharing the quota load for that
manifest and its referenced layers 2). There's also no need to
access a manifest by its ID over the client↔registry API, you just use
the ID when you create your thin tag, and push/get/delete that tag
using the existing API.

I'm not saying that we should never implemented content-addressable
manifest storage, but its a lot of bookkeeping for a system should
be simplified. The current approach allows the users to control the
deletion pattern while still allowing one to keep multiple versions
of the same image around without a lot of work on either side.

I don't see the additional bookkeeping cost to my proposal, except for
a separation between the content-addressable parts of the manifest and
the name/tag/signature parts. That can happen purely inside the
registry, with no need to adjust the client's API calls or the storage
driver API.

The real problem is programmatically associating a with a1 (or
"latest" version) when a notification activates a deployment.

In my scheme, the lightweight tag body would embed the
content-addressable manifest ID directly 3. So going from a tag (a)
to a content-addressable manifest ID (a1) is trivial. If you don't
want to expose this to clients over the client↔registry API, you can
always inline the content-addressable manifest before returning the
tag (but clients would still need to be able to calculate the
content-addressable ID if they wanted to create new tags).

corollary, given an image tag a, what are its equivalent images…

This is where refcount arrays come in 4, and I think that's handled
by the layerindex/ stuff that landed with #729.

… and how can a permanent reference be retained?

Just add a lightweight tag pointing to any content-addressable
manifest you want to preserve, and remove your tag when you no longer
need access to that manifest.

@titanous
Copy link

A similar issue is moby/moby#4106, and I added a comment here: moby/moby#9015 (comment)

@stevvooe
Copy link
Contributor

Another problem with a content-addressable manifest ids is that it muddies the role of a name/tag reference. If name and tag are omitted from the calculation of such an id and multiple manifests with different names and tags have identical ids, which manifest should be returned?

It would stand to reason that the id reference would have to be namespaced by the repository, rather than by pure id. And, at the same time, if an image could be referenced by both name+id and name+tag, it would break the guarantee that the url /v2/<name>/manifests/<tag> always returns a manifest with name and tag. Does the registry change the manifest? If so, who signs the updated manifest?

@wking
Copy link
Contributor

wking commented Dec 18, 2014

On Thu, Dec 18, 2014 at 02:48:06PM -0800, Stephen Day wrote:

Another problem with a content-addressable manifest ids is that it
muddies the role of a name/tag reference. If name and tag are
omitted from the calculation of such an id and multiple manifests
with different names and tags have identical ids, which manifest
should be returned?

Can't we just return the content-addressable manifest without a
name/tag? Why does the Docker engine need a name? Currently we get
along fine with arbitrarily generated names like
511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158.

… who signs the updated manifest?

One reason I like thin tags and detached signatures, which you can
pass back to the client so it knows:

  • Docker thinks that image is scratch:latest
  • Trevor thinks that image is wking/empty:5

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

@stevvooe @wking @titanous @dmp42 how about something like this:

Normal push command as we know it today:

docker push foo/bar:latest (where the digest for the image is D1)

Command to flag an image (by tag & digest) so it won't be garbage collected on future pushes:

docker mark foo/bar:latest (marks the current image)
docker mark foo/bar:latest@D1 (marks a specific image by digest)

Do another push. This time, because we asked to keep the previous version of the image around, it's not deleted:

docker push foo/bar:latest (digest=D2)

Retrieve latest version of image:

docker pull foo/bar:latest (gets you the most recent, D2)

Retrieve a specific version:

docker pull foo/bar:latest@D1
docker pull foo/bar:latest@D2

Command to remove the hold on a particular image + digest:

docker unmark foo/bar:latest@D1

I'm trying to think of something that doesn't require huge changes to the v2 proposals, but that still can give us "pull by immutable identifier." This gives control to the user to decide which images to hold on to, doesn't require the manual use of mutable tags, and still allows the registry to do garbage collection. What do you all think?

EDIT: instead of mark/unmark described above, a better UX would be to automatically preserve n revisions/digests per manifest. The value of n could be controlled globally by the registry operator, and optionally set by operators and/or users per repository as well. A value of 1 would keep the current behavior of only allowing 1 revision per tag.

@titanous
Copy link

@ncdc As I mentioned earlier, we can't assume write access on the part of the user (or robot) doing the deploy. So unless the mark API only required read access to the image, this is not viable for us.

For instance, this would preclude using any public images on Docker Hub, as well as images created by other teams in larger organizations.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

@titanous if Bob wants to deploy Alice's alice/apache:latest (digest=D1) and Alice then pushes a new version of the image (digest=D2), it should be up to Alice and/or the registry operator to decide if D1 is preserved, wouldn't you think? Bob shouldn't be able to impact that decision, since it's Alice (and/or the registry operator) who is paying for storage of her images.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

In other words, Bob could refer to alice/apache:latest:D1, but there's no guarantee that image will always be around, since it's not Bob's decision to make.

@titanous
Copy link

@ncdc Correct, I'm not proposing any policies with regards to garbage collection.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

@titanous so... is my suggestion still an issue for you? The marking is strictly a means to inform the registry not to GC a particular manifest.

@smarterclayton
Copy link

Marking for preservation (or whatever you call it) is more composable than having registry collection be under the docker registry. The idea of the registry defining rules for collection and quota, but the user being able to tag and mark as well as to pull by id, seems like it would allow the registry to function as a general purpose store for images for both humans and systems.

Invoking mark on an image is a request by a user to preserve that image. I guess one downside is the next question is whether you need ref counting. I'm assuming that mark is the image owners choice, and that higher level systems that are managing images are really the image owners anyway, so they can implement their own ref counting.

@titanous
Copy link

@ncdc As long as we can fetch manifests by digest without marking them, then it's totally fine.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

@titanous yes, marking shouldn't be required to fetch by digest

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

@smarterclayton I don't think there's a need for ref counting, since layers can be deleted when they're referenced by 0 manifests.

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

Some additional questions I just thought of: what do you do if you have a few different revisions of foo/bar:latest, and then you delete foo/bar:latest? Does "latest" move back to the previous revision? Do we delete all traces of that image? Do we just delete the ability to pull foo/bar:latest but keep the ability to pull foo/bar:latest:$digest?

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

FYI, I have "pull by tag + digest" working locally, and there's support for "docker run" as well. Here's the quickly hacked together prototype:

ncdc/docker-registry@docker:next-generation...ng-pull-by-digest
ncdc/docker@dmcgowan:v2-registry-pushpull...v2-registry-pullbydigest

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

And in the prototype above, the TagStore json file now looks like this: https://gist.github.com/ncdc/c6fb6cba18dfe679a3b6

@ncdc
Copy link
Contributor Author

ncdc commented Dec 19, 2014

@titanous @wking if you have time, please give my POC a spin and let me know what you think.

@jlhawn
Copy link

jlhawn commented Dec 30, 2014

@ncdc said:

docker pull foo/bar:latest@D1
docker pull foo/bar:latest@D2

I kind of like this style of specifying a version a la a git commit.

@stevvooe I don't think it would be too difficult to add this feature. The registry can still be dumb about the content, just hash the manifest/jws payload getting some content-addressable ID for the manifest like e36eb1f73548649b.... So when I push an image like:

docker push jlhawn/my-app:3.1.4

The registry will store a manifest that is addressable either by that name (until the name is deleted/updated) or by the hash e36eb1f73548649b... until it is explicitly deleted by the user/administrator.

So I could pull it using:

jlhawn/my-app:3.1.4

or:

jlhawn/my-app@e36eb1f73548649b...

Personally, I would prefer that people not rely on "alias" tags but instead treat tags like version numbers and not allow users to push the same version again. The registry could see that there already exists a manifest with this name and refuse to overwrite it. This would force the user/admin to explicitly delete that
version of the manifest from the registry. I think this would essentially enforce the desired behavior - with the exception being that a user may delete a manifest version then re-upload a changed one with the original name, but they couldn't do it by accidentally overwriting it.

@stevvooe
Copy link
Contributor

@jlhawn I think am on board with this approach, although id references will be explicitly namespaced by the tag. From your example, jlhawn/my-app:3.1.4@e36eb1f73548649b... would be valid, but jlhawn/my-app@e36eb1f73548649b... would not be. This is a simplifying restriction due to the way "tags" work in the V2 manifest format that we may lift at a later point in time.

Ultimately, I feel a lot of the contention comes from the term "tag". @mmdriley is correct in that the git model is implied by the nomenclature. I also agree that the git model of tags is appropriate. However, the field known as "tag" in the V2 manifest is not the right way to implement that style of "alias tags". Arguably, we should change this field to "version".

I think we can avoid some premature decisions by doing the following:

  1. For the initial version, the manifest id is controlled by the registry. The manifest id should be returned as part of the response to a manifest PUT, in addition to a Location header with the canonical URL for the manifest (ie /v2/<name>/manifests/<tag>/<digest>).
  2. The "digest" of the manifest is the sha256 of the "unsigned" portion of manifest, with sorted object keys. This should only be calculated by the registry for the time being.
  3. PUT operations on the manifest are no longer destructive. If the content is different, the "tag" is updated to point at the new content. All revisions remain addressable by digest.
  4. The DELETE method on /v2/<name>/manifests/<tag> should be clarified to delete all revisions of a given tag, whereas DELETE on /v2/<name>/manifests/<tag>/<digest> should only delete the revision with the request digest.

I'll repeat the proposed supporting endpoints from moby/moby#9015 comments for reference:

Method Path Entity Description
GET /v2/<name>/manifests/<tag>/<digest> Manifest Fetch the manifest identified by name, tag and digest.
DELETE /v2/<name>/manifests/<tag>/<digest> Manifest Delete the manifest identified by name, tag and digest

If we can agree on this as an interim compromise, I think we can move forward and meet the requirements of this request.

Please let me know if any clarification is required.

cc @ncdc

@ncdc
Copy link
Contributor Author

ncdc commented Jan 5, 2015

@stevvooe I definitely like your suggestion of having the registry compute the digest and return it to the client.

I know you had previously expressed concerns about the ability to perform GC if the registry retains copies of every revision of every manifest (unless they're somehow deleted, either manually by a user or automatically via some sort of policy). Are these concerns still an issue for you, or are they mitigated by the delete mechanics listed in bullet 4 above?

cc @smarterclayton - any additional thoughts on the previous comment?

@stevvooe
Copy link
Contributor

stevvooe commented Jan 5, 2015

@ncdc

Are these concerns still an issue for you, or are they mitigated by the delete mechanics listed in bullet 4 above?

They are mitigated by bullet 4 above. This approach saves everything unless specifically asked to delete it. An external webhook service can then be used to control manifest lifecycle. This keeps GC simple (ref counting) and separates it from lifecycle management. It also reduces the possibility of data loss upon manifest updates.

@wking
Copy link
Contributor

wking commented Jan 5, 2015

On Tue, Dec 30, 2014 at 11:50:48AM -0800, Stephen Day wrote:

  1. The "digest" of the manifest is the sha256 of the "unsigned"
    portion of manifest, with sorted object keys. This should only be
    calculated by the registry for the time being.
  2. PUT operations on the manifest are no longer destructive. If
    the content is different, the "tag" is updated to point at the
    new content. All revisions remain addressable by digest.

The PUT operation will still be descructive if the unsigned portion of
the manifest changes (e.g. new signatures or signers). I'm fine with
that, but thought I'd point it out for completeness.

@stevvooe
Copy link
Contributor

stevvooe commented Jan 6, 2015

@wking We may lean towards just taking the entire hash of the content to address this. There seem to be problems with specialized hashes no matter what way we try to cut this up. We may want to discuss storing the signatures separately from the manifest.

@stevvooe
Copy link
Contributor

stevvooe commented Jan 6, 2015

@wking We may actually be able to merge the signatures on the registry side.

@wking
Copy link
Contributor

wking commented Jan 6, 2015

On Mon, Jan 05, 2015 at 05:10:31PM -0800, Stephen Day wrote:

There seem to be problems with specialized hashes no matter what way
we try to cut this up.

And the way to avoid this is to just hash the whole thing ;). If “the
whole thing” is too much (and I think it is), I'd break the manifest
up into a bunch of individually content-addressable objects (e.g. the
layer tarballs, the image metadata, signed name/tag/image
certifications).

We may want to discuss storing the signatures separately from the
manifest.

Sounds good to me (this is where I started out with moby/moby#6070
and 1). I suggested (but didn't elaborate on) additional endpoints
for independently distributing opaque signatures too 2. Of course,
those proposals were framed in terms of my preferred thin registry,
with all of the validation happening on the client side. With the
current thick registry, you'd have to specify a signature format as
well as signature endpoints and have the registry validate signatures
as they're uploaded, but I think all of that has been worked out for
the new registry code anyway.

@stevvooe
Copy link
Contributor

stevvooe commented Jan 7, 2015

I've spec'ed out a proposal for implementation in distribution/distribution#46.

@docker-archive docker-archive locked and limited conversation to collaborators Jan 8, 2015
@stevvooe
Copy link
Contributor

stevvooe commented Jan 8, 2015

I'm going to close this issue, for now, since it has been superseded by distribution/distribution#46. If there is further discussion to be had, please take it there.

@stevvooe stevvooe closed this as completed Jan 8, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants