Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/spec: Proposal for new manifest format #1068

Merged
merged 1 commit into from Dec 19, 2015

Conversation

aaronlehmann
Copy link
Contributor

This is a follow-on to PR #62, and it borrows much of the format
from #993, but uses specific formats for the image manifest and manifest
list (fat manifest) instead of a combined generic format.

The intent of this proposed manifest format is to allow multi-arch, and
allow for full content-addressability of images in the Docker engine.

Signed-off-by: Aaron Lehmann aaron.lehmann@docker.com

@aaronlehmann
Copy link
Contributor Author

- **`layers`** *array*

The layer list is ordered starting from the base image (opposite order of schema1).
The `baseLayer` label is intended to improve human comprehensibility of the layer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be better if the specification left room for this kind of unstructured annotation without calling them out? This way, different implementations can add/remove information intended for humans / UI without a schema change?

For examples, in the future we might want to know the "top" layer, or per-layer what it's Dockerfile directive was, or...

Maybe this should be "comment": "base layer"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that, or we can avoid calling it out in the spec. The intention was to clarify that the label in the example doesn't have any semantic meaning to the engine.

@dmcgowan
Copy link
Collaborator

dmcgowan commented Oct 7, 2015

We should add a compatibility section at the end which will suggest how we will support clients.

The original suggestion is that clients which support this would provide an "Accept" header and the registry will now set the "Content-Type". When either of these headers are missing the assumption made by the other endpoint would be that schema version 1 is used.


- **`layers`** *array*

The layer list is ordered starting from the base image (opposite order of schema1).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation behind reversing this order from the one in schema1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering in schema1 was awkward because engines ended up having to process it in reverse. The base layer needs to be added to the graph first, but this was at the end of the list in schema1.

@estesp
Copy link
Contributor

estesp commented Oct 13, 2015

I like this as a cleaner alternate to #993. I assume we still have some questions to answer based on the questions in-line above from @mattmoor and others, but given we can find answers to those, I'm LGTM on this spec design for supporting multi-arch.

@bowlofeggs
Copy link
Contributor

Will this format allow there to be a canonical format for manifests so that there is exactly one way to calculate the digest of a manifest? @TomasTomecek has been working on writing a Python library to calculate the digest of a manifest which has been tricky due to manifests not having a "correct" way to be formatted (things like spacing, ordering of the keys, how to remove signatures, etc.)

The signature is particularly troublesome, since it is included inside the manifest. Most crypto signing libraries transmit the signature outside of the data that is being signed instead of inside it (think of PGP). I think that design would really make digest validation much easier. Another option is to consider a signed manifest and the same manifest with different or no signatures not to be the "same" so they get different digests.

@mattmoor
Copy link
Contributor

@rbarlow Very big +1, this drove us nuts because there is also this, which would make things very straightforward.

- **`schemaVersion`** *int*

This field specifies the image manifest schema version as an integer. A
manifest list uses version `3` to distinguish it from an image manifest.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit strange to me. Doesn't this imply that if you want to rev the regular manifest version, the next available value would be 4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought this up in the DWG meeting last week as well.

I think that if these objects are going to have perpetually distinct schema versions, we should split them into distinct API objects (with their own schema number to boot).

I've been trying to articulate this for a couple meetings now, probably badly. I will see if I can put together a proposal that better articulates it. I think where we left things last week was to wait for Steve to weigh in this week.

@stevvooe
Copy link
Collaborator

@rbarlow The content used to calculate the digest of a manifest is set by the original generator of the raw bytes. This avoids the need for specifications like Canonical JSON that purport to bring stable generation when they really don't. There are two many cross-language variations JSON generation to always get this right. This is handled simply by saving the raw bytes of the serialized manifest. This is also flexible in that it doesn't prevent implementations from generating deterministic manifests but doesn't require them where they are not needed.

@bowlofeggs
Copy link
Contributor

@stevvooe The problem is that signatures get inserted into the manifest, changing it. If a client needs to calculate the digest of a retrieved manifest that has signatures, that client must carefully remove the signatures and try to guess how the manifest was formatted when its digest was calculated. Since there is no canonical format for the manifest, there isn't a guarantee that doing this is possible.

If the signatures were instead stored separately of the manifest, there would be no need for any client to perform any manipulation at all on the manifest to calculate the digest. This is how PGP emails work, for example. The data being signed and the signature are separate, rather than having the signature inside the data that is being signed. This makes signature verification straightforward. The Docker manifest's inclusion of the signatures within the manifest itself make it difficult to calculate the digest of the manifest and also make signature verification difficult.

I am proposing that the digest and the signatures be separated from the manifest so that a client can calculate a plain old checksum on exactly the digest bits that were retrieved without having to do any manipulation at all. For example, something like this would be viable:

{'key1': 'value1', 'key2': 'value2', etc...}
------- BEGIN DOCKER SIGNATURES -------
[SIGNATURE_1, SIGNATURE_2, etc.]
------- END DOCKER SIGNATURES -------

In the example above, the client would strip the signature block and run the digest algorithm on the exact bits that are left over (the JSON). This way the client doesn't have to manipulate the JSON at all. If the original manifest used spaces and newlines or did it all in one line, the client calculates the checksum the same way - on whatever bits were received. This makes digest validation so much simpler, and it also aids in signature verification which suffers from the same problem.

Does that make sense?

@aaronlehmann
Copy link
Contributor Author

@rbarlow: This revision of the manifest won't include in-line signatures. Signatures will be handled outside the manifest by signing the manifest's digest, for example with Docker Content Trust.

This will be much less confusing, and I think it resolves the concern about canonical vs. non-canonical JSON.

@bowlofeggs
Copy link
Contributor

@aaronlehmann That is excellent news! Any hints on what sort of time frame we can expect this change to be included in Docker? I and some others are working on integrating Pulp with Docker's v2 API, and this digest calculation is currently an issue for us. Knowing a scale of time might help inform our decision about what we should do in the meantime.

@aaronlehmann
Copy link
Contributor Author

@rbarlow: Support for this new manifest format is planned for Docker 1.10, which will ship in a few months. I'm not sure what your compatibility plans are, but one thing I'd keep in mind is that you would need to handle the old manifest format if you want to interoperate with engines older than Docker 1.10, or registries that haven't been updated to support the new format.

I'm going to add some details on backward compatibility when I update this proposal.

@aaronlehmann
Copy link
Contributor Author

I've updated the proposal with the following changes:

  • s/mediaType/mediatype/
  • schemaVersion 2 used throughout, instead of a combination of 2 and 3
  • The label field has been replaced by a more specific way of specifying a platform. The platform key maps to an object containing fields for the architecture, OS, CPU variant, and required CPU features.
  • Added a backward compatibility section, explaining how the registry will support versions of the engine which don't support the new manifest format.

PTAL

@aaronlehmann aaronlehmann force-pushed the manifest-proposal branch 5 times, most recently from f2bc72f to c68abe3 Compare October 16, 2015 16:57
@aaronlehmann
Copy link
Contributor Author

I was only thinking that the uploads might be ordered in a way where the dependencies can't be checked immediately. This probably won't be a problem in practice. BTW, the code that's in progress for the Docker registry validates dependencies referenced by the manifest, with an exception for the proxy mode.

@harche
Copy link
Contributor

harche commented Dec 17, 2015

@aaronlehmann @stevvooe @estesp Should I create a separate issue to discuss how the CLI going to look like for fat manifests?

@stevvooe
Copy link
Collaborator

@harche Yes. This would probably make the most sense. We can have the proposal stage in the distribution project for now and dispatch to docker/docker depending on where the actual feature will land.

@aaronlehmann
Copy link
Contributor Author

I've updated this proposal to make a few minor changes:

  • Include the content type in-band in both image manifests and manifest lists. This makes it easy to distinguish between image manifests and manifests lists in case there's no content type available out-of-band.
  • Changed mediatype to mediaType, because this is the way the distribution codebase already serializes Descriptor objects, and I don't want to deviate from that. If people prefer to change this serialization (and it's safe to do so), I can change it back.

I feel like this is ready to be merged. PTAL.

I have working code that implements the formats in this proposal, which I'll open as a PR against docker/distribution once #1268 is merged.

@estesp
Copy link
Contributor

estesp commented Dec 18, 2015

Thanks @aaronlehmann ; looks good and glad to see the other PRs showing up with implementation. I'm "non-maintainer LGTM" on this one! :)

@RichardScothern
Copy link
Contributor

LGTM

On Thu, Dec 17, 2015 at 8:17 PM, Phil Estes notifications@github.com
wrote:

Thanks @aaronlehmann https://github.com/aaronlehmann ; looks good and
glad to see the other PRs showing up with implementation. I'm
"non-maintainer LGTM" on this one! :)


Reply to this email directly or view it on GitHub
#1068 (comment).

@stevvooe
Copy link
Collaborator

Include the content type in-band in both image manifests and manifest lists. This makes it easy to distinguish between image manifests and manifests lists in case there's no content type available out-of-band.

Content-type should always be in-band, from the trusted source. Content-type returned by registries (out-of-band) should not be trusted. We might want to say this explicitly, as this is important for security.

Changed mediatype to mediaType, because this is the way the distribution codebase already serializes Descriptor objects, and I don't want to deviate from that. If people prefer to change this serialization (and it's safe to do so), I can change it back.

Please don't. Leave as "mediatype". The fact that descriptor has camelCase is unintentional. In general, keys should be case-insensitive.

@aaronlehmann
Copy link
Contributor Author

Please don't. Leave as "mediatype". The fact that descriptor has camelCase is unintentional. In general, keys should be case-insensitive.

Would you like me to change it to mediatype in descriptor as well? My implementation of the schema2 manifest uses distribution.Descriptor directly, and I would like to avoid forking it.

@stevvooe
Copy link
Collaborator

Would you like me to change it to mediatype in descriptor as well? My implementation of the schema2 manifest uses distribution.Descriptor directly, and I would like to avoid forking it

@aaronlehmann I don't think we use that anywhere, so go ahead.

@aaronlehmann
Copy link
Contributor Author

Reverted the mediatype change; will change the Descriptor serialization to match in #1281.

@aaronlehmann
Copy link
Contributor Author

@stevvooe: Changing the Descriptor serialization would change affect the notification serialization format. Are we confident enough that implementation will be case-insensitive to change the casing there?

@stevvooe
Copy link
Collaborator

@aaronlehmann I guess we can leave it alone and change to mediaType in this specification. We have no guarantee that consumers wrote appropriately agile code to handle case-insensitivity. Sorry for the confusion.

@aaronlehmann
Copy link
Contributor Author

Changed back. No worries.

This is a follow-on to PR distribution#62, and it borrows much of the format
from distribution#993, but uses specific formats for the image manifest and manifest
list (fat manifest) instead of a combined generic format.

The intent of this proposed manifest format is to allow multi-arch, and
allow for full content-addressability of images in the Docker engine.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@dmp42
Copy link
Contributor

dmp42 commented Dec 19, 2015

LGTM

1 similar comment
@stevvooe
Copy link
Collaborator

LGTM

stevvooe added a commit that referenced this pull request Dec 19, 2015
docs/spec: Proposal for new manifest format
@stevvooe stevvooe merged commit ed3346e into distribution:master Dec 19, 2015
thaJeztah pushed a commit to thaJeztah/distribution that referenced this pull request Apr 22, 2021
thaJeztah pushed a commit to thaJeztah/distribution that referenced this pull request Apr 22, 2021
thaJeztah pushed a commit to thaJeztah/distribution that referenced this pull request Jan 19, 2022
thaJeztah pushed a commit to thaJeztah/distribution that referenced this pull request Jan 19, 2022
dylanrhysscott pushed a commit to digitalocean/docker-distribution that referenced this pull request Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet