Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution does not conform to canonical format #1066

Closed
1 of 3 tasks
TomasTomecek opened this issue Oct 5, 2015 · 20 comments
Closed
1 of 3 tasks

Distribution does not conform to canonical format #1066

TomasTomecek opened this issue Oct 5, 2015 · 20 comments
Assignees

Comments

@TomasTomecek
Copy link

Right now when I fetch manifest via curl:

  • it contains whitespace
  • it contains, at the same time, unicode-specified characters and utf8-encoded characters
  • dict keys are not sorted

https://docs.docker.com/v1.6/registry/spec/json/

https://github.com/docker/distribution/blob/master/docs/spec/json.md

Edited: updated spec link, striked obsolete text

@RichardScothern RichardScothern self-assigned this Oct 5, 2015
@RichardScothern
Copy link
Contributor

This is unfortunately correct. I do know know the history of this, but manifest signatures are tied to a non-canonical json format.

The manifest specification is being replaced (see here), perhaps canonical JSON can be aim for that.

@RichardScothern
Copy link
Contributor

In the meantime, we should probably remove the erroneous documentation. @dcmgowan , @stevvooe thoughts?

@TomasTomecek
Copy link
Author

Since #1065 is closed now, can we resolve this one? Current implementation doesn't conform to spec you linked back there.

@RichardScothern
Copy link
Contributor

The manifest specification does not conform to canonical JSON, so when we either 1) change the documentation or 2) have a manifest spec that does produce canonical JSON I would be happy to close this issue.

@aaronlehmann
Copy link
Contributor

I'm not sure I understand what the issue is here. I don't see anything in https://github.com/docker/distribution/blob/master/docs/spec/json.md prohibiting Unicode characters. It says Resulting "JSON text" shall always be encoded in UTF-8.. What problem would canonical JSON solve?

@RichardScothern
Copy link
Contributor

The issue is that the document specifies a canonical JSON format which distribution does not conform to (points 3 and 4)

@dmcgowan
Copy link
Collaborator

The current manifest format schemaVersion is not canonical JSON and cannot be canonical JSON because of the way signatures are added. The linked do document does not guarantee that distribution APIs will always deliver canonical JSON and that is not something we should be striving for. We want to be able to store and get by content hash for these JSON blobs which means we should not be attempting to deliver canonical JSON. However, clients or servers should be able to use the documented format in order to provide a more consistent way to format and hash content, out of scope for the manifest format in my opinion.

@RichardScothern
Copy link
Contributor

Discussed offline with Derek. Closing.

@TomasTomecek
Copy link
Author

Quoting the document:

To provide consistent content hashing of JSON objects throughout Docker Distribution APIs

Shall I understand this in a way that you have created a specification which is in fact not being used?

Am going through spec of manifest and it doesn't say anything about encoding, whitespace, key order. On the other hand, it mentions something interesting:

It is a provisional manifest to provide a compatibility with the V1 Image format

When can we expect proper, not a provisional, manifest?

To give you some background, I am working on a code (in python) which should output digest of a provided manifest. In order to do so, I need to understand how distribution calculates manifest. I was quite surprised that digest is being computed from an indented utf8-encoded manifest (when RFC suggests using unicode).

@aaronlehmann
Copy link
Contributor

Quoting the document:

To provide consistent content hashing of JSON objects throughout Docker Distribution APIs

Shall I understand this in a way that you have created a specification which is in fact not being used?

This specification is for responses generated by the distribution APIs. Manifests are uploaded by clients, and the registry does not specify or impose a canonical format for them.

I was quite surprised that digest is being computed from an indented utf8-encoded manifest (when RFC suggests using unicode).

Unicode is a character set. It does not imply any particular character encoding. UTF-8 is the most commonly used character encoding for Unicode. You are probably thinking of UCS-2.

Note the JSON RFC says:

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

@bowlofeggs
Copy link
Contributor

@aaronlehmann does this mean that there is not a canonical format for manifests at all, or simply that the registry does not enforce a format?

@aaronlehmann
Copy link
Contributor

@rbarlow There is no canonical format for manifests.

@stevvooe
Copy link
Collaborator

Unless a canonical key order is defined for a particular schema

Please make sure to read the entire sentence and any included qualifiers. In the case of the manifest, it has a canonical ordering that differs from sorted.

In general, you should avoid "round tripping" manifests. De-serialize the contents, but only re-serialize if the data fields have changed. Effectively, the byte contents should only generated once. If this care is not taken, it makes generating digests inconsistent and error prone.

@TomasTomecek
Copy link
Author

In general, you should avoid "round tripping" manifests. De-serialize the contents, but only re-serialize if the data fields have changed.

How do you compute digest then?

Effectively, the byte contents should only generated once. If this care is not taken, it makes generating digests inconsistent and error prone.

I read this as: "Manifests and digests are very fragile and you should NOT play with those."

Would be nice if computing digest would be made easier, or having a tool for such job (by tool I mean a binary/executable, which provides output for a given input; not a service).

@aaronlehmann

This specification is for responses generated by the distribution APIs. Manifests are uploaded by clients, and the registry does not specify or impose a canonical format for them.

Yes, but isn't the code which generates manifest and digest part of distribution codebase?

Unicode is a character set. It does not imply any particular character encoding. UTF-8 is the most commonly used character encoding for Unicode. You are probably thinking of UCS-2.

I know what I'm thinking.

Here's a snip from dockerfile:

MAINTAINER ľščťť <mail@example.com> "asd & qwe"

Here's how it looks in manifest:

MAINTAINER \xc4\xbe\xc5\xa1\xc4\x8d\xc5\xa5\xc5\xa5 \\\\u003cmail@example.com\\\\u003e \\\\\\"asd \\\\u0026 qwe\\\\\\"

ľščťť are utf8-encoded. That is not what I would expect. When rfc says this:

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point.

If you also look at json.org, grammar for strings mentions unicode characters.

Therefore what I would expect is this:

MAINTAINER \\u013e\\u0161\\u010d\\u0165\\u0165 \\\\u003cmail@example.com\\\\u003e \\\\\\"asd \\\\u0026 qwe\\\\\\"

Well, actually this:

MAINTAINER \\u013e\\u0161\\u010d\\u0165\\u0165 <mail@example.com> \\\\\\"asd & qwe\\\\\\"

(I still don't understand why you escape <>&)

@stevvooe
Copy link
Collaborator

I read this as: "Manifests and digests are very fragile and you should NOT play with those."

I'm sorry if I wasn't completely clear. You are astute in acknowledging that stable hash generation of non-deterministic formats (ie. JSON) is typically fragile. However, this does mean that you cannot "play" with the contents. This just means that one should preserve the bytes and only regenerate on a change. This is how the registry approaches this. It deserializes the content, saves the raw bytes, and reads the appropriate fields. When the actual data is stored, it just uses the raw bytes, so the client's hash and the registry's digest will always match. This is both secure and preserves the stability of the hash without relying on differing libraries agreeing on deterministic generation.

Would be nice if computing digest would be made easier, or having a tool for such job (by tool I mean a binary/executable, which provides output for a given input; not a service).

We would be more than happy to accept a contribution for a tool that accomplishes this. The packages available in the distribution project are more than capable of making this straightforward. Please let us know if you need more guidance.

@TomasTomecek
Copy link
Author

We would be more than happy to accept a contribution for a tool that accomplishes this. The packages available in the distribution project are more than capable of making this straightforward. Please let us know if you need more guidance.

The question is: what would be a suitable place for such tool? First thing which comes to my mind is docker engine (that's why I submitted this issue back then) since:

  • that's the place where manifests are being created
  • digest is being verified
  • is the most well-known CLI tool

We already have a set of tools which acomplish something similar:

Link Language Uses distribution's packages? Description
docker-manifest Go Yes Output manifest from existing v1-archive
docker-tar-push Go Yes Push provided v1-archive to a v2 registry
digest-of-a-manifest Python No Output digest of provided manifest

So my question is: do you want such tool to be a part of docker-maintained codebase, or is it suppose to be an external thing?

@stevvooe
Copy link
Collaborator

@TomasTomecek I'm very confused by your request. If you need to calculate digests for a manifest, just implement it in Python. Just make sure to follow the above advice (make sure to spend some time carefully reading my responses). If you need a tool to support that, go ahead and implement it in Go. If you want to submit it back to a Docker project, we would welcome it.

@TomasTomecek
Copy link
Author

My last response has only single message:

If you want to submit it back to a Docker project, we would welcome it.

Is there a chance that you will merge it?

@stevvooe
Copy link
Collaborator

@TomasTomecek Yes, if it is well done and fits a need.

My confusion comes from the need of a CLI tool. Generating manifests and calculating their digest should be trivial in Python.

@TomasTomecek
Copy link
Author

@stevvooe here it is: moby/moby#17402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants