Skip to content
This repository has been archived by the owner on Sep 12, 2018. It is now read-only.

Initial implementation of storage layer path mapper #729

Merged
merged 1 commit into from Nov 14, 2014

Conversation

stevvooe
Copy link
Contributor

We've added a path mapper to support simple mapping between path objects used in the storage layer and the underlying file system. The target of this is to ensure that paths are only calculated in a single place and their format is separated from the data that makes up the path components.

This commit only includes spec implementation to support layer reads. Further specs will come along with their implementations.

@stevvooe stevvooe added this to the GO-alpha milestone Nov 13, 2014
@dmp42
Copy link
Contributor

dmp42 commented Nov 13, 2014

LGTM

@dmp42
Copy link
Contributor

dmp42 commented Nov 13, 2014

cc @BrianBland

We've added a path mapper to support simple mapping between path objects used
in the storage layer and the underlying file system. The target of this is to
ensure that paths are only calculated in a single place and their format is
separated from the data that makes up the path components.

This commit only includes spec implementation to support layer reads. Further
specs will come along with their implementations.
@BrianBland
Copy link

LGTM

@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Thu, Nov 13, 2014 at 03:42:06PM -0800, Stephen Day wrote:

We've added a path mapper to support simple mapping between path
objects used in the storage layer and the underlying file system.

I'm not sure why we need an “underlying file system” model on top of
our abstract storage 1. Why assume that the storage engine should
use a filesystem-like layout? Looking at the proposed registry-client
API (moby/moby#9015), we'll need to be able to:

  • Refer to manifests by repository name and tag.
  • List tags for a given repository.
  • Refer to layers by their tarsum.

For other tasks, we'll want endpoints to:

  • Iterate over all layers (for garbage collection [2,3]).
  • Iterate over all repositories (for initializing the search index
    4).
  • List repositories that consume a given layer (for auth 5 or
    listing aliases 6).

Those don't need to be registry endpoints though, if the utility can
talk directly to the storage driver. However, with in-container
drivers 7, we will need registry endpoints.

The content-addressable mapping you're proposing is good (just flat
sha256 sums, which the storage backend may chose to shard/fan-out if
it likes 8).

It looks like layerindex is addressing the “List repositories that
consume a given layer” requirement, and handling that in a parallel
namespace (layerindex/ vs blob/) is fine with me. I don't understand
the / bits though, can't we just use
the tarsum as an opaque string?

layerindex:tarsum+sha256:589…e03

(where I've used a colon instead of slashes between “layerindex” and
the tarsum because you don't need to be able to iterate over
layerindex entries). The value for that key would be a list of
reference tags in JSON. Or, if you want per-tag entries:

layerindex:/:

For example:

layerindex:tarsum+sha256:589…e03/library/debian:latest

You don't even need a value for that key.

I don't understand the need for repositories//layers/…. Can't
the registry look that up in the manifest? Dropping it would let you
use:

repositories//

for storing the manifests.

There's also which sounded to me like out-of-manifest certificate
revocations [9,10], and those would have to live somewhere too.

@stevvooe
Copy link
Contributor Author

@wking

After I clarify some of the issues you brought up, I think you'll find this layout points the next generation registry in the direction you're looking for. Right now, we are focused on standing up a working registry that implements the V2 API so we can validate the concepts.

I'm not sure why we need an “underlying file system” model on top of
our abstract storage [1]. Why assume that the storage engine should
use a filesystem-like layout?

This mapping scheme has the very nice property that it doesn't actually leverage file system operations for layer lookup (ie directory listing). While the scheme is hierarchical like a filesystem, all the paths for the main operations are pre-calculated lookups followed by a read, which maps very well to keyed object stores.

Currently, the next generation storagedriver effectively implements a VFS. The work I am doing now will definitely expose a higher-level interface and possibly a mid-level interface, both of which could be candidates for future driver interfaces, but not yet.

Those don't need to be registry endpoints though, if the utility can
talk directly to the storage driver.

By exposing the registry layout to out of band clients, we would be effectively exporting a new API, which needs to be versioned and controlled. We are still working on the V2 HTTP registry API and we want to focus on that. Though the use cases (listing repos, layers, etc.) you've listed will likely be added to the registry API to support management tasks, the general concept of safe, application-level registry operations would likely be implemented as part of an extension mechanism.

It looks like layerindex is addressing the “List repositories that
consume a given layer” requirement, and handling that in a parallel
namespace (layerindex/ vs blob/) is fine with me.
... I don't understand the need for repositories/<name>/layers/…. Can't
the registry look that up in the manifest?

While it can serve that purpose, this is actually part of the access control methodology and partially a by-product of the v2 image manifest format. The layerindex allows one to lookup up the "owning" repository of the target layer, identified by tarsum, without knowing the name of the repository. If the user doesn't have access to one of the listed repositories, they can't get to the blob id, stored under repositories/<name>/layers/. The purpose of repositories/<name>/layers/ is to track the layers that have been uploaded under the particular namespace. Parsing the manifests wouldn't meet this use case and would be rather expensive.

This scheme supports the use case of two parties uploading identical layers without requiring access to each others repositories. This avoids the pathological case of "claimed tarsums". With this approach, two parties can use the same tarsum for a layer but both parties have to prove they have the content, even if they have the same checksum. If this is not considered, the tarsum effectively becomes a content access key, and any one that knows you're tarsum can get access to your content. Conversely, if one uploads a simple layer, they could claim that tarsum and prevent others using it, even if generated independently.

This will become more obvious as we put the access controls in place.

I don't understand
the <tarsum version>/<tarsum hash alg> bits though, can't we just use
the tarsum as an opaque string?

This allows multiple versions of indexes in the directory, referencing the same content, without blowing up directory listing. It also keeps the + out of filenames, which might be problematic for certain backends. It also allows other non-tarsum, indexes to exist over layer files in the future. We may want to change the scheme to be index/layer/tarsum to free-up indexes over other objects, as well.

For example:

 layerindex:tarsum+sha256:589…e03/library/debian:latest

You don't even need a value for that key.

Even though the number of repositories with the same tarsum but no mutual access should be small, in accordance with your earlier comments, directory listing should be avoided. Concurrent updates will be problematic, but that is a fine tradeoff.

@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Thu, Nov 13, 2014 at 06:36:19PM -0800, Stephen Day wrote:

Right now, we are focused on standing up a working registry that
implements the V2 API so we can validate the concepts.

That's fair enough; obviously we don't need to have everything
polished out of the gate. If I'm pushing back here, it's mostly
because the old docker_registry.core.driver.Base was a mess that was
to complicated to touch ;). I don't want to end up in that hole
again.

I'm not sure why we need an “underlying file system” model on top
of our abstract storage 1. Why assume that the storage engine
should use a filesystem-like layout?

This mapping scheme has the very nice property that it doesn't
actually leverage file system operations for layer lookup (ie
directory listing). While the scheme is hierarchical like a
filesystem, all the paths for the main operations are pre-calculated
lookups followed by a read, which maps very well to keyed object
stores.

Right, key/value is easy with known keys. The hard part is getting a
layout so that listing scales well, which means minimizing the number
of bits where we need listing to work. The current storage API
doesn't have a way to declare “I'm goind to need a list for keys with
prefixes like this”, but if we can say that we'll only need to iterate
over key lists at slash breaks 1 it gets easier to optimize your
listing.

Currently, the next generation storagedriver effectively implements
a VFS. The work I am doing now will definitely expose a higher-level
interface…

I don't want multiple levels of interfaces. If the registry authors
are thinking about the storage as a file system, they'll constrain
possible storage implementations to things that are like filesystems.
I want the storage API to be (and stay) as abstract as possible so
storage driver authors have maximum flexibility to optimize their
implementation for their particular backend.

Those don't need to be registry endpoints though, if the utility
can talk directly to the storage driver.

By exposing the registry layout to out of band clients, we would be
effectively exporting a new API, which needs to be versioned and
controlled. We are still working on the V2 HTTP registry API and we
want to focus on that. Though the use cases (listing repos, layers,
etc.) you've listed will likely be added to the registry API to
support management tasks, the general concept of safe,
application-level registry operations would likely be implemented as
part of an extension mechanism.

I'm fine with that, but they're not in the v2 HTTP client/registry API
now. I'm just saying that I don't really care if the extra
functionality is defined in the client/registry API or in a separate
extension/registry API. So long as I can bind to it from a
webhook-driven external extension (#689).

It looks like layerindex is addressing the “List repositories that
consume a given layer” requirement, and handling that in a
parallel namespace (layerindex/ vs blob/) is fine with me. ... I
don't understand the need for repositories//layers/…. Can't
the registry look that up in the manifest?

While it can serve that purpose, this is actually part of the access
control methodology and partially a by-product of the v2 image
manifest format.

The registry/storage API shouldn't have to care about the
client/registry auth.

The layerindex allows one to lookup up the "owning" repository of
the target layer, identified by tarsum, without knowing the name of
the repository.

Aren't you just storing consuming :s? You don't need a
single owner with content-addressable layers 2.

If the user doesn't have access to one of the listed repositories,
they can't get to the blob id, stored under
repositories/<name>/layers/.

Agreed, but I'd handle this in the registry itself (see the “PUT
/v1/images/(image_id)/layer” discussion in 3), not offload it to
your storage layout.

The purpose of repositories/<name>/layers/ is to track the layers
that have been uploaded under the particular namespace. Parsing the
manifests wouldn't meet this use case and would be rather expensive.

First, the storage driver is free to optimize this. I just don't
think the registry needs to care about that optimization. Second,
why do you need to know which are already uploaded? There's no API in
moby/moby#9015 for “Tell me which layers behind : you
still need”.

This scheme supports the use case of two parties uploading identical
layers without requiring access to each others repositories. This
avoids the pathological case of "claimed tarsums". With this
approach, two parties can use the same tarsum for a layer but both
parties have to prove they have the content, even if they have the
same checksum. If this is not considered, the tarsum effectively
becomes a content access key, and any one that knows you're tarsum
can get access to your content. Conversely, if one uploads a simple
layer, they could claim that tarsum and prevent others using it,
even if generated independently.

First, you can't just claim a tarsum without content that has that
tarsum, because the registry is also checking the sums before
finalizing the push [4,5]. Second, sha256 covers a big space. You're
going to have a hard time squatting on any meaningful fraction of it
unless you can predict tarsums for layers you want to hijack ahead of
time. Even if you do know the tarsum ahead of time (e.g. you're
broken into your target's Docker host), you still need time to hash
all of your malicious tarballs to find one that matches your target
tarsum. I'm happy just trusting sha256 to be one-way, and so is the
Docker security model, which is based on signing these tarsums
(moby/moby#8093).

I don't understand the <tarsum version>/<tarsum hash alg> bits
though, can't we just use the tarsum as an opaque string?

This allows multiple versions of indexes in the directory,
referencing the same content, without blowing up directory listing.

Why would you need to store multiple indexes to the same content with
different tarsum versions and hash algorithms?

It also keeps the + out of filenames, which might be problematic
for certain backends.

They're not filenames, they're keys ;). If the storage driver wants
to use a backing filesystem, it's up to it to chose an appropriate
escape mechanism (e.g. URL-encoding the key).

It also allows other non-tarsum, indexes to exist over layer files
in the future. We may want to change the scheme to be
index/layer/tarsum to free-up indexes over other objects, as well.

I'm happy to kick this can down the road. In the event that we need
additional indexes, there are always other root namespaces
(thing1index/, thing2index/, …).

For example:

 layerindex:tarsum+sha256:589…e03/library/debian:latest

You don't even need a value for that key.

Even though the number of repositories with the same tarsum but no
mutual access should be small, in accordance with your earlier
comments, directory listing should be avoided. Concurrent updates
will be problematic, but that is a fine tradeoff.

I'm fine with a single-key:

layerindex:tarsum+sha256:589…e03

too 6. I don't think listing this will be a big problem though. If
you're looking for “do you have access to layer ”, the auth
lookup will be:

  1. What namespaces or image names do you have access to? (query
    against the auth backend, no registry access needed)

  2. Iterate through the namespaces and image names found above, and
    check if listing:

    layerindex:tarsum+sha256:589…e03//

    or

    layerindex:tarsum+sha256:589…e03//:

    returns any keys, bailing (and granting access) after the first key
    hit. You can make all those requests concurrently, so it may even
    be faster than the single-key approach for layers that have lots
    of consumers. If you don't want the concurrency, you'll need
    another endpoint in the storage-driver API for:

    ListAny(delimiters []string) ([]string, error)

    that's like the current:

    List(path string, error) ([]string, error)

    but takes a series of possible prefix strings.

    Although this was in reference to an earlier iteration of the
    pull request, and no longer seems accessible on GitHub. Relevant
    quote from @dmp42:

    “We are going to use move in the registry, a lot. Tarsums /
    checksums need to be computed, and the file needs to be
    complete. So, the layers will be moved to their final destination
    once they are complete. Whether the driver implements a smart
    "move", or is just smart when copying, should be hidden under the
    specific method call.”

@dmp42
Copy link
Contributor

dmp42 commented Nov 14, 2014

@wking

I believe you guys are pretty much saying the same thing, although probably with different words and looking at it from different perspectives.

A couple of notes, though:

  • I need to insist again on that: drivers must be simple to implement, and safe by default - the more responsibility you delegate to drivers, the more duplication you will get into them, the riskier and the more complex they are going to be
  • we do need provision for multiple versions of tarsums - tarsum will have to be modified in the future - like it had to be modified in the past (include xattrs, remove mtime), if only, for security reasons, and we will live in a world where content will link to layers through multiple different versions of tarsum - there is no discussion on that
  • it's trivial to produce two layers with the same tarsum that have different shas - this is not only expected, but the reason tarsum was designed in the first place
  • keys vs. filesystems paths is not a debate, and doesn't change things as far as allowed characters in path are concerned: unsafe chars should be escaped to a common, safe subset, and this must be done by the registry, not by individual drivers - and I can assure you there are problems with the "+" sign for example, right now, on S3 + Cloudfront
  • internal interfaces are just that: internal APIs to keep things clear and reasonably modular. I don't think there should be a discussion on "I don't want multiple levels of interface" :-)
  • discussions on the extension mechanism should be kept in the corresponding issue

Hope that helps.

dmp42 added a commit that referenced this pull request Nov 14, 2014
Initial implementation of storage layer path mapper
@dmp42 dmp42 merged commit 8a64db6 into docker-archive:next-generation Nov 14, 2014
@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Fri, Nov 14, 2014 at 11:06:13AM -0800, Olivier Gambier wrote:

  • we do need provision for multiple versions of tarsums - tarsum
    will have to be modified in the future - like it had to be
    *modified in the past (include xattrs, remove mtime), if only, for
    *security reasons, and we will live in a world where content will
    *link to layers through multiple different versions of tarsum -
    *there is no discussion on that

And both of our schemes mention the tarsum version and hash. I'm just
saying that we don't need to be able to iterate over keys starting
with:

layerindex/tarsum/v1/

we just need to be able to iterate over keys starting with:

layerindex:tarsum+sha256:589…e03

(if we take the per-tag entry approach 1). If we take the
multiple-tags-in-a-single-entry approach, we don't need to be able to
iterate over layerindex at all (unless we want to garbage-collect it).

  • it's trivial to produce two layers with the same tarsum that have
    different shas
    - this is not only expected, but the reason tarsum
    was designed in the first place

I thought the issue was concern about different layers with the same
tarsum (I don't think that will happen). I don't think the transfer
SHAs come into this at all.

  • keys vs. filesystems paths is not a debate, and doesn't change
    things as far as allowed characters in path are concerned: unsafe
    chars should be escaped to a common, safe subset, and this must be
    done by the registry, not by individual drivers - and I can assure
    you there are problems with the "+" sign for example, right now,
    on S3 + Cloudfront

Escaping that in the registry is fine, but personally, I think that's
just working around buggy storage-drivers.

  • internal interfaces are just that: internal APIs to keep things
    clear and reasonably modular. I don't think there should be a
    discussion on "I don't want multiple levels of interface" :-)

Right, and we have an existing abstract storage API. I think the
mental model of a filesystem will get us into trouble, because it
makes it easier to make implicit (and possibly false) assumptions
about the API. For example, it's not currently clear to me if I can
list on non-slash-terminated prefixes:

List("blob/sha256/01")

and get

01a60e35df88d8b49546cb3f8f4ba4f406870f9b8e1f394c9d48ab73548d748d
01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b

which would be useful for parallel garbage-collection handlers (one
could garbage-collect 00_, another 01_, …). So there I want more
list flexibility than a filesystem offers (without globbing).

@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Fri, Nov 14, 2014 at 11:24:58AM -0800, W. Trevor King wrote:

Fri, Nov 14, 2014 at 11:06:13AM -0800, Olivier Gambier:

  • keys vs. filesystems paths is not a debate, and doesn't change
    things as far as allowed characters in path are concerned:
    unsafe chars should be escaped to a common, safe subset, and
    this must be done by the registry, not by individual drivers -
    and I can assure you there are problems with the "+" sign for
    example, right now, on S3 + Cloudfront

Escaping that in the registry is fine, but personally, I think
that's just working around buggy storage-drivers.

And the more workarounds like this you build into the registry, the
leakier the storage API gets. Now new registry code has to remember
that it's supposed to work around this storage-driver bug. “What key
characters does my backend support” is a question that only has a
clear answer at the storage-driver level, and you can write generic
tests (#731) so that the registry code can be confident that the
storage driver is doing a good job.

@dmp42
Copy link
Contributor

dmp42 commented Nov 14, 2014

Now new registry code has to remember that it's supposed to work around this storage-driver bug.

No it does not need to.
Registry code uses an internal API that deals with that once and for all, in one place.

@dmp42
Copy link
Contributor

dmp42 commented Nov 14, 2014

“What key characters does my backend support”

Simple enough to devise. Likely something like: a-zA-Z0-9_.:-

@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Fri, Nov 14, 2014 at 02:15:57PM -0800, Olivier Gambier wrote:

Now new registry code has to remember that it's supposed to work
around this storage-driver bug.

No it does not. Registry code uses an internal API that deals with
that once and for all, in one place.

So:

bulk registry code
←(internal API)→
storage-backend-workarounds
←(storage-driver API)→
storage driver

that gives you back clean abstractions in the bulk registry code, but
it also gives you more APIs to test and document. The alternative
(pushing the storage-backend workarounds into the storage drivers)
minimizes the number of API layers, but means you'd need libraries
(with their own APIs and docs) if you wanted to share workarounds
between drivers. For URL-encoding keys, suitable libraries and docs
already exist [1,2].

@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Fri, Nov 14, 2014 at 02:20:03PM -0800, Olivier Gambier wrote:

“What key characters does my backend support”

Simple enough to devise. Likely something like: a-zA-Z0-9_.:-

I agree that that's likely to work, but it's another bit you have to
write up in the storage-driver API “We'll only use keys that match the
regexp '[a-zA-Z0-9_.:-/]*', so you'll only need to escape keys if your
storage driver is sensitive to those characters.” There's less shared
information if you say “Keys can be any byte sequence. If your
storage backend doesn't deal with some bytes as keys, make sure you
translate the keys to something it does support.”

@dmp42
Copy link
Contributor

dmp42 commented Nov 14, 2014

I agree that that's likely to work, but it's another bit you have to write up in the storage-driver API “We'll only use keys that match the regexp '[a-zA-Z0-9_.:-/]*', so you'll only need to escape keys if your storage driver is sensitive to those characters.”

Fortunately it's simple enough to test exhaustively (unlike testing exhaustively "some bytes").
People who pass the test suite have nothing to do. People who fail can look into the (pretty self-explicit) test.

There's less shared information if you say “Keys can be any byte sequence. If your storage backend doesn't deal with some bytes as keys, make sure you translate the keys to something it does support.”

Then people don't do the work properly because (insert random reason), and we end-up with a variety of bugs reports for third-party code that we have to carry around.

One step further: what about path traversal? or security at large? do we trust the driver authors as well to do that properly?

[shared libraries to do the work in drivers]

So we maintain them, recommend that people use them, have to upgrade them and go through all drivers so they keep up with the latest version?

Maybe I'm just opinionated :-) but I strongly believe that we do need to own some critical stuff inside the registry and that a great API is a good API that don't let you shoot yourself in the foot.

Drivers authors should focus on stuff that has value: provide storage primitives - not implement petty sanity checking.

Either way :)

@wking
Copy link
Contributor

wking commented Nov 14, 2014

On Fri, Nov 14, 2014 at 02:53:15PM -0800, Olivier Gambier wrote:

I agree that that's likely to work, but it's another bit you have
to write up in the storage-driver API “We'll only use keys that
match the regexp '[a-zA-Z0-9_.:-/]*', so you'll only need to
escape keys if your storage driver is sensitive to those
characters.”

Fortunately it's simple enough to test exhaustively (unlike testing
exhaustively "some bytes").

How is it easier? You have an infinite number of possible keys in
both cases. Realistically, testing anything that you suspected might
be a problem (e.g. +, /, ‘ ’, … in a key) is equally easy in both
cases.

There's less shared information if you say “Keys can be any byte
sequence. If your storage backend doesn't deal with some bytes as
keys, make sure you translate the keys to something it does
support.”

Then people don't do the work properly because (insert random
reason), and we end-up with a variety of bugs reports for
third-party code that we have to carry around.

You don't have to carry third-party code. Suggested workflow:

  1. You get a bug report that identifies a problem in a storage driver.
  2. You write a storage-API test to expose the bug.
  3. You warn the driver authors and file a bug with them. Now there
    are some choices:
    a. They fix their driver.
    1. You bundle their new version in future official releases.
      b. They don't fix their driver
    2. You drop their driver from future official releases, and point
      people at the upstream bug report.

With active upstream maintainers you get a.4, and everyone's happy.
With lazy upstream maintainers you get b.4, and you don't have to
worry about it.

One step further: what about path traversal? or security at large?
do we trust the driver authors as well to do that properly?

Absolutely not. We write a test suite to validate all of the APIs,
and make it as detailed as we like to ensure we never see anything
that worries us or has been reported before.

[shared libraries to do the work in drivers]

So we maintain them, recommend that people use them, have to upgrade
them and go through all drivers so they keep up with the latest
version?

No, whoever maintains net/url has you covered on the maintenance side,
and your test suite has you covered on the updates side. It's less
work than maintaining workaround code in the registry itself.

Maybe I'm just opinionated :-) but I strongly believe that we do
need to own some critical stuff inside the registry and that a great
API is a good API that don't let you shoot yourself in the foot.

Fair enough, I'm opinionated too ;). I'm fine making the
storage-driver API the great API, but to me “great API” means “the
simplest mental model required for efficiency, with good docs and a
comprehensive test suite,” not “doesn't let you shoot yourself in the
foot” (although I think that my definition makes it hard to shoot
yourself in the foot too).

Drivers authors should focus on stuff that has value: provide
storage primitives - not implement petty sanity checking.

Encoding your keys so your backend can handle them doesn't sound like
“petty sanity checking” to me. It's not like the current storage API
is a complicated mess of hard-to-implement features ;). I think
storage driver authors are capable of implementing it as it stands,
without the registry holding their hand. And I think that a good test
suite monitoring for compliance will catch any bugs that they write
while implementing their driver.

But all of my pushback hear is just me trying to save you future work
;). Until I learn Go, I won't have to work with or maintain the
internal API. If my arguments for removing it aren't convincing,
that's fine with me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants