NG: Transactional storage #626

wking · 2014-10-10T16:37:29Z

We've had some difficulties ensuring atomic transactions with the current registry. This makes it difficult to do things like robustly refcount images (or anything else where you'd need to touch metadata in several locations at once). I've been poking around to get a feel for how other folks handle transactions with huge.

Redis (and most key/value stores we might use for the smaller metadata) supports transactions, and there is support in asyncio-redis, if we end up implementing the new registry in Python with aiohttp.
Transactional filesystems seem a bit harder to come by, but if you don't have a huge volume of them, you can likely squeak by with something like Toggenburger's libbtrfstrans (although the paper doesn't seem to be maintained or have version-controlled code).
Microsoft's SQL Server supports FILESTREAM, which seems to give transactional access through the database layer, but stores the actual blobs transparently to the filesystem. Oracle Database has something similar with BFILE.
Gropengießer and Sattler have a paper on transactional S3. I only skimmed it, but it seems to be talking about a unpublished code for their client library and service layer.

This is not my area of expertise though, and its certainly possible that there's a mainstream, open-source, transactional storage solution for large binary blobs; I just can't find it ;).

dmp42 · 2014-10-28T20:40:28Z

I don't expect miracles unfortunately. Distributed store are unlikely to grow wings...

On the other hand, we had problems with that aspect because of the design / workflow of registry v1.
I expect (and wish, and will pay extra attention) the new workflow to be more resistant with that.

That might include using "locks" on the transactional storage (redis).

wking · 2014-10-28T20:58:27Z

On Tue, Oct 28, 2014 at 01:40:29PM -0700, Olivier Gambier wrote:

That might include using "locks" on the transactional storage
(redis).

Redis supports this already, we just need to add at least
start/commit/abort (in Redis: MULTI/EXEC/DISCARD) to the atomic
storage API. I'm still not sure how accepted my independent
atomic/streaming strategy is 1, but that should be pretty doable.
With transactional atomic storage and content-addressable streaming
storage, I don't expect any problems beyond occasional orphan entries
in the streaming storage. An occasional garbage-collection pass could
check for and clean that sort of thing up. With reference information
in the transactional atomic storage, it would even be pretty cheap,
just remove anything from streaming storage that's over a day old and
has no references listed in it's atomic storage reference file. Of
course, you still have to iterate over all the items in the streaming
storage, so some way to slowly work through a complete list of entries
would be good.

dmp42 · 2014-10-28T22:00:56Z

content-addressable streaming storage

The content-addressability part is up to the registry, not to the storage itself (drivers shouldn't need to know anything about that).

Of course, you still have to iterate over all the items in the streaming storage, so some way to slowly work through a complete list of entries would be good.

That was the same problem with v1... such an approach is not practical with any crowded storage.

About the rest, I would rather keep the "transactional storage" as a "simple cache" and not as full-blown requirement.

wking · 2014-10-28T22:13:36Z

On Tue, Oct 28, 2014 at 03:00:58PM -0700, Olivier Gambier wrote:

content-addressable streaming storage

The content-addressability part is up to the registry, not to the
storage itself (drivers shouldn't need to know anything about that).

If you're going to have content-addressability I'd definately put it
in at the storage level. Knowing that content is write-once and named
based on its content allows you to make a bunch of simplifying choices
in your implementation (e.g. no need to worry about racy writes).
Requiring generic storage just so you can layer content-addressability
on top seems like a waste.

Of course, you still have to iterate over all the items in the
streaming storage, so some way to slowly work through a complete
list of entries would be good.

That was the same problem with v1... such an approach is not
practical with any crowded storage.

That just means your garbage collection process takes longer to
complete its pass. There's no reason it has to complete quickly
though, so I don't mind if it takes a month or two to gradually chew
through looking for old orphans. If you want faster cleanup, just run
a few garbage collectors in parallel on separate shards. If you don't
care about cleanup, just don't run any garbage collectors. I don't
see why it would be a large load on the storage driver.

About the rest, I would rather keep the "transactional storage" as a
"simple cache" and not as full-blown requirement.

How can a cache add transactionality?

dmp42 · 2014-10-28T22:26:31Z

If you're going to have content-addressability I'd definately put it in at the storage level.

That would make for code duplication into every driver, doesn't solve race conditions problems, requires the drivers to do more work, and make it more difficult to fix the adressibility model if we want to change it (eg: multiple version of tarsums for example).

So, it's no on this :-)

wking · 2014-10-28T22:33:40Z

On Tue, Oct 28, 2014 at 03:26:33PM -0700, Olivier Gambier wrote:

If you're going to have content-addressability I'd definately put
it in at the storage level.

That would make for code duplication into every driver, doesn't
solve race conditions problems, requires the drivers to do more
work, and make it more difficult to fix the adressibility model if
we want to change it (eg: multiple version of tarsums for example).

So I'd use a simple hash for content addressability. For another
reason why using a braindead hash is useful, see 1.

If you go ahead with a fancy hash anyway, extending the API to have:

“I'm writing $COUNT bytes to $FANCY_HASH which should have a sha1
hash $BRAINDEAD_HASH”.
“How many bytes have been written to $FANCY_HASH?”
“I'm continuing an earlier write to $FANCY_HASH bith $COUNT
additional bytes starting at $OFFSET”.

would be a fairly small change. And the rest of the streaming
storage layer could be content-addressable (no changes after a
complete write, no renames).

wking · 2014-10-28T22:39:00Z

On Tue, Oct 28, 2014 at 03:26:33PM -0700, Olivier Gambier wrote:

That … doesn't solve race conditions problems

Why not? With content-addressable storage (even with
externally-supplied names and resumable partial writes 1), you have
the same bytes behind the name (or $FANCY_HASH name). If multiple
clients are writing those bytes to the same path, who cares? It's the
same bytes. I don't expect the degredation in disk life from
occasional duplicate writes to be a big deal ;).

dmp42 · 2014-10-28T22:54:49Z

you don't know the hash before reading to the end of the file - meanwhile, you need to write it somewhere - bottom line: you don't know the hash beforehand - and if you are going to argue that the client should send it, then you just allowed random content to overwrite any other content :-)
duplicating the hashing mechanism in every driver is bad - duplicated code is bad - error prone - a maintenance headache

wking · 2014-10-28T23:23:05Z

On Tue, Oct 28, 2014 at 03:54:50PM -0700, Olivier Gambier wrote:

you don't know the hash before reading to the end of the file -
meanwhile, you need to write it somewhere - bottom line: you don't
know the hash beforehand - and if you are going to argue that the
client should send it, then you just allowed random content to
overwrite any other content :-)

Ah, that is another reason why calculating the name in the storage
driver would be easy.

duplicating the hashing mechanism in every driver is bad -
duplicated code is bad - error prone - a maintenance headache

I'd be surprised if any language folks would use wouldn't have a sha1
or sha256 implementation. So you could expect the client and engine
to both use it, and they'd both share some library implementation.
That's not bad, it's just good separation of concerns. So here's my
list of options in order of decreasing preference:

a. Use a hash that folks have already implemented (e.g. sha256), with
possible restrictions on clients to generate consistent tarballs
1.
b. Write a tarsum library that storage-drivers and the engine can
share. You'll need one for each language folks want to use.
c. Create a streaming storage API that lets you centralize hash
calculation in the registry. With post-write hashes, that looks
like:

 * “I'm going to write $COUNT bytes which should have a SHA-1 hash
   $BRAINDEAD_HASH, please give me a token for this write.”
 * “How many bytes have been written to $TOKEN?”
 * “I'm continuing an earlier write to $TOKEN with $COUNT
   additional bytes starting at $OFFSET.”
 * “I'm finishing my write to $TOKEN.  Please check the SHA-1 hash
   of what you got against my initial request, and if it matches
   store the content as $FANCY_HASH.”

There's no need for a generic move here, we just get the name for
the final object at the end instead of the beginning.

wking · 2014-10-29T14:06:32Z

On Tue, Oct 28, 2014 at 03:13:29PM -0700, W. Trevor King wrote:

Knowing that content is write-once and named based on its content
allows you to make a bunch of simplifying choices in your
implementation (e.g. no need to worry about racy writes).

Another possible optimization is that you can make the initial,
partial writes (before you know the name [1,2]) to some faster media,
and then after the file is complete with a verified wire checksum, you
can push it to slower storage. For example, local filesystem storage
could write to a tmpfs and then push to disk, and S3 storage could
write to tmpfs (or a local disk) and then push to S3.

Is it worth avoiding copies on S3? It looks like S3 has a transparent
copy operation that hides the move, but it only works on files that
are ≤ 5 GB [3]. Maybe S3-wrapping libraries work around that
tranparently, so it's not a big deal?

dmp42 · 2014-10-29T17:18:50Z

We have very few layers above 500MB, not to mention above 5GB.
But then again, if we wanted to use local fs before pushing to the driver (like we did in the past!), the registry would do that and have ALL drivers benefit from it, instead of putting (again) that complexity into the driver.

Either way, I still fail to understand what benefit we could possibly get from delegating (the same) intelligence into (each and every) drivers :-).

wking · 2014-10-29T17:44:42Z

On Wed, Oct 29, 2014 at 10:18:51AM -0700, Olivier Gambier wrote:

We have very few layers above 500MB, not to mention above 5GB.

If you want a 5GB cap on layers, then that will work (assuming other
streaming storage backends don't have tighter restrictions. Otherwise
things may get interesting when someone breaks that limit on S3.

Either way, I still fail to understand what benefit we could
possibly get from delegating (the same) intelligence into (each and
every) drivers :-).

It's not the same intelligence. Maybe I have a dozen repositories
locally, never have write-contention or broken uploads, and want
something that's ridiculously simple. Maybe I have thousands of
repositories and want to maximize reliability and response times at
the cost of some extra complexity. In this case, local filesystem
storage will likely benefit from having the initial unnamed file
written to the same filesystem as it will eventually be stored on
(because a local rename once we know the name is cheap). But I can
see folks going either way for S3 (remote moves are expensive, but
maybe they don't have enough local storage to cache while we wait for
the name). The benefit to delegating these “how do we store things
efficiently” decisions to the drivers is that they are storage-side
issues, not registry-side issues. If some storage-driver maintainers
want to get together and write up shared intelligence in a library,
that would be great. But baking that knowledge into the registry
just breaks down the storage abstraction, and makes the registry code
more complex. Since libraries are cheap in general, and
write-then-move is easy to implement in particular, I see no benefit
to weakening the abstraction here.

stevvooe · 2015-01-09T00:02:30Z

We are using this discussion in the ongoing roadmap for distribution but don't have actionable work based on this issue. Closing for now.

dmp42 added the Next-generation label Oct 10, 2014

dmp42 mentioned this issue Oct 10, 2014

Adds prototype NG driver apis #630

Closed

dmp42 added this to the GO-RC milestone Nov 5, 2014

This was referenced Nov 7, 2014

NG: quotas #698

Closed

Split streaming, content-addressable storage from transactional, mutable storage #704

Open

Cleanup of unused images / layers #706

Open

stevvooe mentioned this issue Jan 6, 2015

Port Next-Generation Issues from docker/docker-registry distribution/distribution#35

Closed

39 tasks

stevvooe closed this as completed Jan 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NG: Transactional storage #626

NG: Transactional storage #626

wking commented Oct 10, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

wking commented Oct 28, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

wking commented Oct 29, 2014

dmp42 commented Oct 29, 2014

wking commented Oct 29, 2014

stevvooe commented Jan 9, 2015

NG: Transactional storage #626

NG: Transactional storage #626

Comments

wking commented Oct 10, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

wking commented Oct 28, 2014

dmp42 commented Oct 28, 2014

wking commented Oct 28, 2014

wking commented Oct 29, 2014

dmp42 commented Oct 29, 2014

wking commented Oct 29, 2014

stevvooe commented Jan 9, 2015