Skip to content

Object Chunking and Garbage Collection

reiddraper edited this page Apr 10, 2012 · 33 revisions

Object Chunking and Garbage Collection

this is a wip

Once it's done, it should subsume these pages:


Note:

The large file material here should be valid for Riak CS 1.0, and the GC implementation described is currently under development.

Also, I'm going to refer to Riak CS by the name Moss, as it's less of a pain to type :)

Goals

The simple goal is the support objects up to 5GB in size, with the same API and behavioral characteristics as S3. S3 offers eventually consistent access to objects, with no concurrency control other than last-write-wins. This does mean that there might be one "version" of a file being read on one machine, while another is accepting a write for the same file, and yet another is deleting the file (but who's view for the delete is yet another version). All of this has to happen while also garbage collecting overwritten and failed-uploaded files. To accomplish this, we make heavy use of sibling resolution and Riak's eventual consistency properties.

Implementation Overview

Moss splits files up into small chunks, and stores each chunk as a separate Riak object. Files less than the size of one chunk will simply be stored as a single chunk. There is a separate manifest object that is used to record metadata about the file, such as the key, md5, and how many blocks have been written so far. Manifests and individual blocks are stored in different buckets. A user bucket called "foo", will, for example, correspond to two buckets in Riak, one for the manifests, and one for the blocks.

There is a manifest for each PUT operation on a particular key, each with a different UUID. Because of things like garbage collection and concurrent PUTs, the object stored at the manifest {Bucket, Key} is a collection of manifests (an orddict), not just a single manifest. Furthermore, {allow_mult, true} is set on this bucket, so there may be several siblings, each which are a collection of manifests. Depending on the operation, a piece of code may examine all of the manifests, or only a particular one.

Individual blocks are immutable, as they include the UUID as part of their key. They can however, be deleted.

Manifests can be in one of four different states:

  • writing: A manifest is in this state when it is first created, and as the blocks are being written to Riak.

  • active: A manifest is in this state once all of the blocks have been written to Riak. This is the only state that a manifest can be in and serve GET requests.

  • pending_delete: A manifest is in this state when it has been deleted by the user, or a newer manifest has been oberserved in the active state (ie. this one has been overwritten). At this point, the blocks for the manifest are not yet deleted. After the manifest has been in this state for some (configurable) amount of time, the individual blocks will start to be deleted.

  • deleted: A manifest is in this state state when all of the blocks have been deleted. The manifest itself is not removed, for some (configurable) time, so that it can act as a tombstone.

When two manifests with the same UUID are in conflict (because they came from manifest-collection siblings), they are resolved with this code.

Operations

Note, each of these operations might also spawn off garbage collection, which we'll talk about later.

  • GET:

    1. Retrieve manifests
    2. Resolve siblings
    3. Select all active manifests, and choose the one with the most recent write_start_time (a property of the manifest). If there are no active manifests, retun 404.

    Note, we specifically don't write the resolved manifests back to Riak, as this can create a situation where to actors both resolve siblings, therefore creating siblings again. This follows the general rule-of-thumb, don't make a write to Riak just to resolve siblings.

  • PUT:

    1. Create a new manifest (with fresh UUID) in the writing state. Once all of the blocks have been written, change the state to active.

    Note, each time the manifest is written back to Riak, we resolve siblings, and write back with the correct vector clock.

  • DELETE:

    1. Retrieve manifests
    2. Resolve siblings
    3. Select all active manifests, and mark them as pending_delete.

Garbage Collection

Moss implements lazy, distributed probabilistic garbage collection. It has a lot in common with how we do vector clock pruning. Let's start with the negative tradeoffs this implementation makes:

  1. A "slow" reader may have blocks GC'd as it's reading a file.

  2. A read from vnodes that have been in a "long" network partition (including machine failure, etc.), might return a manifest that claims to be in the active state, but has since been deleted and had some or all of the blocks garbage collected.

The GC implementation gives the deployer four knobs to tune things like "slow" and "long", used previously:

  1. delete_leeway_time
  2. delete_tombstone_time
  3. retry_delete_time
  4. partial_upload_delete_time

They're commenting here.

The typical lifecycle for a file that has been deleted (or overwritten) is to first go into the pending_delete state. It stays here for (by default) 24-hours before the actual blocks are deleted. This is so that we don't delete the blocks out from under slow readers, or actors that are in network partitions. The delete_leeway_time allows for control over the minimum time that manifests remain in this state. Once all of the blocks have been deleted, the manifest enters the deleted state, where it remains, as a tombstone, for delete_tombstone_time.

We call the GC implementation lazy because collection only happens when you observe a manifest that needs garbage collecting. We don't, for example, set a timer for 24-hours when you delete something, for the blocks to be deleted 24-hours later. This means that there are certain access patterns that can cause garbage to get accumulated. We might consider a reaper process that occasionaly goes through the cluster, looking for these situations.

There is pure function that determines whether a manifest needs GC or not, located here.

TODO: document pruning of delete tombstones