Skip to content

Object Chunking and Garbage Collection

reiddraper edited this page May 22, 2012 · 33 revisions

Object Chunking and Garbage Collection

this is a wip

Once it's done, it should subsume these pages:


Note:

The large file material here should be valid for Riak CS 1.0, and the GC implementation described is currently under development.

Goals

The simple goal is the support objects up to 5GB in size, with the same API and behavioral characteristics as S3. S3 offers eventually consistent access to objects, with no concurrency control other than last-write-wins. This does mean that there might be one "version" of a file being read on one machine, while another is accepting a write for the same file, and yet another is deleting the file (but who's view for the delete is yet another version). All of this has to happen while also garbage collecting overwritten and failed-uploaded files. To accomplish this, we make heavy use of sibling resolution and Riak's eventual consistency properties.

Implementation Overview

Riak CS splits files up into small blocks, and stores each block as a separate Riak object. Files less than the size of one block will simply be stored as a single block. There is a separate manifest object that is used to record metadata about the file, such as the key, md5, and how many blocks have been written so far. Manifests and individual blocks are stored in different buckets. A user bucket called "foo", will, for example, correspond to two buckets in Riak, one for the manifests, and one for the blocks.

There is a manifest for each PUT operation on a particular key, each with a different UUID. Because of things like garbage collection and concurrent PUTs, the object stored at the manifest {Bucket, Key} is a collection of manifests (an orddict), not just a single manifest. Furthermore, {allow_mult, true} is set on this bucket, so there may be several siblings, each which are a collection of manifests. Depending on the operation, a piece of code may examine all of the manifests, or only a particular one. The erlang record definition for manifests is well commented here.

Individual blocks are immutable, as they include the UUID as part of their key. They can however, be deleted.

Manifests can be in one of four different states:

  • writing: A manifest is in this state when it is first created, and as the blocks are being written to Riak.

  • active: A manifest is in this state once all of the blocks have been written to Riak. This is the only state that a manifest can be in and serve GET requests.

  • pending_delete: A manifest is in this state when it has been deleted by the user, or a newer manifest has been oberserved in the active state (ie. this one has been overwritten). At this point, the blocks for the manifest are not yet deleted. After the manifest has been in this state for some (configurable) amount of time, the individual blocks will start to be deleted.

  • scheduled_delete: A manifest is in this state when its information has been written to the riak-cs-gc bucket thus scheduling the file blocks for deletion.

  • deleted: A manifest is in this state state when all of the blocks have been deleted. The manifest itself is not removed, for some (configurable) time, so that it can act as a tombstone.

When two manifests with the same UUID are in conflict (because they came from manifest-collection siblings), they are resolved with this code.

Operations

Note, each of these operations might also spawn off garbage collection, which we'll talk about later.

  • GET:

    1. Retrieve manifests
    2. Resolve siblings
    3. Select all active manifests, and choose the one with the most recent write_start_time (a property of the manifest). If there are no active manifests, retun 404.

    Note, we specifically don't write the resolved manifests back to Riak, as this can create a situation where to actors both resolve siblings, therefore creating siblings again. This follows the general rule-of-thumb, don't make a write to Riak just to resolve siblings.

  • PUT:

    1. Create a new manifest (with fresh UUID) in the writing state. Once all of the blocks have been written, change the state to active.

    Note, each time the manifest is written back to Riak, we resolve siblings, and write back with the correct vector clock.

  • DELETE:

    1. Retrieve manifests
    2. Resolve siblings
    3. Select all active manifests, and mark them as pending_delete.
    4. Check for any pending_delete manifests and attempt to move them to the scheduled_delete state. More details on this in the next section.
    5. Check for any deleted manifests whose tombstone deletion window has elapsed and remove them.

Garbage Collection

First, a reminder that for a given named file, multiple internal versions of that file may be stored in the system at one time. Each version of the file is accessible by a file manifest that includes a UUID identifying that particular version. There is at most 1 active manifest for a named file at any one time and the active version is only one that is externally available to a Riak CS user.

The lifecycle for a file that has been deleted (or overwritten) is to change the state of the manifest to pending_delete and write a time value representing the current time to the delete_marked_time field. In the case of an overwrite a new file manifest is written with the state set to active, but in the case of a delete the file is no longer externally available.

Once a manifest has been updated to set the state of the active manifest to pending_delete the manifest information for any versions that are found in the pending_delete state are collected into a CRDT set and the set is written as a value to the riak-cs-gc bucket keyed by a time value representing the current epoch time plus the leeway interval. If that write is successful then the state for each manifest in the set is updated to scheduled_delete. This indicates that the blocks of the file have been scheduled for deletion by the garbage collection daemon and avoids other manifest resolution processes for the file from scheduling unnecessary deletions.

The manifest remains in the scheduled_delete state until the garbage collection daemon completes the deletion of all of the file blocks. Once the file blocks are deleted the gc daemon updates the manifest and sets the state to deleted, updates the last_block_deleted_time field, and sets delete_blocks_remaining to be an empty set. Once the manifest enters the deleted state it remains as a tombstone for delete_tombstone_time.

Garbage Collection Daemon

The actual deletion of file blocks is handled by a garbage collection daemon that wakes up at specific intervals and checks the riak-cs-gc bucket for any scheduled entries that are eligible for deletion. The daemon enters a working state and begins to delete the file blocks associated with the eligible keys and continues until all keys have been processed. The duration of this working state varies depending on the number of keys involved and the size of the files they represent. The daemon checks for messages after processing each file so that the work interval may be manually interrupted if needed.

Deletion eligibility is determined using the key values in the riak-cs-gc bucket which are time values. If the current time according to the daemon is later than the time represented by a key plus the leeway interval (gc_seconds_per_slice * num_leeway_slices seconds), the blocks for the file manifest stored at that key are eligible for deletion and the daemon attempts to delete them.

The daemon gathers the eligible keys for deletion by performing a secondary index range query on the $key index with a lower bound of time 0 and an upper bound of the current time. This allows the daemon to collect all the keys that are eligible for deletion and have some way of accounting for clock skew.

Once the file blocks represented by a key in the riak-cs-gc bucket have all been deleted, the daemon updates the original file manifest in the objects bucket to set the state to deleted and the key and value are deleted from the riak-cs-gc bucket.

Tradeoffs

  1. An slow reader may have blocks GC'd as it is reading a file if the read exceeds the leeway interval.
  2. There is some reliance on system clocks and this could lead to file blocks being deleted earlier or later than their intended eligibility window dictates due to clock skew.
  3. A network partition (or machine failure) lasting longer than delete_tombstone_time could cause a manifest to "come back to life" and appear active, it would then continually serve requests whose blocks could not be found.

Configuration

The GC implementation gives the deployer two knobs to tune things like slow, used previously:

  1. delete_tombstone_time
  2. leeway_seconds

They are commented here.

TODO: document pruning of delete tombstones