Object Chunking and Garbage Collection

Note:

The large file material here should be valid for Riak CS 1.0, and the GC implementation described is currently under development, to be released in Riak CS 1.1.

Goals

The simple goal is the support objects up to 5GB in size, with the same API and behavioral characteristics as S3. S3 offers eventually consistent access to objects, with no concurrency control other than last-write-wins. This does mean that there might be one "version" of a file being read on one machine, while another is accepting a write for the same file, and yet another is deleting the file (but who's view for the delete is yet another version). All of this has to happen while also garbage collecting overwritten and failed-uploaded files. To accomplish this, we make heavy use of sibling resolution and Riak's eventual consistency properties.

Implementation Overview

Riak CS splits files up into small blocks, and stores each block as a separate Riak object. Files less than the size of one block will simply be stored as a single block. There is a separate manifest object that is used to record metadata about the file, such as the key, MD5, and how many blocks have been written so far. Manifests and individual blocks are stored in different buckets. A user bucket called "foo", will, for example, correspond to two buckets in Riak, one for the manifests, and one for the blocks.

There is a manifest for each PUT operation on a particular key, each with a different UUID. Because of things like garbage collection and concurrent PUTs, the object stored at the manifest {Bucket, Key} is a collection of manifests (an orddict), not just a single manifest. Furthermore, {allow_mult, true} is set on this bucket, so there may be several siblings, each which are a collection of manifests. Depending on the operation, a piece of code may examine all of the manifests, or only a particular one. The erlang record definition for manifests is well commented here.

Individual blocks are immutable, as they include the UUID as part of their key. They can however, be deleted.

Manifests can be in one of four different states:

writing: A manifest is in this state when it is first created, and as the blocks are being written to Riak.
active: A manifest is in this state once all of the blocks have been written to Riak. This is the only state that a manifest can be in and serve GET requests.
pending_delete: When a user deletes a file or a file is overwritten, the manifest is first put in the pending_delete state, before the move to the GC bucket is successful. Once the manifest has been successfully moved, it will go into the scheduled_delete state.
scheduled_delete: A manifest is in this state when its information has been written to the riak-cs-gc bucket thus scheduling the file blocks for deletion. The manifest stays around for as long as leeway_seconds, and is pruned lazily.

When two manifests with the same UUID are in conflict (because they came from manifest-collection siblings), they are resolved with this code. NOTE/TO-UPDATE: riak_moss_manifest_resolution.erl has been changed by the gc147-gc feature branch.

Operations

GET:
1. Retrieve manifests
2. Resolve siblings
3. Select all active manifests, and choose the one with the most recent write_start_time (a property of the manifest). If there are no active manifests, return 404.
Note, we specifically don't write the resolved manifests back to Riak, as this can create a situation where to actors both resolve siblings, therefore creating siblings again. This follows the general rule-of-thumb, don't make a write to Riak just to resolve siblings.
PUT:
1. Create a new manifest (with fresh UUID) in the writing state. Once all of the blocks have been written, change the state to active.
2. Follow the same steps as in DELETE, to delete any manifests that this overwrites with the following exception: Manifests found in the writing state are not marked as pending_delete unless their last_block_written_time is leeway_seconds in the past.
Note, each time the manifest is written back to Riak, we resolve siblings, and write back with the correct vector clock.
DELETE:
1. Retrieve manifests
2. Resolve siblings
3. Select all active and writing manifests, and mark them as pending_delete.
4. "Move" these manifests to the GC bucket
5. Mark them as scheduled_delete

Garbage Collection

Overview

First, a reminder that for a given named object, multiple internal versions of that object may be stored in the system at one time. Each version of the object is accessible by an object manifest that includes a UUID identifying that particular version. There is at most one active manifest for a named object at any one time and the active version is the only one that is externally available to a Riak CS user.

Garbage collection of an object version involves several different actions. These actions can be divided into synchronous actions that occur while the user is waiting for notification of successful command completion and asynchronous actions that are not directly tied to user actions and occur in the background. These two action groups are described in more detail in the following sections.

Synchronous GC Actions

There are two direct actions a user may take to initiate the garbage collection of an object version: overwriting the object with a new version or deleting the object.

When an object version is overwritten a new object manifest is written with the state set to active and this new version becomes what is available to the user, but in the case of a delete the object is no longer externally available.

Also, as part of the overwrite or delete action, a set of eligible manifest versions are determined and the state of each eligible manifest is changed to pending_delete and the delete_marked_time field is set to a time value representing the current time.

The method for compiling the list of eligible manifests is dependent on the operation.

For object overwrites, the previously active manifest version is selected along with any manifest versions that are in the writing state where the last_block_written_time field (or the write_start_time if last_block_written_time is undefined) of the manifest represents a time value greater than leeway_seconds seconds ago. If a manifest version remains in the writing state for greater than leeway_seconds seconds, it is assumed that that manifest version represents a failed upload attempt and therefore it is acceptable to reap any object blocks that may have been written. Manifest versions in the writing state whose last_block_written_time has not exceeded the leeway_seconds threshold are not deemed eligible because they could represent an object version that is still in the progress of writing its blocks.

Object deletes are more straightforward. Since no object is externally available to the user after the delete operation, then any manifest versions in the active or writing state are eligible to be cleaned up. There is no concern about reaping the object version that is currently being written to become the next active version.

Once the states of the eligible manifests have been updated to pending_delete the manifest information for any pending_delete manifest versions are collected into a CRDT set and the set is written as a value to the riak-cs-gc bucket keyed by a time value representing the current epoch time plus the leeway interval (i.e. the leeway_seconds configuration option). If that write is successful then the state for each manifest in the set is updated to scheduled_delete. This indicates that the blocks of the object have been scheduled for deletion by the garbage collection daemon and avoids other manifest resolution processes for the object from scheduling unnecessary deletions.

Once the manifest enters the scheduled_delete state it remains as a tombstone for a minimum of leeway_seconds.

After these actions have been attempted, the synchronous portion of the garbage collection process is concluded and a response is return to the user who issued the request.

Garbage Collection Daemon

The asynchronous portion of the garbage collection process is orchestrated by the garbage collection daemon that wakes up at specific intervals and checks the riak-cs-gc bucket for any scheduled entries that are eligible for reaping. The daemon enters a working state and begins to delete the object blocks associated with the eligible keys and continues until all keys have been processed. The duration of this working state varies depending on the number of keys involved and the size of the objects they represent. The daemon checks for messages after processing each object so that the work interval may be manually interrupted if needed.

Deletion eligibility is determined using the key values in the riak-cs-gc bucket which are time values. If the current time according to the daemon is later than the time represented by a key plus the leeway interval (gc_seconds_per_slice * num_leeway_slices seconds), the blocks for the object manifest stored at that key are eligible for deletion and the daemon attempts to delete them.

The daemon gathers the eligible keys for deletion by performing a secondary index range query on the $key index with a lower bound of time 0 and an upper bound of the current time. This allows the daemon to collect all the keys that are eligible for deletion and have some way of accounting for clock skew.

Once the object blocks represented by a key in the riak-cs-gc bucket have all been deleted, the key is deleted from the riak-cs-gc bucket.

Controlling the GC Daemon

The garbage collection daemon may be queried and manipulated using the riak-cs-gc script. The script is installed to the sbin directory along with the primary riak-cs script. The available commands that can be used with the riak-cs-gc script are listed below. Running the script with no command provided displays a list of the available commands.

batch - Manually start garbage collection for a batch of eligible objects
status - Get the current status of the garbage collection daemon. The output is dependent on the current state of the daemon.
pause - Pause the current batch of object garbage collection. It has no effect if there is no active batch.
resume - Resume a paused garbage collection batch. It has no effect if there is no previously paused batch.

Manifest Updates

Manifest versions are retrieved and updated by the riak_moss_manifest_fsm module with very few exceptions. This module encapsulates the logic needed to retrieve the manifests, resolve any conflicts due to siblings, and write updated manifest versions back to Riak.

Object Block Reaping

The actual deletion of the blocks of an object is managed by the riak_cs_delete_fsm module. It starts up a number of delete workers (based on the configured delete concurrency) and passes off object block information to those workers who in turn carry out the actual delete operation for that block. The delete workers are instances of the riak_moss_block_server module.

Once a worker deletes a block it notifies the delete fsm and waits for notification about another block to delete. Once all blocks of an object are deleted then the delete fsm starts an instance of the manifest fsm to handle deleting the manifest version from the object manifest data structure and if there are no remaining manifest versions to delete the entire object manifest data structure. The goal of this final step is to avoid the cost of scanning through empty manifest keys that could linger indefinitely.

Trade-offs

An slow reader may have blocks GC'd as it is reading an object if the read exceeds the leeway interval.
There is some reliance on system clocks and this could lead to object blocks being deleted earlier or later than their intended eligibility window dictates due to clock skew.
A network partition (or machine failure) lasting longer than leeway_seconds could cause a manifest to "come back to life" and appear active, it would then continually serve requests whose blocks could not be found.

Configuration

The GC implementation gives the deployer two knobs to tune things like slow, used previously:

leeway_seconds
gc_interval
gc_retry_interval

They are commented here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly