Object Chunking and Garbage Collection
Once it's done, it should subsume these pages:
The large file material here should be valid for Riak CS 1.0, and the GC implementation described is currently under development.
The simple goal is the support objects up to 5GB in size, with the same API and behavioral characteristics as S3. S3 offers eventually consistent access to objects, with no concurrency control other than last-write-wins. This does mean that there might be one "version" of a file being read on one machine, while another is accepting a write for the same file, and yet another is deleting the file (but who's view for the delete is yet another version). All of this has to happen while also garbage collecting overwritten and failed-uploaded files. To accomplish this, we make heavy use of sibling resolution and Riak's eventual consistency properties.
Riak CS splits files up into small blocks, and stores each block as a separate Riak object. Files less than the size of one block will simply be stored as a single block. There is a separate manifest object that is used to record metadata about the file, such as the key, md5, and how many blocks have been written so far. Manifests and individual blocks are stored in different buckets. A user bucket called "foo", will, for example, correspond to two buckets in Riak, one for the manifests, and one for the blocks.
There is a manifest for each PUT operation
on a particular key, each with a different
UUID. Because of things like garbage
collection and concurrent PUTs, the object stored at the
manifest {Bucket, Key} is a collection of manifests (an orddict),
not just a single manifest. Furthermore, {allow_mult, true}
is
set on this bucket, so there may be several siblings, each which
are a collection of manifests. Depending on the operation, a piece
of code may examine all of the manifests, or only a particular one.
The erlang record definition for manifests is well commented
here.
Individual blocks are immutable, as they include the UUID as part of their key. They can however, be deleted.
Manifests can be in one of four different states:
-
writing
: A manifest is in this state when it is first created, and as the blocks are being written to Riak. -
active
: A manifest is in this state once all of the blocks have been written to Riak. This is the only state that a manifest can be in and serve GET requests. -
pending_delete
: A manifest is in this state when it has been deleted by the user, or a newer manifest has been oberserved in theactive
state (ie. this one has been overwritten). At this point, the blocks for the manifest are not yet deleted. After the manifest has been in this state for some (configurable) amount of time, the individual blocks will start to be deleted. -
scheduled_delete
: A manifest is in this state when its information has been written to theriak-cs-gc
bucket thus scheduling the file blocks for deletion. -
deleted
: A manifest is in this state state when all of the blocks have been deleted. The manifest itself is not removed, for some (configurable) time, so that it can act as a tombstone.
When two manifests with the same UUID are in conflict (because they came from manifest-collection siblings), they are resolved with this code.
Note, each of these operations might also spawn off garbage collection, which we'll talk about later.
-
GET
:- Retrieve manifests
- Resolve siblings
- Select all
active
manifests, and choose the one with the most recentwrite_start_time
(a property of the manifest). If there are noactive
manifests, retun404
.
Note, we specifically don't write the resolved manifests back to Riak, as this can create a situation where to actors both resolve siblings, therefore creating siblings again. This follows the general rule-of-thumb, don't make a write to Riak just to resolve siblings.
-
PUT
:- Create a new manifest (with fresh UUID)
in the
writing
state. Once all of the blocks have been written, change the state toactive
.
Note, each time the manifest is written back to Riak, we resolve siblings, and write back with the correct vector clock.
- Create a new manifest (with fresh UUID)
in the
-
DELETE
:- Retrieve manifests
- Resolve siblings
- Select all
active
manifests, and mark them aspending_delete
. - Check for any
pending_delete
manifests and attempt to move them to thescheduled_delete
state. More details on this in the next section. - Check for any
deleted
manifests whose tombstone deletion window has elapsed and remove them.
First, a reminder that for a given named file, multiple internal versions of that file may be stored in the system at one time. Each version of the file is accessible by a file manifest that includes a UUID identifying that particular version. There is at most 1 active manifest for a named file at any one time and the active version is only one that is externally available to a Riak CS user.
The lifecycle for a file that has been deleted (or overwritten) is to
change the state of the manifest to pending_delete
and write a time
value representing the current time to the delete_marked_time
field.
In the case of an overwrite a new file manifest is written with the
state set to active
, but in the case of a delete the file is no
longer externally available.
Once a manifest has been updated to set the state of the active
manifest to pending_delete
the manifest information for any versions
that are found in the pending_delete
state are collected into a CRDT
set and the set is written as a value to the riak-cs-gc
bucket keyed
by a time value representing the current epoch time plus the leeway
interval. If that write is successful then the state for each manifest
in the set is updated to scheduled_delete
. This indicates that the
blocks of the file have been scheduled for deletion by the garbage
collection daemon and avoids other manifest resolution processes for
the file from scheduling unnecessary deletions.
The manifest remains in the scheduled_delete
state until the garbage
collection daemon completes the deletion of all of the file blocks.
Once the file blocks are deleted the gc daemon updates the manifest
and sets the state to deleted
, updates the last_block_deleted_time
field, and sets delete_blocks_remaining
to be an empty set. Once the
manifest enters the deleted
state it remains as a tombstone for
delete_tombstone_time
.
The actual deletion of file blocks is handled by a garbage collection
daemon that wakes up at specific intervals and checks the riak-cs-gc
bucket for any scheduled entries that are eligible for deletion. The
daemon enters a working state and begins to delete the file blocks
associated with the eligible keys and continues until all keys have
been processed. The duration of this working state varies depending
on the number of keys involved and the size of the files they
represent. The daemon checks for messages after processing each file
so that the work interval may be manually interrupted if needed.
Deletion eligibility is determined using the key values in
the riak-cs-gc
bucket which are time values. If the current time
according to the daemon is later than the time represented by a key
plus the leeway interval (gc_seconds_per_slice
* num_leeway_slices
seconds), the blocks for the file manifest stored at that key are
eligible for deletion and the daemon attempts to delete them.
The daemon gathers the eligible keys for deletion by performing a
secondary index range query on the $key
index with a lower bound of
time 0 and an upper bound of the current time. This allows the
daemon to collect all the keys that are eligible for deletion and have
some way of accounting for clock skew.
Once the file blocks represented by a key in the riak-cs-gc
bucket
have all been deleted, the daemon updates the original file manifest
in the objects bucket to set the state to deleted
and the key and
value are deleted from the riak-cs-gc
bucket.
- An slow reader may have blocks GC'd as it is reading a file if the read exceeds the leeway interval.
- There is some reliance on system clocks and this could lead to file blocks being deleted earlier or later than their intended eligibility window dictates due to clock skew.
- A network partition (or machine failure) lasting longer than
delete_tombstone_time
could cause a manifest to "come back to life" and appear active, it would then continually serve requests whose blocks could not be found.
The GC implementation gives the deployer two knobs to tune things like slow, used previously:
delete_tombstone_time
leeway_seconds
They are commented here.