Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

DAOS Container

A container represents an object address space inside a pool and is identified by a UUID. To access a container, an application must first connect to the pool and then create or open the container. If the application is authorized to access the container, it obtains a container handle. This includes capabilities that authorize any process in the application to access the container and its contents. The opening process may share this handle with any or all of its peers. Their capabilities are revoked on closing the container.

Metadata Layout

The Container Service (cont_svc) stores the metadata for containers and provides an API to query and update the state as well as for managing the life-cycle of a container. Container metadata are organized as a hierarchy of key-value stores (KVS) that is replicated over a number of servers backed by Raft consensus protocol which uses strong leadership; client requests can only be serviced by the service leader while non-leader replicas merely respond with a hint pointing to the current leader for the client to retry. cont_svc derives from a generic replicated service module rsvc (see: Replicated Services: Architecture) whose implementation facilitates the client search for the current leader.

Container Service Layout

The top-level KVS root has two children:

  1. Containers KVS: Holds a list of Container Properties KVSs indexed by UUID of the Container which is supplied by the user at the time of creating a new container.
  2. Container Handles KVS: Used for storing data about container handles opened by various applications and indexed by a handle UUID which is generated by the client at the time of opening a container. The metadata associated with a container handle include its capabilities (e.g., read-only or read-write) and its per-handle epoch state. When a container is closed, the corresponding entry is removed from this store.

The container properties KVS is used to store per-container metadata that consists of many mutable and immutable scalar valued properties as well as other KVSs as shown in the figure above.

Users can create, delete and retrieve a list of persistent snapshots, which are essentially epochs that will not be aggregated away. A snapshot remains readable until it is explicitly destroyed. A container can also be rolled back to a particular snapshot. (see: Storage Model: DAOS Container and Transaction Model: Container Snapshot).

Users can also define custom attributes for containers which are essentially name-value pairs; with the name being a null-terminated string while the value is an arbitrary sequence of bytes. The Container Service allows clients to retrieve and update multiple attributes at a time as well as to list names of stored attributes.

Target Service (TO BE UPDATED)

The Target Service maps the global object address space of a DAOS container onto the local object address space of a VOS container within the target's VOS pool (vpool), and and calls VOS methods on behalf of the Container Service (see: VOS Concepts). It caches per-thread information on container objects and open handles in volatile memory for ready access.

Target Faults

Given hundreds of thousands of targets, the epoch protocol must allow progress in the presence of target faults. Since pool and container services are highly available, the problem is mainly concerned with target services. The solution is based on the assumption that losing some targets may not necessarily cause any application data loss, as there may be enough redundancy created by the DAOS-SR layer to hide the faults from applications. Moreover, an application might even want to ignore a particular data loss (which the DAOS-SR layer is unable to hide), for it has enough application-level redundancy to cope or it simply does not care.

When a write, flush, or discard operation fails, the DAOS-SR layer calculates if there is sufficient redundancy left to continue with the epoch. If the failure can be hidden, and assuming that the target in question has not already been disabled in the pool map (e.g., as a result of a RAS notification), the DAOS-SR layer must disable the target before committing the epoch. For the epoch protocol, the resulting pool map update effectively records the fact that the target may store an undefined set of write operations in the epoch, and should be avoided. This also applies to applications that would like to ignore similar failures which the DAOS-SR layer cannot hide.

Object ID Allocator

The OID allocator is a helper routine service that allows users to allocate a unique set of 64 bit unsigned integers within a container. This is helpful for applications or middleware that do not have a way to easily allocate a unique DAOS object ID in a scalable manner. The largest allocated ID is tracked in the Container Properties KVS for future access to that container. This service does not guarantee that the IDs allocated are sequential and several ID ranges may be discarded at container close.

The allocator is implemented using an Incast Variable on the server side that tracks the highest used object ID on a container on the root of the IV tree. A client may request a new allocation from any server running the Container Target Service, i.e. any node in the IV tree. When a new request arrives, the server first checks whether there are any allocated IDs available locally. If not, it forwards a request to the parent (asking for a bigger range of OIDs in that case). The parent does the same check and keeps forwarding to its parent until a request is satisfied or we reach the IV root, which updates the incast variable for the max OID allocated in the container metadata. At each tree level, the number of OIDs asked for is increased to be able to satisfy future OID allocation requests faster.

Container Operations

A client creates a new container by sending a CONT_CREATE request to the Container Service with the pool handle and a UUID. The client must first establish a pool connection to obtain a pool handle. Optionally, the request can also contain a list of properties to be set on the newly created container. In response, the Container Service creates the corresponding Container Properties KVS with the UUID as the key. Creating a container does not require involvement of the Target Service.

Clients may now open a container by supplying the open pool handle and the container UUID along with flags (e.g., read-only or read-write). The client library sends a CONT_OPEN request with a locally generated UUID to the Container Service, then use IV(Incast Variable) to broadcast handle asynchronously to all enabled targets in the pool. On successful completion it creates a new entry in the Container Handles KVS.

A client can close a container handle that is no longer needed by sending a CONT_CLOSE request to the Container Service which it broadcasts to all enabled targets as a collective CONT_TGT_CLOSE in order to close the container handle. It then deletes the corresponding entry from the Container Handles KVS and discards updates performed on the handle that were not committed.

A container is destroyed when the client sends a CONT_DESTROY request to the Container Service causing it to purge all metadata. Similarly, the targets collectively receive a CONT_TGT_DESTROY request from the Container Service and drop all data associated with that container including all the objects within that container. The client can optionally destroy a container forcibly in case it has handles that are currently open.

Epoch Protocol

The epoch protocol implements the epoch model described in the Transaction Model. The Container service manages the epochs of a container; it maintains the definitive epoch state as part of the container metadata, whereas the target services have little knowledge of the global epoch state. Epoch commit, discard, and aggregate procedures are therefore all driven by the container service.

On each target, the target service eagerly stores incoming write operations into the matching VOS container. If a container handle discards an epoch, VOS helps discard all write operations associated with that container handle. When a write operation succeeds, it is immediately visible to conflicting operations in equal or higher epochs. A conflicting write operation with the same epoch will be rejected by VOS unless it is associated with the same container handle and has the same content as the one that is already executed.

Before committing an epoch, an application must ensure that a sufficient set of write operations for this epoch have been persisted by the target services. The application may decide that losing some write operations is acceptable, depending on the redundancy scheme each of them employs. Committing an epoch of a container handle results in a CONT_EPOCH_COMMIT request to the corresponding container service, which simply updates the metadata. When the update becomes persistent, the container service replies to the client with the new epoch state.