Skip to content

Hierarchical Storage Management API

Amir Goldstein edited this page Jul 20, 2023 · 50 revisions

Table of Contents

Introduction

Hierarchical Storage Management (a.k.a HSM) refers to storage systems where content may be seemingly migrated between storage tiers.

When integrating HSM with a local filesystem, HSM and filesystem need an API to communicate at least the following events:

  • HSM is about to migrate away the content of a filesystem object
  • HSM has completed to migrate the content of a filesystem object (to here or away)
  • Filesystem object without content is about to be accessed
  • Filesystem object content was modified
For example, when filesystem notifies HSM that an object without content is about to be accessed, HSM needs to migrate the content of the object back into the filesystem and notify the filesystem when the migration has completed or whether the migration has failed. If the migration has succeeded, the filesystem will grant access to the object content and if the migration has failed, the filesystem will deny access to the object with incomplete content.

HSM for Linux

HSM is a very old concept and many different HSM systems are in use or have been in use on different operating systems.

HSM systems for Linux are also available, despite the fact that there is no generic HSM API available in Linux.

Most of these HSM systems use proprietary code, so information about them is limited to pieces of information collected from public sources. Following are some examples of how HSM systems for Linux work.

FUSE pass-through

Probably the most common way to implement an HSM system on Linux is by using a FUSE overlaid filesystem that exposes access safe filesystem objects.

The FUSE filesystem intercepts all filesystem object access on the overlaid filesystem and controls access to objects on the local filesystem, so it can manage seamless migration of local filesystem content.

The libprojfs port of the Windows ProjFS API for Linux is an example of a FUSE based HSM.

The use of a FUSE overlaid filesystem has noticeable performance cost even when accessing objects with complete content on local filesystem.

The Android FUSE BPF project is a proposed solution to reducing this performance cost.

Out-of-tree driver

There is at least one known example of an HSM that is implemented in the kernel, but was not upstreamed.

The Android Incremental File System can be seen as an HSM to access Android applications directly from the cloud without having to install them locally.

It was implemented as a kernel driver, because the performance overhead of FUSE (context switches in particular) was considered unacceptable for the Android workloads.

XFS and DMAPI

DMAPI is an old standard for HSM-filesystem API that is implemented by some non-Linux filesystems. The only Linux native filesystem that had partial support for DMAPI is XFS, but XFS DMAPI support was never fully upstreamed.

In Linux kernel v2.6.36, commit 288699fecaff xfs: drop dmapi hooks removed most of the XFS DMAPI code, saying that "If we'll ever get HSM support in mainline at least the namespace events can be done much saner in the VFS instead of the individual filesystem, so it's not like this is much help for future work."

Providing a generic VFS implementation for HSM API based on fanotify is exactly what this project aims to do.

Generic Linux HSM API

The following sections will list the APIs already available in the Linux kernel that could be utilized by an HSM implementation and the missing APIs that will need to be added to the Linux kernel in order to enable a complete and useful HSM implementation.

Existing APIs

A very basic HSM can be implemented with fanotify APIs available since Linux kernel version v5.1:

  • Call fcntl(fd, F_SETLEASE, F_WRLCK) to acquire exclusive access to a file object before migrating its content away
  • Calls fallocate(fd, FALLOC_FL_PUNCH_HOLE, ...) to evict the local object content after migrating its content away
  • Subscribe to FAN_OPEN_PERM permission events on objects with evicted content to approve filesystem object access
  • Subscribe to FAN_MODIFY fanotify events on a filesystem mark to be notified after every file content modification
  • Subscribe to dirent modification events on a filesystem mark to be notified after every directory content modification
This basic HSM implementation would have some major drawbacks:
  1. HSM needs to scan the entire data set on startup, to find objects with evicted content and subscribe to permission events
  2. Modification events can be lost on system crash, so HSM needs to scan the entire data set on startup, looking for changed content
  3. HSM cannot support migrating directory content, because HSM is not called before looking up path components
  4. HSM cannot support migrating partial file content, because HSM is not provided the file range information on content access
  5. Writing file content in the context of handling a permission event can lead to deadlock with another process doing filesystem freeze

New fanotify events and information

The following fanotify events and information would need to be added to tackle the functional shortcomings of the aforementioned basic HSM:

  • FAN_PRE_ACCESS pre-content permission events to allow filling file content before it is accessed
  • FAN_PRE_MODIFY pre-content pre-modify permission events to allow filling file content before it is modified
  • Report file range information in pre-content permission events
  • FAN_PATH_ACCESS lookup permission event to approve lookup of entries in a directory with evicted content
  • FAN_PATH_MODIFY pre-modify permission events to approve modification to directory content before it is modified (e.g. create, delete, rename)
  • FAN_MARK_SYNC command to wait for in-progress modifications for which a pre-modify event was already delivered
HSM can use the lookup permission event to lazily subscribe to lookup and pre-modify permission events on children during path lookup to avoid scanning for objects with evicted content and objects with unmodified content on startup.

With the lookup permission event and file range info, HSM will be able to support migrating directory and partial file content.

With the pre-modify permission events and FAN_MARK_SYNC command, HSM will be able to maintain a crash safe registry of the filesystem modified objects.

Resource Utilization vs. Performance

When lazily walking a large data set, HSM needs to subscribe for permission events on every filesystem object with incomplete content and on every filesystem object not listed in the modified objects registry.

In the worst case, HSM may end up setting inode marks on the entire data set. Inode marks pin the inode to inode cache, so in that worst case, the fanotify marks could deplete the entire system resources.

This resource utilization problem could be addressed by subscribing to permission events on a mount or filesystem mark. Doing that will result in calling the HSM and waiting for its approval before any filesystem object access or modification and that is going to have unacceptable performance impact in real life workloads.

Evictable inode marks

One solution to this problem is an fanotify feature called evictable inode marks that was merged to kernel version v5.19.

The idea is to subscribe for permission events on a mount or filesystem mark and then as HSM handles the permission events, it lazily adds evictable inode marks with ignore masks on the file and directories that it is NOT interested in getting events for.

The difference from the lazy walking of the data set, is that those evictbale inode marks with ignore masks may be evicted from inode cache, because at any time, HSM may get an undesired event and re-establish the evictable inode mark with ignore mask.

Persistent inode marks

The concept of evictable inode marks can be further enhanced to support persistent inode marks that are stored in xattr on the inode on-disk.

When HSM sets a persistent inode mark with ignore mask on a file or directory, it also does not pin the inode to inode cache, but unlike the evictable inode mark, the persistent inode mark is re-established by the kernel automatically when the inode is loaded back into inode cache, without having to go through the cycle of sending a permission event to HSM and let HSM re-establish an evictable inode mark.

Avoiding the unneeded userspace cycle is nice, but the biggest performance benefit of persistent inode marks would be achieved if HSM subscribes to permissions events with persistent inode marks and not on a mount or filesystem mark. In that case, HSM lazily walks the data set, setting up persistent inode marks only on the files and directories it needs to migrate and then there is almost no performance penalty of fsnotify for accessing filesystem objects whose full content is available in the filesystem and are already recorded in the modified objects registry.

Synchronizing access to objects

It should be obvious that users must not be allowed to access filesytsem objects with evicted content and that HSM is not allowed to evict filesystem content while that content is being accessed, so some synchronization techniques are needed to make sure that there are no races that can allow that.

Populating object content

One way or another, HSM needs to make sure to setup fanotify marks on startup, that will guarantee the delivery of permission events before access or modification to filesystem objects with incomplete content.

HSM systems can be setup in a way that does not expose the filesystem mount to users until the time that HSM service is running and all the fanotify marks have been established.

HSM must be allowed to populate the content of the accessed object, file or directory, by calling VFS system calls (e.g. write(2), mknodat(2)), before responding to the permission event and for that reason, VFS lock on the object must not be held in the context of permission event handling.

Once a filesystem object content is fully populated, HSM may remove the fanotify mark or add an fanotify mark with ignore mask to suppress future access permission events for that object, but that can only be done after the object content has been populated.

If the system calls used to populate the object content would have generated permission events themselves, that would not have been a desired outcome. To avoid those recursive permission events, fanotify permission events carry a special file descriptor of the object that can be used in system calls (e.g. write(2), mknodat(2)) to modify to content of the object without generating permission events.

Evicting file content

HSM can evict regular file content, fully or partially, using the FALLOC_FL_PUNCH_HOLE command to fallocate(2), but special care needs to be taken to synchronize this operation with object content access.

HSM will use F_SETLEASE command to try and acquire an exclusive write lease on an object before evicting its content. If any process has an open file descriptor to the object, exclusive write lease will be denied. If HSM was granted the exclusive write lease, it will add permission events to the event mask or remove permission events from the ignore mask of the fanotify inode mark on the object.

Updating the fanotify mark masks on the inode object would need to be synchronized against fsnotify hooks checks for marks and event masks to guarantee the delivery of future permission events before HSM starts to evict the file content.

When HSM is using permission events in ignore mask of inode marks to suppress permission events on objects with populated content, there is always a mount or filesystem mark with permission events in its mask, which guarantees that fsnotify hooks lockless check of sb/mount/inode event mask will not drop the event before taking the srcu_read_lock() to iterate on the inode marks.

The synchronization is thus easier to achieve in this mode without paying performance penalty in the likely case of no fsnotify marks that are interested in permission events at all.

Once HSM has guaranteed the delivery of future permission events, and before it starts evicting file content, HSM should release and re-acquire the write lease on the file to make sure that there are no pending write lease breakers.

If any process opens the file while HSM is holding the write lease, HSM will receive a signal to break the write lease. If HSM receives a signal to break the write lease or receives permission event on the file before starting to evict file content, HSM should abort file content eviction.

If HSM receives a permission event on the file after it has started or completed to evict file content, HSM will perform the operations to populate the file content.

Reference implementation

To demonstrate how all the pieces of an HSM implementation using fanotify API come together, a reference implementation was created based on the open source code of HTTPDirFS.

The code of the reference implementation is available on this HTTPDirFS POC branch and the kernel patches that implement the fanotify HSM APIs are available on this fanotify POC branch.

Direct access to local filesystem

HTTPDirFS can be seen as a private case of HSM, which knows how to migrate content from a website to a local filesystem cache of that website.

HTTPDirFS implements a FUSE overlaid filesystem that exposes access safe filesystem objects, whose content is migrated to local filesystem on first access.

The reference implementation demonstrate the use of fanotify HSM APIs for migrating object content to local filesystem on direct access to the local filesystem, without the need for an overlaid filesystem, namely fanotify mode.

fanotify mode is invoked with the httpdirfs command line argument --fanotify. fanotify mode requires a kernel with fanotify HSM APIs patches.

HTTPDirFS fanotify mode uses a bind mount to expose the filesystem objects to user access and sets an fanotify mount mark with permission events on the mount before moving it into place.

Cache state management

HTTPDirFS manages the cache state of objects - which parts of the object content are cached in the local filesystem - in a metadata registry.

An object can be in one of three states:

  • Uninitialized (deny open)
  • Incomplete content (access events)
  • Populated content (no events)
When a file is created in local filesystem its cache state is uninitialized and any open syscall is denied.

HTTPDirFS creates the cache entries lazily on first access to an unpopulated diretory. After the local sparse file is created and its metadata registry entry, an evictable inode mark is set to ignore open permission events, allowing the local file to be opened.

On first user access to an object with incomplete content, HTTPDirFS starts downloading the object content by HTTP to local file and updates the metadata registry with the downloaded parts.

On user access to an object with fully cached content, HTTPDirFS sets an evictable inode mark with ignore mask to ignore access permission events and to suppress future permission events on access to that object.

When persistent inode marks are supported, HTTPDirFS uses persistent inode marks instead of evictable inode marks, so permission events will not be delivered on access to an object with fully cached content even after restarting HTTPDirFS.

Invalidating local cache

HTTPDirFS has a user command evict that can be used to evict cache of a fully or partly cache local file.

The command runs the following sequence to synchronize cache evict with cache populate by the fanotify mode service:

  1. Acquire exclusive write lease (no existing fds nor in-progress opens)
  2. Remove ignore mask (deny completion of future opens)
  3. Reaquire write lease (no pending lease breaks from in-progress opens)
  4. Unlink the metadata registry entry
  5. Punch out the content of local file
  6. Restore inode mark with ignore mark to allow open (back to Incomplete state)
When terminated by write lease break async signal or another handled signal, the command will restore the ignore mask to allow future opens, but the lease breaker open call will still be denied.

The command could also be killed by an unhandled signal leaving the object in the Uninitialized state. In that case, the object will remain inaccessible until the command is run again to reset the cache to Incomplete state.

Accessing file content requires a process with an open fd, but there are no other processes with open fds and no other processes that started open before content eviction started. Any process trying to open the file after eviction started will be denied open by the fanotify mode mount watch until content eviction is completed.

In any case, as long as all users only access files through the watched mount, they cannot access the content of a file while the evict command is evicting content.

Tracking local modifications

HTTPDirFS is a read-only local cache of a read-only website, but an HSM will typically need to migrate the local file to another storage tier before the file can be evicted.

To demostrate this use case, HTTPDirFS fanotify mode maintains the dirty state for local cached objects.

In the general case, local objects could also be created, deleted or renamed which can be considered as a dirty state for the local directory object.

The reference implementation does not support dirty directory state, so the mount mark watches create delete and rename permission events and denies them all.

The mount mark also watches pre-modify permission events and denies them for files with Incomplete content.

TODO:

Modify is allowed only for fully cached files and HTTPDirFS writes a dirty record in the metadata registry before allowing the file modification.

The evict command refuses to evict the content of a file with a dirty record.

HTTPDirFS fanotify mode uses dedicated fanotify groups to subscribe for pre-modify permission events. These groups may be setup and torn down periodically, but there is always at least one group handling pre-modify permission events.

Each such group records the changes since epoc of period T and the dirty records written by that group carry the period indentifier T.

After writing a dirty record, an evictable mark with ignore mask is setup to ignore future pre-modify permission events to that group.

HTTPDirFS fanotify mode provides a method to iterate files that were marked dirty during a specific time period:

  • There can be several dirty records per file/directory each representing different time periods
  • For each file marked dirty from period T, all its ancestor dirs are also marked dirty from period T
  • When the metadata registry has a dirty record from priod T for the root directory, the metadata tree can be traverseved following only children files/directories that have a dirty record from period T

Modified files query

HSM may have different policies for writing modified file content to slower storage tier. Modified file content could be uploaded shorty after the modification, queued for periodic batch uploads, uploaded on-demand before evicting content, or a combination of all those policies.

TODO:

The reference implementation does not upload any files at all. Instead, it implements a query to get a list of the files modified since a checkpoint in time.

Similar to the git fsmonitor hook v2 interface, the query takes a since checkpoint input argument and reports a new upto checkpoint along with the query results.

When that query is run periodically, for subsequent checkpoints, it is guaranteed that:

  1. If a file was modified, it will show up in at least one of the query results
  2. If there were no modifications to a file that started since the last query started, the next query result will not list the file as modified
The recursive dirty records and per period dirty records are used to guarantee forward progress towards eventually clean state of all of the files in a race free and crash safe manner.

When querying modifications since the latest checkpoint T, a new checkpoint is created using the following sequence:

  1. New group with id T+1 subscribes to pre-modify events and starts recording dirty records since T+1
  2. FAN_MARK_SYNC command is executed to wait for in-progress modifications that have already been recorded
  3. Filesystem syncfs() command is executed to submit io and wait for dirty data that has already been recorded
  4. The group with id T unsubscribes to pre-modify events and stops recording dirty records since T