Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices #6961

yhr · 2020-06-10T08:36:19Z

This is a request for comments for a new file system for zoned block devices.

The pull request based on top of another open pull request : #6878 that enables db_bench and db_stress to be used with custom file systems.

With this pull request I am mainly asking for comments on the high-level architecture and feedback on the following:

What applicable testing is available?
I have mainly run smoke testing using db_bench and db_stress up till now and looking for ways to do i.e. recovery/crash/power fail testing.

Would a completely self-contained file system be preferable?
Currently ZenFS stores logs and lock files on the default file system. The reason for this if to avoid duplicating already working code and for easy access to the log files.

What kind of workloads are most interesting to optimize for?

Looking forward to feedback as I finish up my laundry list of todos and optimizations.

Thanks!

Overview

ZenFS is a simple file system that utilizes RockDBs FileSystem interface to place files into zones on a raw zoned block device. By separating files into zones and utilizing the write life time hints to co-locate data of similar life times, the system write amplification is greatly reduced(compared to conventional block devices) while keeping the ZenFS capacity overhead at a very reasonable level.

ZenFS is designed to work with host-managed zoned spinning disks as well as NVME SSDs with Zoned Namespaces.

Some of the ideas and concepts in ZenFS are based on earlier work done by Abutalib Aghayev and Marc Acosta.

Dependencies

ZenFS depends on libzbd and Linux kernel 5.4 or later to perform zone management operations.

Architecture overview

ZenFS implements the FileSystem API, and stores all data files on to a raw zoned block device. Log and lock files are stored on the default file system under a configurable directory. Zone management is done through libzbd and zenfs io is done through normal pread/pwrite calls.

Optimizing the IO path is on the TODO list.

Example usage

This example issues 100 million random inserts followed by as many overwrites on a 100G memory backed zoned null block device. Target file sizes are set up to align with zone size.

make db_bench zenfs

DEV=nullb1
ZONE_SZ_SECS=$(cat /sys/class/block/$DEV/queue/chunk_sectors)
FUZZ=5
ZONE_CAP=$((ZONE_SZ_SECS * 512))
BASE_FZ=$(($ZONE_CAP  * (100 - $FUZZ) / 100))
WB_SIZE=$(($BASE_FZ * 2))

TARGET_FZ_BASE=$WB_SIZE
TARGET_FILE_SIZE_MULTIPLIER=2
MAX_BYTES_FOR_LEVEL_BASE=$((2 * $TARGET_FZ_BASE))

./zenfs mkfs --zbd=/dev/$DEV --aux_path=/tmp/zenfs_$DEV --finish_threshold=$FUZZ --force

./db_bench --fs_uri=zenfs://$DEV --key_size=16 --value_size=800 --target_file_size_base=$TARGET_FZ_BASE --write_buffer_size=$WB_SIZE --max_bytes_for_level_base=$MAX_BYTES_FOR_LEVEL_BASE --max_bytes_for_level_multiplier=4 --use_direct_io_for_flush_and_compaction --max_background_jobs=$(nproc) --num=100000000 --benchmarks=fillrandom,overwrite

This graph below shows the capacity usage over time.
As ZenFS does not do any garbage collection the write amplification is 1.

File system implementation

Files are mapped into into a set of extents:

Extents are block-aligned, continious regions on the block device
Extents do not span across zones
A zone may contain more than one extent
Extents from different files may share zones

Reclaim

ZenFS is exceptionally lazy at current state of implementation and does not do any garbage collection whatsoever. As files gets deleted, the used capacity zone counters drops and when
it reaches zero, a zone can be reset and reused.

Metadata

Metadata is stored in a rolling log in the first zones of the block device.

Each valid meta data zone contains:

A superblock with the current sequence number and global file system metadata
At least one snapshot of all files in the file system

The metadata format is currently experimental. More extensive testing is needed and support for differential updates is planned to be implemented before bumping up the version to 1.0.

Summary: If a composite env is created with Env::Default, there is no way to guarantee that all threads have been joined by the time the composite env is destroyed. If threads that have been started by the composite env have a reference back to the composite env we risk a segfault since the composite environment might be destroyed before all the threads have been joined upon the destruction of Env::Default. I.e: in case of an exit() in the user program the base env must be destroyed, joining all running theads, before the composite env can be destroyed. So, to ensure that destruction can be done in the right order, add a new version of NewCompositeEnv, which lets the user order a base environment and its composite environment(s) as statics within its own compilation unit. The NewCompositeEnv requires changes to the default posix env constructor and destructor to allow for more than one global, static base environment.

Summary: Add the parameter --fs_uri to db_bench, creating a composite env combining the default env with a specified registered rocksdb file system.

Summary: Add the parameter --fs_uri to db_stress, creating a composite env combining the default env with a specified registered rocksdb file system.

Summary: Register the posix file system so it can be loaded with the uri posix:// This is useful when testing custom file system support in i.e. db_bench and db_stress.

Summary: Host managed Zoned Block Devices enables an application to do smart data placement by making informed decisions on how to place data into the storage media's erase units. This improves system write amplification and/or enables greater media capacity utilization. ZenFS is a simple file system that implements RockDBs FileSystem interface to place files into zones on a raw zoned block device. By separating files into zones and utilizing the write life time hints to co-locate data of simliar life times, the system write amplification is greatly reduced while keeping the ZenFS capacity overhead at a very resonable level. ZenFS depends on libzbd to and Linux kernel 5.4 to do zone management operations. Some of the ideas and concepts in ZenFS are based on earlier work done by Abutalib Aghayev and Marc Acosta. Files are mapped into into a set of extents: * Extents are block-aligned, continious regions on the block device * Extents do not span across zones * A zone may contain more than one extent * Extents from different files may share zones Log files and LOCK files are routed to a configurable path in the default file system. ZenFS is exceptionally lazy at current state of implementation and does not do any garbage collection whatsoever. As files gets deleted the used capacity zone counters drops and when it reaches zero the zone can be reset and reused. Metadata is stored in a rolling log in the first zones of the block device. Each valid meta data zone contains: * A superblock with the current sequence number and global filesytem metadata * At least one snapshot of all files in the file system The metadata format is currently experimental. More extensive testing is needed and support for differential updates is planned to be implemented before bumping up the version to 1.0.

Summary: This adds a file system management tool for ZenFS. It is required for setting up the metadata for the filesystem. It also allows the user to list files on the file system and can be extended in the future to do other management tasks like offline garbage collection, backups etc. Examples: 1. Create a zenfs file system on zoned block device /dev/nullb1 with auxiliary file storage (logfiles, lockfiles) under /tmp/zenfs_nullb1 and with a finishing threshold for zones of 5%. If the zone has less than finish_threshold capacity left, no additional extents will be mapped to the zone. ./zenfs mkfs --zbd=/dev/nullb1 --aux_path=/tmp/zenfs_nullb1 --finish_threshold=5 ZenFS file system created. Free space: 31744 MB 2. List files: ./zenfs list --zbd=/dev/nullb1 --path=/rocksdbtest/dbbench /rocksdbtest/dbbench/LOG /rocksdbtest/dbbench/LOCK /rocksdbtest/dbbench/000003.log /rocksdbtest/dbbench/CURRENT /rocksdbtest/dbbench/IDENTITY /rocksdbtest/dbbench/MANIFEST-000001 /rocksdbtest/dbbench/OPTIONS-000005

zhichao-cao · 2020-06-10T18:36:06Z

@yhr Thanks for the work, very interesting. Is this the implementation of the work you introduced on SDC last year? (https://www.snia.org/sites/default/files/SDC/2019/presentations/NVMe/Holmberg_Hans_Accelerating_RocksDB_with_Zoned_Namespaces.pdf)

yhr · 2020-06-11T07:12:52Z

@zhichao-cao , yes it is a continuation of that work. Thanks for linking to the presentation, it's useful for understanding the big picture.

qihui81 · 2020-07-16T01:22:23Z

what if zone on the device turns to RO or oflfine, how to sync the io_zones state in the memory ?

yhr · 2020-07-16T19:28:04Z

@qihui81 , Devices with zone active excursions (which may transit e.g open zones to read-only) is not supported by Linux. The zenfs code checks the state of a zone after it is reset though the report zone ioctl - so zones going offline at that point is fine.

yhr · 2020-11-02T14:50:07Z

Closing this RFC, as I've created a new pull request to pull in the code: #7626

yhr added 6 commits June 4, 2020 15:55

Add a db_bench parameter for specifying a file system uri

ea1b817

Summary: Add the parameter --fs_uri to db_bench, creating a composite env combining the default env with a specified registered rocksdb file system.

Add a db_stress parameter for specifying a file system uri

15c4e6a

Summary: Add the parameter --fs_uri to db_stress, creating a composite env combining the default env with a specified registered rocksdb file system.

Register the posix file system so it can be loaded

21b7738

Summary: Register the posix file system so it can be loaded with the uri posix:// This is useful when testing custom file system support in i.e. db_bench and db_stress.

facebook-github-bot added the CLA Signed label Jun 10, 2020

yhr closed this Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices #6961

Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices #6961

yhr commented Jun 10, 2020

zhichao-cao commented Jun 10, 2020

yhr commented Jun 11, 2020

qihui81 commented Jul 16, 2020

yhr commented Jul 16, 2020 •

edited

Loading

yhr commented Nov 2, 2020

Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices #6961

Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices #6961

Conversation

yhr commented Jun 10, 2020

Overview

Dependencies

Architecture overview

Example usage

File system implementation

Reclaim

Metadata

zhichao-cao commented Jun 10, 2020

yhr commented Jun 11, 2020

qihui81 commented Jul 16, 2020

yhr commented Jul 16, 2020 • edited Loading

yhr commented Nov 2, 2020

yhr commented Jul 16, 2020 •

edited

Loading