V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

EdSchouten · 2020-10-16T12:18:02Z

The size of objects stored in the CAS has a pretty large spread. We can have tiny Directory objects that are less than 100 bytes in size, but we can also have output files that are gigabytes in size. Dealing with these large files is annoying:

When using the digest as a key for sharding the CAS across multiple backends, you notice that you get an almost perfect distribution in terms of request count. Below is an example read count, 1h average, of a setup that has eight shards:

But if you take a look at the amount of data read in bytes, again 1h average, it's a lot less balanced. We sometimes see that the busiest shard receives 50% traffic more than the one that's least busy. And that's measured across a full hour.

Many storage systems out there put some limits on the maximum size of an object. Redis can only store 512 MB objects. Cassandra has a 2 GB limit on column values. Ceph's Rados has a maximum object size of 128 MB. (Buildbarn's LocalBlobAccess also has certain limits.) It is possible to decompose objects into chunks on the storage infrastructure side, but that removes the ability to perform data integrity checking at the chunk level, as multiple chunks need to be concatenated to perform checksum validation. When chunks are also spread across shards, an individual shard loses the ability to validate its own data entirely.
Connection drops may occur while uploading files. When this happens, two things may happen: A) the client may attempt to re-upload the file from the beginning. B) the client somehow requests that the upload continues at the last received position. This would require the server to hold on to partial uploads, and negotiate with the client at which offset the upload needs to resume. That isn't fun to implement.
The current approach doesn't make it possible to do validated reads of chunks of a file. If workers implement lazy loading of file contents (e.g., using a worker FUSE file system), it may be preferable to not read large files as a whole, but only read the parts that are actually used by build actions. This can't be done in a validated way if the only thing you have is the SHA-sum of the file as a whole.
Relatedly, in Add support for transferring compressed blobs via ByteStream #168 we're discussing how support for compressing CAS contents can be added to REv2. As we've noticed, the need for supporting partial uploads & downloads complicates that a lot.

My suggestion is that REv3 drops support for offset reads & writes entirely. Instead, we should model large files as shallow (1-level deep) Merkle trees of chunks with some fixed size. Instead of sending a single Bytestream request to download such a large file, a client must (or may?) first need to read a manifest object from the CAS containing a list of digests of chunks whose contents need to be concatenated. When uploading a file, FindMissingBlobs() should be called against the the digest of the manifest object and each of the digests of the chunks. This allows a client to skip uploading of parts that are already in the CAS. This both speeds up resumption of partially uploaded files, and adds (a rough version of) deduplication of identical regions across multiple files. Because large objects don't occur very frequently (read: almost all objects tend to be small), this indirection doesn't really impact performance for most workloads. Compression can be added by simply compressing individual chunks.

Earlier this year I experimented with this concept for Buildbarn (ADR#3), where the decomposition size can be any power of 2, 2 KiB or larger. It used a slightly simplified version of BLAKE3 to hash objects. I never fully completed it/merged it into master, unfortunately. It also didn't take the desire to do compression into account. I think that both decomposition and compression should be considered, and can likely not be discussed separately. My hope is that for REv3, we find the time to solve this properly.

ulfjack · 2020-10-16T13:11:47Z

You can do that today in the server, as long as you can support read and write offsets. I'm not sure about exposing this complexity to the client - it's not required in most cases, but it makes every client significantly more complex.

EdSchouten · 2020-10-16T13:48:45Z

I'm not sure about exposing this complexity to the client - it's not required in most cases, but it makes every client significantly more complex.

But there are cases in which it is desired, and stating that it isn't required in most cases is not helping. Any protocol out there contains features that aren't needed in most cases, yet they aren't removed because people depend on them. If we removed all features from HTTP that aren't needed in most cases, we'd have Gopher.

I want to make use of a feature similar to Builds without the Bytes, except that I want a FUSE file system on the client side that allows me to still access any output files on demand. Some of these output files may be large. Right now reads against these files will hang, until they have been downloaded over the Bytestream protocol entirely, which makes ad hoc exploration hard. This is because there is no way to do partial file validation.

glukasiknuro · 2021-03-12T21:02:01Z

I have collected some initial data to check whether deduplication caused by chunking would be helpful to reduce size of the data. Focused on data residing in cache, checking in-flight data maybe next step.

Basic results:

zstd compression itself allows to reduce size of cache close to 5x
adding chunking on 1MB with compression allows slight but not significant improvement
using content based chunking (FastCDC using Gear) plus compression allows close to 7x reduction of cache size.

Below stats show content of a bazel-remote cache CAS directory when entries are chunked, only unique chunks are preserved, and all chunks are compressed using zstd. Would be nice to get some additional results, those results may not be representative to others.

The tool itself with more details is here: https://github.com/glukasiknuro/bazel-cache-chunking

CC @mostynb

Processed successfully 468997 files, total size: 1.7 TB

No chunking ZSTD CGO SPEED  size: 343 GB (avg: 732 kB), 20.53% of total  (4.87x effective space)  chunks: 468997    duplicate: 0
        1MB ZSTD CGO SPEED  size: 323 GB (avg: 688 kB), 19.29% of total  (5.18x effective space)  chunks: 1888847   duplicate: 134986
       Gear ZSTD CGO SPEED  size: 244 GB (avg: 519 kB), 14.55% of total  (6.87x effective space)  chunks: 1171361   duplicate: 440797

EdSchouten · 2021-03-12T21:21:39Z

Nice! Thanks for doing these measurements.

Important to mention that deduplication of identical chunks is not one of the main reasons I created this proposal. It is all about ensuring that large blobs simply don’t exist. Large blobs are annoying to work with, especially if you want partial access, resumption of transfers, compression, etc. while still having guaranteed data integrity.

mostynb · 2021-03-12T21:34:21Z

IIRC @Erikma and @johnterickson reported that transferring large files in parallel using chunks was significantly faster with azure blob storage than a single upload/download stream (and I imagine S3 and GCS would be similar).

glukasiknuro · 2021-03-12T21:39:34Z

Also, from somewhat related item on bazel side: remote/performance: support rsync compression

In early experiments we are seeing up to 90% reduction in required download bandwidth.

The result above maybe before build without bytes was introduced though.

glukasiknuro · 2021-03-12T21:49:59Z

IIRC @Erikma and @johnterickson reported that transferring large files in parallel using chunks was significantly faster with azure blob storage than a single upload/download stream (and I imagine S3 and GCS would be similar).

For GCS we are using composite uploads and parallel sliced downloads which improve speed greatly for gigabyte (some article mentioning it in https://jbrojbrojbro.medium.com/slice-up-your-life-large-file-download-optimization-50ee623b708c)

sluongng · 2022-03-18T08:33:01Z

Regarding breaking down a large blob into multiple smaller chunks: here are some related arts which are being applied by Cloud Vendor to distribute container images:

AWS Lambda https://www.youtube.com/watch?v=A-7j0QlGwFk
Alibaba Nydus https://www.alibabacloud.com/blog/dragonfly-releases-the-nydus-container-image-acceleration-service_598673

Some quick summary:

Large blobs here are broken down into either chunks or blocks.
Client / server can share files in a P2P fashion.
Blobs are lazily loaded.

Out of the 3 points above, I think [1] is proposed in this issue.

[2][3] could be a client-server specific implementation, especially around FUSE client, but it would be nice to have P2P consideration for V3?

sluongng · 2022-03-18T08:36:34Z

Worth to note that the large blob performance issue is quite real: rules_docker used to have to disable remote-caching for a bunch of intermediary artifacts during a container image build bazelbuild/rules_docker#2043

If these data could be broken down to re-usable chunks/blocks, the cache hit should be much higher and thus eliminate the needs to disable remote cache for in-between artifacts.

sluongng · 2023-11-28T11:07:32Z

Returning to this from V3 doc:

Another idea is to acknowledge the existence of bigger blobs and set a boundary to detect bigger blob for special treatment. (i.e. up to what size the server would accept a blob, all files bigger than that blob need to go through special treatment).

Similar to how Yandex's Arc VCS does it (https://habr.com/ru/companies/yandex/articles/482926/), we could introduce a metadata type BlobRef that stores metadata about chunks that make up a bigger blob. We should have a separate ChunkingAlgorithm to let different implementations define how to chunk the bigger blobs into smaller chunks and generate BlobRef accordingly. Examples of chunking algorithms include: SHA256TREE, ReedSolomon (ErasureEncoding), FastCDC etc...

roloffs · 2023-12-01T23:28:05Z

To add some thoughts to this discussion from the perspective of the justbuild development. We like the idea to avoid large blobs in the remote-execution CAS for several reasons as it has already been discussed. In bazelbuild/remote-apis#282, we proposed a content-defined blob-splitting scheme to save network traffic and in future also storage. Currently, we are happy with verifying blobs by reading through their entire data, which is why we focused on traffic and storage reduction and not optimized verification of splitted large blobs. If there exists a solution to model the Merkle tree not over fixed sizes but content defined, this would not only avoid large blobs but also save traffic and storage. This requires a fixed content-defined splitting algorithm with all randomness being known, e.g., in terms of seed and random number generator. We don't yet have a complete solution, but we just wanted to put this consideration into the discussion, not only for this issue, but also for the REv3 development since we are highly interested in having such network traffic and storage reduction concepts available in REv3.

EdSchouten linked a pull request Nov 7, 2022 that will close this issue

Add support for chunking of blobs, using SHA256TREE #233

Open

sluongng mentioned this issue Nov 28, 2023

Add blob split and splice API #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

EdSchouten commented Oct 16, 2020 •

edited

Loading

ulfjack commented Oct 16, 2020

EdSchouten commented Oct 16, 2020 •

edited

Loading

glukasiknuro commented Mar 12, 2021

EdSchouten commented Mar 12, 2021

mostynb commented Mar 12, 2021

glukasiknuro commented Mar 12, 2021

glukasiknuro commented Mar 12, 2021

sluongng commented Mar 18, 2022

sluongng commented Mar 18, 2022

sluongng commented Nov 28, 2023

roloffs commented Dec 1, 2023

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

Comments

EdSchouten commented Oct 16, 2020 • edited Loading

ulfjack commented Oct 16, 2020

EdSchouten commented Oct 16, 2020 • edited Loading

glukasiknuro commented Mar 12, 2021

EdSchouten commented Mar 12, 2021

mostynb commented Mar 12, 2021

glukasiknuro commented Mar 12, 2021

glukasiknuro commented Mar 12, 2021

sluongng commented Mar 18, 2022

sluongng commented Mar 18, 2022

sluongng commented Nov 28, 2023

roloffs commented Dec 1, 2023

EdSchouten commented Oct 16, 2020 •

edited

Loading

EdSchouten commented Oct 16, 2020 •

edited

Loading