Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

Open
EdSchouten opened this issue Oct 16, 2020 · 11 comments · May be fixed by #233

Comments

@EdSchouten
Copy link
Collaborator

EdSchouten commented Oct 16, 2020

The size of objects stored in the CAS has a pretty large spread. We can have tiny Directory objects that are less than 100 bytes in size, but we can also have output files that are gigabytes in size. Dealing with these large files is annoying:

  • When using the digest as a key for sharding the CAS across multiple backends, you notice that you get an almost perfect distribution in terms of request count. Below is an example read count, 1h average, of a setup that has eight shards:

Screenshot 2020-10-16 at 13 43 58

But if you take a look at the amount of data read in bytes, again 1h average, it's a lot less balanced. We sometimes see that the busiest shard receives 50% traffic more than the one that's least busy. And that's measured across a full hour.

Screenshot 2020-10-16 at 13 44 38

  • Many storage systems out there put some limits on the maximum size of an object. Redis can only store 512 MB objects. Cassandra has a 2 GB limit on column values. Ceph's Rados has a maximum object size of 128 MB. (Buildbarn's LocalBlobAccess also has certain limits.) It is possible to decompose objects into chunks on the storage infrastructure side, but that removes the ability to perform data integrity checking at the chunk level, as multiple chunks need to be concatenated to perform checksum validation. When chunks are also spread across shards, an individual shard loses the ability to validate its own data entirely.
  • Connection drops may occur while uploading files. When this happens, two things may happen: A) the client may attempt to re-upload the file from the beginning. B) the client somehow requests that the upload continues at the last received position. This would require the server to hold on to partial uploads, and negotiate with the client at which offset the upload needs to resume. That isn't fun to implement.
  • The current approach doesn't make it possible to do validated reads of chunks of a file. If workers implement lazy loading of file contents (e.g., using a worker FUSE file system), it may be preferable to not read large files as a whole, but only read the parts that are actually used by build actions. This can't be done in a validated way if the only thing you have is the SHA-sum of the file as a whole.
  • Relatedly, in Add support for transferring compressed blobs via ByteStream #168 we're discussing how support for compressing CAS contents can be added to REv2. As we've noticed, the need for supporting partial uploads & downloads complicates that a lot.

My suggestion is that REv3 drops support for offset reads & writes entirely. Instead, we should model large files as shallow (1-level deep) Merkle trees of chunks with some fixed size. Instead of sending a single Bytestream request to download such a large file, a client must (or may?) first need to read a manifest object from the CAS containing a list of digests of chunks whose contents need to be concatenated. When uploading a file, FindMissingBlobs() should be called against the the digest of the manifest object and each of the digests of the chunks. This allows a client to skip uploading of parts that are already in the CAS. This both speeds up resumption of partially uploaded files, and adds (a rough version of) deduplication of identical regions across multiple files. Because large objects don't occur very frequently (read: almost all objects tend to be small), this indirection doesn't really impact performance for most workloads. Compression can be added by simply compressing individual chunks.

Earlier this year I experimented with this concept for Buildbarn (ADR#3), where the decomposition size can be any power of 2, 2 KiB or larger. It used a slightly simplified version of BLAKE3 to hash objects. I never fully completed it/merged it into master, unfortunately. It also didn't take the desire to do compression into account. I think that both decomposition and compression should be considered, and can likely not be discussed separately. My hope is that for REv3, we find the time to solve this properly.

@ulfjack
Copy link
Collaborator

ulfjack commented Oct 16, 2020

You can do that today in the server, as long as you can support read and write offsets. I'm not sure about exposing this complexity to the client - it's not required in most cases, but it makes every client significantly more complex.

@EdSchouten
Copy link
Collaborator Author

EdSchouten commented Oct 16, 2020

I'm not sure about exposing this complexity to the client - it's not required in most cases, but it makes every client significantly more complex.

But there are cases in which it is desired, and stating that it isn't required in most cases is not helping. Any protocol out there contains features that aren't needed in most cases, yet they aren't removed because people depend on them. If we removed all features from HTTP that aren't needed in most cases, we'd have Gopher.

I want to make use of a feature similar to Builds without the Bytes, except that I want a FUSE file system on the client side that allows me to still access any output files on demand. Some of these output files may be large. Right now reads against these files will hang, until they have been downloaded over the Bytestream protocol entirely, which makes ad hoc exploration hard. This is because there is no way to do partial file validation.

@glukasiknuro
Copy link

I have collected some initial data to check whether deduplication caused by chunking would be helpful to reduce size of the data. Focused on data residing in cache, checking in-flight data maybe next step.

Basic results:

  • zstd compression itself allows to reduce size of cache close to 5x
  • adding chunking on 1MB with compression allows slight but not significant improvement
  • using content based chunking (FastCDC using Gear) plus compression allows close to 7x reduction of cache size.

Below stats show content of a bazel-remote cache CAS directory when entries are chunked, only unique chunks are preserved, and all chunks are compressed using zstd. Would be nice to get some additional results, those results may not be representative to others.

The tool itself with more details is here: https://github.com/glukasiknuro/bazel-cache-chunking

CC @mostynb

Processed successfully 468997 files, total size: 1.7 TB

No chunking ZSTD CGO SPEED  size: 343 GB (avg: 732 kB), 20.53% of total  (4.87x effective space)  chunks: 468997    duplicate: 0
        1MB ZSTD CGO SPEED  size: 323 GB (avg: 688 kB), 19.29% of total  (5.18x effective space)  chunks: 1888847   duplicate: 134986
       Gear ZSTD CGO SPEED  size: 244 GB (avg: 519 kB), 14.55% of total  (6.87x effective space)  chunks: 1171361   duplicate: 440797

@EdSchouten
Copy link
Collaborator Author

Nice! Thanks for doing these measurements.

Important to mention that deduplication of identical chunks is not one of the main reasons I created this proposal. It is all about ensuring that large blobs simply don’t exist. Large blobs are annoying to work with, especially if you want partial access, resumption of transfers, compression, etc. while still having guaranteed data integrity.

@mostynb
Copy link
Contributor

mostynb commented Mar 12, 2021

IIRC @Erikma and @johnterickson reported that transferring large files in parallel using chunks was significantly faster with azure blob storage than a single upload/download stream (and I imagine S3 and GCS would be similar).

@glukasiknuro
Copy link

Also, from somewhat related item on bazel side: remote/performance: support rsync compression

In early experiments we are seeing up to 90% reduction in required download bandwidth.

The result above maybe before build without bytes was introduced though.

@glukasiknuro
Copy link

IIRC @Erikma and @johnterickson reported that transferring large files in parallel using chunks was significantly faster with azure blob storage than a single upload/download stream (and I imagine S3 and GCS would be similar).

For GCS we are using composite uploads and parallel sliced downloads which improve speed greatly for gigabyte (some article mentioning it in https://jbrojbrojbro.medium.com/slice-up-your-life-large-file-download-optimization-50ee623b708c)

@sluongng
Copy link
Contributor

Regarding breaking down a large blob into multiple smaller chunks: here are some related arts which are being applied by Cloud Vendor to distribute container images:

Some quick summary:

  1. Large blobs here are broken down into either chunks or blocks.
  2. Client / server can share files in a P2P fashion.
  3. Blobs are lazily loaded.

Out of the 3 points above, I think [1] is proposed in this issue.

[2][3] could be a client-server specific implementation, especially around FUSE client, but it would be nice to have P2P consideration for V3?

@sluongng
Copy link
Contributor

Worth to note that the large blob performance issue is quite real: rules_docker used to have to disable remote-caching for a bunch of intermediary artifacts during a container image build bazelbuild/rules_docker#2043

If these data could be broken down to re-usable chunks/blocks, the cache hit should be much higher and thus eliminate the needs to disable remote cache for in-between artifacts.

@EdSchouten EdSchouten linked a pull request Nov 7, 2022 that will close this issue
@sluongng
Copy link
Contributor

Returning to this from V3 doc:

Another idea is to acknowledge the existence of bigger blobs and set a boundary to detect bigger blob for special treatment. (i.e. up to what size the server would accept a blob, all files bigger than that blob need to go through special treatment).

Similar to how Yandex's Arc VCS does it (https://habr.com/ru/companies/yandex/articles/482926/), we could introduce a metadata type BlobRef that stores metadata about chunks that make up a bigger blob. We should have a separate ChunkingAlgorithm to let different implementations define how to chunk the bigger blobs into smaller chunks and generate BlobRef accordingly. Examples of chunking algorithms include: SHA256TREE, ReedSolomon (ErasureEncoding), FastCDC etc...

@roloffs
Copy link

roloffs commented Dec 1, 2023

To add some thoughts to this discussion from the perspective of the justbuild development. We like the idea to avoid large blobs in the remote-execution CAS for several reasons as it has already been discussed. In bazelbuild/remote-apis#282, we proposed a content-defined blob-splitting scheme to save network traffic and in future also storage. Currently, we are happy with verifying blobs by reading through their entire data, which is why we focused on traffic and storage reduction and not optimized verification of splitted large blobs. If there exists a solution to model the Merkle tree not over fixed sizes but content defined, this would not only avoid large blobs but also save traffic and storage. This requires a fixed content-defined splitting algorithm with all randomness being known, e.g., in terms of seed and random number generator. We don't yet have a complete solution, but we just wanted to put this consideration into the discussion, not only for this issue, but also for the REv3 development since we are highly interested in having such network traffic and storage reduction concepts available in REv3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants