Skip to content

Create a "end to end" Content Addressable Storage / CDC Chunking Example #9592

@alamb

Description

@alamb

@kszucs has added support for Content Addressable Chunking in #9450 ❤️

This is an important feature for hugging face's xet filesystem which automatically deduplicates multiple copies of the same data.

I think however, it is a much more interesting usecase more broadly and would be applicable to many users, for example, those who "compact" data on object store stored in parquet files. The compacted versions often share a substantial number of similar bytes / pages, but currently they aren't typically deduped that I know of

To help others understand more easily how to take advantage of this feature, I think it would help to have simple working example showing how to make such a Content Addressable Storage system

BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files.

Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem

I have been thinking of a page store prototype for a while actually, that would kinda look like:

  1. iterate over the parquet pages using a page reader
  2. use a hash function to assign a unique key to the page based on its content, like xxhash, shar, blake (this is different from the gearhash since chunking is already done by the parquet writer)
  3. write out the page to a hashtable like storage system like kv store, object store, but really depends on the use case
  4. maintain the necessary metadata to reassemble the original parquet file from the stored pages

A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs

Originally posted by @kszucs in #9450 (comment)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions