I'm curious what's the right way to support custom, application-defined byte blocks in Parquet files
There are use cases where it's valuable to embed arbitrary byte ranges like bloom filter extensions, custom statistics, secondary indexes, or application specific metadata directly inside a Parquet file
Today, key_value_metadata can store small values but isn't designed for large binary blobs with efficient random access (readers must deserialize the entire footer to access them)
I prototyped one approach that adds a new Thrift field to FileMetaData:
// field 10 of FileMetaData
struct CustomBlock {
1: required string name
2: required i64 offset
3: required i64 length
4: optional string block_type
}
Blocks are written after bloom filters / column + offset indexes but before the Thrift footer. Readers that don't recognize field 10 skip it, so the file stays valid
This seems to work but it raises some open questions like if this is the right mechanism or if this is even the right layer to edit
Motivation
The main use case I'm exploring is storing precomputed secondary indexes inside Parquet files so that query engines can seek directly to them without external sidecar files. Keeping everything in one file simplifies write amplification and cache invalidation
I'm curious what's the right way to support custom, application-defined byte blocks in Parquet files
There are use cases where it's valuable to embed arbitrary byte ranges like bloom filter extensions, custom statistics, secondary indexes, or application specific metadata directly inside a Parquet file
Today,
key_value_metadatacan store small values but isn't designed for large binary blobs with efficient random access (readers must deserialize the entire footer to access them)I prototyped one approach that adds a new Thrift field to
FileMetaData:Blocks are written after bloom filters / column + offset indexes but before the Thrift footer. Readers that don't recognize field 10 skip it, so the file stays valid
This seems to work but it raises some open questions like if this is the right mechanism or if this is even the right layer to edit
Motivation
The main use case I'm exploring is storing precomputed secondary indexes inside Parquet files so that query engines can seek directly to them without external sidecar files. Keeping everything in one file simplifies write amplification and cache invalidation