Skip to content

[parquet]: support storing embedding custom byte blocks alongside row group data? #9860

@friendlymatthew

Description

@friendlymatthew

I'm curious what's the right way to support custom, application-defined byte blocks in Parquet files

There are use cases where it's valuable to embed arbitrary byte ranges like bloom filter extensions, custom statistics, secondary indexes, or application specific metadata directly inside a Parquet file

Today, key_value_metadata can store small values but isn't designed for large binary blobs with efficient random access (readers must deserialize the entire footer to access them)

I prototyped one approach that adds a new Thrift field to FileMetaData:

// field 10 of FileMetaData
struct CustomBlock {
  1: required string name
  2: required i64 offset
  3: required i64 length
  4: optional string block_type
}

Blocks are written after bloom filters / column + offset indexes but before the Thrift footer. Readers that don't recognize field 10 skip it, so the file stays valid

This seems to work but it raises some open questions like if this is the right mechanism or if this is even the right layer to edit

Motivation

The main use case I'm exploring is storing precomputed secondary indexes inside Parquet files so that query engines can seek directly to them without external sidecar files. Keeping everything in one file simplifies write amplification and cache invalidation

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions