[parquet]: support storing embedding custom byte blocks alongside row group data?

I'm curious what's the right way to support custom, application-defined byte blocks in Parquet files

There are use cases where it's valuable to embed arbitrary byte ranges like bloom filter extensions, custom statistics, **secondary indexes**, or application specific metadata directly inside a Parquet file

Today, `key_value_metadata` can store small values but isn't designed for large binary blobs with efficient random access (readers must deserialize the entire footer to access them)

I prototyped one approach that adds a new Thrift field to `FileMetaData`:

```thrift
// field 10 of FileMetaData
struct CustomBlock {
  1: required string name
  2: required i64 offset
  3: required i64 length
  4: optional string block_type
}
```

Blocks are written after bloom filters / column + offset indexes but before the Thrift footer. Readers that don't recognize field 10 skip it, so the file stays valid

This seems to work but it raises some open questions like if this is the right mechanism or if this is even the right layer to edit

# Motivation
The main use case I'm exploring is storing precomputed secondary indexes inside Parquet files so that query engines can seek directly to them without external sidecar files. Keeping everything in one file simplifies write amplification and cache invalidation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet]: support storing embedding custom byte blocks alongside row group data? #9860

Motivation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[parquet]: support storing embedding custom byte blocks alongside row group data? #9860

Description

Motivation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions