Reading arrow schemas from parquet files is expensive

We have a specific use case in one of our deployments where a smaller subset of files ends up serving heavy reads, many of which are point lookups. I am noticing in profiles that most of the CPU time is spent on inferring the arrow schema from the `ARROW:schema` Parquet metadata. The other expensive part is rebuilding the bloom filter on the predicate column over and over again. 

In our case we know the arrow schema for each file and are okay with providing it ourselves. Perhaps one option to do it is to add it as an optional field to `PartitionedFile` and the opener can prioritize it if set, before trying to infer it from the parquet footer. I don't yet have a good solution for reusing bloom filters but I am open to ideas of what can be done to inject more information in the Parquet opener ahead of time. I am happy to also open a separate issue for them.

The flamegraph bellow is taken from one of our production deployments and I have focused it only on the stack frames doing parquet file reads. We are currently on Datafusion 53.0.0.

<img width="1352" height="549" alt="Image" src="https://github.com/user-attachments/assets/26bef2d4-e8b4-4c44-9931-dc1573b9b358" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading arrow schemas from parquet files is expensive #22200

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reading arrow schemas from parquet files is expensive #22200

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions