Skip to content

Reading arrow schemas from parquet files is expensive #22200

@fpetkovski

Description

@fpetkovski

We have a specific use case in one of our deployments where a smaller subset of files ends up serving heavy reads, many of which are point lookups. I am noticing in profiles that most of the CPU time is spent on inferring the arrow schema from the ARROW:schema Parquet metadata. The other expensive part is rebuilding the bloom filter on the predicate column over and over again.

In our case we know the arrow schema for each file and are okay with providing it ourselves. Perhaps one option to do it is to add it as an optional field to PartitionedFile and the opener can prioritize it if set, before trying to infer it from the parquet footer. I don't yet have a good solution for reusing bloom filters but I am open to ideas of what can be done to inject more information in the Parquet opener ahead of time. I am happy to also open a separate issue for them.

The flamegraph bellow is taken from one of our production deployments and I have focused it only on the stack frames doing parquet file reads. We are currently on Datafusion 53.0.0.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions