We have a specific use case in one of our deployments where a smaller subset of files ends up serving heavy reads, many of which are point lookups. I am noticing in profiles that most of the CPU time is spent on inferring the arrow schema from the ARROW:schema Parquet metadata. The other expensive part is rebuilding the bloom filter on the predicate column over and over again.
In our case we know the arrow schema for each file and are okay with providing it ourselves. Perhaps one option to do it is to add it as an optional field to PartitionedFile and the opener can prioritize it if set, before trying to infer it from the parquet footer. I don't yet have a good solution for reusing bloom filters but I am open to ideas of what can be done to inject more information in the Parquet opener ahead of time. I am happy to also open a separate issue for them.
The flamegraph bellow is taken from one of our production deployments and I have focused it only on the stack frames doing parquet file reads. We are currently on Datafusion 53.0.0.

We have a specific use case in one of our deployments where a smaller subset of files ends up serving heavy reads, many of which are point lookups. I am noticing in profiles that most of the CPU time is spent on inferring the arrow schema from the
ARROW:schemaParquet metadata. The other expensive part is rebuilding the bloom filter on the predicate column over and over again.In our case we know the arrow schema for each file and are okay with providing it ourselves. Perhaps one option to do it is to add it as an optional field to
PartitionedFileand the opener can prioritize it if set, before trying to infer it from the parquet footer. I don't yet have a good solution for reusing bloom filters but I am open to ideas of what can be done to inject more information in the Parquet opener ahead of time. I am happy to also open a separate issue for them.The flamegraph bellow is taken from one of our production deployments and I have focused it only on the stack frames doing parquet file reads. We are currently on Datafusion 53.0.0.