Improve query performance by caching Parquet footers and Bloom filters #1597

gaffer01 · 2023-12-04T16:22:01Z

Background

We should investigate improvements to the query performance.

It is normal in LSM stores to store a Bloom filter with each file. When querying for a key, the Bloom filter is queried and if that says the key is not in the file, the expense of opening the file and reading pages looking for data can be skipped.

Reading Parquet files requires reading the footer first. If the footer could be copied from the file to a higher performing storage layer then a file could potentially be opened by reading the footer from one storage location and the data from S3.

Description

Storing Bloom filters of the keys will be simple. However, for this to provide performance benefits for queries, the Bloom filters will need to be stored somewhere that can be read from without this adding significantly to the overall query time in the case that the key is in the file, and without it taking almost as long as just opening the Parquet file in the case that the key is not in the file.

We can investigate whether it is possible to store Parquet file footers in a higher performance storage system than S3 and reduce the query time by reading the footers from there and the pages from the Parquet file in S3. The higher performing storage system might be a lower latency layer of S3 (e.g. https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/), or a local NVM drive in the case where queries are run from an EC2 server.

gaffer01 added enhancement New feature or request epic labels Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve query performance by caching Parquet footers and Bloom filters #1597

Improve query performance by caching Parquet footers and Bloom filters #1597

gaffer01 commented Dec 4, 2023

Improve query performance by caching Parquet footers and Bloom filters #1597

Improve query performance by caching Parquet footers and Bloom filters #1597

Comments

gaffer01 commented Dec 4, 2023

Background

Description