You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should investigate improvements to the query performance.
It is normal in LSM stores to store a Bloom filter with each file. When querying for a key, the Bloom filter is queried and if that says the key is not in the file, the expense of opening the file and reading pages looking for data can be skipped.
Reading Parquet files requires reading the footer first. If the footer could be copied from the file to a higher performing storage layer then a file could potentially be opened by reading the footer from one storage location and the data from S3.
Description
Storing Bloom filters of the keys will be simple. However, for this to provide performance benefits for queries, the Bloom filters will need to be stored somewhere that can be read from without this adding significantly to the overall query time in the case that the key is in the file, and without it taking almost as long as just opening the Parquet file in the case that the key is not in the file.
We can investigate whether it is possible to store Parquet file footers in a higher performance storage system than S3 and reduce the query time by reading the footers from there and the pages from the Parquet file in S3. The higher performing storage system might be a lower latency layer of S3 (e.g. https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/), or a local NVM drive in the case where queries are run from an EC2 server.
The text was updated successfully, but these errors were encountered:
Background
We should investigate improvements to the query performance.
It is normal in LSM stores to store a Bloom filter with each file. When querying for a key, the Bloom filter is queried and if that says the key is not in the file, the expense of opening the file and reading pages looking for data can be skipped.
Reading Parquet files requires reading the footer first. If the footer could be copied from the file to a higher performing storage layer then a file could potentially be opened by reading the footer from one storage location and the data from S3.
Description
Storing Bloom filters of the keys will be simple. However, for this to provide performance benefits for queries, the Bloom filters will need to be stored somewhere that can be read from without this adding significantly to the overall query time in the case that the key is in the file, and without it taking almost as long as just opening the Parquet file in the case that the key is not in the file.
We can investigate whether it is possible to store Parquet file footers in a higher performance storage system than S3 and reduce the query time by reading the footers from there and the pages from the Parquet file in S3. The higher performing storage system might be a lower latency layer of S3 (e.g. https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/), or a local NVM drive in the case where queries are run from an EC2 server.
The text was updated successfully, but these errors were encountered: