-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ParquetMetadata::memory_size size estimation #5965
Conversation
use crate::schema::types::SchemaDescriptor; | ||
use std::sync::Arc; | ||
|
||
/// Trait for calculating the size of various containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to add as many memory estimation calculations as possible in their own module rather than sprinkling it next to the structure definitions. I put it next to the structures when the internal fields are private.
I can put the code next to the definitions if people think that would be cleaner / less likely to be forgotten if new fields are added in the future
29abb9e
to
c00cb27
Compare
c00cb27
to
be3c014
Compare
@@ -176,6 +179,28 @@ impl ParquetMetaData { | |||
self.offset_index.as_ref() | |||
} | |||
|
|||
/// Estimate of the bytes allocated to store `ParquetMetadata` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the (only) new API introduced in this PR
]]), | ||
); | ||
|
||
let bigger_expected_size = 2304; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shows there is non trivial overhead with storing these structures -- 2K already for a 2 column with a single row group
Require HeapSize for ParquetValueType
I plan to merge this in later today as well unless there are any additional comments or people would like additional time to review |
Draft as I need to implement memory size calculations for the schema structs as wellWhich issue does this PR close?
Closes #1729
Rationale for this change
For systems that want to do low latency queries on parquet files stored in
object store, it is important to somehow provide the ParquetMetadata to the reader to
avoid the overhead of fetching the file footer and re-parsing the metadata.
For example, when using the
ArrowReaderMetadata
API:arrow-rs/parquet/src/arrow/arrow_reader/mod.rs
Lines 293 to 315 in ed018a3
One way to provide
ParquetMetadata
to the Arrow reader is to cache it in memory and for large numbersof parquet files this can consume non trivial memory. Thus accurately understanding the memory requirements
is important
What changes are included in this PR?
ParquetMetadata::memory_size()
, named similarly toarrow::Array::get_array_memory_size
Are there any user-facing changes?
There is a new function in the API