Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParquetMetadata::memory_size size estimation #5965

Merged
merged 3 commits into from
Jul 2, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 26, 2024

Draft as I need to implement memory size calculations for the schema structs as well

Which issue does this PR close?

Closes #1729

Rationale for this change

For systems that want to do low latency queries on parquet files stored in
object store, it is important to somehow provide the ParquetMetadata to the reader to
avoid the overhead of fetching the file footer and re-parsing the metadata.

For example, when using the ArrowReaderMetadata API:

/// The metadata necessary to construct a [`ArrowReaderBuilder`]
///
/// Note this structure is cheaply clone-able as it consists of several arcs.
///
/// This structure allows
///
/// 1. Loading metadata for a file once and then using that same metadata to
/// construct multiple separate readers, for example, to distribute readers
/// across multiple threads
///
/// 2. Using a cached copy of the [`ParquetMetadata`] rather than reading it
/// from the file each time a reader is constructed.
///
/// [`ParquetMetadata`]: crate::file::metadata::ParquetMetaData
#[derive(Debug, Clone)]
pub struct ArrowReaderMetadata {
/// The Parquet Metadata, if known aprior
pub(crate) metadata: Arc<ParquetMetaData>,
/// The Arrow Schema
pub(crate) schema: SchemaRef,
pub(crate) fields: Option<Arc<ParquetField>>,
}

One way to provide ParquetMetadata to the Arrow reader is to cache it in memory and for large numbers
of parquet files this can consume non trivial memory. Thus accurately understanding the memory requirements
is important

What changes are included in this PR?

  1. Add ParquetMetadata::memory_size(), named similarly to arrow::Array::get_array_memory_size
  2. Add a non public trait to help calculate memory usage
  3. Tests

Are there any user-facing changes?

There is a new function in the API

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 26, 2024
use crate::schema::types::SchemaDescriptor;
use std::sync::Arc;

/// Trait for calculating the size of various containers
Copy link
Contributor Author

@alamb alamb Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to add as many memory estimation calculations as possible in their own module rather than sprinkling it next to the structure definitions. I put it next to the structures when the internal fields are private.

I can put the code next to the definitions if people think that would be cleaner / less likely to be forgotten if new fields are added in the future

@alamb alamb force-pushed the alamb/memory_accounting branch 2 times, most recently from 29abb9e to c00cb27 Compare June 27, 2024 00:44
parquet/src/data_type.rs Outdated Show resolved Hide resolved
@@ -176,6 +179,28 @@ impl ParquetMetaData {
self.offset_index.as_ref()
}

/// Estimate of the bytes allocated to store `ParquetMetadata`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the (only) new API introduced in this PR

]]),
);

let bigger_expected_size = 2304;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows there is non trivial overhead with storing these structures -- 2K already for a 2 column with a single row group

parquet/src/data_type.rs Outdated Show resolved Hide resolved
Require HeapSize for ParquetValueType
@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2024

I plan to merge this in later today as well unless there are any additional comments or people would like additional time to review

@alamb alamb merged commit e61fb62 into apache:master Jul 2, 2024
16 checks passed
@alamb alamb deleted the alamb/memory_accounting branch July 2, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add memory size estimation for ParquetMetadata
2 participants