Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet WriterProperties.max_row_group_size not wired up #257

Closed
crepererum opened this issue May 5, 2021 · 1 comment · Fixed by #381
Closed

Parquet WriterProperties.max_row_group_size not wired up #257

crepererum opened this issue May 5, 2021 · 1 comment · Fixed by #381
Labels

Comments

@crepererum
Copy link
Contributor

Describe the bug
This property (that can be set via the WriterPropertiesBuilder):

max_row_group_size: usize,

can only be retrieved using this getter:

/// Returns max size for a row group.
pub fn max_row_group_size(&self) -> usize {
self.max_row_group_size
}

but this getter is never used. In fact quickly trying out this property has no effect. I think it should probably we wired up here:

/// Write a RecordBatch to writer
///
/// *NOTE:* The writer currently does not support all Arrow data types
pub fn write(&mut self, batch: &RecordBatch) -> Result<()> {
// validate batch schema against writer's supplied schema
if self.arrow_schema != batch.schema() {
return Err(ParquetError::ArrowError(
"Record batch schema does not match writer schema".to_string(),
));
}
// compute the definition and repetition levels of the batch
let batch_level = LevelInfo::new_from_batch(batch);
let mut row_group_writer = self.writer.next_row_group()?;
for (array, field) in batch.columns().iter().zip(batch.schema().fields()) {
let mut levels = batch_level.calculate_array_levels(array, field, false);
// Reverse levels as we pop() them when writing arrays
levels.reverse();
write_leaves(&mut row_group_writer, array, &mut levels)?;
}
self.writer.close_row_group(row_group_writer)
}

where the incoming RecordBatch is split into batches of the configured size that will then fed into individual record batches.

To Reproduce
Steps to reproduce the behavior:

  1. create a RecordBatch with 3 rows.
  2. Set WriterProperties.max_row_group_size to 1
  3. Create a parquet file
  4. See the that parquet file has only 1 row group (should be 3).

Expected behavior
Record batches created from arrow should respect WriterProperties.max_row_group_size.

Additional context
Commit in question is 508f25c10032857da34ea88cc8166f0741616a32.

@crepererum crepererum added the bug label May 5, 2021
@nevi-me
Copy link
Contributor

nevi-me commented May 5, 2021

Thanks for noticing this @crepererum, it's related to #225, which is blocked by the inability to slice structs and lists correctly.

We want to use the slice facility to limit the row group sizes, i was thinking about this a few hours ago, that maybe we could track the offset and length in parquet instead of relying on arrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants