Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5129: [Rust] Column writer bug: check dictionary encoder when adding a new data page #4152

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
55 changes: 54 additions & 1 deletion rust/parquet/src/column/writer.rs
Expand Up @@ -421,7 +421,15 @@ impl<T: DataType> ColumnWriterImpl<T> {
/// Returns true if there is enough data for a data page, false otherwise.
#[inline]
fn should_add_data_page(&self) -> bool {
self.encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
match self.dict_encoder {
Some(ref encoder) => {
encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
}
None => {
self.encoder.estimated_data_encoded_size()
>= self.props.data_pagesize_limit()
}
}
}

/// Performs dictionary fallback.
Expand Down Expand Up @@ -1379,6 +1387,51 @@ mod tests {
);
}

#[test]
fn test_column_writer_add_data_pages_with_dict() {
// ARROW-5129: Test verifies that we add data page in case of dictionary encoding
// and no fallback occured so far.
let file = get_temp_file("test_column_writer_add_data_pages_with_dict", &[]);
let sink = FileSink::new(&file);
let page_writer = Box::new(SerializedPageWriter::new(sink));
let props = Rc::new(
WriterProperties::builder()
.set_data_pagesize_limit(15) // actually each page will have size 15-18 bytes
.set_write_batch_size(3) // write 3 values at a time
.build(),
);
let data = &[1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
let mut writer = get_test_column_writer::<Int32Type>(page_writer, 0, 0, props);
writer.write_batch(data, None, None).unwrap();
let (bytes_written, _, _) = writer.close().unwrap();

// Read pages and check the sequence
let source = FileSource::new(&file, 0, bytes_written as usize);
let mut page_reader = Box::new(
SerializedPageReader::new(
source,
data.len() as i64,
Compression::UNCOMPRESSED,
Int32Type::get_physical_type(),
)
.unwrap(),
);
let mut res = Vec::new();
while let Some(page) = page_reader.get_next_page().unwrap() {
res.push((page.page_type(), page.num_values()));
}
assert_eq!(
res,
vec![
(PageType::DICTIONARY_PAGE, 10),
(PageType::DATA_PAGE, 3),
(PageType::DATA_PAGE, 3),
(PageType::DATA_PAGE, 3),
(PageType::DATA_PAGE, 1)
]
);
}

/// Performs write-read roundtrip with randomly generated values and levels.
/// `max_size` is maximum number of values or levels (if `max_def_level` > 0) to write
/// for a column.
Expand Down