-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix record delimiting on row group boundaries (#2025) #2027
Conversation
@@ -124,7 +124,7 @@ impl<OffsetSize: OffsetSizeTrait> ArrayReader for ListArrayReader<OffsetSize> { | |||
|
|||
// The output offsets for the computed ListArray | |||
let mut list_offsets: Vec<OffsetSize> = | |||
Vec::with_capacity(next_batch_array.len()); | |||
Vec::with_capacity(next_batch_array.len() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by fix
@@ -193,10 +180,7 @@ where | |||
}; | |||
|
|||
// Try to more value from parquet pages | |||
let values_read = self.read_one_batch(batch_size)?; | |||
if values_read < batch_size { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the cause of the bug, the end of the column chunk was being detected based on if read_on_batch returned less than the batch size. If the batch_size happens to exactly match the remaining records this would fail, and the reader would miss the last record in the chunk
9472312
to
1fb75f8
Compare
1fb75f8
to
ad58424
Compare
7ffd362
to
abefdd4
Compare
I don't understand what the CI is doing, it seems to be checking out some ref that doesn't exist??? |
5e12b71
to
63c7394
Compare
63c7394
to
7681047
Compare
7681047
to
b9322fc
Compare
Turns out it was a combination of PEBCAC and a logical conflict when CI merged the lastest master in. |
@@ -731,4 +739,48 @@ mod tests { | |||
assert_eq!(5000, record_reader.num_values()); | |||
} | |||
} | |||
|
|||
#[test] | |||
fn test_row_group_boundary() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes on master, without the changes in this PR 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat expected, I'll see if I can make it fail
@@ -1528,4 +1528,57 @@ mod tests { | |||
|
|||
assert_eq!(total_rows, expected_rows); | |||
} | |||
|
|||
#[test] | |||
fn test_row_group_exact_multiple() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I remove the code in this PR this test fails like the following (I was expecting the reader to miss the last record)
---- arrow::arrow_reader::tests::test_row_group_exact_multiple stdout ----
thread 'arrow::arrow_reader::tests::test_row_group_exact_multiple' panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error: Not all children array length are the same!")', parquet/src/arrow/arrow_reader.rs:1581:53
stack backtrace:
0: rust_begin_unwind
at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/result.rs:1785:5
3: core::result::Result<T,E>::unwrap
at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/result.rs:1078:23
4: parquet::arrow::arrow_reader::tests::test_row_group_exact_multiple
at ./src/arrow/arrow_reader.rs:1581:23
5: parquet::arrow::arrow_reader::tests::test_row_group_exact_multiple::{{closure}}
at ./src/arrow/arrow_reader.rs:1533:5
6: core::ops::function::FnOnce::call_once
at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/ops/function.rs:248:5
7: core::ops::function::FnOnce::call_once
at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only misses the last record for the repeated array, and so you end up with this error
@tustvold lint failed ! |
0fe7e57
to
e5f57df
Compare
e5f57df
to
4ecf5d8
Compare
So the batch_size bug was a bug, but would be masked by the MIN_BATCH_SIZE setting of 1024 and so would only occur with batch sizes >= 1024. The actual cause of the failure was more subtle. The bug was that once exhausted RecordReader would continue to re-read the last record on subsequent calls to read_records. This would only occur if it returned exactly the batch_size number of records, when reaching the end of a chunk. This wouldn't actually read any new data and so would end up actually returning less records than it claimed to have read. So
If only a single list column this somewhat funky behaviour would not be noticeable, as it would eventually read all the data, corresponding to the total number of rows. However, it would read batches of 8, 7, 5 instead of 8, 8, 4, which would throw off the StructArrayReader. |
Which issue does this PR close?
Closes #2025
Rationale for this change
The previous behaviour was incorrect
What changes are included in this PR?
Fixes the delimiting logic in RecordReader
Are there any user-facing changes?
No