Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading/writing nested null arrays (#1480) (#1036) (#1399) #1481

Merged
merged 1 commit into from
Mar 25, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #1480
Closes #1036
Closes #1399

Rationale for this change

See tickets

What changes are included in this PR?

Fixes various bugs in handling of null arrays

Are there any user-facing changes?

Previous incorrect behaviour is now fixed

@@ -887,7 +886,7 @@ fn remove_indices(
Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
}
}
ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))),
ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() - indices.len()))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version of #1448 was this, which is correct, it was then changed in a subsequent follow up commit. I'm not sure why...

FYI @viirya

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I remember by deducting the null indices, the resulting record batch will be empty, but actually there are a row with empty list in previous issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was caused by #1448 (comment) if I'm not mistaken

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1513,6 +1513,43 @@ mod tests {
required_and_optional::<LargeStringArray, _>(raw_strs);
}

#[test]
fn null_list_single_column() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1564,6 +1601,25 @@ mod tests {
one_column_roundtrip(values, true, Some(SMALL_SIZE / 2));
}

#[test]
fn list_nested_nulls() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was just something I wrote whilst exploring the problem space, and I think is generally useful

// TODO: The behaviour of a <list<null>> is untested
DataType::Null => vec![list_level],
DataType::Boolean
DataType::Null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for #1036

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i think this is correct.

DataType::Null
| DataType::Boolean
// A NullArray is entirely nulls, despite not containing a null buffer
DataType::Null => ((0..=(len as i64)).collect(), vec![false; len]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for #1480


// convert to arrays
let array = arrow::array::NullArray::new(records_read);
let array = arrow::array::NullArray::new(self.record_reader.num_values());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug, the NullArray should have the value count number of elements, i.e. the same number of elements as the definition/repetition levels if any.

In the non-nested case the two quantities will be the same, but in the event of a nested array a single record might correspond to 1 or more values.

This would then cause this line to fire - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L930

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 24, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #1481 (a1e37df) into master (e778c10) will decrease coverage by 0.01%.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##           master    #1481      +/-   ##
==========================================
- Coverage   82.72%   82.71%   -0.02%     
==========================================
  Files         187      187              
  Lines       54186    54239      +53     
==========================================
+ Hits        44824    44862      +38     
- Misses       9362     9377      +15     
Impacted Files Coverage Δ
parquet/src/arrow/levels.rs 84.80% <25.00%> (+0.12%) ⬆️
parquet/src/arrow/array_reader.rs 78.79% <100.00%> (+0.06%) ⬆️
parquet/src/arrow/arrow_writer.rs 97.65% <100.00%> (+0.09%) ⬆️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.46%) ⬇️
arrow/src/array/data.rs 82.80% <0.00%> (-0.16%) ⬇️
integration-testing/src/lib.rs 0.00% <0.00%> (ø)
arrow/src/array/array_binary.rs 92.62% <0.00%> (ø)
arrow/src/array/array_dictionary.rs 91.91% <0.00%> (+0.24%) ⬆️
arrow/src/datatypes/field.rs 54.10% <0.00%> (+0.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e778c10...a1e37df. Read the comment docs.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the PR, the changes that you made make sense to me, and I can see how they address the issues that you've listed.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good.

@alamb alamb merged commit 6bf3b3a into apache:master Mar 25, 2022
@alamb
Copy link
Contributor

alamb commented Mar 25, 2022

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
5 participants