Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NullArrayReader (#1245) #1246

Merged
merged 1 commit into from
Jan 29, 2022
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #1245.

Rationale for this change

Originally reported by @bjchambers, a sanity check added in #1054 fires when reading NullArrays.

This highlighted two problems:

  • NullArrayReader has always been somewhat broken as it would never flush the accumulated null buffer, instead just growing it indefinitely
  • There were literally no tests for NullArrayReader 😱

What changes are included in this PR?

Fixes the panic, and adds a basic test of the NullArrayReader

Are there any user-facing changes?

NullArrayReader should work again

@@ -214,6 +214,10 @@ where
// save definition and repetition buffers
self.def_levels_buffer = self.record_reader.consume_def_levels()?;
self.rep_levels_buffer = self.record_reader.consume_rep_levels()?;

// Must consume bitmap buffer
self.record_reader.consume_bitmap_buffer()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is even getting computed at all is definitely a valid question, but at the same time a workload where the speed of reading arrays of nulls is the bottleneck feels decidedly... niche 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One place it comes up is where you have multiple Parquet files representing a "table", with one or more columns being relatively sparse. If you have a file dropped every hour, then it may be that some of the files have a null column while others have a few values.

It doesn't seem likely it would be the bottleneck (the other columns in the file probably would be), but that's at least how it's come up for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't mean to suggest NullArrayReader itself is niche, but rather there is probably a limited benefit to optimising it to not compute the bitmask unnecessarily 😄

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold -- I'll get this out in arrow 9.0.0 next week; I can try to release a patchset sooner if you need @bjchambers

@bjchambers
Copy link
Contributor

bjchambers commented Jan 29, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet v8.0.0 panics when reading all null column to NullArray
3 participants