Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support full u32 and u64 roundtrip through parquet #258

Merged
merged 5 commits into from
May 10, 2021

Conversation

crepererum
Copy link
Contributor

Which issue does this PR close?

Closes #254 .

Rationale for this change

Up until now u64 values larger than i64::MAX were silently truncated to "invalid" when going through parquet. With this change we follow the C++ stack and

What changes are included in this PR?

  • a bit of prep (first two commits)
  • map all u64 values to i64 via overflow/reinterpret cast
  • account of logical type is_signed flag in statistics calculation (as per spec)

Are there any user-facing changes?

Yes, users can now expect to have full u64 storage support. Old files should still work since we previously marked values larger than i64::MAX as invalid.

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2021

Codecov Report

Merging #258 (2710f31) into master (8bd769b) will increase coverage by 0.01%.
The diff coverage is 90.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #258      +/-   ##
==========================================
+ Coverage   82.53%   82.54%   +0.01%     
==========================================
  Files         162      162              
  Lines       43796    43862      +66     
==========================================
+ Hits        36149    36208      +59     
- Misses       7647     7654       +7     
Impacted Files Coverage Δ
parquet/src/column/writer.rs 93.64% <84.00%> (-0.13%) ⬇️
parquet/src/arrow/array_reader.rs 77.12% <85.71%> (-0.02%) ⬇️
parquet/src/arrow/arrow_writer.rs 98.04% <94.33%> (-0.38%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8bd769b...2710f31. Read the comment docs.

@@ -380,6 +380,18 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
}
Arc::new(builder.finish()) as ArrayRef
}
ArrowType::UInt64 => match array.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was confusing to me (as it is manipulating the Arrow array after it has been created rather than manipulating the data prior to creating the array).

It also seems like it will result in a copy of all Int64 columns which may not be idea.

I wonder if you considered creating the Unit64Array directly from the parquet data up here: https://github.com/apache/arrow-rs/pull/258/files#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fR316-R319
instead?

That may be clearer to understand as well as more performant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see we always cast the array, see default case in line 383 (pre-patch) / 395 (post-patch). So we would always copy, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the additional copy that comes from the arity call -- specifically https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/arity.rs#L64

The arity call will be in addition to any copying that cast does, I think.

Also, when the types are the same, then cast is a noop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@alamb alamb added the parquet Changes to the parquet crate label May 5, 2021
@crepererum crepererum requested a review from alamb May 6, 2021 07:38
@alamb
Copy link
Contributor

alamb commented May 6, 2021

The test failure
https://github.com/apache/arrow-rs/pull/258/checks?check_run_id=2516139774 seems to be due to a network problem. Retriggering.


  nightly-2021-03-24-x86_64-unknown-linux-gnu installed - rustc 1.53.0-nightly (673d0db5e 2021-03-23)

info: checking for self-updates
error: could not download file from 'https://static.rust-lang.org/rustup/release-stable.toml' to '/tmp/rustup-updatemwNOD1/release-stable.toml'
error: caused by: failed to make network request
error: caused by: error sending request for url (https://static.rust-lang.org/rustup/release-stable.toml): connection error: Connection reset by peer (os error 104)
error: caused by: connection error: Connection reset by peer (os error 104)
error: caused by: Connection reset by peer (os error 104)
Error: Process completed with exit code 1.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is looking great. Thanks @crepererum !

@alamb alamb requested a review from nevi-me May 6, 2021 11:45
@nevi-me
Copy link
Contributor

nevi-me commented May 7, 2021

@crepererum thanks for picking this up, and working on it. Do we have the same risk for u32 <> i32?

@crepererum
Copy link
Contributor Author

@nevi-me good catch. Indeed there was the same bug. Added a test and fixed as well.

@crepererum
Copy link
Contributor Author

Will rebase once #267 is merged.

@nevi-me nevi-me changed the title support full u64 roundtrip through parquet support full u32 and u64 roundtrip through parquet May 7, 2021
@alamb
Copy link
Contributor

alamb commented May 7, 2021

#267 has been merged

Seems logical since all other kernels are re-exported as well under this
flat hierarchy.
- updates arrow to parquet type mapping to use reinterpret/overflow cast
  for u64<->i64 similar to what the C++ stack does
- changes statistics calculation to account for the fact that u64 should
  be compared unsigned (as per spec)

Fixes apache#254.
This is idential to the solution we now have for u64.
@crepererum
Copy link
Contributor Author

ready :)


/// Evaluate `a > b` according to underlying logical type.
fn compare_greater(&self, a: &T::T, b: &T::T) -> bool {
if let Some(LogicalType::INTEGER(int_type)) = self.descr.logical_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: not all implementations might write LogicalType, even though ConvertedType is deprecated. I had to think a bit about this, but it's fine because it's on the write-side, where we will always write LogicalType going forward.

If this was on the read side, we'd have to also check ConvertedType as a fallback.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

u64::MAX does not roundtrip through parquet
4 participants