Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

u64::MAX does not roundtrip through parquet #254

Closed
crepererum opened this issue May 4, 2021 · 2 comments · Fixed by #258
Closed

u64::MAX does not roundtrip through parquet #254

crepererum opened this issue May 4, 2021 · 2 comments · Fixed by #258
Assignees
Labels
bug parquet Changes to the parquet crate

Comments

@crepererum
Copy link
Contributor

Describe the bug
u64::MAX gets truncated to 0 when storing to parquet and reading back.

To Reproduce
Add the following test:

#[test]
fn u64_min_max() {
    let values = Arc::new(UInt64Array::from_iter_values(vec![u64::MIN, u64::MAX]));
    one_column_roundtrip("u64_min_max_single_column", values, false);
}

Expected behavior
All values either roundtrip correctly or some error is produced - at least they should not be silently truncated.

Additional context
Tested on commit 8f030db53d9eda901c82db9daf94339fc447d0db.

@crepererum crepererum added the bug label May 4, 2021
@crepererum
Copy link
Contributor Author

Some more context: the logical type u64 is stored as the physical "i64 w/o sign bit". As per format specs:

If a stored value is larger than the maximum allowed by the annotation, the behavior is not defined and can be determined by the implementation. Implementations must not write values that are larger than the annotation allows.

so in theory we could map the logical range (i64:MAX as u64)..u64::MAX to the physical range i64:MIN..0i64. From looking at the C++ code I think (!!!) this is also what C++ is doing:

https://github.com/apache/arrow/blob/0ee3b90cac8f8eb3bf512e51a7f941fcead026d9/cpp/src/parquet/arrow/reader_internal.cc#L307-L322

aka: "reinterpret signed as unsigned ints".

Then the only bit that might be tricky (or not in which case a to-be-written test will pass) is to get the statistics right.

@crepererum
Copy link
Contributor Author

BTW: I'm working on that issue.

crepererum added a commit to crepererum/arrow-rs that referenced this issue May 5, 2021
- updates arrow to parquet type mapping to use reinterpret/overflow cast
  for u64<->i64 similar to what the C++ stack does
- changes statistics calculation to account for the fact that u64 should
  be compared unsigned (as per spec)

Fixes apache#254.
crepererum added a commit to crepererum/arrow-rs that referenced this issue May 7, 2021
- updates arrow to parquet type mapping to use reinterpret/overflow cast
  for u64<->i64 similar to what the C++ stack does
- changes statistics calculation to account for the fact that u64 should
  be compared unsigned (as per spec)

Fixes apache#254.
crepererum added a commit to crepererum/arrow-rs that referenced this issue May 10, 2021
- updates arrow to parquet type mapping to use reinterpret/overflow cast
  for u64<->i64 similar to what the C++ stack does
- changes statistics calculation to account for the fact that u64 should
  be compared unsigned (as per spec)

Fixes apache#254.
nevi-me pushed a commit that referenced this issue May 10, 2021
* re-export arity kernels in `arrow::compute`

Seems logical since all other kernels are re-exported as well under this
flat hierarchy.

* return file from `parquet::arrow::arrow_writer::tests::[one_column]_roundtrip`

* support full arrow u64 through parquet

- updates arrow to parquet type mapping to use reinterpret/overflow cast
  for u64<->i64 similar to what the C++ stack does
- changes statistics calculation to account for the fact that u64 should
  be compared unsigned (as per spec)

Fixes #254.

* avoid copying array when reading u64 from parquet

* support full arrow u32 through parquet

This is idential to the solution we now have for u64.
@jorgecarleitao jorgecarleitao added the parquet Changes to the parquet crate label May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants