Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11265: [Rust] Made bool not ArrowNativeType #9212

Closed
wants to merge 1 commit into from
Closed

ARROW-11265: [Rust] Made bool not ArrowNativeType #9212

wants to merge 1 commit into from

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Jan 16, 2021

This PR removes the risk of boolean values to be converted to bytes via ToByteSlice by explicitly making ArrowNativeType be only used in types whose in-memory representation in Rust equates to the in-memory representation in Arrow. bool in Rust is a byte and in Arrow it is a bit.

Overall, the direction of this PR is to have the traits represent one aspect of the type. In this case, ArrowNativeType is currently

  • a type that has the same in memory representation (ToByteSlice is implemented for it)
  • a json serializable type
  • something that can be casted to/from usize.

This poses a problem because:

  1. bools are serializable, not castable to usize, have different memory representation
  2. fixed size (iX, uX) are serializable, castable to usize, have the same memory representation
  3. fixed floating (f32, f64) are serializable, not castable to usize, have the same memory representation

however, they all implement ArrowNativeType.

This PR focus on splitting the json-serializable part of it.

@github-actions
Copy link

@codecov-io
Copy link

Codecov Report

Merging #9212 (1cd33d0) into master (1393188) will increase coverage by 0.00%.
The diff coverage is 80.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9212   +/-   ##
=======================================
  Coverage   81.61%   81.61%           
=======================================
  Files         215      215           
  Lines       51867    51866    -1     
=======================================
  Hits        42329    42329           
+ Misses       9538     9537    -1     
Impacted Files Coverage Δ
rust/arrow/src/util/integration_util.rs 66.87% <0.00%> (+0.14%) ⬆️
rust/arrow/src/compute/kernels/sort.rs 93.56% <100.00%> (ø)
rust/arrow/src/datatypes.rs 78.47% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eaa7b7a...1cd33d0. Read the comment docs.

@nevi-me nevi-me self-requested a review January 16, 2021 13:17
@@ -449,12 +449,10 @@ impl ArrowJsonBatch {
for i in 0..col.len() {
if col.is_null(i) {
validity.push(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, but unless I'm reading this incorrectly (likely am), I would have thought that we'd validity.push(0) here for null values.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation @jorgecarleitao

@alamb
Copy link
Contributor

alamb commented Jan 17, 2021

I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more discussion in the mailing list

@jhorstmann
Copy link
Contributor

jhorstmann commented Jan 17, 2021

I did not see this, but had the same idea at around the same time :) This matches now also the distinction between PrimitiveArray and BooleanArray.

Another motivation for this change is that this could make Buffer::typed_data safe and allow us to remove the num::Num trait bound from that method.


/// Trait expressing a Rust type that has the same in-memory representation
/// as Arrow. This includes `i16`, `f32`, but excludes `bool` (which in arrow is represented in bits).
/// In little endian machines, types that implement [`ArrowNativeType`] can be memcopied to arrow buffers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the in-memory format require little-endian? I'd expect conversion to happen when reading data into memory, which should make memcpy work on bot LE and BE machines. Assuming absence of other endianness bugs of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not require it but is little-endian by default.

I recall someone wanting to add big-endian support to the C++ impl.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this has always been a source of corner cases (bools being bitpacked) so nice to clean it up.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jorgecarleitao

/// In little endian machines, types that implement [`ArrowNativeType`] can be memcopied to arrow buffers
/// as is.
pub trait ArrowNativeType:
fmt::Debug + Send + Sync + Copy + PartialOrd + FromStr + Default + JsonSerializable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice approach (breaking out the JsonSerializable part) 👍

@alamb alamb closed this in 01c5aec Jan 20, 2021
@alamb
Copy link
Contributor

alamb commented Jan 20, 2021

I merged this branch locally into master and re-ran the tests. Things looked good.

kszucs pushed a commit that referenced this pull request Jan 25, 2021
This PR removes the risk of boolean values to be converted to bytes via `ToByteSlice` by explicitly making `ArrowNativeType` be only used in types whose in-memory representation in Rust equates to the in-memory representation in Arrow. `bool` in Rust is a byte and in Arrow it is a bit.

Overall, the direction of this PR is to have the traits represent one aspect of the type. In this case, `ArrowNativeType` is currently
* a type that has the same in memory representation (ToByteSlice is implemented for it)
* a json serializable type
* something that can be casted to/from `usize`.

This poses a problem because:

1. bools are serializable, not castable to usize, have different memory representation
2. fixed size (iX, uX) are serializable, castable to usize, have the same memory representation
3. fixed floating (f32, f64) are serializable, not castable to usize, have the same memory representation

however, they all implement `ArrowNativeType`.

This PR focus on splitting the json-serializable part of it.

Closes #9212 from jorgecarleitao/fix_trait

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR removes the risk of boolean values to be converted to bytes via `ToByteSlice` by explicitly making `ArrowNativeType` be only used in types whose in-memory representation in Rust equates to the in-memory representation in Arrow. `bool` in Rust is a byte and in Arrow it is a bit.

Overall, the direction of this PR is to have the traits represent one aspect of the type. In this case, `ArrowNativeType` is currently
* a type that has the same in memory representation (ToByteSlice is implemented for it)
* a json serializable type
* something that can be casted to/from `usize`.

This poses a problem because:

1. bools are serializable, not castable to usize, have different memory representation
2. fixed size (iX, uX) are serializable, castable to usize, have the same memory representation
3. fixed floating (f32, f64) are serializable, not castable to usize, have the same memory representation

however, they all implement `ArrowNativeType`.

This PR focus on splitting the json-serializable part of it.

Closes apache#9212 from jorgecarleitao/fix_trait

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR removes the risk of boolean values to be converted to bytes via `ToByteSlice` by explicitly making `ArrowNativeType` be only used in types whose in-memory representation in Rust equates to the in-memory representation in Arrow. `bool` in Rust is a byte and in Arrow it is a bit.

Overall, the direction of this PR is to have the traits represent one aspect of the type. In this case, `ArrowNativeType` is currently
* a type that has the same in memory representation (ToByteSlice is implemented for it)
* a json serializable type
* something that can be casted to/from `usize`.

This poses a problem because:

1. bools are serializable, not castable to usize, have different memory representation
2. fixed size (iX, uX) are serializable, castable to usize, have the same memory representation
3. fixed floating (f32, f64) are serializable, not castable to usize, have the same memory representation

however, they all implement `ArrowNativeType`.

This PR focus on splitting the json-serializable part of it.

Closes apache#9212 from jorgecarleitao/fix_trait

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants