-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2583: [Rust] Buffer should be typeless #2330
Conversation
Seems like a reasonable refactoring; I don't know enough about Rust to know what would be the best design, but the way that the C++ library developed made having a generic data container for arrays (i.e. ArrayData) desirable architecturally. |
Thanks @wesm . I think this PR is mostly ready for review - I still need to fix the Windows test failure. @andygrove @pitrou @crepererum @xhochy : it would be great if you can take a look too. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I don't know much about Rust, but I left a few comments.
rust/src/array_data.rs
Outdated
offset: i64, | ||
|
||
/// The buffers for this array data. Note that depending on the array types, this | ||
/// could hold different types of buffers (e.g., value buffer, value offset buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps say "kinds" instead of "types" if you want to avoid ambiguity about the Rust type of buffers.
rust/src/array_data.rs
Outdated
/// at different positions. | ||
buffers: Vec<Buffer>, | ||
|
||
/// The child(ren) of this array. Only non-empty for `ListArray` and `StructArray`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say something like "Only non-empty for nested types, currently List and Struct".
rust/src/array_data.rs
Outdated
/// The child(ren) of this array. Only non-empty for `ListArray` and `StructArray`. | ||
child_data: Vec<ArrayDataRef>, | ||
|
||
/// The null bit map. A `None` value for this indicates all values are non-null in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually spell it "bitmap".
rust/src/array_data.rs
Outdated
) -> Self { | ||
if null_count < 0 { | ||
null_count = if let Some(ref buf) = null_bit_buffer { | ||
count_set_bits(buf.data()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, in the C++ implementation we call count_set_bits
lazily (when a caller first asks for the null count). YMMV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought about that but moving this to null_count()
will require it to take a &mut self
instead of &self
(since now it needs to modify self.null_count
), which is not very desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently Rust allows interior mutability, which is adequate for such cached properties:
https://ricardomartins.cc/2016/06/08/interior-mutability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I forgot about Cell
. Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually @pitrou : the Cell
is not thread-safe in this case so we cannot use it. I thought about using AtomicI64
but it is nightly-only. I'm not sure how significant this is for performance reason - can we leave it as a TODO for now?
rust/src/array_data.rs
Outdated
|
||
/// Returns a copy of the data type for this array data. | ||
pub fn data_type(&self) -> DataType { | ||
self.data_type.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why is this a copy? Data types are mutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data types are not mutable. The other option is to return a reference here. However, if you need to store this reference anywhere in a struct, you'd have to add a lifetime parameter in that struct, which will get populated to any struct that use this struct, and so on. Personally I'd like to avoid this since it makes the design much more restrictive.
I think it's a good idea to add another version of this method that returns a reference though, in case anyone only needs to use the result in the local stack. Will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(IMHO!) In Rust, the user should decide if the clone is required or not. In many cases, you could read out information of the data type w/o cloning the entire type. It would follow the "zero-cost abstraction"-principle. And for buffers
and child_data
you do the same, so it would be consistent. Also, data_type_ref
is not really "cheap/efficient by default".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you point. I think we can remove the data_type
and rename data_type_ref
to data_type
. This keep it consistent with others like buffer
and child_data
. What do you think?
rust/src/array.rs
Outdated
fn from(data: ArrayDataRef) -> Self { | ||
debug_assert!(data.buffers().len() == 1); | ||
debug_assert!(data.child_data().len() == 1); | ||
let values = make_array(data.child_data()[0].clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could assert more things, for example that offsets[0] == 0
and offsets[array length + 1] == len(values array)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rust/src/array.rs
Outdated
fn from(vec: Vec<T>) -> Self { | ||
PrimitiveArray::from(Buffer::from(vec)) | ||
impl<'a> From<Vec<&'a str>> for BinaryArray { | ||
fn from(v: Vec<&'a str>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, I suppose you'll want to add a BinaryBuilder at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add that in a follow-up PR.
rust/src/array.rs
Outdated
null_count: 0, | ||
validity_bitmap: None, | ||
impl From<Vec<ArrayRef>> for StructArray { | ||
fn from(v: Vec<ArrayRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want a variant of this method that also takes a vector of field names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to only used for testing, and I really want to add more tests when the ArrayBuilder
is implemented. For now, let me change this to accept Vec<(Field,ArrayRef)>
.
rust/src/array.rs
Outdated
for arr in &v { | ||
let idx = count.to_string(); | ||
count += 1; | ||
field_types.push(Field::new(&idx, arr.data().data_type(), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is nullable
false? In Arrow you can have individual null elements in a non-null struct value...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some dummy value for testing. I'll change it so that the nullable
is passed in.
rust/src/array.rs
Outdated
fn from(v: Vec<ArrayRef>) -> Self { | ||
let mut field_types = vec![]; | ||
let mut count = 0; | ||
for arr in &v { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why &v
here while you're using plain v
in the other for
loop below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v
means to consume the unique reference. We cannot consume the reference twice so the first use needs to be &v
- just read it. We can change the second one to &v
as well and it should have no difference.
I'll have a look tomorrow. Thanks for the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply.
First of all, thank you very much for this PR. Great work! Could you please have a look at my comments, esp. regarding security-related questions?
rust/src/array.rs
Outdated
|
||
/// Returns a raw pointer to the values of this array. | ||
pub fn raw_values(&self) -> *const $native_ty { | ||
unsafe { mem::transmute(self.raw_values.get().offset(self.data.offset() as isize)) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this would require the *const u8
is aligned correctly. I think we should check this here. Otherwise, you end up with undefined behavior (or on x86, at least with a SIGBUS).
rust/src/array_data.rs
Outdated
|
||
/// Returns a copy of the data type for this array data. | ||
pub fn data_type(&self) -> DataType { | ||
self.data_type.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(IMHO!) In Rust, the user should decide if the clone is required or not. In many cases, you could read out information of the data type w/o cloning the entire type. It would follow the "zero-cost abstraction"-principle. And for buffers
and child_data
you do the same, so it would be consistent. Also, data_type_ref
is not really "cheap/efficient by default".
rust/src/util/bit_util.rs
Outdated
|
||
/// Returns the nearest multiple of `factor` that is `>=` than `num`. Here `factor` must | ||
/// be a power of 2. | ||
pub fn round_upto_power_of_2(num: i64, factor: i64) -> i64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a private function, I think most users are already OK with https://doc.rust-lang.org/std/primitive.u32.html#method.next_power_of_two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rust/src/util/bit_util.rs
Outdated
|
||
/// Returns whether bit at position `i` in `data` is set or not. | ||
#[inline] | ||
pub fn get_bit(data: &[u8], i: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you're implementing quite a solid amount of bitfield/-vector stuff here. There are loads of crates that do this and I think we should use them, for the following reason: your implementation might be very slow, if LLVM doesn't remove your division and modulo ops. And there might be other things (edge cases, optimizations) that I think we don't need to care if you just don't invent our own bitfields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can do that, but here we are just doing very simple stuff like get_bit
, set_bit
, so I don't see a strong need to depend on another crate just for now.
rust/src/array_data.rs
Outdated
} | ||
|
||
/// Returns the length (i.e., number of elements) of this array. | ||
pub fn length(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) Rust uses len
for container sizes, so I think that's what we should call it.
b) why is the internal datatype i64
and you cast to usize
here? Let's say we don't support 16bit systems and length hopefully does not contain negative values (which you don't check in ArrayDataBuilder::len
, on both 32 and 64 bit this conversion can never fail). So why not storing usize
in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think we should use len
and don't return usize
here.
I followed the C++ ArrayData
structure and used i64
here, but it seems Arrow format specify that Array length should be stored as a 32-bit signed integer. @wesm : could you clarify why int64_t
is used for C++ ArrayData
length - seems int32_t
is enough?
Because of the format specification, we can't use usize
to store the length. Also, to my understanding Array is a logical concept and could contain one or more physical Buffers, therefore the length could go beyond the limit of usize
(e.g., you can have a 32-bit long Array on a 16-bit platform).
rust/src/array.rs
Outdated
let mut null_buffer = Vec::with_capacity(size); | ||
unsafe { | ||
null_buffer.set_len(size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so now you have allocated a vector with undefined data it it, is that really what you want? I would rather prefer to have an explicit state here (like all-zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rust/src/array.rs
Outdated
impl BinaryArray { | ||
/// Returns the element at index `i` as a byte slice. | ||
pub fn get_value(&self, mut i: i64) -> &[u8] { | ||
i += self.data.offset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, that may overflow in production builds (overflows in Rust are only checked in debug builds) and lead to a security issue some lines below, because you read from a out-of-bounds position. Please add some checks here that this doesn't happen:
- check that the integer addition doesn't overflow ( https://doc.rust-lang.org/std/primitive.usize.html#method.checked_add )
- check that the initial
i
is actually in-range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! fixed.
rust/src/array.rs
Outdated
debug_assert!(data.buffers().len() == 1); | ||
debug_assert!(data.child_data().len() == 1); | ||
let values = make_array(data.child_data()[0].clone()); | ||
let value_offsets = data.buffers()[0].raw_data() as *const i32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this pointer here is actually aligned and points to a valid location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's similar to the last alignment comment. It will be good if we can add some alignment check but I don't know what to use in Rust. Suggestion will be appreciated.
@sunchao Sorry I'm a bit late to the party here but I'm working on getting up to speed and have set some time aside tomorrow to pull your branch and understand the refactor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran out of time to really give this a proper review today. It looks good overall and I've made a couple minor comments. I'll try and continue with the review during this week.
rust/src/buffer.rs
Outdated
unsafe { slice::from_raw_parts(self.data.offset(start as isize), end - start) } | ||
impl PartialEq for BufferData { | ||
fn eq(&self, other: &BufferData) -> bool { | ||
unsafe { memory::memcmp(self.ptr, other.ptr, self.len as usize) == 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to compare self.len
to other.len
first before calling memcmp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
rust/src/memory.rs
Outdated
@@ -68,6 +68,19 @@ pub fn free_aligned(p: *const u8) { | |||
} | |||
} | |||
|
|||
pub fn memcpy(dst: *mut u8, src: *const u8, len: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this method be marked unsafe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed. Done.
Thanks for the great comments @pitrou @crepererum and @andygrove ! I'll try to address them soon.
No worries @andygrove - really appreciate it. |
Codecov Report
@@ Coverage Diff @@
## master #2330 +/- ##
==========================================
+ Coverage 84.73% 87.23% +2.49%
==========================================
Files 293 13 -280
Lines 45471 1347 -44124
==========================================
- Hits 38529 1175 -37354
+ Misses 6905 172 -6733
+ Partials 37 0 -37
Continue to review full report at Codecov.
|
@sunchao regarding alignment: for C++, I've put quite some effort in the ability to read arrow from partly untrusted sources (using flatbuffer checks and custom runtime checks). Also, I think the check is rather cheap and may improve debugging experience and UX. Regarding "what to use": honestly I thought a simple |
Thanks @crepererum ! will add the function and update the code. |
rust/src/array.rs
Outdated
@@ -209,10 +215,11 @@ macro_rules! def_primitive_array { | |||
impl<T: ArrowPrimitiveType> From<ArrayDataRef> for PrimitiveArray<T> { | |||
fn from(data: ArrayDataRef) -> Self { | |||
assert!(data.buffers().len() == 1); | |||
let values = data.buffers()[0].copy(); | |||
let raw_values = data.buffers()[0].raw_data(); | |||
debug_assert!(memory::is_aligned::<u8>(raw_values, mem::size_of::<T>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this an debug assert only? I also would like to protect my production code against these kind of errors.
Also, could you test this? There's should_panic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. I'll change this to assert
. For testing though, I need to find a way to bypass allocate_aligned
and inject some mis-aligned data for Buffer
. Let me think about it...
rust/src/array.rs
Outdated
@@ -288,7 +299,9 @@ impl From<ArrayDataRef> for ListArray { | |||
debug_assert!(data.buffers().len() == 1); | |||
debug_assert!(data.child_data().len() == 1); | |||
let values = make_array(data.child_data()[0].clone()); | |||
let value_offsets = data.buffers()[0].raw_data() as *const i32; | |||
let raw_value_offsets = data.buffers()[0].raw_data(); | |||
debug_assert!(memory::is_aligned(raw_value_offsets, mem::size_of::<i32>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
rust/src/array.rs
Outdated
@@ -364,11 +385,12 @@ impl BinaryArray { | |||
impl From<ArrayDataRef> for BinaryArray { | |||
fn from(data: ArrayDataRef) -> Self { | |||
assert!(data.buffers().len() == 2); | |||
let value_offsets = data.buffers()[0].raw_data() as *const u8 as *const i32; | |||
let raw_value_offsets = data.buffers()[0].raw_data(); | |||
debug_assert!(memory::is_aligned(raw_value_offsets, mem::size_of::<i32>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
rust/src/memory.rs
Outdated
@@ -79,6 +79,13 @@ extern "C" { | |||
pub fn memcmp(p1: *const u8, p2: *const u8, len: usize) -> i32; | |||
} | |||
|
|||
/// Check if the pointer `p` is aligned to offset `a`. | |||
pub fn is_aligned<T>(p: *const T, a: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is from libcore, but could we have some rough tests for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rust/src/array.rs
Outdated
/// Pointer to the value array. The lifetime of this must be <= to the value buffer | ||
/// stored in `data`, so it's safe to store. | ||
raw_values: RawPtrBox<u8>, | ||
_phantom: PhantomData<T>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not required anymore because of the change in RawPtrBox
. Beforehand, it was there because the compiler requires you to use all type parameters (otherwise, rust would have some soundness problems). See https://play.rust-lang.org/?gist=32d1891decd3e0dfcdf14f92ca123e63&version=stable&mode=debug&edition=2015 for what would have happened w/o the marker.
rust/benches/array_from_vec.rs
Outdated
@@ -37,7 +35,7 @@ fn array_from_vec(n: usize) { | |||
let arr_data = ArrayDataBuilder::new(DataType::Int32) | |||
.add_buffer(Buffer::from(v)) | |||
.build(); | |||
let _: PrimitiveArray<i32> = PrimitiveArray::from(Rc::new(arr_data)); | |||
criterion::black_box(PrimitiveArray::<i32>::from(arr_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a side note:
It's completely fine to use this to compare one change to another (like "did this PR result in any regression"), but it may not work to compare the rust against the C++ implementation, because the black box is not probably implemented under stable rust (see criterion source code and the reason why black_box
is unstable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crepererum . Interesting to know more about black_box
. I think the purpose for this benchmark is exactly the former case: check whether there's any perf regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crepererum . Really appreciate these great comments!
rust/src/array_data.rs
Outdated
) -> Self { | ||
if null_count < 0 { | ||
null_count = if let Some(ref buf) = null_bit_buffer { | ||
count_set_bits(buf.data()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually @pitrou : the Cell
is not thread-safe in this case so we cannot use it. I thought about using AtomicI64
but it is nightly-only. I'm not sure how significant this is for performance reason - can we leave it as a TODO for now?
rust/src/array.rs
Outdated
@@ -209,10 +215,11 @@ macro_rules! def_primitive_array { | |||
impl<T: ArrowPrimitiveType> From<ArrayDataRef> for PrimitiveArray<T> { | |||
fn from(data: ArrayDataRef) -> Self { | |||
assert!(data.buffers().len() == 1); | |||
let values = data.buffers()[0].copy(); | |||
let raw_values = data.buffers()[0].raw_data(); | |||
debug_assert!(memory::is_aligned::<u8>(raw_values, mem::size_of::<T>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. I'll change this to assert
. For testing though, I need to find a way to bypass allocate_aligned
and inject some mis-aligned data for Buffer
. Let me think about it...
rust/src/array.rs
Outdated
/// Pointer to the value array. The lifetime of this must be <= to the value buffer | ||
/// stored in `data`, so it's safe to store. | ||
raw_values: RawPtrBox<u8>, | ||
_phantom: PhantomData<T>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes PhantomData
is no longer needed. Done.
rust/src/memory.rs
Outdated
@@ -79,6 +79,13 @@ extern "C" { | |||
pub fn memcmp(p1: *const u8, p2: *const u8, len: usize) -> i32; | |||
} | |||
|
|||
/// Check if the pointer `p` is aligned to offset `a`. | |||
pub fn is_aligned<T>(p: *const T, a: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Sorry for dragging this for so long. I've addressed most of the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last tiny request.
rust/src/array.rs
Outdated
fn test_from_i32() { | ||
let a = PrimitiveArray::from(vec![15, 14, 13, 12, 11]); | ||
assert_eq!(5, a.len()); | ||
#[should_panic(expected = "")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just picking one should_panic
here, but this applies to all of them:
assert!
can also take a message, see https://doc.rust-lang.org/std/macro.assert.html , esp. the formatting example at the very end. BUT: since technically this wasn't solved before your PR, let's just agree on creating a ticket (like "rust asserts should all provide messages"), so we can get THIS PR merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I'll file a separate JIRA for this later.
rust/benches/array_from_vec.rs
Outdated
for i in 0..n { | ||
v.push(i as i32); | ||
v.push((i % 0xffff) as u8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be i & 0xfff
. Yes, technically, this is the same for positive integers (which is ensured since n: usize
and therefore i: usize
), but modulo operator is a) usually slower (might be implemented using DIV
) and b) cannot be optimized that well.
See https://godbolt.org/g/bL2jJE (debug build)
and https://godbolt.org/g/bQi1Zs (release build)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting to know! Will change. I think you mean i & 0xffff
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've meant i & 0xffff
. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO good to go.
@andygrove you cool to merge this? |
@wesm Yes, I think so. I need to read Uwe's instructions again on how to merge as this will be my first one as a committer. I will try and get to this tomorrow, but Saturday at the latest. |
Ok, I will go ahead and merge, there will be other PRs to merge |
I just went through all the setup and I'm ready now to merge future PRs. |
This changes the existing `Buffer` class to be non-generic over type `T`, since a `Buffer` class should just represent a plain byte array and interpretation of the data within the buffer should be done on a higher level, such as in `Array`. While working on this, I found that I also need to make significant changes on the `Array` and `List` types, since they are currently heavily tied with the `Buffer<T>` implementation. The new implementation follows arrow-cpp and defines a `ArrayData` struct which provides the common operations on a Arrow array. Subtypes of `Array` then provide specific operations for the types they represent. For instance, one can get a primitive value at index `i` for `PrimitiveArray` type, or can get a column at index `i` for `StructArray`. I removed `List` since it's no longer necessary. Removed `PrimitiveArray::{min,max}` for now but plan to add them back. Author: Chao Sun <sunchao@uber.com> Closes apache#2330 from sunchao/ARROW-2583 and squashes the following commits: 91c580b <Chao Sun> Fix lint 0e8a8dc <Chao Sun> Address review comments 21b8d1d <Chao Sun> Fix lint 2493d12 <Chao Sun> Fix a few more issues and add more tests 383cc3e <Chao Sun> More refactoring 2ee3cf9 <Chao Sun> Fix lint a29ae4a <Chao Sun> Fix test for is_aligned c194165 <Chao Sun> Fix Buffer offset and test for Array alignment a3206cc <Chao Sun> Address review comments 1863448 <Chao Sun> Fix lint 1e8dab5 <Chao Sun> In is_aligned(), should use align_of instead of size_of 363e7cf <Chao Sun> Fix bench. Change Buffer#copy() to Buffer#clone() 042796b <Chao Sun> Add check for pointer alignment 18e5dea <Chao Sun> Address comments 51327fe <Chao Sun> Address comments ac782f1 <Chao Sun> Remove commented out code 08fb847 <Chao Sun> Fix to_bytes() collision and test failure c3c0f6c <Chao Sun> Fix style 83e1a1f <Chao Sun> Bring back min and max for PrimitiveArray 7e57fd0 <Chao Sun> ARROW-2583: Buffer should be typeless
This changes the existing
Buffer
class to be non-generic over typeT
, since aBuffer
class should just represent a plain byte array and interpretation of the data within the buffer should be done on a higher level, such as inArray
.While working on this, I found that I also need to make significant changes on the
Array
andList
types, since they are currently heavily tied with theBuffer<T>
implementation. The new implementation follows arrow-cpp and defines aArrayData
struct which provides the common operations on a Arrow array. Subtypes ofArray
then provide specific operations for the types they represent. For instance, one can get a primitive value at indexi
forPrimitiveArray
type, or can get a column at indexi
forStructArray
.I removed
List
since it's no longer necessary. RemovedPrimitiveArray::{min,max}
for now but plan to add them back.