Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2583: [Rust] Buffer should be typeless #2330

Closed
wants to merge 20 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Jul 27, 2018

This changes the existing Buffer class to be non-generic over type T, since a Buffer class should just represent a plain byte array and interpretation of the data within the buffer should be done on a higher level, such as in Array.

While working on this, I found that I also need to make significant changes on the Array and List types, since they are currently heavily tied with the Buffer<T> implementation. The new implementation follows arrow-cpp and defines a ArrayData struct which provides the common operations on a Arrow array. Subtypes of Array then provide specific operations for the types they represent. For instance, one can get a primitive value at index i for PrimitiveArray type, or can get a column at index i for StructArray.

I removed List since it's no longer necessary. Removed PrimitiveArray::{min,max} for now but plan to add them back.

@wesm
Copy link
Member

wesm commented Jul 27, 2018

Seems like a reasonable refactoring; I don't know enough about Rust to know what would be the best design, but the way that the C++ library developed made having a generic data container for arrays (i.e. ArrayData) desirable architecturally.

@sunchao
Copy link
Member Author

sunchao commented Aug 1, 2018

Thanks @wesm . I think this PR is mostly ready for review - I still need to fix the Windows test failure.

@andygrove @pitrou @crepererum @xhochy : it would be great if you can take a look too. Thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I don't know much about Rust, but I left a few comments.

offset: i64,

/// The buffers for this array data. Note that depending on the array types, this
/// could hold different types of buffers (e.g., value buffer, value offset buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps say "kinds" instead of "types" if you want to avoid ambiguity about the Rust type of buffers.

/// at different positions.
buffers: Vec<Buffer>,

/// The child(ren) of this array. Only non-empty for `ListArray` and `StructArray`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something like "Only non-empty for nested types, currently List and Struct".

/// The child(ren) of this array. Only non-empty for `ListArray` and `StructArray`.
child_data: Vec<ArrayDataRef>,

/// The null bit map. A `None` value for this indicates all values are non-null in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually spell it "bitmap".

) -> Self {
if null_count < 0 {
null_count = if let Some(ref buf) = null_bit_buffer {
count_set_bits(buf.data())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, in the C++ implementation we call count_set_bits lazily (when a caller first asks for the null count). YMMV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought about that but moving this to null_count() will require it to take a &mut self instead of &self (since now it needs to modify self.null_count), which is not very desirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently Rust allows interior mutability, which is adequate for such cached properties:
https://ricardomartins.cc/2016/06/08/interior-mutability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot about Cell. Will change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @pitrou : the Cell is not thread-safe in this case so we cannot use it. I thought about using AtomicI64 but it is nightly-only. I'm not sure how significant this is for performance reason - can we leave it as a TODO for now?


/// Returns a copy of the data type for this array data.
pub fn data_type(&self) -> DataType {
self.data_type.clone()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why is this a copy? Data types are mutable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data types are not mutable. The other option is to return a reference here. However, if you need to store this reference anywhere in a struct, you'd have to add a lifetime parameter in that struct, which will get populated to any struct that use this struct, and so on. Personally I'd like to avoid this since it makes the design much more restrictive.

I think it's a good idea to add another version of this method that returns a reference though, in case anyone only needs to use the result in the local stack. Will do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(IMHO!) In Rust, the user should decide if the clone is required or not. In many cases, you could read out information of the data type w/o cloning the entire type. It would follow the "zero-cost abstraction"-principle. And for buffers and child_data you do the same, so it would be consistent. Also, data_type_ref is not really "cheap/efficient by default".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you point. I think we can remove the data_type and rename data_type_ref to data_type. This keep it consistent with others like buffer and child_data. What do you think?

fn from(data: ArrayDataRef) -> Self {
debug_assert!(data.buffers().len() == 1);
debug_assert!(data.child_data().len() == 1);
let values = make_array(data.child_data()[0].clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could assert more things, for example that offsets[0] == 0 and offsets[array length + 1] == len(values array).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fn from(vec: Vec<T>) -> Self {
PrimitiveArray::from(Buffer::from(vec))
impl<'a> From<Vec<&'a str>> for BinaryArray {
fn from(v: Vec<&'a str>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, I suppose you'll want to add a BinaryBuilder at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add that in a follow-up PR.

null_count: 0,
validity_bitmap: None,
impl From<Vec<ArrayRef>> for StructArray {
fn from(v: Vec<ArrayRef>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want a variant of this method that also takes a vector of field names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to only used for testing, and I really want to add more tests when the ArrayBuilder is implemented. For now, let me change this to accept Vec<(Field,ArrayRef)>.

for arr in &v {
let idx = count.to_string();
count += 1;
field_types.push(Field::new(&idx, arr.data().data_type(), false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is nullable false? In Arrow you can have individual null elements in a non-null struct value...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some dummy value for testing. I'll change it so that the nullable is passed in.

fn from(v: Vec<ArrayRef>) -> Self {
let mut field_types = vec![];
let mut count = 0;
for arr in &v {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why &v here while you're using plain v in the other for loop below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v means to consume the unique reference. We cannot consume the reference twice so the first use needs to be &v - just read it. We can change the second one to &v as well and it should have no difference.

@crepererum
Copy link
Contributor

I'll have a look tomorrow. Thanks for the PR.

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

First of all, thank you very much for this PR. Great work! Could you please have a look at my comments, esp. regarding security-related questions?


/// Returns a raw pointer to the values of this array.
pub fn raw_values(&self) -> *const $native_ty {
unsafe { mem::transmute(self.raw_values.get().offset(self.data.offset() as isize)) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this would require the *const u8 is aligned correctly. I think we should check this here. Otherwise, you end up with undefined behavior (or on x86, at least with a SIGBUS).


/// Returns a copy of the data type for this array data.
pub fn data_type(&self) -> DataType {
self.data_type.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(IMHO!) In Rust, the user should decide if the clone is required or not. In many cases, you could read out information of the data type w/o cloning the entire type. It would follow the "zero-cost abstraction"-principle. And for buffers and child_data you do the same, so it would be consistent. Also, data_type_ref is not really "cheap/efficient by default".


/// Returns the nearest multiple of `factor` that is `>=` than `num`. Here `factor` must
/// be a power of 2.
pub fn round_upto_power_of_2(num: i64, factor: i64) -> i64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a private function, I think most users are already OK with https://doc.rust-lang.org/std/primitive.u32.html#method.next_power_of_two

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/// Returns whether bit at position `i` in `data` is set or not.
#[inline]
pub fn get_bit(data: &[u8], i: usize) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you're implementing quite a solid amount of bitfield/-vector stuff here. There are loads of crates that do this and I think we should use them, for the following reason: your implementation might be very slow, if LLVM doesn't remove your division and modulo ops. And there might be other things (edge cases, optimizations) that I think we don't need to care if you just don't invent our own bitfields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can do that, but here we are just doing very simple stuff like get_bit, set_bit, so I don't see a strong need to depend on another crate just for now.

}

/// Returns the length (i.e., number of elements) of this array.
pub fn length(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Rust uses len for container sizes, so I think that's what we should call it.
b) why is the internal datatype i64 and you cast to usize here? Let's say we don't support 16bit systems and length hopefully does not contain negative values (which you don't check in ArrayDataBuilder::len, on both 32 and 64 bit this conversion can never fail). So why not storing usize in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we should use len and don't return usize here.

I followed the C++ ArrayData structure and used i64 here, but it seems Arrow format specify that Array length should be stored as a 32-bit signed integer. @wesm : could you clarify why int64_t is used for C++ ArrayData length - seems int32_t is enough?

Because of the format specification, we can't use usize to store the length. Also, to my understanding Array is a logical concept and could contain one or more physical Buffers, therefore the length could go beyond the limit of usize(e.g., you can have a 32-bit long Array on a 16-bit platform).

let mut null_buffer = Vec::with_capacity(size);
unsafe {
null_buffer.set_len(size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so now you have allocated a vector with undefined data it it, is that really what you want? I would rather prefer to have an explicit state here (like all-zero).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

impl BinaryArray {
/// Returns the element at index `i` as a byte slice.
pub fn get_value(&self, mut i: i64) -> &[u8] {
i += self.data.offset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, that may overflow in production builds (overflows in Rust are only checked in debug builds) and lead to a security issue some lines below, because you read from a out-of-bounds position. Please add some checks here that this doesn't happen:

  1. check that the integer addition doesn't overflow ( https://doc.rust-lang.org/std/primitive.usize.html#method.checked_add )
  2. check that the initial i is actually in-range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! fixed.

debug_assert!(data.buffers().len() == 1);
debug_assert!(data.child_data().len() == 1);
let values = make_array(data.child_data()[0].clone());
let value_offsets = data.buffers()[0].raw_data() as *const i32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this pointer here is actually aligned and points to a valid location?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar to the last alignment comment. It will be good if we can add some alignment check but I don't know what to use in Rust. Suggestion will be appreciated.

@andygrove
Copy link
Member

@sunchao Sorry I'm a bit late to the party here but I'm working on getting up to speed and have set some time aside tomorrow to pull your branch and understand the refactor.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran out of time to really give this a proper review today. It looks good overall and I've made a couple minor comments. I'll try and continue with the review during this week.

unsafe { slice::from_raw_parts(self.data.offset(start as isize), end - start) }
impl PartialEq for BufferData {
fn eq(&self, other: &BufferData) -> bool {
unsafe { memory::memcmp(self.ptr, other.ptr, self.len as usize) == 0 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to compare self.len to other.len first before calling memcmp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

@@ -68,6 +68,19 @@ pub fn free_aligned(p: *const u8) {
}
}

pub fn memcpy(dst: *mut u8, src: *const u8, len: usize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this method be marked unsafe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed. Done.

@sunchao
Copy link
Member Author

sunchao commented Aug 6, 2018

Thanks for the great comments @pitrou @crepererum and @andygrove ! I'll try to address them soon.

I'll try and continue with the review during this week.

No worries @andygrove - really appreciate it.

@codecov-io
Copy link

codecov-io commented Aug 8, 2018

Codecov Report

Merging #2330 into master will increase coverage by 2.49%.
The diff coverage is 90.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2330      +/-   ##
==========================================
+ Coverage   84.73%   87.23%   +2.49%     
==========================================
  Files         293       13     -280     
  Lines       45471     1347   -44124     
==========================================
- Hits        38529     1175   -37354     
+ Misses       6905      172    -6733     
+ Partials       37        0      -37
Impacted Files Coverage Δ
rust/src/lib.rs 100% <ø> (ø) ⬆️
rust/src/util/test_util.rs 0% <0%> (ø)
rust/src/memory_pool.rs 100% <100%> (ø) ⬆️
rust/src/builder.rs 97.56% <100%> (-0.25%) ⬇️
rust/src/record_batch.rs 97.77% <100%> (-2.23%) ⬇️
rust/src/memory.rs 94.44% <100%> (+6.94%) ⬆️
rust/src/util/bit_util.rs 100% <100%> (ø)
rust/src/datatypes.rs 73.42% <100%> (+1.46%) ⬆️
rust/src/array.rs 83.89% <85.88%> (-1.69%) ⬇️
rust/src/bitmap.rs 94.87% <88.23%> (-5.13%) ⬇️
... and 288 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fda431...91c580b. Read the comment docs.

@crepererum
Copy link
Contributor

@sunchao regarding alignment: for C++, I've put quite some effort in the ability to read arrow from partly untrusted sources (using flatbuffer checks and custom runtime checks). Also, I think the check is rather cheap and may improve debugging experience and UX. Regarding "what to use": honestly I thought a simple ptr % alignment == 0 would do, but that's wrong. Luckily, the Rust libcore impl. contains a fast-path for the check, which could easily be used (just place a helper function in our memory module). Please note that the check is simple, but the actual offset calculation may be hard (see this comment and the libcore code).

@sunchao
Copy link
Member Author

sunchao commented Aug 9, 2018

Thanks @crepererum ! will add the function and update the code.

@@ -209,10 +215,11 @@ macro_rules! def_primitive_array {
impl<T: ArrowPrimitiveType> From<ArrayDataRef> for PrimitiveArray<T> {
fn from(data: ArrayDataRef) -> Self {
assert!(data.buffers().len() == 1);
let values = data.buffers()[0].copy();
let raw_values = data.buffers()[0].raw_data();
debug_assert!(memory::is_aligned::<u8>(raw_values, mem::size_of::<T>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an debug assert only? I also would like to protect my production code against these kind of errors.

Also, could you test this? There's should_panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. I'll change this to assert. For testing though, I need to find a way to bypass allocate_aligned and inject some mis-aligned data for Buffer. Let me think about it...

@@ -288,7 +299,9 @@ impl From<ArrayDataRef> for ListArray {
debug_assert!(data.buffers().len() == 1);
debug_assert!(data.child_data().len() == 1);
let values = make_array(data.child_data()[0].clone());
let value_offsets = data.buffers()[0].raw_data() as *const i32;
let raw_value_offsets = data.buffers()[0].raw_data();
debug_assert!(memory::is_aligned(raw_value_offsets, mem::size_of::<i32>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -364,11 +385,12 @@ impl BinaryArray {
impl From<ArrayDataRef> for BinaryArray {
fn from(data: ArrayDataRef) -> Self {
assert!(data.buffers().len() == 2);
let value_offsets = data.buffers()[0].raw_data() as *const u8 as *const i32;
let raw_value_offsets = data.buffers()[0].raw_data();
debug_assert!(memory::is_aligned(raw_value_offsets, mem::size_of::<i32>()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -79,6 +79,13 @@ extern "C" {
pub fn memcmp(p1: *const u8, p2: *const u8, len: usize) -> i32;
}

/// Check if the pointer `p` is aligned to offset `a`.
pub fn is_aligned<T>(p: *const T, a: usize) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is from libcore, but could we have some rough tests for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// Pointer to the value array. The lifetime of this must be <= to the value buffer
/// stored in `data`, so it's safe to store.
raw_values: RawPtrBox<u8>,
_phantom: PhantomData<T>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not required anymore because of the change in RawPtrBox. Beforehand, it was there because the compiler requires you to use all type parameters (otherwise, rust would have some soundness problems). See https://play.rust-lang.org/?gist=32d1891decd3e0dfcdf14f92ca123e63&version=stable&mode=debug&edition=2015 for what would have happened w/o the marker.

@@ -37,7 +35,7 @@ fn array_from_vec(n: usize) {
let arr_data = ArrayDataBuilder::new(DataType::Int32)
.add_buffer(Buffer::from(v))
.build();
let _: PrimitiveArray<i32> = PrimitiveArray::from(Rc::new(arr_data));
criterion::black_box(PrimitiveArray::<i32>::from(arr_data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a side note:
It's completely fine to use this to compare one change to another (like "did this PR result in any regression"), but it may not work to compare the rust against the C++ implementation, because the black box is not probably implemented under stable rust (see criterion source code and the reason why black_box is unstable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crepererum . Interesting to know more about black_box. I think the purpose for this benchmark is exactly the former case: check whether there's any perf regression.

Copy link
Member Author

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crepererum . Really appreciate these great comments!

) -> Self {
if null_count < 0 {
null_count = if let Some(ref buf) = null_bit_buffer {
count_set_bits(buf.data())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @pitrou : the Cell is not thread-safe in this case so we cannot use it. I thought about using AtomicI64 but it is nightly-only. I'm not sure how significant this is for performance reason - can we leave it as a TODO for now?

@@ -209,10 +215,11 @@ macro_rules! def_primitive_array {
impl<T: ArrowPrimitiveType> From<ArrayDataRef> for PrimitiveArray<T> {
fn from(data: ArrayDataRef) -> Self {
assert!(data.buffers().len() == 1);
let values = data.buffers()[0].copy();
let raw_values = data.buffers()[0].raw_data();
debug_assert!(memory::is_aligned::<u8>(raw_values, mem::size_of::<T>()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. I'll change this to assert. For testing though, I need to find a way to bypass allocate_aligned and inject some mis-aligned data for Buffer. Let me think about it...

/// Pointer to the value array. The lifetime of this must be <= to the value buffer
/// stored in `data`, so it's safe to store.
raw_values: RawPtrBox<u8>,
_phantom: PhantomData<T>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes PhantomData is no longer needed. Done.

@@ -79,6 +79,13 @@ extern "C" {
pub fn memcmp(p1: *const u8, p2: *const u8, len: usize) -> i32;
}

/// Check if the pointer `p` is aligned to offset `a`.
pub fn is_aligned<T>(p: *const T, a: usize) -> bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sunchao sunchao changed the title [WIP] ARROW-2583: [Rust] Buffer should be typeless ARROW-2583: [Rust] Buffer should be typeless Aug 10, 2018
@sunchao
Copy link
Member Author

sunchao commented Aug 14, 2018

Sorry for dragging this for so long. I've addressed most of the comments.
@crepererum @pitrou @wesm @andygrove : could you take another look? Thanks!

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last tiny request.

fn test_from_i32() {
let a = PrimitiveArray::from(vec![15, 14, 13, 12, 11]);
assert_eq!(5, a.len());
#[should_panic(expected = "")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just picking one should_panic here, but this applies to all of them:
assert! can also take a message, see https://doc.rust-lang.org/std/macro.assert.html , esp. the formatting example at the very end. BUT: since technically this wasn't solved before your PR, let's just agree on creating a ticket (like "rust asserts should all provide messages"), so we can get THIS PR merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I'll file a separate JIRA for this later.

for i in 0..n {
v.push(i as i32);
v.push((i % 0xffff) as u8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be i & 0xfff. Yes, technically, this is the same for positive integers (which is ensured since n: usize and therefore i: usize), but modulo operator is a) usually slower (might be implemented using DIV) and b) cannot be optimized that well.

See https://godbolt.org/g/bL2jJE (debug build)
and https://godbolt.org/g/bQi1Zs (release build)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting to know! Will change. I think you mean i & 0xffff right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've meant i & 0xffff. Sorry.

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO good to go.

@wesm
Copy link
Member

wesm commented Aug 15, 2018

@andygrove you cool to merge this?

@andygrove
Copy link
Member

@wesm Yes, I think so. I need to read Uwe's instructions again on how to merge as this will be my first one as a committer. I will try and get to this tomorrow, but Saturday at the latest.

@wesm
Copy link
Member

wesm commented Aug 17, 2018

Ok, I will go ahead and merge, there will be other PRs to merge

@wesm wesm closed this in 1209a80 Aug 17, 2018
@andygrove
Copy link
Member

I just went through all the setup and I'm ready now to merge future PRs.

stephanie-wang pushed a commit to stephanie-wang/arrow that referenced this pull request Aug 29, 2018
This changes the existing `Buffer` class to be non-generic over type `T`, since a `Buffer` class should just represent a plain byte array and interpretation of the data within the buffer should be done on a higher level, such as in `Array`.

While working on this, I found that I also need to make significant changes on the `Array` and `List` types, since they are currently heavily tied with the `Buffer<T>` implementation. The new implementation follows arrow-cpp and defines a `ArrayData` struct which provides the common operations on a Arrow array. Subtypes of `Array` then provide specific operations for the types they represent. For instance, one can get a primitive value at index `i` for `PrimitiveArray` type, or can get a column at index `i` for `StructArray`.

I removed `List` since it's no longer necessary. Removed `PrimitiveArray::{min,max}` for now but plan to add them back.

Author: Chao Sun <sunchao@uber.com>

Closes apache#2330 from sunchao/ARROW-2583 and squashes the following commits:

91c580b <Chao Sun> Fix lint
0e8a8dc <Chao Sun> Address review comments
21b8d1d <Chao Sun> Fix lint
2493d12 <Chao Sun> Fix a few more issues and add more tests
383cc3e <Chao Sun> More refactoring
2ee3cf9 <Chao Sun> Fix lint
a29ae4a <Chao Sun> Fix test for is_aligned
c194165 <Chao Sun> Fix Buffer offset and test for Array alignment
a3206cc <Chao Sun> Address review comments
1863448 <Chao Sun> Fix lint
1e8dab5 <Chao Sun> In is_aligned(), should use align_of instead of size_of
363e7cf <Chao Sun> Fix bench. Change Buffer#copy() to Buffer#clone()
042796b <Chao Sun> Add check for pointer alignment
18e5dea <Chao Sun> Address comments
51327fe <Chao Sun> Address comments
ac782f1 <Chao Sun> Remove commented out code
08fb847 <Chao Sun> Fix to_bytes() collision and test failure
c3c0f6c <Chao Sun> Fix style
83e1a1f <Chao Sun> Bring back min and max for PrimitiveArray
7e57fd0 <Chao Sun>  ARROW-2583:  Buffer should be typeless
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants