Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2521: [Rust] Refactor Rust API to use traits and generic to represent Array instead of enum #1971

Closed
wants to merge 26 commits into from

Conversation

Projects
None yet
7 participants
@andygrove
Copy link
Contributor

commented May 1, 2018

No description provided.

use super::memory::*;

/// Buffer<T> is essentially just a Vec<T> for fixed-width primitive types and the start of the
/// memory region is aligned at a 64-byte boundary
pub struct Buffer<T> {
pub struct Buffer<T>

This comment has been minimized.

Copy link
@pitrou

pitrou May 1, 2018

Contributor

I'm curious about the design choice here. In the C++ world, Arrow buffers are typeless (they just point to spans of bytes), it's arrays that give them their meaning. Do you benefit from giving buffers a specific type?

This comment has been minimized.

Copy link
@andygrove

andygrove May 1, 2018

Author Contributor

I think I do. I have type-safe Builder<T> where the push() method does not need to do any conversion per value being pushed, so it seems as efficient as possible. Then the memory is passed to the Buffer<T> where looking up values by index also seems as efficient as possible without any calculations on each lookup.

I am spending more time looking at the C++ implementation now to try and learn from it and keep the Rust version consistent. If you think there are arguments for using a Buffer<u8> instead of Buffer<T> I'd like to hear those too.

This comment has been minimized.

Copy link
@sunchao

sunchao May 14, 2018

Contributor

It seems this is based on the assumption that other languages have no problem interpreting the memory layout used to represent T in Rust, right? For instance, &'static str is a ArrowPrimitiveType - what's its memory layout look like? can it be correctly interpreted in languages such as Java?

Another thing is, this makes it hard to use a different implementation for a particular T. For instance, if we want to use bit-packing for boolean type, then this doesn't allow it, since bool always occupy 1 byte.

IMHO Buffer should only care about how to represent a continuous memory segment. Interpreting the memory layout into different logical types should be handled in a different module (e.g., array).

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

So the overall question here is: does Buffer know the memory layout for primitive types or does it not? And what are primitive types? For example --- and as @sunchao pointed out correctly --- strings are not that primitive, since the memory layout differs heavily amongst languages (e.g. C uses a pointer to a char-array where the last character is '\0', C++ stores the pointer to a char-array and it's length but in implementation-definied order, some Java versions use singelton objects and so on). And even if the types are primitive and are not packed (like booleans for example), we still have the endiannes problem (see Specs). So I think we may need to implement the conversion and read/write routines for every type and may be able to use macros for common cases (like u8, u16, u32, u64 are very similar).

Since we now have memory-IO for every type, the Buffer could be typeless. I think the conversion-free buffer-to-buffer copies are something that sound great, but I think the Array-type should deal with it, not the buffer. That keeps the design somehow clean and we don't need to repeat the types in that many layers.

This comment has been minimized.

Copy link
@andygrove

andygrove May 15, 2018

Author Contributor

OK so if I understand we want to use Buffer for all memory rather than Buffer<T> and then have Array take care of the type conversions. That makes sense to me, but again I think that is a separate PR from converting array from enum to trait/generic. I'm concerned that we're expanding the scope of this particular PR.

pub struct Buffer<T>
where
T: ArrowPrimitiveType,
{
/// Contiguous memory region holding instances of primitive T
data: *const T,
/// Number of elements in the buffer
len: i32,

This comment has been minimized.

Copy link
@pitrou

pitrou May 1, 2018

Contributor

I'm curious why you don't make the length 64-bit?

This comment has been minimized.

Copy link
@andygrove

andygrove May 1, 2018

Author Contributor

Naturally I would have used usize but I understood that Arrow requirs the type i32 to be used for array lengths?

This comment has been minimized.

Copy link
@pitrou

pitrou May 1, 2018

Contributor

The memory layout spec and the C++ implementation seem out of sync in that regard. The C++ implementation uses 64-bit ints for array sizes (it should perhaps use size_t or ssize_t). What must be 32-bit ints is offsets for list / string / binary arrays. @wesm

@xhochy

This comment has been minimized.

Copy link
Member

commented May 2, 2018

This needs a rebase

@andygrove

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2018

@pitrou I went ahead and changed to use usize everywhere (except for list offsets)

@andygrove

This comment has been minimized.

Copy link
Contributor Author

commented May 4, 2018

I wrote a blog post about this design and posted it to Reddit to get feedback. Let's see what people think over this weekend and if there are no objections I'd like to get this merged on Monday or soon after.

https://www.reddit.com/r/rust/comments/8gy45t/refactoring_apache_arrow_to_use_traits_and/

andygrove added some commits May 5, 2018

@crepererum
Copy link
Contributor

left a comment

Overall, I think this is a good change and makes the entire API must "rusty". My main (fixable) critique is the documentation. It's currently really hard to figure out the relation of the different traits and types and I think just a sentence here and there could improve this quite a lot.

use super::list::*;
use super::list_builder::*;

/// Array data type

This comment has been minimized.

Copy link
@crepererum

crepererum May 7, 2018

Contributor

If ArrayData is a "Array data type", then what exactly is an Array? Or is ArrayData the "data type of an Array"? This doesn't seem to be right either, because len, null_count, validity_bitmap and as_any don't make up a data type.


/// Array data type
pub trait ArrayData {
fn len(&self) -> usize;

This comment has been minimized.

Copy link
@crepererum

crepererum May 7, 2018

Contributor

All functions here lack documentation.

}
}

/// Array of T

This comment has been minimized.

Copy link
@crepererum

crepererum May 7, 2018

Contributor

Again, what's the difference to the Array type below?

}
}

///// This method mostly just used for unit tests

This comment has been minimized.

Copy link
@crepererum

crepererum May 7, 2018

Contributor

That's what you think, but since it's public, people will use it for all kinds of things. So I suggest you just remove this comment altogether.


impl RecordBatch {
pub fn new(schema: Rc<Schema>, columns: Vec<Rc<Array>>) -> Self {
assert!(columns.len() > 0);

This comment has been minimized.

Copy link
@crepererum

crepererum May 7, 2018

Contributor

I think you're missing some sanity checks here. At least, you must ensure that all arrays have the same length, otherwise, neither the container name ("batch") nor the implementation of num_rows make any sense.

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

🔝 this still is a TODO.

/// Primitive type (ints, floats, strings)
pub trait ArrowPrimitiveType: Copy + PartialOrd + 'static {}

impl ArrowPrimitiveType for bool {}

This comment has been minimized.

Copy link
@crepererum

crepererum May 7, 2018

Contributor

Your design choice of using a trait instead of an enum has some important implications:

@andygrove

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2018

@crepererum Thanks for reviewing and the great feedback. I'll work to address these points over this coming weekend.

andygrove added some commits May 12, 2018

@andygrove

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2018

@crepererum This is ready for another review

@codecov-io

This comment has been minimized.

Copy link

commented May 12, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@15e4811). Click here to learn what that means.
The diff coverage is 86.15%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master   #1971   +/-   ##
========================================
  Coverage          ?   87.2%           
========================================
  Files             ?      12           
  Lines             ?     883           
  Branches          ?       0           
========================================
  Hits              ?     770           
  Misses            ?     113           
  Partials          ?       0
Impacted Files Coverage Δ
rust/src/lib.rs 100% <ø> (ø)
rust/src/record_batch.rs 100% <100%> (ø)
rust/src/bitmap.rs 100% <100%> (ø)
rust/src/list.rs 100% <100%> (ø)
rust/src/builder.rs 97.81% <100%> (ø)
rust/src/list_builder.rs 100% <100%> (ø)
rust/src/datatypes.rs 71.83% <53.84%> (ø)
rust/src/array.rs 84.26% <82.73%> (ø)
rust/src/buffer.rs 89.14% <95%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15e4811...a04d66a. Read the comment docs.

@sunchao
Copy link
Contributor

left a comment

Thanks @andygrove ! I'm interested in this too as I'm looking at how to integrate this with parquet-rs. Left some comments.

}
}
/// Array of T
pub struct BufferArray<T: ArrowPrimitiveType> {

This comment has been minimized.

Copy link
@sunchao

sunchao May 14, 2018

Contributor

wonder if it's better to call this PrimitiveArray?

This comment has been minimized.

Copy link
@kszucs

kszucs May 15, 2018

Member

+1 BufferArray is a bit misleading

use super::memory::*;

/// Buffer<T> is essentially just a Vec<T> for fixed-width primitive types and the start of the
/// memory region is aligned at a 64-byte boundary
pub struct Buffer<T> {
pub struct Buffer<T>

This comment has been minimized.

Copy link
@sunchao

sunchao May 14, 2018

Contributor

It seems this is based on the assumption that other languages have no problem interpreting the memory layout used to represent T in Rust, right? For instance, &'static str is a ArrowPrimitiveType - what's its memory layout look like? can it be correctly interpreted in languages such as Java?

Another thing is, this makes it hard to use a different implementation for a particular T. For instance, if we want to use bit-packing for boolean type, then this doesn't allow it, since bool always occupy 1 byte.

IMHO Buffer should only care about how to represent a continuous memory segment. Interpreting the memory layout into different logical types should be handled in a different module (e.g., array).

null_count: i32,
/// If null_count is greater than zero then the validity_bitmap will be Some(Bitmap)
/// Array of List<T>
pub struct ListArray<T: ArrowPrimitiveType> {

This comment has been minimized.

Copy link
@sunchao

sunchao May 14, 2018

Contributor

Seems there's no constructor to initialize null_count and validity_bitmap (validity_bitmap is always None and null_count is always 0). Should we add one?

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

As far as I remember, null_count is a cached count extracted from validity_bitmap. I'm not sure we even need this as a member here, since this could also be handled by the BitMap type, and a member method ListArray::null_count(&self) -> usize could return 0 if validity_bitmap is None. For now, this may be OK and this could be done in another PR.

This comment has been minimized.

Copy link
@andygrove

andygrove May 15, 2018

Author Contributor

Buffer<T> does not support strings. It only supports fixed-width types (bool, ints, floats). Strings are represented by List<u8>. I believe the Arrow spec states that all languages must support the same endianness for these types to ensure compatibility. Arrow doesn't have a bit-packed boolean type AFAIK.

Regarding the null count / constructor questions, we generally need to use the builders to build Buffer/List arrays that contain null values. There is more we can do I'm sure but the focus of this PR is for changing the underlying data structures from enum to trait/generics. We didn't have great null handling before either and I think we should handle that in a separate PR once we've taken care of this refactor if there are no objections.

This comment has been minimized.

Copy link
@xhochy

xhochy May 15, 2018

Member

Arrow doesn't have a bit-packed boolean type AFAIK.

No, the boolean type is always bit-packed according to the spec and implemented as such in Java and C++

This comment has been minimized.

Copy link
@crepererum

crepererum May 15, 2018

Contributor

OK, but this isn't implemented right now and we could do this in another PR. So this boolean types probably won't be primitive types. (that's one possible solution)

};
/// Trait for dealing with different types of Array at runtime when the type of the
/// array is not known in advance
pub trait Array {

This comment has been minimized.

Copy link
@sunchao

sunchao May 14, 2018

Contributor

do you plan to add more methods? such as slice, is_null, etc? If so, perhaps add a TODO?

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

I don't think we should add methods as we need, not because we think that we may --- under some circumstances --- need them someday. So a TODO is not required.

@crepererum
Copy link
Contributor

left a comment

I'm afraid @sunchao brought up a very important question: does the buffer contain type information or not? And at which point to we do the layout-conversion? This needs to be addressed.

};
/// Trait for dealing with different types of Array at runtime when the type of the
/// array is not known in advance
pub trait Array {

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

I don't think we should add methods as we need, not because we think that we may --- under some circumstances --- need them someday. So a TODO is not required.

null_count: i32,
/// If null_count is greater than zero then the validity_bitmap will be Some(Bitmap)
/// Array of List<T>
pub struct ListArray<T: ArrowPrimitiveType> {

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

As far as I remember, null_count is a cached count extracted from validity_bitmap. I'm not sure we even need this as a member here, since this could also be handled by the BitMap type, and a member method ListArray::null_count(&self) -> usize could return 0 if validity_bitmap is None. For now, this may be OK and this could be done in another PR.

use super::memory::*;

/// Buffer<T> is essentially just a Vec<T> for fixed-width primitive types and the start of the
/// memory region is aligned at a 64-byte boundary
pub struct Buffer<T> {
pub struct Buffer<T>

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

So the overall question here is: does Buffer know the memory layout for primitive types or does it not? And what are primitive types? For example --- and as @sunchao pointed out correctly --- strings are not that primitive, since the memory layout differs heavily amongst languages (e.g. C uses a pointer to a char-array where the last character is '\0', C++ stores the pointer to a char-array and it's length but in implementation-definied order, some Java versions use singelton objects and so on). And even if the types are primitive and are not packed (like booleans for example), we still have the endiannes problem (see Specs). So I think we may need to implement the conversion and read/write routines for every type and may be able to use macros for common cases (like u8, u16, u32, u64 are very similar).

Since we now have memory-IO for every type, the Buffer could be typeless. I think the conversion-free buffer-to-buffer copies are something that sound great, but I think the Array-type should deal with it, not the buffer. That keeps the design somehow clean and we don't need to repeat the types in that many layers.


impl RecordBatch {
pub fn new(schema: Rc<Schema>, columns: Vec<Rc<Array>>) -> Self {
assert!(columns.len() > 0);

This comment has been minimized.

Copy link
@crepererum

crepererum May 14, 2018

Contributor

🔝 this still is a TODO.

@crepererum

This comment has been minimized.

Copy link
Contributor

commented May 15, 2018

@andygrove thank you for the thoughtful response. Considering your explanations, I'm fine this the PR. The RecordBatch comment should still be resolved though. Thanks a lot for your effort!

@andygrove

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2018

Thanks @crepererum! I've pushed a fix for the RecordBatch TODO and will file JIRAs today for the other issues that we discussed.

@crepererum

This comment has been minimized.

Copy link
Contributor

commented May 15, 2018

@sunchao

This comment has been minimized.

Copy link
Contributor

commented May 15, 2018

Merging this first sounds good to me too. Thanks @andygrove for the effort!

@andygrove

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2018

@crepererum Are you OK to approve this PR now?

@sunchao Thanks. I think we can start to move a little faster once this PR is merged.

@sunchao

This comment has been minimized.

Copy link
Contributor

commented May 16, 2018

@andygrove one minor thing: would you consider renaming BufferArray to PrimitiveArray? It seems the latter is more intuitive.

andygrove added some commits May 16, 2018

@xhochy

xhochy approved these changes May 17, 2018

Copy link
Member

left a comment

+1, LGTM

For me it looks like all points from @crepererum were addressed and only the formatting was still outstanding (which is also fixed now).

@xhochy xhochy closed this in 635ee1f May 17, 2018

fjetter added a commit to fjetter/arrow that referenced this pull request May 25, 2018

ARROW-2521: [Rust] Refactor Rust API to use traits and generic to rep…
…resent Array instead of enum

Author: Andy Grove <andygrove73@gmail.com>

Closes apache#1971 from andygrove/refactor_rust_api_v2 and squashes the following commits:

a04d66a <Andy Grove> cargo fmt with 1.26.0
f3f71dd <Andy Grove> Rename BufferArray to PrimitiveArray
10714a1 <Andy Grove> cargo fmt
b2d9e42 <Andy Grove> add assertions to RecordBatch
d577510 <Andy Grove> Remove need to clone array
be3a981 <Andy Grove> cargo fmt
22f907a <Andy Grove> Renaming structs and traits and adding documentation
4add4f0 <Andy Grove> Revert "Add type coercion helper method"
51270de <Andy Grove> Add type coercion helper method
cc40ba4 <Andy Grove> Removing macros, implemented min/max for arrays of primitives
01bc953 <Andy Grove> implement min/max for primitive array
b2659b1 <Andy Grove> run cargo fmt with stable rust
66c016e <Andy Grove> use usize instead of i32 (except for list offsets)
dbe49a7 <Andy Grove> Rebase
d1bfdca <Andy Grove> Merge branch 'master' of github.com:andygrove/arrow
2bae169 <Andy Grove> Refactor Rust API to use traits and generic to represent Array instead of enum
52de6a1 <Andy Grove> Merge branch 'master' of github.com:andygrove/arrow
0e2606b <Andy Grove> Merge remote-tracking branch 'upstream/master'
d883da2 <Andy Grove> Merge remote-tracking branch 'upstream/master'
589ef71 <Andy Grove> Merge remote-tracking branch 'upstream/master'
bd4fbb5 <Andy Grove> Merge remote-tracking branch 'upstream/master'
9c8a10a <Andy Grove> Merge remote-tracking branch 'upstream/master'
05592f8 <Andy Grove> Merge remote-tracking branch 'upstream/master'
8c0e698 <Andy Grove> Merge remote-tracking branch 'upstream/master'
31ef90b <Andy Grove> Merge remote-tracking branch 'upstream/master'
2f87c70 <Andy Grove> Fix build - add missing import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.