Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3347: [Rust] Implement PrimitiveArrayBuilder #2858

Closed
wants to merge 6 commits into from

Conversation

Projects
None yet
6 participants
@paddyhoran
Copy link
Contributor

paddyhoran commented Oct 29, 2018

Adds builder for PrimitiveArray.

@sunchao has mentioned that it's unfortunate that we have to rely on macros to define the impl block for types implementing ArrowPrimitiveType. When specialization lands in stable we can remove much/all of this but for now we have to rely on macros.

This implementation mostly focuses on being correct. However, maybe we should add push_value_raw and push_null_raw and allow the caller to handle updating the bitmap (i.e. avoid checking if the bitmap is_some on every push)? If so, I can add this as a separate PR (along with other optimizations).

@paddyhoran

This comment has been minimized.

Copy link
Contributor Author

paddyhoran commented Oct 29, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 29, 2018

Codecov Report

Merging #2858 into master will increase coverage by 0.93%.
The diff coverage is 92.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2858      +/-   ##
==========================================
+ Coverage   87.55%   88.49%   +0.93%     
==========================================
  Files         410      361      -49     
  Lines       63486    61702    -1784     
==========================================
- Hits        55586    54602     -984     
+ Misses       7828     7100     -728     
+ Partials       72        0      -72
Impacted Files Coverage Δ
rust/src/array.rs 87.61% <100%> (+0.11%) ⬆️
rust/src/builder.rs 94.17% <92.14%> (-1.53%) ⬇️
cpp/src/arrow/status.cc 43.54% <0%> (-4.67%) ⬇️
cpp/src/arrow/python/io.cc 94.91% <0%> (-3.11%) ⬇️
python/pyarrow/tests/test_io.py 98.68% <0%> (-0.47%) ⬇️
python/pyarrow/ipc.pxi 68.36% <0%> (-0.29%) ⬇️
python/pyarrow/tests/test_cuda.py 1.75% <0%> (-0.06%) ⬇️
python/pyarrow/_plasma.pyx 64.88% <0%> (-0.03%) ⬇️
cpp/src/arrow/io/hdfs.cc 0.32% <0%> (-0.01%) ⬇️
python/pyarrow/lib.pxd 0% <0%> (ø) ⬆️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d61988d...42841fd. Read the comment docs.

T: ArrowPrimitiveType,
{
values_builder: BufferBuilder<T>,
bitmap_builder: Option<BufferBuilder<bool>>,

This comment has been minimized.

Copy link
@sunchao

sunchao Oct 29, 2018

Contributor

Not sure how much we can save by making this an option, and having multiple checks in push* doesn't look good - can we start with a allocated bitmap_builder and optimize later if this is indeed an issue?

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Oct 29, 2018

Author Contributor

Yea, I was on the fence about this. Sounds good, I'll remove the Option


/// Pushes a value of type T into the builder
pub fn push(&mut self, v: $native_ty) -> Result<()> {
if self.bitmap_builder.is_some() {

This comment has been minimized.

Copy link
@sunchao

sunchao Oct 29, 2018

Contributor

We can use if let syntax - it looks nicer :)

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Oct 29, 2018

Author Contributor

+1

}

/// Pushes an Option<T> into the builder
pub fn push_option(&mut self, v: Option<$native_ty>) -> Result<()> {

This comment has been minimized.

Copy link
@sunchao

sunchao Oct 29, 2018

Contributor

Like C++, perhaps we can have a version with the following interface:

push_values(&mut self, values: &[$native_ty], is_valid: &[u8])

which can allow us to efficiently memcpy the values and null array.

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Oct 29, 2018

Author Contributor

+1

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Oct 29, 2018

Author Contributor

@sunchao ok with you if we add this in a follow up PR?

This comment has been minimized.

Copy link
@sunchao

sunchao Oct 29, 2018

Contributor

Sure. SGTM.

This comment has been minimized.

Copy link
@kszucs

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Nov 1, 2018

Author Contributor

Thanks @kszucs, was waiting until it was merged...

}
}

/// Returns the capacity of this builder measured in slots of type T

This comment has been minimized.

Copy link
@sunchao

sunchao Oct 29, 2018

Contributor

nit: can we use T (with backticks) in all the places?

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Oct 29, 2018

Author Contributor

Sure, agreed.

Ok(())
}

/// Pushes an Option<T> into the builder

This comment has been minimized.

Copy link
@sunchao

sunchao Oct 29, 2018

Contributor

nit: also Option<T>.

This comment has been minimized.

Copy link
@paddyhoran

paddyhoran Oct 29, 2018

Author Contributor

+1

@paddyhoran

This comment has been minimized.

Copy link
Contributor Author

paddyhoran commented Oct 29, 2018

@sunchao, thanks for the review, appreciated. Will update soon

@crepererum

This comment has been minimized.

Copy link
Contributor

crepererum commented Oct 30, 2018

The underlying BufferBuilder is already wrong here, but I think capacity and len should be usize (or at least u64 if you like to overflow on 32bit systems). Maybe this could be fixed in a follow-up PR.

@paddyhoran

This comment has been minimized.

Copy link
Contributor Author

paddyhoran commented Oct 30, 2018

@crepererum I'm not sure about this, the spec says that lengths should be i32, I believe that this is to ensure compatibility with different implementations, in particular, with Java.

But different implementations can provide support up to i64, for instance, C++ seems to use i64. I think that we may need to add assertions to guard against overflows though (I recall seeing this in the C++ implementation but I can't find it now)?

@xhochy or @wesm could you comment on this?

@sunchao
Copy link
Contributor

sunchao left a comment

Thanks @paddyhoran . LGTM except one nit.

Show resolved Hide resolved rust/src/builder.rs Outdated
@xhochy

This comment has been minimized.

Copy link
Member

xhochy commented Oct 31, 2018

Implementations are free to support larger than i32 lengths as long as they stay inside their scope. But you need to be able to split these things up when you pass Arrow data to other implementations. You can pass these i64 length buffers to C++. But in the case that you deal with "any" Arrow implementation, you should only pass i32 sized buffers along.

@paddyhoran

This comment has been minimized.

Copy link
Contributor Author

paddyhoran commented Nov 1, 2018

@crepererum my thinking is that as BufferBuilder is generic over T so len and capacity are measuring array slots and so i64 is used. The resulting Buffer is type-less and so it's len is usize.

Does this seem right to you or do you still see an issue?

@crepererum

This comment has been minimized.

Copy link
Contributor

crepererum commented Nov 1, 2018

I still don't see how "array slots" can be negative though. Same goes for everything that @xhochy said. How can a length be negative?

Not matter how we decide, we have basically two options:
a) use signed integers but assert that the are not negative (if nobody comes up with a good reason why this should be the case)
b) use unsigned integer but check that they don't exceed std::i32::MAX, although I'm kinda unhappy about the fact that we only support 2G elements.

@xhochy

This comment has been minimized.

Copy link
Member

xhochy commented Nov 1, 2018

These lengths cannot be negative. We use signed types due to some weird language behaviours in Java and C++. Sometimes, e.g. for the null count, we use -1 as a placeholder value that we need to compute this value first.

Thus feel free in the Rust implementation to use one of the things @crepererum proposed.

@paddyhoran

This comment has been minimized.

Copy link
Contributor Author

paddyhoran commented Nov 1, 2018

Thanks @xhochy. @crepererum the jira is here.

Is this PR ready to merge then?

@crepererum

This comment has been minimized.

Copy link
Contributor

crepererum commented Nov 1, 2018

Great. Thanks @paddyhoran for the work, patience and issue creation. Thanks @xhochy for the clarifications. IMHO the PR is now ready to merge.

paddyhoran added some commits Oct 27, 2018

@kszucs kszucs force-pushed the paddyhoran:ARROW-3347 branch from 42841fd to 13b8b7d Nov 1, 2018

@kszucs

This comment has been minimized.

Copy link
Member

kszucs commented Nov 1, 2018

Rebased on top of #2868

@sunchao

This comment has been minimized.

Copy link
Contributor

sunchao commented Nov 1, 2018

But you need to be able to split these things up when you pass Arrow data to other implementations.

@xhochy Is there any existing example on this? I assume if we choose usize on a 64-bit platform, we have to handle this conversion somehow in order to be truly compatible with the standard, right?

@kszucs

kszucs approved these changes Nov 1, 2018

@kszucs kszucs closed this in 919119f Nov 1, 2018

@kszucs

This comment has been minimized.

Copy link
Member

kszucs commented Nov 1, 2018

Thanks for the PR and the reviews!

@kszucs

This comment has been minimized.

Copy link
Member

kszucs commented Nov 1, 2018

@sunchao We should add rust to the integration testing suite: https://github.com/apache/arrow/tree/master/integration

@sunchao

This comment has been minimized.

Copy link
Contributor

sunchao commented Nov 1, 2018

@sunchao We should add rust to the integration testing suite: https://github.com/apache/arrow/tree/master/integration

Sure. Filed ARROW-3688.

@paddyhoran paddyhoran deleted the paddyhoran:ARROW-3347 branch Nov 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.