Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3347: [Rust] Implement PrimitiveArrayBuilder #2858

Closed
wants to merge 6 commits into from

Conversation

paddyhoran
Copy link
Contributor

@paddyhoran paddyhoran commented Oct 29, 2018

Adds builder for PrimitiveArray.

@sunchao has mentioned that it's unfortunate that we have to rely on macros to define the impl block for types implementing ArrowPrimitiveType. When specialization lands in stable we can remove much/all of this but for now we have to rely on macros.

This implementation mostly focuses on being correct. However, maybe we should add push_value_raw and push_null_raw and allow the caller to handle updating the bitmap (i.e. avoid checking if the bitmap is_some on every push)? If so, I can add this as a separate PR (along with other optimizations).

@paddyhoran
Copy link
Contributor Author

@codecov-io
Copy link

codecov-io commented Oct 29, 2018

Codecov Report

Merging #2858 into master will increase coverage by 0.93%.
The diff coverage is 92.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2858      +/-   ##
==========================================
+ Coverage   87.55%   88.49%   +0.93%     
==========================================
  Files         410      361      -49     
  Lines       63486    61702    -1784     
==========================================
- Hits        55586    54602     -984     
+ Misses       7828     7100     -728     
+ Partials       72        0      -72
Impacted Files Coverage Δ
rust/src/array.rs 87.61% <100%> (+0.11%) ⬆️
rust/src/builder.rs 94.17% <92.14%> (-1.53%) ⬇️
cpp/src/arrow/status.cc 43.54% <0%> (-4.67%) ⬇️
cpp/src/arrow/python/io.cc 94.91% <0%> (-3.11%) ⬇️
python/pyarrow/tests/test_io.py 98.68% <0%> (-0.47%) ⬇️
python/pyarrow/ipc.pxi 68.36% <0%> (-0.29%) ⬇️
python/pyarrow/tests/test_cuda.py 1.75% <0%> (-0.06%) ⬇️
python/pyarrow/_plasma.pyx 64.88% <0%> (-0.03%) ⬇️
cpp/src/arrow/io/hdfs.cc 0.32% <0%> (-0.01%) ⬇️
python/pyarrow/lib.pxd 0% <0%> (ø) ⬆️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d61988d...42841fd. Read the comment docs.

T: ArrowPrimitiveType,
{
values_builder: BufferBuilder<T>,
bitmap_builder: Option<BufferBuilder<bool>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much we can save by making this an option, and having multiple checks in push* doesn't look good - can we start with a allocated bitmap_builder and optimize later if this is indeed an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was on the fence about this. Sounds good, I'll remove the Option


/// Pushes a value of type T into the builder
pub fn push(&mut self, v: $native_ty) -> Result<()> {
if self.bitmap_builder.is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use if let syntax - it looks nicer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

/// Pushes an Option<T> into the builder
pub fn push_option(&mut self, v: Option<$native_ty>) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like C++, perhaps we can have a version with the following interface:

push_values(&mut self, values: &[$native_ty], is_valid: &[u8])

which can allow us to efficiently memcpy the values and null array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao ok with you if we add this in a follow up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. SGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kszucs, was waiting until it was merged...

}
}

/// Returns the capacity of this builder measured in slots of type T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use T (with backticks) in all the places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, agreed.

Ok(())
}

/// Pushes an Option<T> into the builder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also Option<T>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@paddyhoran
Copy link
Contributor Author

@sunchao, thanks for the review, appreciated. Will update soon

@crepererum
Copy link
Contributor

The underlying BufferBuilder is already wrong here, but I think capacity and len should be usize (or at least u64 if you like to overflow on 32bit systems). Maybe this could be fixed in a follow-up PR.

@paddyhoran
Copy link
Contributor Author

@crepererum I'm not sure about this, the spec says that lengths should be i32, I believe that this is to ensure compatibility with different implementations, in particular, with Java.

But different implementations can provide support up to i64, for instance, C++ seems to use i64. I think that we may need to add assertions to guard against overflows though (I recall seeing this in the C++ implementation but I can't find it now)?

@xhochy or @wesm could you comment on this?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paddyhoran . LGTM except one nit.

rust/src/builder.rs Outdated Show resolved Hide resolved
@xhochy
Copy link
Member

xhochy commented Oct 31, 2018

Implementations are free to support larger than i32 lengths as long as they stay inside their scope. But you need to be able to split these things up when you pass Arrow data to other implementations. You can pass these i64 length buffers to C++. But in the case that you deal with "any" Arrow implementation, you should only pass i32 sized buffers along.

@paddyhoran
Copy link
Contributor Author

paddyhoran commented Nov 1, 2018

@crepererum my thinking is that as BufferBuilder is generic over T so len and capacity are measuring array slots and so i64 is used. The resulting Buffer is type-less and so it's len is usize.

Does this seem right to you or do you still see an issue?

@crepererum
Copy link
Contributor

I still don't see how "array slots" can be negative though. Same goes for everything that @xhochy said. How can a length be negative?

Not matter how we decide, we have basically two options:
a) use signed integers but assert that the are not negative (if nobody comes up with a good reason why this should be the case)
b) use unsigned integer but check that they don't exceed std::i32::MAX, although I'm kinda unhappy about the fact that we only support 2G elements.

@xhochy
Copy link
Member

xhochy commented Nov 1, 2018

These lengths cannot be negative. We use signed types due to some weird language behaviours in Java and C++. Sometimes, e.g. for the null count, we use -1 as a placeholder value that we need to compute this value first.

Thus feel free in the Rust implementation to use one of the things @crepererum proposed.

@paddyhoran
Copy link
Contributor Author

Thanks @xhochy. @crepererum the jira is here.

Is this PR ready to merge then?

@crepererum
Copy link
Contributor

Great. Thanks @paddyhoran for the work, patience and issue creation. Thanks @xhochy for the clarifications. IMHO the PR is now ready to merge.

@kszucs
Copy link
Member

kszucs commented Nov 1, 2018

Rebased on top of #2868

@sunchao
Copy link
Member

sunchao commented Nov 1, 2018

But you need to be able to split these things up when you pass Arrow data to other implementations.

@xhochy Is there any existing example on this? I assume if we choose usize on a 64-bit platform, we have to handle this conversion somehow in order to be truly compatible with the standard, right?

@kszucs kszucs closed this in 919119f Nov 1, 2018
@kszucs
Copy link
Member

kszucs commented Nov 1, 2018

Thanks for the PR and the reviews!

@kszucs
Copy link
Member

kszucs commented Nov 1, 2018

@sunchao We should add rust to the integration testing suite: https://github.com/apache/arrow/tree/master/integration

@sunchao
Copy link
Member

sunchao commented Nov 1, 2018

@sunchao We should add rust to the integration testing suite: https://github.com/apache/arrow/tree/master/integration

Sure. Filed ARROW-3688.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants