Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6031: [Java] Support iterating a vector by ArrowBufPointer #4950

Closed
wants to merge 1 commit into from

Conversation

liyafan82
Copy link
Contributor

Provide the functionality to traverse a vector (fixed-width vector & variable-width vector) by an iterator. This is convenient for scenarios when accessing vector elements in sequence.

@codecov-io
Copy link

codecov-io commented Jul 26, 2019

Codecov Report

Merging #4950 into master will increase coverage by 2.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4950      +/-   ##
==========================================
+ Coverage   87.64%   89.68%   +2.04%     
==========================================
  Files        1030      690     -340     
  Lines      148327   103813   -44514     
  Branches     1437        0    -1437     
==========================================
- Hits       129995    93103   -36892     
+ Misses      17970    10710    -7260     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/json/converter.cc 90.05% <0%> (-1.76%) ⬇️
cpp/src/arrow/json/chunked_builder.cc 79.91% <0%> (-1.68%) ⬇️
python/pyarrow/tests/test_parquet.py 96.56% <0%> (-0.07%) ⬇️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
... and 333 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcf5897...bb05ce7. Read the comment docs.

* Vector for which each data element resides in a continuous memory region,
* so it can be pointed to by an {@link org.apache.arrow.memory.util.ArrowBufPointer}.
*/
public interface ElementAddressableVector extends ValueVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a new interface needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, getDataPointer methods are supported by FixedWidthVector and VariableWidthVector interfaces. To implement the iterator properly, we need a common interface with the method.

However, the common interface should not be ValueVector, because some sub-class should not support the getDataPointer operator, as their elements are not stored in a continuous memory region.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems like a better name for this might be PrimitiveVector? All primitive types should be able to have a direct pointer to there elements?

Copy link
Contributor

@jacques-n jacques-n Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-0.9

I'm really struggling with this being added to the interface. I generally don't think this is a non-performant interface. It is one thing to be external to the core apis. Adding into the core APIs seems like it suggests that this is a good way to access data. For example, if you want to make a performant hash implementation, you wouldn't use this interface. What are we really solving with ArrowBufPointer? I feel like we're letting traditional java-isms leak into a codebase that was trying to avoid them for the most part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add some detail, if you're reading 4 byte values, you should be working with ints in java, not a wrapper object that points to a byte buffer with an offset and length. A huge part of the performance you can get with Arrow is by using known structures: most are 1,2,4,8,16 byte widths. Writing things for each of those widths is way better than trying to layer a java object interface on top.

@liyafan82
Copy link
Contributor Author

liyafan82 commented Jul 31, 2019

@jacques-n thanks a lot for your comments.

I am convinced that these APIs should not be added to the core APIs. In fact, they are not intended to be the primary way of accessing vector data elements.

The iterator can be something nice to have, but not essential.

I can think of two applications of this iterator:

  1. quickly calculating/combining hash code & determining element equality.
  2. for traversing some "vectors" that does not support random access, like compressed vectors.

@liyafan82
Copy link
Contributor Author

@jacques-n @emkornfield
I have removed the methods out of the core APIs. Please check if it looks good. Thank yo so much.

@emkornfield
Copy link
Contributor

@jacques-n any more concerns if the APIs are not directly on Vector classes?

@emkornfield
Copy link
Contributor

I think @jacques-n is out this week, if there are no more comments next week we can probably merge this.

@liyafan82
Copy link
Contributor Author

I think @jacques-n is out this week, if there are no more comments next week we can probably merge this.

@emkornfield Thanks for your comments.

I have revised the PR a little, since now we have a common super interface for fixed width vectors and the variable width vectors.

@jacques-n 's comments make sense. However, I think this is at least useful for variable width vectors.

import org.apache.arrow.memory.util.hash.SimpleHasher;
import org.apache.arrow.vector.ElementAddressableVector;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to clarify someplace that the value returned by next() is only valid until the subsequent call, this isn't typical of iterators as far as I know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thanks.
To address this issue. I have also added a method that allows the client to specify the pointer to populate:
ArrowBufPointer next(ArrowBufPointer outPointer)

/**
* Retrieves the next pointer from the vector.
* @return the pointer pointing to the next element in the vector.
* Note that the returned point is only valid before the next call to this method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Note that the returned point is only valid before the next call to this method.
* Note that the returned pointer is only valid before the next call to this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thank you.

* @param outPointer the pointer to populate.
* @return the pointer to populate.
*/
public ArrowBufPointer next(ArrowBufPointer outPointer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public ArrowBufPointer next(ArrowBufPointer outPointer) {
public void next(ArrowBufPointer outPointer) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks.

/**
* The hasher to calculate the hash code.
*/
private final ArrowBufHasher hasher;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that hasher is not used, could it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thank you.
The hasher is used to construct the ArrowBufPointer.

@emkornfield
Copy link
Contributor

+1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants