New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6031: [Java] Support iterating a vector by ArrowBufPointer #4950
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4950 +/- ##
==========================================
+ Coverage 87.64% 89.68% +2.04%
==========================================
Files 1030 690 -340
Lines 148327 103813 -44514
Branches 1437 0 -1437
==========================================
- Hits 129995 93103 -36892
+ Misses 17970 10710 -7260
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
* Vector for which each data element resides in a continuous memory region, | ||
* so it can be pointed to by an {@link org.apache.arrow.memory.util.ArrowBufPointer}. | ||
*/ | ||
public interface ElementAddressableVector extends ValueVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is a new interface needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, getDataPointer methods are supported by FixedWidthVector and VariableWidthVector interfaces. To implement the iterator properly, we need a common interface with the method.
However, the common interface should not be ValueVector, because some sub-class should not support the getDataPointer operator, as their elements are not stored in a continuous memory region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems like a better name for this might be PrimitiveVector? All primitive types should be able to have a direct pointer to there elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-0.9
I'm really struggling with this being added to the interface. I generally don't think this is a non-performant interface. It is one thing to be external to the core apis. Adding into the core APIs seems like it suggests that this is a good way to access data. For example, if you want to make a performant hash implementation, you wouldn't use this interface. What are we really solving with ArrowBufPointer? I feel like we're letting traditional java-isms leak into a codebase that was trying to avoid them for the most part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add some detail, if you're reading 4 byte values, you should be working with ints in java, not a wrapper object that points to a byte buffer with an offset and length. A huge part of the performance you can get with Arrow is by using known structures: most are 1,2,4,8,16 byte widths. Writing things for each of those widths is way better than trying to layer a java object interface on top.
@jacques-n thanks a lot for your comments. I am convinced that these APIs should not be added to the core APIs. In fact, they are not intended to be the primary way of accessing vector data elements. The iterator can be something nice to have, but not essential. I can think of two applications of this iterator:
|
@jacques-n @emkornfield |
@jacques-n any more concerns if the APIs are not directly on Vector classes? |
I think @jacques-n is out this week, if there are no more comments next week we can probably merge this. |
3f718bd
to
9496333
Compare
@emkornfield Thanks for your comments. I have revised the PR a little, since now we have a common super interface for fixed width vectors and the variable width vectors. @jacques-n 's comments make sense. However, I think this is at least useful for variable width vectors. |
import org.apache.arrow.memory.util.hash.SimpleHasher; | ||
import org.apache.arrow.vector.ElementAddressableVector; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to clarify someplace that the value returned by next() is only valid until the subsequent call, this isn't typical of iterators as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks.
To address this issue. I have also added a method that allows the client to specify the pointer to populate:
ArrowBufPointer next(ArrowBufPointer outPointer)
9496333
to
52818cb
Compare
/** | ||
* Retrieves the next pointer from the vector. | ||
* @return the pointer pointing to the next element in the vector. | ||
* Note that the returned point is only valid before the next call to this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Note that the returned point is only valid before the next call to this method. | |
* Note that the returned pointer is only valid before the next call to this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thank you.
* @param outPointer the pointer to populate. | ||
* @return the pointer to populate. | ||
*/ | ||
public ArrowBufPointer next(ArrowBufPointer outPointer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public ArrowBufPointer next(ArrowBufPointer outPointer) { | |
public void next(ArrowBufPointer outPointer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks.
52818cb
to
929ffa0
Compare
/** | ||
* The hasher to calculate the hash code. | ||
*/ | ||
private final ArrowBufHasher hasher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that hasher is not used, could it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thank you.
The hasher is used to construct the ArrowBufPointer.
929ffa0
to
bb05ce7
Compare
+1. |
Provide the functionality to traverse a vector (fixed-width vector & variable-width vector) by an iterator. This is convenient for scenarios when accessing vector elements in sequence.