Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6352: [Java] Add implementation of DenseUnionVector #5473

Closed
wants to merge 8 commits into from

Conversation

liyafan82
Copy link
Contributor

Today only Sparse unions are supported. We should have a dense union implementation vector that conforms to the IPC protocol (the current sparse union vector doesn't do this and there are other JIRAs covering making it compatible).

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to propagate the fact that there can only be one Vector of a given minor type within the union. This isn't according to the spec, I would prefer to see something that might not be as efficient but implements the specification cleanly, instead of copying the same template pattern that SparseUnion uses.

list.add(offsetBuffer);
list.addAll(java.util.Arrays.asList(internalStruct.getBuffers(clear)));
}
if (clear) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this call clear instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think so. Calling clear/close is essentially the same as calling release, which decrements the refcount by 1.

* each time the vector is accessed.
* Source code generated using FreeMarker template ${.template_name}
*/
public class DenseUnionVector extends UnionVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this relationship should necessarily hold without a larger refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The implementations are separated now.

@liyafan82
Copy link
Contributor Author

liyafan82 commented Sep 24, 2019

This change seems to propagate the fact that there can only be one Vector of a given minor type within the union. This isn't according to the spec, I would prefer to see something that might not be as efficient but implements the specification cleanly, instead of copying the same template pattern that SparseUnion uses.

Thanks for your comments. You mean DenseUnionVector and SparseUnionVector should have separate minor type, and separate implementation, right?

@emkornfield
Copy link
Contributor

Thanks for your comments. You mean DenseUnionVector and SparseUnionVector should have separate minor type, and separate implementation, right?

Separate implemention yes. Seperate minor type I'm not sure about I need to refresh myself on the implications of having two minor types corresponding to the same parameterized IPC specification type.

@liyafan82
Copy link
Contributor Author

Thanks for your comments. You mean DenseUnionVector and SparseUnionVector should have separate minor type, and separate implementation, right?

Separate implemention yes. Seperate minor type I'm not sure about I need to refresh myself on the implications of having two minor types corresponding to the same parameterized IPC specification type.

Thanks for your feedback. I have separated the implementations for UnionVector and DenseUnionVector. In addition, the implementations for reader/writer are also separate. However, implementations for list/struct reader/writer are shared. Please take a look.

@SuppressWarnings("unused")
public class DenseUnionReader extends AbstractFieldReader {

private BaseReader[] readers = new BaseReader[43];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is propagating the incompatibity with the IPC specification. Specifically a union should be able to have more then one vector for each minor type. is this the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking this question: UnionVector and DenseUnionVector are sharing the same minor type? If so, the answer is yes for now.
Do you think we should provide separate minor types for them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I'm asking if there can be two different vectors of the same type in the union (e.g. two different struct vectors).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the clarification.

You are right. This implementation does not support more than one vector for each minor type.
This is aligning with the current (Sparse) UnionVector.

Should we resolve this problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO densevector shouldn't propagate the same problems the existing UnionVector has with compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your feedback. I will resolve it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield I have updated the code to allow arbitrary relative types for the desne union vector. Please take a look.

Brief change list:

  1. For StructVector and ListVector, a custom type id is registered when they are added, if the same relative type has never occured.
  2. the type id of the custom relative types start from MinorType.values().length, and the maximum value is 128.
  3. some additional data structures are added to track the existing custom relative types, struct vector, and list vectors.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this refactored so we do base classes of the vector and reader as non-templated code and then the implementation adds the templated code (or vice versa). This is to minimize the use of java code in freemarker templates. We have tried to eliminate that in a bunch of places. We haven't refactored in the readers/writers but we should avoid introducing new debt on this front.

</#list>
</#list>

public void setType(int index, int typeId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typeIds should probably be "byte" everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised accordingly. Thanks.

}

public boolean isNull(int index) {
return (typeBuffer.getByte(index * TYPE_WIDTH) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a separate validity buffer or there needs to be a more discussion the mailing list. As the spec is currently written this doesn't conform.

Copy link
Contributor Author

@liyafan82 liyafan82 Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a real problem, and UnionVector also had this problem.
We solved it by using the underlying vector's validity buffer (please see #5752 (comment)).

Here we solve it with a similar method. Please see if it is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more thoughts, I think the solution for UnionVector does not work for DenseUnionVector. So I have revised the code to add a validity buffer.


private ValueVector getVector(int index) {
int type = typeBuffer.getByte(index * TYPE_WIDTH);
if (type < Types.MinorType.values().length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still doesn't conform to the spec as written. There can pontentially be more then one vector of each primitive type (e.g. 2 IntVectors). Again we should try to close on this on the mailing list, but I thought last time it was discussed we agreed to keep it general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will fix this.

I am a little curious about in what scenario do we need to have 2 IntVectors? Two IntVectors with different meta data?

Another question is, do we need to fix this problem for the (sparse) UnionVector in a follow up issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have revised to code to support more than one vector for each vector type. Please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to review over the weekend. To answer the question on use-case Arrow's unions underlying vectors are named meaning they could have two integers with different semantics. Another example of a data format that allows this is protobufs with oneof fields. Whether this is a common case is up for debate (I can't think of a good use-case at the moment) and the last time it was discussed on the mailing list the most common case is "json" representations which would exclude the duplicate integer use-cases. This is how the SparseUnionVector is implemented in Java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the clarification.Have a great weekend.

@liyafan82
Copy link
Contributor Author

I would like to see this refactored so we do base classes of the vector and reader as non-templated code and then the implementation adds the templated code (or vice versa). This is to minimize the use of java code in freemarker templates. We have tried to eliminate that in a bunch of places. We haven't refactored in the readers/writers but we should avoid introducing new debt on this front.

Thanks for the suggestion. I will refactor the code towards that direction.

/**
* The key is type Id, and the value is vector.
*/
private ValueVector[] childVectors = new ValueVector[Byte.MAX_VALUE + 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the +1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The valid type id should be from 0 to Byte.MAX_VALUE, both inclusive, so the total number is Byte.MAX_VALUE - 0 + 1.

private static final byte TYPE_WIDTH = 1;
public static final byte OFFSET_WIDTH = 4;

private static final FieldType INTERNAL_STRUCT_TYPE = new FieldType(false /*nullable*/,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets use /nullable=/false, instead of putting it after the values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thank you.

if (ownBuffers.size() != 3) {
throw new IllegalArgumentException("Illegal buffer count, expected " + 2 + ", got: " + ownBuffers.size());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this reference typeIds metadata data someplace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Revised the exception message.

@@ -170,6 +171,11 @@ public Boolean visit(UnionVector left, Range range) {
return compareUnionVectors(range);
}

@Override
public Boolean visit(DenseUnionVector left, Range range) {
throw new UnsupportedOperationException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a JIRA to fill this and add a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will create one.

FieldReader reader = new DenseUnionReader(DenseUnionVector.this);
int offset = offsetBuffer.getInt(index * OFFSET_WIDTH);
reader.setPosition(offset);
holder.reader = reader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a check to see if reader is already initialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we can reuse the reader here? I am not sure, because the reader is stateful and there can be more than one client for this vector.
The implementation here aligns with the sparse union vector.

if (this.fieldType == null) {
fieldType = FieldType.nullable(new ArrowType.Union(Dense, typeIds));
} else {
final UnionMode mode = ((ArrowType.Union) this.fieldType.getType()).getMode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this always be Dense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Revised accordingly.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick pass through and I think a a high level this looks like a reasonable start (modulo comments). The real test will be adding this to integration tests. Did you want to do this in this review or open up a JIRA for follow-up.

I'm also not sure if this adresses @jacques-n comment about removing templates, but I'll let him comment on that.

@liyafan82
Copy link
Contributor Author

Took a quick pass through and I think a a high level this looks like a reasonable start (modulo comments). The real test will be adding this to integration tests. Did you want to do this in this review or open up a JIRA for follow-up.

I'm also not sure if this adresses @jacques-n comment about removing templates, but I'll let him comment on that.

@emkornfield First, I would like to express my sincere appreciation to you, as this PR is so large and reviewing it is so effort-consuming.

To make the code easier to review, I want to do integration test in a follow-up issue.

@jacques-n 's comment has not been resolved yet, although I believe the suggested changes/refactorings are reasonable and beneficial. If @jacques-n agrees, I think we can do it in a follow up issue (this PR is already large enough). In addition, I think we can also apply the refactoring to the sparse union vector.

@emkornfield
Copy link
Contributor

OK, going to merge now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants