-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6078: [Java] Implement dictionary-encoded subfields for List type #4972
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4972 +/- ##
==========================================
+ Coverage 87.62% 89.76% +2.13%
==========================================
Files 1014 685 -329
Lines 145908 102521 -43387
Branches 1437 0 -1437
==========================================
- Hits 127857 92025 -35832
+ Misses 17689 10496 -7193
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
@emkornfield Since FixedListSizeVector is a specific case of ListVector(don't know why not inherit from ListVector before), I did some refactor:
Please help take a look, thanks! |
java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java
Outdated
Show resolved
Hide resolved
@emkornfield I did a refactor for this as you suggested:
Please help take a look again, thanks a lot! |
1ceb3cd
to
3f1bc08
Compare
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseListVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseListVector.java
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseListVector.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions on simplifying BaseListVector.
Thanks, fixed now. |
I updated this PR, please see if you have other comments, thanks |
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseListVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks cleaner, I'm still not sure about the cloning which seems in-efficient compared to copying references to buffers.
You are right, I fixed it with getFieldBuffers/loadFieldBuffers instead and remove replaceDataVector in BaseListVector. |
@emkornfield PR updated, please help take a look, thanks! Actually, I haven't found re-request button :) |
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ListSubfieldEncoder.java
Outdated
Show resolved
Hide resolved
allocator, null); | ||
|
||
final ArrowFieldNode fieldNode = new ArrowFieldNode(vector.getValueCount(), vector.getNullCount()); | ||
cloned.loadFieldBuffers(fieldNode, vector.getFieldBuffers()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to call retain on the buffers. Also it looks like ListVector doesn't round trip correctly on when it is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think retain is called in loadFieldBuffers already.
Also it looks like ListVector doesn't round trip correctly on when it is empty
How to validity this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called in loadfieldbuffers, but decrement is also called to remove the reference count from the original vector ....
Create a unit test that tries to encode/decide a brand new vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadFieldBuffers looks like:
offsetBuffer.getReferenceManager().release();
offsetBuffer = offBuffer.getReferenceManager().retain(offBuffer, allocator);
It decrement it's own offsetBuffer and retain passed offBuffer, I think it's no need to call retain outside? And it works well with the encoded vector even if I clear the original vector.
And I am not quite understand with
Also it looks like ListVector doesn't round trip correctly on when it is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, about retention, I wasn't reading carefully.
I think the empty ListVector is a separate issue that we can cleanup later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I guess you could describe empty ListVector issue in a separate JIRA and just assign to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments on cloning but this is very close to mergable.
…istSubfieldEncoder.java Co-Authored-By: emkornfield <emkornfield@gmail.com>
I really like this idea, but I believe that it's not supported in the IPC protocol, as a dictionary can only be applied to a field vector, so for a dictionary encoded ListVector, the dictionary values would have to be Lists themselves (please correct me if I'm wrong). Would it be worthwhile to raise a discussion on the mailing list to support this more formally? |
Thanks for your attention, here dictionary is the same type as original vector, for example: |
I thought dictionary-encoded vectors were always assumed to be (Tiny/Big)IntVectors, but maybe that is not the case. |
@elahrvivaz I don't think implementations have good support for this use-case yet, but my understanding it is intended to be a supported use case (See https://github.com/apache/arrow/pull/1848/files and corresponding JIRA as well as an old ML thread Schema.fbs places dictionary metadata on Fields which are recursive structures so I think it should be able to support dictionary at a nested level. |
@emkornfield thanks for the clarification |
We have good support for dictionaries on children of nested types. I think JavaScript even supports dictionaries inside dictionaries (which is also permitted by the spec IIUC) |
+1, thank you. |
Related to ARROW-6078.
For example, int type List (valueCount = 5) has data like below:
10, 20
10, 20
30, 40, 50
30, 40, 50
10, 20
could be encoded to:
0, 1
0, 1
2, 3, 4
2, 3, 4
0, 1
with list type dictionary
10, 20, 30, 40, 50
or
10,
20,
30,
40,
50