Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6078: [Java] Implement dictionary-encoded subfields for List type #4972

Closed
wants to merge 7 commits into from

Conversation

tianchen92
Copy link
Contributor

Related to ARROW-6078.
For example, int type List (valueCount = 5) has data like below:
10, 20
10, 20
30, 40, 50
30, 40, 50
10, 20
could be encoded to:
0, 1
0, 1
2, 3, 4
2, 3, 4
0, 1
with list type dictionary
10, 20, 30, 40, 50
or
10,
20,
30,
40,
50

@codecov-io
Copy link

codecov-io commented Jul 31, 2019

Codecov Report

Merging #4972 into master will increase coverage by 2.13%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4972      +/-   ##
==========================================
+ Coverage   87.62%   89.76%   +2.13%     
==========================================
  Files        1014      685     -329     
  Lines      145908   102521   -43387     
  Branches     1437        0    -1437     
==========================================
- Hits       127857    92025   -35832     
+ Misses      17689    10496    -7193     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/testing/gtest_util.h 80.53% <0%> (-16.84%) ⬇️
cpp/src/parquet/arrow/reader.h 65% <0%> (-15%) ⬇️
cpp/src/arrow/testing/util.h 91.3% <0%> (-8.7%) ⬇️
cpp/src/parquet/thrift.h 86.13% <0%> (-7.9%) ⬇️
cpp/src/arrow/util/compression.cc 82.14% <0%> (-4.96%) ⬇️
cpp/src/parquet/arrow/reader.cc 81.43% <0%> (-3.48%) ⬇️
cpp/src/arrow/status.cc 48.88% <0%> (-3.34%) ⬇️
cpp/src/parquet/properties_test.cc 96.96% <0%> (-3.04%) ⬇️
cpp/src/arrow/dataset/file_base.h 87.5% <0%> (-2.98%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
... and 431 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a40d6b6...5d2f751. Read the comment docs.

@tianchen92
Copy link
Contributor Author

@emkornfield Since FixedListSizeVector is a specific case of ListVector(don't know why not inherit from ListVector before), I did some refactor:

  1. make FixedListSizeVector extends ListVector, UnionFixedListWriter extends UnionListWriter, UnionFixedListReader extends UnionListReader to remove plenty duplicated logic.
  2. Make ListSubfieldEncoder non static to avoid some problems in DictionaryEncoder mentioned in another thread(i.e. dictionary reuse etc)

Please help take a look, thanks!

@tianchen92
Copy link
Contributor Author

@emkornfield I did a refactor for this as you suggested:

  1. Add BaseListVector for BaseRepeatedVector/FixedSizeListVector
  2. extract loop logic in DictionaryEncoder for reuse purpose.

Please help take a look again, thanks a lot!

@tianchen92 tianchen92 force-pushed the ARROW-1175 branch 2 times, most recently from 1ceb3cd to 3f1bc08 Compare August 19, 2019 16:43
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions on simplifying BaseListVector.

@tianchen92
Copy link
Contributor Author

Some suggestions on simplifying BaseListVector.

Thanks, fixed now.

@tianchen92
Copy link
Contributor Author

Some suggestions on simplifying BaseListVector.

I updated this PR, please see if you have other comments, thanks

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cleaner, I'm still not sure about the cloning which seems in-efficient compared to copying references to buffers.

@tianchen92
Copy link
Contributor Author

This looks cleaner, I'm still not sure about the cloning which seems in-efficient compared to copying references to buffers.

You are right, I fixed it with getFieldBuffers/loadFieldBuffers instead and remove replaceDataVector in BaseListVector.

@tianchen92 tianchen92 closed this Aug 28, 2019
@tianchen92 tianchen92 reopened this Aug 28, 2019
@tianchen92
Copy link
Contributor Author

@emkornfield PR updated, please help take a look, thanks! Actually, I haven't found re-request button :)

allocator, null);

final ArrowFieldNode fieldNode = new ArrowFieldNode(vector.getValueCount(), vector.getNullCount());
cloned.loadFieldBuffers(fieldNode, vector.getFieldBuffers());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to call retain on the buffers. Also it looks like ListVector doesn't round trip correctly on when it is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think retain is called in loadFieldBuffers already.

Also it looks like ListVector doesn't round trip correctly on when it is empty

How to validity this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called in loadfieldbuffers, but decrement is also called to remove the reference count from the original vector ....

Create a unit test that tries to encode/decide a brand new vectors.

Copy link
Contributor Author

@tianchen92 tianchen92 Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadFieldBuffers looks like:

offsetBuffer.getReferenceManager().release();
offsetBuffer = offBuffer.getReferenceManager().retain(offBuffer, allocator);

It decrement it's own offsetBuffer and retain passed offBuffer, I think it's no need to call retain outside? And it works well with the encoded vector even if I clear the original vector.

And I am not quite understand with

Also it looks like ListVector doesn't round trip correctly on when it is empty.

Copy link
Contributor

@emkornfield emkornfield Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, about retention, I wasn't reading carefully.

I think the empty ListVector is a separate issue that we can cleanup later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I guess you could describe empty ListVector issue in a separate JIRA and just assign to me.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments on cloning but this is very close to mergable.

…istSubfieldEncoder.java

Co-Authored-By: emkornfield <emkornfield@gmail.com>
@elahrvivaz
Copy link
Contributor

I really like this idea, but I believe that it's not supported in the IPC protocol, as a dictionary can only be applied to a field vector, so for a dictionary encoded ListVector, the dictionary values would have to be Lists themselves (please correct me if I'm wrong). Would it be worthwhile to raise a discussion on the mailing list to support this more formally?

@tianchen92
Copy link
Contributor Author

I really like this idea, but I believe that it's not supported in the IPC protocol, as a dictionary can only be applied to a field vector, so for a dictionary encoded ListVector, the dictionary values would have to be Lists themselves (please correct me if I'm wrong). Would it be worthwhile to raise a discussion on the mailing list to support this more formally?

Thanks for your attention, here dictionary is the same type as original vector, for example:
i. the original vector: ListVector with data vector of varchar type
ii. the dictionary: ListVector with data vector of varchar vector which holds unique values.
iii. the encoded vector: ListVector with data vector of int type.
In this case, its just like normal encoding and is fine with IPC if we write both dictionary and encoded vector? Or do I miss something about IPC?thanks!

@elahrvivaz
Copy link
Contributor

I thought dictionary-encoded vectors were always assumed to be (Tiny/Big)IntVectors, but maybe that is not the case.

@emkornfield
Copy link
Contributor

@elahrvivaz I don't think implementations have good support for this use-case yet, but my understanding it is intended to be a supported use case (See https://github.com/apache/arrow/pull/1848/files and corresponding JIRA as well as an old ML thread

Schema.fbs places dictionary metadata on Fields which are recursive structures so I think it should be able to support dictionary at a nested level.

@elahrvivaz
Copy link
Contributor

@emkornfield thanks for the clarification

@wesm
Copy link
Member

wesm commented Aug 29, 2019

We have good support for dictionaries on children of nested types. I think JavaScript even supports dictionaries inside dictionaries (which is also permitted by the spec IIUC)

@emkornfield
Copy link
Contributor

+1, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants