Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6473: Dictionary encoding format clarifications/future proofing #5585

Closed
wants to merge 10 commits into from

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Oct 5, 2019

This needs to be discussed first on the mailing list. It is a consolidation of recent dictionary encoding threads: 1, 2 and 3

@github-actions
Copy link

github-actions bot commented Oct 5, 2019

.. note:: An edge-case for interleaved dictionary and record batches occurs
when the the record batches contain dictionary encoded arrays that are
completely null. In this case, the dictionary for the encoded column might
appear after the first record batch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? This may complicate implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fine if we revise, the specification to require a dictionary for each ID in the schema even if it is empty. Requiring all dictionaries at the beginning seemed like a change in specification which is why I proposed this. Lets see if anyone else has any thoughts on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not especially concerned about this at least on the C++ side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to this is to require an empty dictionary batch to be sent. It is also difficult to know whether the dictionary should be empty without checking the dictionary indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think either option is too bad we just need to decide whether an empty batch is required. I think if require implementations to support dictionary delta encoding and dictionary resets I don't think it make it any easier on the consumer side to require a dictionary (on the producer side I can imagine most implementations might not bother checking).

@codecov-io
Copy link

Codecov Report

Merging #5585 into master will decrease coverage by 0.39%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #5585     +/-   ##
=========================================
- Coverage   89.18%   88.79%   -0.4%     
=========================================
  Files         924      983     +59     
  Lines      127616   132170   +4554     
  Branches     1501     1501             
=========================================
+ Hits       113817   117359   +3542     
- Misses      13434    14446   +1012     
  Partials      365      365
Impacted Files Coverage Δ
cpp/src/arrow/json/converter.cc 90.05% <0%> (-1.76%) ⬇️
cpp/src/arrow/json/chunked_builder.cc 80% <0%> (-1.67%) ⬇️
cpp/src/arrow/csv/column_builder.cc 95.54% <0%> (-1.49%) ⬇️
python/pyarrow/plasma.py 58.9% <0%> (-1.37%) ⬇️
python/pyarrow/tests/test_parquet.py 95.24% <0%> (-0.06%) ⬇️
r/R/feather.R 63.33% <0%> (ø)
r/src/recordbatch.cpp 87.76% <0%> (ø)
r/src/table.cpp 87.61% <0%> (ø)
r/R/array-data.R 20% <0%> (ø)
r/R/filesystem.R 70.45% <0%> (ø)
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed753fd...c556f5a. Read the comment docs.

@@ -267,7 +267,9 @@ table KeyValue {

/// ----------------------------------------------------------------------
/// Dictionary encoding metadata

/// Maintained for forwards compatibility, in the future
/// Dictionaries might be (sparse) maps betwen Index and Value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "sparse map" here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the word "type" suggests an Arrow datatype, perhaps use some other word such as "kind"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to "Kind". I reworded as discontinuous indices.

@wesm
Copy link
Member

wesm commented Oct 17, 2019

The parquet-testing changes should not be here

.. note:: An edge-case for interleaved dictionary and record batches occurs
when the the record batches contain dictionary encoded arrays that are
completely null. In this case, the dictionary for the encoded column might
appear after the first record batch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to this is to require an empty dictionary batch to be sent. It is also difficult to know whether the dictionary should be empty without checking the dictionary indices.

file.
file. Further more, it is invalid to have more then one non-delta
dictionary batch per dictionary ID. Delta dictionaries are applied
in the order they appear in the file footer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're allowing delta dictionaries in the file format at all, then why can there only be one?

If you're reading the file, it seems you could assemble all the deltas during the initial pass to create the "master" dictionary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're allowing delta dictionaries in the file format at all, then why can there only be one?
I agree. I added emphasis on what I think I wrote. This only allows for one "non-delta" dictionary.
Further more, it is invalid to have more then one non-delta dictionary batch per dictionary ID

Do you think we Should we disallow Delta dictionaries?


.. note:: Implementations should check to ensure the dictionary constraints
are satisfied. In future revisions of the specification this requirement
might be relaxed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what this note is getting at

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should simply remove this. I included it because there was some discussion on if dictionary replacement should be allowed in the file format

@@ -283,6 +286,8 @@ table DictionaryEncoding {
/// is used to represent ordered categorical data, and we provide a way to
/// preserve that metadata here
isOrdered: bool;

dictionaryKind: DictionaryKind;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I need to spend some time looking at the sparseness / encoding proposal to understand this better. Is your thinking that alternative dictionary-like encodings would not be handled through some other metadata, and better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't part of the proposal but came from a paper that Ippokratis linked to (https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to provide more context. It seems that in the future for ordered dictionaries having explicit numeric keys might be advantageous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this now

@emkornfield emkornfield force-pushed the dict_document branch 2 times, most recently from e859f45 to c439a83 Compare October 18, 2019 05:40
file.
file. There must be one dictionary batch per dictionary encoded column.
Even if a all record batches are null for a column an empty dictionary
batch is expected. Delta dictionaries are not used in the file format.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou updated per ML conversation, does this look reasonable to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me yes, but it looks like the conversation is not done ;-)

@emkornfield
Copy link
Contributor Author

will revert the testing change, somehow it cropped back up.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me

@@ -283,6 +286,8 @@ table DictionaryEncoding {
/// is used to represent ordered categorical data, and we provide a way to
/// preserve that metadata here
isOrdered: bool;

dictionaryKind: DictionaryKind;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this now

@emkornfield
Copy link
Contributor Author

Merging based on vote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants