-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6473: Dictionary encoding format clarifications/future proofing #5585
Conversation
.. note:: An edge-case for interleaved dictionary and record batches occurs | ||
when the the record batches contain dictionary encoded arrays that are | ||
completely null. In this case, the dictionary for the encoded column might | ||
appear after the first record batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? This may complicate implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm fine if we revise, the specification to require a dictionary for each ID in the schema even if it is empty. Requiring all dictionaries at the beginning seemed like a change in specification which is why I proposed this. Lets see if anyone else has any thoughts on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not especially concerned about this at least on the C++ side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative to this is to require an empty dictionary batch to be sent. It is also difficult to know whether the dictionary should be empty without checking the dictionary indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't think either option is too bad we just need to decide whether an empty batch is required. I think if require implementations to support dictionary delta encoding and dictionary resets I don't think it make it any easier on the consumer side to require a dictionary (on the producer side I can imagine most implementations might not bother checking).
Codecov Report
@@ Coverage Diff @@
## master #5585 +/- ##
=========================================
- Coverage 89.18% 88.79% -0.4%
=========================================
Files 924 983 +59
Lines 127616 132170 +4554
Branches 1501 1501
=========================================
+ Hits 113817 117359 +3542
- Misses 13434 14446 +1012
Partials 365 365
Continue to review full report at Codecov.
|
format/Schema.fbs
Outdated
@@ -267,7 +267,9 @@ table KeyValue { | |||
|
|||
/// ---------------------------------------------------------------------- | |||
/// Dictionary encoding metadata | |||
|
|||
/// Maintained for forwards compatibility, in the future | |||
/// Dictionaries might be (sparse) maps betwen Index and Value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "sparse map" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the word "type" suggests an Arrow datatype, perhaps use some other word such as "kind"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to "Kind". I reworded as discontinuous indices.
bd8c266
to
c75ac30
Compare
The parquet-testing changes should not be here |
.. note:: An edge-case for interleaved dictionary and record batches occurs | ||
when the the record batches contain dictionary encoded arrays that are | ||
completely null. In this case, the dictionary for the encoded column might | ||
appear after the first record batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative to this is to require an empty dictionary batch to be sent. It is also difficult to know whether the dictionary should be empty without checking the dictionary indices.
docs/source/format/Columnar.rst
Outdated
file. | ||
file. Further more, it is invalid to have more then one non-delta | ||
dictionary batch per dictionary ID. Delta dictionaries are applied | ||
in the order they appear in the file footer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're allowing delta dictionaries in the file format at all, then why can there only be one?
If you're reading the file, it seems you could assemble all the deltas during the initial pass to create the "master" dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're allowing delta dictionaries in the file format at all, then why can there only be one?
I agree. I added emphasis on what I think I wrote. This only allows for one "non-delta" dictionary.
Further more, it is invalid to have more then one non-delta dictionary batch per dictionary ID
Do you think we Should we disallow Delta dictionaries?
docs/source/format/Columnar.rst
Outdated
|
||
.. note:: Implementations should check to ensure the dictionary constraints | ||
are satisfied. In future revisions of the specification this requirement | ||
might be relaxed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what this note is getting at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should simply remove this. I included it because there was some discussion on if dictionary replacement should be allowed in the file format
@@ -283,6 +286,8 @@ table DictionaryEncoding { | |||
/// is used to represent ordered categorical data, and we provide a way to | |||
/// preserve that metadata here | |||
isOrdered: bool; | |||
|
|||
dictionaryKind: DictionaryKind; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I need to spend some time looking at the sparseness / encoding proposal to understand this better. Is your thinking that alternative dictionary-like encodings would not be handled through some other metadata, and better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't part of the proposal but came from a paper that Ippokratis linked to (https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to provide more context. It seems that in the future for ordered dictionaries having explicit numeric keys might be advantageous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this now
e859f45
to
c439a83
Compare
docs/source/format/Columnar.rst
Outdated
file. | ||
file. There must be one dictionary batch per dictionary encoded column. | ||
Even if a all record batches are null for a column an empty dictionary | ||
batch is expected. Delta dictionaries are not used in the file format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou updated per ML conversation, does this look reasonable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me yes, but it looks like the conversation is not done ;-)
6f8d49c
to
99fa52c
Compare
will revert the testing change, somehow it cropped back up. |
99fa52c
to
a418422
Compare
a418422
to
d1a0804
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me
@@ -283,6 +286,8 @@ table DictionaryEncoding { | |||
/// is used to represent ordered categorical data, and we provide a way to | |||
/// preserve that metadata here | |||
isOrdered: bool; | |||
|
|||
dictionaryKind: DictionaryKind; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this now
Merging based on vote. |
This needs to be discussed first on the mailing list. It is a consolidation of recent dictionary encoding threads: 1, 2 and 3