ARROW-6473: Dictionary encoding format clarifications/future proofing #5585

emkornfield · 2019-10-05T22:51:45Z

This needs to be discussed first on the mailing list. It is a consolidation of recent dictionary encoding threads: 1, 2 and 3

github-actions · 2019-10-05T23:02:06Z

https://issues.apache.org/jira/browse/ARROW-6473

pitrou · 2019-10-06T09:36:10Z

docs/source/format/Columnar.rst

+.. note:: An edge-case for interleaved dictionary and record batches occurs
+   when the the record batches contain dictionary encoded arrays that are
+   completely null. In this case, the dictionary for the encoded column might
+   appear after the first record batch.


Is this necessary? This may complicate implementations.

i'm fine if we revise, the specification to require a dictionary for each ID in the schema even if it is empty. Requiring all dictionaries at the beginning seemed like a change in specification which is why I proposed this. Lets see if anyone else has any thoughts on it.

I'm not especially concerned about this at least on the C++ side.

The alternative to this is to require an empty dictionary batch to be sent. It is also difficult to know whether the dictionary should be empty without checking the dictionary indices.

Yes, I don't think either option is too bad we just need to decide whether an empty batch is required. I think if require implementations to support dictionary delta encoding and dictionary resets I don't think it make it any easier on the consumer side to require a dictionary (on the producer side I can imagine most implementations might not bother checking).

codecov-io · 2019-10-07T02:45:45Z

Codecov Report

Merging #5585 into master will decrease coverage by 0.39%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master    #5585     +/-   ##
=========================================
- Coverage   89.18%   88.79%   -0.4%     
=========================================
  Files         924      983     +59     
  Lines      127616   132170   +4554     
  Branches     1501     1501             
=========================================
+ Hits       113817   117359   +3542     
- Misses      13434    14446   +1012     
  Partials      365      365

Impacted Files	Coverage Δ
cpp/src/arrow/json/converter.cc	`90.05% <0%> (-1.76%)`	⬇️
cpp/src/arrow/json/chunked_builder.cc	`80% <0%> (-1.67%)`	⬇️
cpp/src/arrow/csv/column_builder.cc	`95.54% <0%> (-1.49%)`	⬇️
python/pyarrow/plasma.py	`58.9% <0%> (-1.37%)`	⬇️
python/pyarrow/tests/test_parquet.py	`95.24% <0%> (-0.06%)`	⬇️
r/R/feather.R	`63.33% <0%> (ø)`
r/src/recordbatch.cpp	`87.76% <0%> (ø)`
r/src/table.cpp	`87.61% <0%> (ø)`
r/R/array-data.R	`20% <0%> (ø)`
r/R/filesystem.R	`70.45% <0%> (ø)`
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed753fd...c556f5a. Read the comment docs.

pitrou · 2019-10-16T18:42:14Z

format/Schema.fbs

@@ -267,7 +267,9 @@ table KeyValue {

 /// ----------------------------------------------------------------------
 /// Dictionary encoding metadata
-
+/// Maintained for forwards compatibility, in the future
+/// Dictionaries might be (sparse) maps betwen Index and Value.


What is a "sparse map" here?

Also the word "type" suggests an Arrow datatype, perhaps use some other word such as "kind"?

Updated to "Kind". I reworded as discontinuous indices.

wesm · 2019-10-17T21:12:56Z

The parquet-testing changes should not be here

wesm · 2019-10-17T21:14:34Z

docs/source/format/Columnar.rst

+.. note:: An edge-case for interleaved dictionary and record batches occurs
+   when the the record batches contain dictionary encoded arrays that are
+   completely null. In this case, the dictionary for the encoded column might
+   appear after the first record batch.


The alternative to this is to require an empty dictionary batch to be sent. It is also difficult to know whether the dictionary should be empty without checking the dictionary indices.

wesm · 2019-10-17T21:15:59Z

docs/source/format/Columnar.rst

-file.
+file. Further more, it is invalid to have more then one non-delta
+dictionary batch per dictionary ID.  Delta dictionaries are applied
+in the order they appear in the file footer.


If we're allowing delta dictionaries in the file format at all, then why can there only be one?

If you're reading the file, it seems you could assemble all the deltas during the initial pass to create the "master" dictionary

If we're allowing delta dictionaries in the file format at all, then why can there only be one?
I agree. I added emphasis on what I think I wrote. This only allows for one "non-delta" dictionary.
Further more, it is invalid to have more then one non-delta dictionary batch per dictionary ID

Do you think we Should we disallow Delta dictionaries?

wesm · 2019-10-17T21:16:23Z

docs/source/format/Columnar.rst

+
+.. note:: Implementations should check to ensure the dictionary constraints
+  are satisfied.  In future revisions of the specification this requirement
+  might be relaxed.


I don't follow what this note is getting at

I think I should simply remove this. I included it because there was some discussion on if dictionary replacement should be allowed in the file format

wesm · 2019-10-17T21:17:45Z

format/Schema.fbs

@@ -283,6 +286,8 @@ table DictionaryEncoding {
  /// is used to represent ordered categorical data, and we provide a way to
  /// preserve that metadata here
  isOrdered: bool;
+
+  dictionaryKind: DictionaryKind;


I suppose I need to spend some time looking at the sparseness / encoding proposal to understand this better. Is your thinking that alternative dictionary-like encodings would not be handled through some other metadata, and better here?

This isn't part of the proposal but came from a paper that Ippokratis linked to (https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf)

to provide more context. It seems that in the future for ordered dictionaries having explicit numeric keys might be advantageous.

I understand this now

emkornfield · 2019-10-25T04:03:40Z

docs/source/format/Columnar.rst

-file.
+file. There must be one dictionary batch per dictionary encoded column.
+Even if a all record batches are null for a column an empty dictionary
+batch is expected. Delta dictionaries are not used in the file format.


@pitrou updated per ML conversation, does this look reasonable to you?

To me yes, but it looks like the conversation is not done ;-)

emkornfield · 2019-11-21T20:26:59Z

will revert the testing change, somehow it cropped back up.

This reverts commit 72f170c.

wesm

+1 from me

wesm · 2019-12-02T01:00:10Z

format/Schema.fbs

@@ -283,6 +286,8 @@ table DictionaryEncoding {
  /// is used to represent ordered categorical data, and we provide a way to
  /// preserve that metadata here
  isOrdered: bool;
+
+  dictionaryKind: DictionaryKind;


I understand this now

emkornfield · 2019-12-02T01:22:53Z

Merging based on vote.

pitrou reviewed Oct 6, 2019

View reviewed changes

pitrou reviewed Oct 16, 2019

View reviewed changes

emkornfield force-pushed the dict_document branch from bd8c266 to c75ac30 Compare October 17, 2019 02:34

wesm reviewed Oct 17, 2019

View reviewed changes

emkornfield force-pushed the dict_document branch 2 times, most recently from e859f45 to c439a83 Compare October 18, 2019 05:40

emkornfield commented Oct 25, 2019

View reviewed changes

emkornfield force-pushed the dict_document branch from 6f8d49c to 99fa52c Compare November 21, 2019 04:36

emkornfield force-pushed the dict_document branch from 99fa52c to a418422 Compare November 22, 2019 05:22

emkornfield added 9 commits November 26, 2019 21:32

Proposal

2f0724c

undo related change.

509f2d0

remove duplicate the

3d65c75

Update based on review.

720a05e

revert testing

e58e5df

address feedback

7c1f171

Revert "remove duplicate the"

65c709d

This reverts commit 72f170c.

remove duplicate the

5213782

be explicit about dictionary replacement

d1a0804

emkornfield force-pushed the dict_document branch from a418422 to d1a0804 Compare November 27, 2019 05:32

update to latest submodule

ee8cbfd

wesm approved these changes Dec 2, 2019

View reviewed changes

emkornfield closed this in 0ddc1f4 Dec 2, 2019

This was referenced Dec 2, 2019

[Format] Clarify dictionary encoding edge cases #22842

Closed

[Integration] Ensure dictionary IPC implementations match spec clarifications #23572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6473: Dictionary encoding format clarifications/future proofing #5585

ARROW-6473: Dictionary encoding format clarifications/future proofing #5585

emkornfield commented Oct 5, 2019 •

edited

Loading

github-actions bot commented Oct 5, 2019

pitrou Oct 6, 2019

emkornfield Oct 7, 2019

wesm Oct 8, 2019

wesm Oct 17, 2019

emkornfield Oct 18, 2019

codecov-io commented Oct 7, 2019

pitrou Oct 16, 2019

pitrou Oct 16, 2019

emkornfield Oct 17, 2019

wesm commented Oct 17, 2019

wesm Oct 17, 2019

wesm Oct 17, 2019

emkornfield Oct 17, 2019

wesm Oct 17, 2019

emkornfield Oct 17, 2019

wesm Oct 17, 2019

emkornfield Oct 17, 2019

emkornfield Oct 18, 2019

wesm Dec 2, 2019

emkornfield Oct 25, 2019

pitrou Oct 25, 2019

emkornfield commented Nov 21, 2019

wesm left a comment

wesm Dec 2, 2019

emkornfield commented Dec 2, 2019

ARROW-6473: Dictionary encoding format clarifications/future proofing #5585

ARROW-6473: Dictionary encoding format clarifications/future proofing #5585

Conversation

emkornfield commented Oct 5, 2019 • edited Loading

github-actions bot commented Oct 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Oct 7, 2019

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Oct 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Nov 21, 2019

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Dec 2, 2019

emkornfield commented Oct 5, 2019 •

edited

Loading