Skip to content

Commit

Permalink
ARROW-6473: Dictionary encoding format clarifications/future proofing
Browse files Browse the repository at this point in the history
This needs to be discussed first on the mailing list.  It is a consolidation of recent dictionary encoding threads: [1](https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E), [2](https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E) and [3](https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E)

Closes #5585 from emkornfield/dict_document and squashes the following commits:

ee8cbfd <Micah Kornfield> update to latest submodule
d1a0804 <Micah Kornfield> be explicit about dictionary replacement
5213782 <Micah Kornfield> remove duplicate the
65c709d <Micah Kornfield> Revert "remove duplicate the"
7c1f171 <Micah Kornfield>  address feedback
e58e5df <Micah Kornfield> revert testing
720a05e <Micah Kornfield> Update based on review.
3d65c75 <Micah Kornfield> remove duplicate the
509f2d0 <emkornfield> undo related change.
2f0724c <Micah Kornfield> Proposal

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
  • Loading branch information
emkornfield committed Dec 2, 2019
1 parent 5c2bb6f commit 0ddc1f4
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 3 deletions.
41 changes: 40 additions & 1 deletion docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -986,6 +986,11 @@ a ``RecordBatch`` it should be defined in a ``DictionaryBatch``. ::
<RECORD BATCH n - 1>
<EOS [optional]: 0xFFFFFFFF 0x00000000>

.. note:: An edge-case for interleaved dictionary and record batches occurs
when the record batches contain dictionary encoded arrays that are
completely null. In this case, the dictionary for the encoded column might
appear after the first record batch.

When a stream reader implementation is reading a stream, after each
message, it may read the next 8 bytes to determine both if the stream
continues and the size of the message metadata that follows. Once the
Expand Down Expand Up @@ -1019,7 +1024,10 @@ Schematically we have: ::
In the file format, there is no requirement that dictionary keys
should be defined in a ``DictionaryBatch`` before they are used in a
``RecordBatch``, as long as the keys are defined somewhere in the
file.
file. Further more, it is invalid to have more then one **non-delta**
dictionary batch per dictionary ID (i.e. dictionary replacement is not
supported). Delta dictionaries are applied in the order they appear in
the file footer.

Dictionary Messages
-------------------
Expand Down Expand Up @@ -1073,6 +1081,37 @@ form: ::
0
EOS

Alternatively, if ``isDelta`` is set to false, then the dictionary
replaces the existing dictionary for the same ID. Using the same
example as above, an alternate encoding could be: ::


<SCHEMA>
<DICTIONARY 0>
(0) "A"
(1) "B"
(2) "C"

<RECORD BATCH 0>
0
1
2
1

<DICTIONARY 0>
(0) "A"
(1) "C"
(2) "D"
(3) "E"

<RECORD BATCH 1>
2
1
3
0
EOS


Custom Application Metadata
---------------------------

Expand Down
3 changes: 2 additions & 1 deletion format/Message.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ table DictionaryBatch {
data: RecordBatch;

/// If isDelta is true the values in the dictionary are to be appended to a
/// dictionary with the indicated id
/// dictionary with the indicated id. If isDelta is false this dictionary
/// should replace the existing dictionary.
isDelta: bool = false;
}

Expand Down
7 changes: 6 additions & 1 deletion format/Schema.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,10 @@ table KeyValue {

/// ----------------------------------------------------------------------
/// Dictionary encoding metadata

/// Maintained for forwards compatibility, in the future
/// Dictionaries might be explicit maps between integers and values
/// allowing for non-contiguous index values
enum DictionaryKind : short { DenseArray }
table DictionaryEncoding {
/// The known dictionary id in the application where this data is used. In
/// the file or streaming formats, the dictionary ids are found in the
Expand All @@ -283,6 +286,8 @@ table DictionaryEncoding {
/// is used to represent ordered categorical data, and we provide a way to
/// preserve that metadata here
isOrdered: bool;

dictionaryKind: DictionaryKind;
}

/// ----------------------------------------------------------------------
Expand Down

0 comments on commit 0ddc1f4

Please sign in to comment.