Skip to content

[C++] Extend DictionaryBuilder to support delta dictionaries #18142

@asfimport

Description

@asfimport

The IPC format specifies a possibility of sending additional dictionary batches with a previously seen id and a isDelta flag to extend the existing dictionaries with new entries. Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not support generation of delta dictionaries.

This pull request contains a basic implementation of the DictionaryBuilder with delta dictionaries support. The use API can be seen in the dictionary tests (i.e. here). The basic idea is that the user just reuses the builder object after calling Finish(Array*) for the first time. Subsequent calls to Append will create new entries only for the unseen element and reuse id from previous dictionaries for the seen ones.

Some considerations:

  1. The API is pretty implicit, and additional flag for Finish, which explicitly indicates a desire to use the builder for delta dictionary generation might be expedient from the error avoidance point of view.

  2. Right now the implementation uses an additional "overflow dictionary" to store the seen items. This adds a copy on each Finish call and an additional lookup at each GetItem or Append call. I assume, we might get away with returning Array slices at Finish, which would remove the need for an additional overflow dictionary. If the gist of the PR is approved, I can look into further optimizations.

    The Writer and Reader extensions would be pretty simple, since the DictionaryBuilder API remains basically the same. 

Reporter: Dimitri Vorona / @alendit
Assignee: Dimitri Vorona / @alendit

Externally tracked issue: #1629

PRs and other links:

Note: This issue was originally created as ARROW-2176. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions