Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Support re-emitting dictionaries in ArrowStreamWriter #18547

Closed
asfimport opened this issue Mar 5, 2021 · 3 comments · Fixed by #35920
Closed

[Java] Support re-emitting dictionaries in ArrowStreamWriter #18547

asfimport opened this issue Mar 5, 2021 · 3 comments · Fixed by #35920

Comments

@asfimport
Copy link

The ArrowStreamWriter currently takes a DictionaryProvider at construction time and emits the used dicts once.

However, the streaming format allows for the dictionaries to change between record batches. It would be useful to support this mechanism. It can be worked around in various ways (e.g. manually re-emitting DictionaryBatches between calling writeBatch), but this isn't very pleasant.

We'd somehow have to reconcile this with the abstract ArrowWriter parent and the ArrowFileWriter sibling. In the latter, for example, this mechanism is not supported.

An example solution (but perhaps we can do better) might be to add a virtual writeBatch(Provider provider) method, that is UnsupportedOperationException in ArrowFileWriter, and re-emits the used dicts in ArrowStreamWriter.

In the present context just looking at dictionary replacement, not dictionary delta's.

 

 

Reporter: Joris Peeters / @jmgpeeters

Note: This issue was originally created as ARROW-11869. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

@adamreeve
Copy link
Contributor

We've run into this issue and I'd like to work on a fix. It appears that the C++ implementation handles this transparently by re-writing any dictionaries that have changed, so it should be possible to implement a similar solution in Java:

Status WriteDictionaries(const RecordBatch& batch) {

@adamreeve
Copy link
Contributor

take

@lidavidm lidavidm added this to the 13.0.0 milestone Jun 9, 2023
lidavidm pushed a commit that referenced this issue Jun 9, 2023
#35920)

### Rationale for this change

This allows writing IPC streams where dictionary values change between record batches.

### What changes are included in this PR?

* Add new abstract `void ensureDictionariesWritten(DictionaryProvider provider, Set<Long> dictionaryIdsUsed)` to the base `ArrowWriter` class
* Move existing logic that only writes dictionaries once into the `ArrowFileWriter` class
* Implement replacement dictionary writing in `ArrowStreamWriter` by keeping copies of previously written dictionaries

### Are these changes tested?

Yes, I've added a new unit test for this

### Are there any user-facing changes?

Yes, `ArrowStreamWriter` will now write replacement dictionaries when dictionary values change between batches.

**This PR includes breaking changes to public APIs.**

`ArrowWriter` has a new abstract `ensureDictionariesWritten` method. This will only affect users directly inheriting from  `ArrowWriter` rather than `ArrowFileWriter` or `ArrowStreamWriter`.

There's also a behaviour change to `ArrowWriter`, where previously dictionaries were read from a `DictionaryProvider` on construction, but this is now delayed until the first batch is written.
* Closes: #18547

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants