You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ArrowStreamWriter currently takes a DictionaryProvider at construction time and emits the used dicts once.
However, the streaming format allows for the dictionaries to change between record batches. It would be useful to support this mechanism. It can be worked around in various ways (e.g. manually re-emitting DictionaryBatches between calling writeBatch), but this isn't very pleasant.
We'd somehow have to reconcile this with the abstract ArrowWriter parent and the ArrowFileWriter sibling. In the latter, for example, this mechanism is not supported.
An example solution (but perhaps we can do better) might be to add a virtual writeBatch(Provider provider) method, that is UnsupportedOperationException in ArrowFileWriter, and re-emits the used dicts in ArrowStreamWriter.
In the present context just looking at dictionary replacement, not dictionary delta's.
Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.
We've run into this issue and I'd like to work on a fix. It appears that the C++ implementation handles this transparently by re-writing any dictionaries that have changed, so it should be possible to implement a similar solution in Java:
#35920)
### Rationale for this change
This allows writing IPC streams where dictionary values change between record batches.
### What changes are included in this PR?
* Add new abstract `void ensureDictionariesWritten(DictionaryProvider provider, Set<Long> dictionaryIdsUsed)` to the base `ArrowWriter` class
* Move existing logic that only writes dictionaries once into the `ArrowFileWriter` class
* Implement replacement dictionary writing in `ArrowStreamWriter` by keeping copies of previously written dictionaries
### Are these changes tested?
Yes, I've added a new unit test for this
### Are there any user-facing changes?
Yes, `ArrowStreamWriter` will now write replacement dictionaries when dictionary values change between batches.
**This PR includes breaking changes to public APIs.**
`ArrowWriter` has a new abstract `ensureDictionariesWritten` method. This will only affect users directly inheriting from `ArrowWriter` rather than `ArrowFileWriter` or `ArrowStreamWriter`.
There's also a behaviour change to `ArrowWriter`, where previously dictionaries were read from a `DictionaryProvider` on construction, but this is now delayed until the first batch is written.
* Closes: #18547
Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
The ArrowStreamWriter currently takes a DictionaryProvider at construction time and emits the used dicts once.
However, the streaming format allows for the dictionaries to change between record batches. It would be useful to support this mechanism. It can be worked around in various ways (e.g. manually re-emitting DictionaryBatches between calling writeBatch), but this isn't very pleasant.
We'd somehow have to reconcile this with the abstract ArrowWriter parent and the ArrowFileWriter sibling. In the latter, for example, this mechanism is not supported.
An example solution (but perhaps we can do better) might be to add a virtual
writeBatch(Provider provider)
method, that is UnsupportedOperationException in ArrowFileWriter, and re-emits the used dicts in ArrowStreamWriter.In the present context just looking at dictionary replacement, not dictionary delta's.
Reporter: Joris Peeters / @jmgpeeters
Note: This issue was originally created as ARROW-11869. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: