New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer #8302
Conversation
@wesm This is a first draft. It is functional (i.e. fixes the underlying issue) but doesn't try to detect deltas. |
|
||
// TODO: Check for delta dictionaries. Can we scan for deltas while computing | ||
// the RecordBatch payload to save time? | ||
RETURN_NOT_OK(WriteDictionaries(batch)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this will populate last_dictionaries_
entries even when the length of the dictionary is zero? I recall there was some discussion about empty dictionaries (and whether they need to be written at all) at the start of a stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we can skip zero-length dictionaries explicitly, but won't it confuse the IPC reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue
c414eec
to
cd759d0
Compare
I still need to add tests for this (I'll do that once #8309 is merged). Detecting and writing delta dictionaries is not implemented, but that's a feature, not a bug, so it may also wait for another PR. |
cd759d0
to
3e5e6df
Compare
When a dictionary changes from the previous batch, we should re-emit it.
3e5e6df
to
4e5dac1
Compare
I can take a final look this morning. |
|
||
// TODO: Check for delta dictionaries. Can we scan for deltas while computing | ||
// the RecordBatch payload to save time? | ||
RETURN_NOT_OK(WriteDictionaries(batch)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue
When a dictionary changes from the previous batch, it is emitted again in the IPC stream. If this happens when writing the IPC file format, an error is returned. Closes apache#8302 from pitrou/ARROW-10121-ipc-emit-dicts Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org>
When a dictionary changes from the previous batch, it is emitted again in the IPC stream. If this happens when writing the IPC file format, an error is returned. Closes apache#8302 from pitrou/ARROW-10121-ipc-emit-dicts Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org>
When a dictionary changes from the previous batch, it is emitted again in the IPC stream.
If this happens when writing the IPC file format, an error is returned.