This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Ensure dict encoded index types match from record batch to record bat…
…ch (#148) Fixes #144. The core issue here was the initial record batch had a dict-encoded column that ended up having an index type of Int8. However, in a subsequent record batch, we use a different code path for dict encoded columns because we need to check if a dictionary delta message needs to be sent (i.e. there are new pooled values that need to be serialized). The problem was in this code path, the index type was computed from the total length of the input column instead of matching what was already serialized in the initial schema message. This does open up the question of another possible failure: if an initial dict encoded column is serialized with an index type of Int8, yet subsequent record batches end up including enough unique values that this index type will be overflowed. I've added in an error check for this case. Currently it's a fatal error that will stop the `Arrow.write` process completely. I'm not quite sure what the best recommendation would be in that case; ultimately the user needs to either widen the first record batch column index type, but perhaps we should allow passing a dict-encoded index type to the overall `Arrow.write` function so users can easily specify what that type should be. The other change that had to be made in this PR is on the reading side, since we're now tracking the index type in the DictEncoding type itself, which probably not coincidentally is what the arrow-json struct already does. For reading, we already have access to the dictionary field, so it's just a matter of deserializing the index type before constructing the DictEncoding struct.
- Loading branch information
Showing 3 changed files with 38 additions and 11 deletions.