Skip to content
Permalink
Browse files
Don't use ChainedVector as DictEncoding data array unless necessary (#…
…110)

Fixes #109. The issue here was when reading arrow record batches with
dict encoded columns, we eagerly used `ChainedVector` for the underlying
array backing the `DictEncoding` in case there were subsequent
record batches that added additional elements to the dict encoding. This
is too eager though, since it's probably common, like for "feather"
files, where the dict encoding values are always known and provided in
the first record batch. In fact, several language implementations don't
even support these kind of "delta" dict updates in subsequent record
batches. This PR, therefore, uses a regular array for the dict encoding
backing for the first record batch, and only promotes to a ChainedVector
if we happen to get a delta update.
  • Loading branch information
quinnj committed Jan 23, 2021
1 parent 12eb00a commit e7cd867c05233fb0b306d99a528d4a1474b9f8a6
Showing 1 changed file with 7 additions and 2 deletions.
@@ -234,13 +234,18 @@ function Table(bytes::Vector{UInt8}, off::Integer=1, tlen::Union{Integer, Nothin
field = dictencoded[id]
values, _, _ = build(field, field.type, batch, recordbatch, dictencodings, Int64(1), Int64(1), convert)
dictencoding = dictencodings[id]
append!(dictencoding.data, values)
if typeof(dictencoding.data) <: ChainedVector
append!(dictencoding.data, values)
else
A = ChainedVector([dictencoding.data, values])
dictencodings[id] = DictEncoding{eltype(A), typeof(A)}(id, A, field.dictionary.isOrdered, values.metadata)
end
continue
end
# new dictencoding or replace
field = dictencoded[id]
values, _, _ = build(field, field.type, batch, recordbatch, dictencodings, Int64(1), Int64(1), convert)
A = ChainedVector([values])
A = values
dictencodings[id] = DictEncoding{eltype(A), typeof(A)}(id, A, field.dictionary.isOrdered, values.metadata)
@debug 1 "parsed dictionary batch message: id=$id, data=$values\n"
elseif header isa Meta.RecordBatch

0 comments on commit e7cd867

Please sign in to comment.