Hey!
We've been using both adbc_snowflake and databricks-adbc to pull data, and we noticed that string columns always come back as plain Utf8, no matter how repetitive the values are. Even when a 1M-row column has only 10 distinct values, we get a flat Utf8 array rather than Dictionary<Int32, Utf8>.
A few questions:
- Is this the intended behavior? We weren't sure whether the warehouse ships strings as dictionary-encoded on the wire and the driver flattens them, or whether they always come in flat from the source.
- Is there an option we're missing? Something to either preserve dictionary encoding (if it exists upstream), or otherwise get
Dictionary<Int32, Utf8> back from the driver?
- If there isn't, would you be open to adding one? Something like a cardinality threshold — columns where the distinct-to-row-count ratio is below some limit get cast to
Dictionary<Int32, Utf8> before reaching the consumer.
It matters to us because we serialize the data to Arrow IPC and send it over a network. Dictionary-encoded strings are a few times smaller on the wire and the decoder on the other end has a fast path for that type. Today we walk each RecordBatch ourselves and cast qualifying Utf8 columns to dictionaries — it works, but it feels like the kind of thing that should live in the driver, especially since every consumer doing IPC over a network probably wants the same thing.
Happy to share more detail if useful. Thanks!
Hey!
We've been using both
adbc_snowflakeanddatabricks-adbcto pull data, and we noticed that string columns always come back as plainUtf8, no matter how repetitive the values are. Even when a 1M-row column has only 10 distinct values, we get a flatUtf8array rather thanDictionary<Int32, Utf8>.A few questions:
Dictionary<Int32, Utf8>back from the driver?Dictionary<Int32, Utf8>before reaching the consumer.It matters to us because we serialize the data to Arrow IPC and send it over a network. Dictionary-encoded strings are a few times smaller on the wire and the decoder on the other end has a fast path for that type. Today we walk each
RecordBatchourselves and cast qualifyingUtf8columns to dictionaries — it works, but it feels like the kind of thing that should live in the driver, especially since every consumer doing IPC over a network probably wants the same thing.Happy to share more detail if useful. Thanks!