String columns come back as `Utf8` even when cardinality is tiny — is that expected?

Hey!

We've been using both `adbc_snowflake` and `databricks-adbc` to pull data, and we noticed that string columns always come back as plain `Utf8`, no matter how repetitive the values are. Even when a 1M-row column has only 10 distinct values, we get a flat `Utf8` array rather than `Dictionary<Int32, Utf8>`.

A few questions:

1. **Is this the intended behavior?** We weren't sure whether the warehouse ships strings as dictionary-encoded on the wire and the driver flattens them, or whether they always come in flat from the source.
2. **Is there an option we're missing?** Something to either preserve dictionary encoding (if it exists upstream), or otherwise get `Dictionary<Int32, Utf8>` back from the driver?
3. **If there isn't, would you be open to adding one?** Something like a cardinality threshold — columns where the distinct-to-row-count ratio is below some limit get cast to `Dictionary<Int32, Utf8>` before reaching the consumer.

It matters to us because we serialize the data to Arrow IPC and send it over a network. Dictionary-encoded strings are a few times smaller on the wire and the decoder on the other end has a fast path for that type. Today we walk each `RecordBatch` ourselves and cast qualifying `Utf8` columns to dictionaries — it works, but it feels like the kind of thing that should live in the driver, especially since every consumer doing IPC over a network probably wants the same thing.

Happy to share more detail if useful. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String columns come back as `Utf8` even when cardinality is tiny — is that expected? #4271

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

String columns come back as Utf8 even when cardinality is tiny — is that expected? #4271

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

String columns come back as `Utf8` even when cardinality is tiny — is that expected? #4271