Skip to content

String columns come back as Utf8 even when cardinality is tiny — is that expected? #4271

@PavloPolovyi

Description

@PavloPolovyi

Hey!

We've been using both adbc_snowflake and databricks-adbc to pull data, and we noticed that string columns always come back as plain Utf8, no matter how repetitive the values are. Even when a 1M-row column has only 10 distinct values, we get a flat Utf8 array rather than Dictionary<Int32, Utf8>.

A few questions:

  1. Is this the intended behavior? We weren't sure whether the warehouse ships strings as dictionary-encoded on the wire and the driver flattens them, or whether they always come in flat from the source.
  2. Is there an option we're missing? Something to either preserve dictionary encoding (if it exists upstream), or otherwise get Dictionary<Int32, Utf8> back from the driver?
  3. If there isn't, would you be open to adding one? Something like a cardinality threshold — columns where the distinct-to-row-count ratio is below some limit get cast to Dictionary<Int32, Utf8> before reaching the consumer.

It matters to us because we serialize the data to Arrow IPC and send it over a network. Dictionary-encoded strings are a few times smaller on the wire and the decoder on the other end has a fast path for that type. Today we walk each RecordBatch ourselves and cast qualifying Utf8 columns to dictionaries — it works, but it feels like the kind of thing that should live in the driver, especially since every consumer doing IPC over a network probably wants the same thing.

Happy to share more detail if useful. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions