Skip to content

Support dictionary literals in Substrait producer#22041

Draft
alexanderbianchi wants to merge 1 commit intoapache:branch-53from
alexanderbianchi:codex/substrait-dictionary-literals
Draft

Support dictionary literals in Substrait producer#22041
alexanderbianchi wants to merge 1 commit intoapache:branch-53from
alexanderbianchi:codex/substrait-dictionary-literals

Conversation

@alexanderbianchi
Copy link
Copy Markdown
Contributor

@alexanderbianchi alexanderbianchi commented May 6, 2026

Rationale for this change

The Substrait producer currently fails when a logical plan contains a ScalarValue::Dictionary, for example a string literal that was coerced to match a dictionary-encoded string column. Dictionary encoding is a physical representation detail; the Substrait literal should carry the logical inner value.

A bit more context on why this takes the approach of serializing the inner value:

There are two unrelated "dictionary" concepts that are easy to conflate here:

Map/dictionary value:
  {"service": "beagle", "dc": "us1"}
  logical object/map type

Arrow dictionary encoding:
  Dictionary(Int32, Utf8)
  physically encoded string column: integer keys + string values

This PR is about the second one. The logical value is still just a string.

The failing path we hit was:

metric_name = 'req.latency'

where the table schema exposes metric_name as:

Dictionary(Int32, Utf8)

DataFusion type coercion makes both sides of the equality compatible, so the predicate becomes conceptually:

Column(metric_name: Dictionary(Int32, Utf8))
=
Literal(Dictionary(Int32, Utf8("req.latency")))

That scalar is not a map/object value. It is DataFusion representing a string scalar that has been coerced to match a dictionary-encoded string column.

The Substrait producer then failed with:

Unsupported literal: Dictionary(Int32, Utf8("req.latency"))

In Substrait, there is no useful distinction between a string scalar and a "dictionary-encoded string scalar" here. Dictionary encoding is meaningful for arrays/columns, not for a single scalar literal. So the intended encoding is just the logical literal value:

Substrait string literal "req.latency"

The column/scan can still be dictionary encoded when the plan is consumed against a table schema where metric_name is Dictionary(Int32, Utf8). At that point DataFusion can again apply its normal coercion/execution behavior for comparing the dictionary column to the string literal.

So the key point is: this PR is not trying to encode dictionary array layout into Substrait literals. It is preserving the logical scalar value while avoiding a producer failure caused by DataFusion's internal ScalarValue::Dictionary representation after type coercion.

What changes are included in this PR?

This unwraps ScalarValue::Dictionary(_, value) in the Substrait literal producer and serializes the inner scalar value. It also adds a regression test showing a dictionary-encoded UTF8 scalar is emitted as a string literal and round trips as the logical UTF8 scalar.

Are these changes tested?

Yes:

cargo +1.91 test -p datafusion-substrait dictionary_literals_are_serialized_as_inner_values -- --nocapture

@github-actions github-actions Bot added the substrait Changes to the substrait crate label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant