fix: rebind RecursiveQueryExec batches to the declared output schema#21770
Merged
adriangb merged 1 commit intoapache:mainfrom Apr 22, 2026
Merged
fix: rebind RecursiveQueryExec batches to the declared output schema#21770adriangb merged 1 commit intoapache:mainfrom
adriangb merged 1 commit intoapache:mainfrom
Conversation
kosiew
approved these changes
Apr 22, 2026
When a recursive CTE's anchor term aliases a computed column (e.g. `upper(val) AS val`) and the recursive term leaves the same expression un-aliased (`upper(r.val)`), `RecursiveQueryExec` declared its output schema from the anchor but forwarded batches from both branches with their native schemas intact. Downstream consumers that key on `batch.schema().field(i).name()` — TopK (ORDER BY + LIMIT), CSV/JSON writers, custom collectors — then observed the recursive branch's leaked field name instead of the anchor's. Rebind each emitted batch to the declared output schema in `RecursiveQueryStream::push_batch`. Logical-plan coercion in `LogicalPlanBuilder::to_recursive_query` already guarantees matching column types, so this is a zero-copy field rebind. Regression coverage: - Rust test in `datafusion/core/tests/sql/select.rs` asserts every collected `RecordBatch` carries the anchor's field names. - sqllogictest in `cte.slt` round-trips the result through a headered CSV file (whose header row is written from each batch's own schema) and re-reads it to surface the leaked name inside SLT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3d4b807 to
47d75a6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
A recursive CTE whose anchor aliases a computed column (e.g.
upper(val) AS val) and whose recursive term leaves the same expression un-aliased (upper(r.val)) currently returns the wrong column name — but only when the outer query has bothORDER BYandLIMIT. The plan-level schema is correct (taken from the anchor), butRecursiveQueryExecforwards recursive-termRecordBatches with their native schemas intact. Downstream consumers that key onbatch.schema().field(i).name()—SortExec's TopK path, CSV/JSON writers, user-codecollectors — then observe the leaked recursive-branch name instead of the anchor's.MRE (fails on
datafusion-clipre-fix):Pre-fix header column reads
upper(r.val); expectedval.Only
ORDER BY + LIMITtriggers it because:SortExecwithout fetch re-materialises batches viaExternalSorter(stable schema).LimitExecwithout sort sits aboveRecursiveQueryExec, never mixing branches.SortExecwith fetch uses the TopK path, which emitsinterleave_record_batchoutput that carries whichever input batch's schema was used last.What changes are included in this PR?
In
RecursiveQueryStream::push_batch, rebind each emitted batch to the declared output schema (taken from the anchor term). Logical-plan coercion inLogicalPlanBuilder::to_recursive_queryalready guarantees matching column types, so this is a zero-copy field rebind. 14 lines of production code + comment.Are these changes tested?
Yes.
datafusion/core/tests/sql/select.rs::test_recursive_cte_batch_schema_stable_with_order_by_limit— runs the MRE and asserts every collectedRecordBatch's schema field names equal["id", "parent_id", "ts", "val"]. Fails pre-fix withleft: ["id", "parent_id", "ts", "upper(r.val)"].datafusion/sqllogictest/test_files/cte.slt— round-trips the buggy query through a headered CSV file (whose header is written from each batch's schema) and re-reads it as headerless CSV so the header row is compared as a data row. SLT otherwise cannot assert column names directly, so this is the only way to surface the leak inside SLT.Both regression tests were verified to fail on the base branch before the fix was applied and pass after.
Are there any user-facing changes?
Recursive CTEs with mismatched anchor/recursive column names will now emit batches with the anchor-declared names consistently, regardless of downstream operators. No API changes.