[SPARK-55794][FOLLOWUP][SQL] Wrap outer star expansion results in Alias in ResolveReferences#55606
Conversation
47d5007 to
939ad7a
Compare
939ad7a to
498a3f5
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Thanks for the followup. The Project-output ExprId leak fix is correct and the new derived-table-wrapper test demonstrates the bug well. Two concerns:
1. Scope of the wrap (design). expand is shared with the function-arg / CreateStruct / CreateArray / In / Murmur3Hash / XxHash64 callers (Analyzer.scala:1969-1995). Those callers don't expose Project.output, so they don't have the leak you're fixing. Wrapping their expansions in Alias is functionally harmless — CleanupAliases strips non-top-level aliases at runtime — but the enclosing Alias's auto-generated name is computed via toPrettySQL before CleanupAliases runs, so the inner AS c1, AS c2 gets baked into the column name. This is visible in selectExcept.sql.out:
- +- Project [coalesce(...) AS coalesce(outer(v1.c1), outer(v1.c2), ...)#x]
+ +- Project [coalesce(...) AS coalesce(outer(v1.c1) AS c1, outer(v1.c2) AS c2, ...)#x]
and in the aliasedNewArray.sql rename in ResolveSubquerySuite. Could the alias wrapping be moved to buildExpandedProjectList (the Project / Aggregate / CollectMetrics call site) instead, so non-projection callers keep bare OuterReferences and don't pick up this naming change?
2. PR description. Given the column-name change above, "Does this PR introduce any user-facing change? No" isn't quite right for queries shaped like SELECT (SELECT array(t1.*) FROM ...) FROM t1 — the outer scalarsubquery(...) column name picks up the new AS c1, AS c2 text. Worth calling out, even if you decide it's acceptable.
Also: the new tests cover Project paths nicely; if you keep the wrap inside expand, consider adding a function-arg case (e.g. SELECT (SELECT array(t1.*) FROM ...) FROM t1) to scalar-subquery-select.sql to lock in the chosen behavior.
| Try(s.expand(u.children.head, resolver)) match { | ||
| case Success(expanded) => expanded.map(wrapOuterReference) | ||
| case Success(expanded) => | ||
| expanded.map { |
There was a problem hiding this comment.
Design: this expand helper is shared with the function-arg / CreateStruct / CreateArray / In / Murmur3Hash / XxHash64 callers further down (lines 1969-1995). Only the project-list path (buildExpandedProjectList) exposes Project.output, which is where the ExprId leak you're fixing actually surfaces. The other callers don't need an Alias wrap — and as the selectExcept.sql.out and ResolveSubquerySuite diffs show, wrapping them changes the auto-generated column name (the inner AS c1, AS c2 gets baked into the surrounding alias's toPrettySQL before CleanupAliases runs).
Could the wrapping be moved to buildExpandedProjectList (and similar project-list call sites) so function-arg expansions keep bare OuterReferences? That would scope the change to where the leak actually exists and avoid the column-name regression.
| alias.withNewChildren(Seq(wrapOuterReference(alias.child))) | ||
| .asInstanceOf[Alias] |
There was a problem hiding this comment.
Minor: wrapOuterReference[E <: Expression](e: E): E recursively transforms Attribute nodes and preserves the rest of the tree, so applying it to the whole alias produces the same Alias(GetStructField(OuterReference(attr), i), name) and avoids the withNewChildren + cast dance.
| alias.withNewChildren(Seq(wrapOuterReference(alias.child))) | |
| .asInstanceOf[Alias] | |
| wrapOuterReference(alias) |
| case e => | ||
| Alias(wrapOuterReference(e), toPrettySQL(e))() |
There was a problem hiding this comment.
Nit: the inner case e => shadows the outer case e: AnalysisException from the surrounding catch. It still works (the outer e is referenced from case Failure(_) => throw e at a different scope) but reads confusingly.
| case e => | |
| Alias(wrapOuterReference(e), toPrettySQL(e))() | |
| case other => | |
| Alias(wrapOuterReference(other), toPrettySQL(other))() |
What changes were proposed in this pull request?
Wrap outer star expansion results in Alias in ResolveReferences.expandStarExpressionInValidOperators. When a targeted star (e.g. t1.*)
inside a scalar subquery resolves from the outer scope, each expanded attribute is now wrapped in Alias(OuterReference(attr), name)
instead of bare OuterReference(attr).
For struct star expansion (e.g. t1.s.*), the expansion already produces Alias(GetStructField(...), fieldName). In that case we wrap
only the inner child with OuterReference, preserving the existing Alias to avoid double aliasing.
Why are the changes needed?
Without the Alias, the OuterReference attribute's ExprId leaks into the subquery scope via Project.output (since
OuterReference.toAttribute strips the wrapper). When a derived table wraps the outer star expansion, downstream operators (e.g. SELECT
*) reference this leaked ExprId directly, which can cause issues with expression ID tracking. Wrapping in Alias gives each outer
reference a fresh ExprId in the subquery's scope.
Example:
SELECT (SELECT * FROM (SELECT t1.* FROM VALUES(2) AS t2(col1) LIMIT 1)) FROM VALUES(1) AS t1(col1)
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added SQL golden file tests in scalar-subquery-select.sql:
Also fixed a pre-existing missing semicolon in the test file.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-6)