Skip to content

[SPARK-52498][SQL] Fix false ambiguous self-join error with foldable condition#56056

Closed
shrirangmhalgi wants to merge 1 commit into
apache:masterfrom
shrirangmhalgi:SPARK-52498-self-join-ambiguity
Closed

[SPARK-52498][SQL] Fix false ambiguous self-join error with foldable condition#56056
shrirangmhalgi wants to merge 1 commit into
apache:masterfrom
shrirangmhalgi:SPARK-52498-self-join-ambiguity

Conversation

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fix DetectAmbiguousSelfJoin to not flag column references as ambiguous when the root plan is a Project on top of a self-join with a foldable join condition (e.g., df.join(df, df("col") === 0).select(df("col"))).

When the join condition compares a column to a literal, it doesn't matter which side the column comes from - both sides have identical data. The ambiguity check was incorrectly rejecting this pattern.

Why are the changes needed?

df.join(df, df("col") === 0).select(df("col")) throws AnalysisException: Column are ambiguous with the regular resolver when spark.sql.analyzer.failAmbiguousSelfJoin is true. The single-pass resolver handles this correctly. This inconsistency breaks multi-layer self-join patterns that work fine with the single-pass resolver.

Does this PR introduce any user-facing change?

Yes. Self-join queries with foldable conditions followed by select no longer throw a false ambiguity error.

Design approach

First attempted a broader fix: skip ambiguity check for any column reference whose exprId matches the plan's output (outputExprIds.contains(attr.exprId)). This was too permissive - broke 4 existing tests (SPARK-28344: fail ambiguous self join - column ref in Project, join three tables, SPARK-33071, SPARK-35454: join four tables) because it suppressed legitimate ambiguity errors where the user genuinely needs to alias.

Narrowed to the specific case: only skip when the root plan is a Project directly on top of a self-join (leftId == rightId) with a foldable join condition. This correctly targets the false positive without affecting real ambiguity detection.

How was this patch tested?

Added a test in DataFrameSelfJoinSuite verifying single-layer and multi-layer self-joins with foldable conditions. All 23 existing self-join tests are passing.

Was this patch authored or co-authored using generative AI tooling?

Yes.

…condition

DetectAmbiguousSelfJoin incorrectly flags column references as ambiguous in select(df(col)) after df.join(df, df(col) === 0). When the join condition is foldable (comparing to a literal), it does not matter which side the column comes from, so the select should not be ambiguous.

The fix skips the ambiguity check when the root plan is a Project on top of a self-join with a foldable join condition.
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@cloud-fan could you please review the following PR.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to push back on the premise of this PR. See the inline comment at the new comment block — the short version is that df.join(df, df("col") === 0).select(df("col")) is genuinely ambiguous in the sense DetectAmbiguousSelfJoin is concerned with, and the rule is correct to reject it. Suggest closing or reworking, and aligning the single-pass resolver to also reject this case instead.

Comment on lines +157 to +178
// SPARK-52498: For a Project on top of a self-join with a foldable join condition
// (e.g., df.join(df, df("col") === 0).select(df("col"))), the column references
// in the select are not ambiguous because the foldable condition means it doesn't
// matter which side the column comes from.
val isProjectOverFoldableSelfJoin = plan match {
case Project(_, Join(
LogicalPlanWithDatasetId(_, leftId),
LogicalPlanWithDatasetId(_, rightId),
_, Some(condition), _)) if leftId == rightId =>
condition.collectFirst {
case Equality(_, b) if b.foldable => true
case Equality(a, _) if a.foldable => true
}.isDefined
case _ => false
}
if (isProjectOverFoldableSelfJoin) {
Nil
} else {
ambiguousColRefs.toSeq.map { ref =>
colRefAttrs.find(attr => toColumnReference(attr) == ref).get
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the premise of this exemption needs reconsidering — the case it allows is genuinely ambiguous.

The justification in the comment is inaccurate. "The foldable condition means it doesn't matter which side the column comes from" is not true. The condition df("col") === 0 resolves to one side only (LEFT, since df("col") carries LEFT's exprId after ResolveDeduplicateRelations). The join is effectively:

SELECT * FROM df L JOIN df R ON L.col = 0

That filters L.col = 0 but leaves R.col unconstrained. For df = [0, 1, 2]:

L.col R.col
0 0
0 1
0 2

select df("col")) → LEFT.col → [0, 0, 0]. Selecting the right side (hypothetically) would give [0, 1, 2]. The two sides do not agree. (Note JoinWith.resolveSelfJoinCondition only rewrites trivially-true EqualTo(a, b) if a.sameRef(b)df("col") === 0 is not rewritten.)

So the column reference is ambiguous in exactly the way this rule exists to flag: the user wrote df("col") and got LEFT, but a reasonable reader of df.join(df, df("col") === 0).select(df("col")) might expect the result to be "col from the join result" — which isn't a single well-defined thing.

Internal consistency with the existing tests. SPARK-28344: fail ambiguous self join - column ref in Project in DataFrameSelfJoinSuite (line 148) already establishes that self_join.select(df("col")) is ambiguous and must throw. The PR carves out an exemption when the join condition happens to contain a foldable equality — but the foldable equality does no semantic work here (it doesn't make L.col == R.col, doesn't make df("col") resolve any differently). It's just the discriminator that separates the new test case from the four existing tests that broke with the broader fix mentioned in the PR description. Without a principled reason, this looks like reverse-engineering the rule to fit one test.

Implication for the motivation. The PR description says the single-pass resolver handles this "correctly." If we agree the case is ambiguous, then single-pass is the one with the bug — the right direction is to align single-pass with DetectAmbiguousSelfJoin (preserve the error), not to weaken DetectAmbiguousSelfJoin.

Would you mind reconsidering the direction? Happy to discuss if there's a use case I'm missing — but as-is I don't think we should land this exemption.

@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@cloud-fan - Thanks for the thorough explanation. You're right that df("col") === 0 only constrains one side, making L.col ≠ R.col. I incorrectly assumed foldable meant symmetric. The ambiguity is real and the rule is correct to flag it. Closing this PR. The right direction is aligning the single-pass resolver to also reject this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants