Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 #43614

Closed
wants to merge 6 commits into from

Conversation

amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes an issue that if a CTE is referenced by a non-referenced CTE, then this CTE should also have ref count as 0 and goes through CheckAnalysis0. This will guarantee analyzer throw proper error message for problematic CTE which is not referenced.

Why are the changes needed?

To improve error message for non-referenced CTE case.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

Was this patch authored or co-authored using generative AI tooling?

NO

@github-actions github-actions bot added the SQL label Nov 1, 2023
@amaliujia
Copy link
Contributor Author

@cloud-fan

@amaliujia amaliujia force-pushed the cte_ref branch 2 times, most recently from e876a60 to 96defe4 Compare November 1, 2023 05:52
val e = intercept[AnalysisException](sql(
s"""
|with
|a as (select * from t),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|a as (select * from t),
|a as (select * from non_exist),

@@ -120,6 +120,30 @@ case class InlineCTE(alwaysInline: Boolean = false) extends Rule[LogicalPlan] {
}
}
}
// If a CTE ref count is 0, the other CTE that is referencing it should also have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about changing the ref count directly. I think the ref count is correct, the problem is in CheckAnalysis when we decide which CTE relations need to be checked ahead. The current condition is simply refCount == 0, but we should also check CTE relations referenced by CTE relations that has 0 refCount.

Copy link
Contributor

@peter-toth peter-toth Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @cloud-fan. Probably you just need to call cleanCTEMap() in CheckAnalysis too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to address this comment.

// If a CTE ref count is 0, the other CTE that is referencing it should also have
// a 0 ref count.
val visited: mutable.Map[Long, Boolean] = mutable.Map.empty.withDefaultValue(false)
cteMap.foreach { case(cteId, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cteMap.foreach { case(cteId, _) =>
cteMap.foreach { case (cteId, (_, refCount, _)) =>

cteMap.foreach { case(cteId, _) =>
val (_, refCount, _) = cteMap(cteId)
if (refCount == 0 && !visited(cteId) ) {
reconcileCTERefCount(cteMap, visited, cteId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should inline reconcileCTERefCount here.

@amaliujia
Copy link
Contributor Author

The code has been updated. Please take a look.

@@ -156,7 +171,15 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
// If a CTE relation is never used, it will disappear after inline. Here we explicitly check
// analysis for it, to make sure the entire query plan is valid.
try {
if (refCount == 0) checkAnalysis0(relation.child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can keep this check as it is, but call the already existing cleanCTEMap() right after buildCTEMap() a bit above? cleanCTEMap()'s comment seems wrong and it doesn't remove CTE defs without useful refs, but it just set ref counts to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current code is better as it guarantees the leaf CTE relations (do not reference any other) are checked first. Maybe we can make it clearer using comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added some comments to explain we are trying to check the leaf relation first because that gives the most accurate error message.

Copy link
Contributor

@peter-toth peter-toth Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I was travelling this week.

The reason why I would suggest the cleanCTEMap() approach is because you might be checking a CTE 2 times now. Not in this new code, as visited map makes sure to check a CTE only once here, but later when we check the inlined plan it might contain a CTE that was referenced from an other CTE with ref count = 0 (so the CTE is checked in the new code) and also from a third CTE with valid references (so the CTE is checked in the inlined plan again).

As for checking the leafs first, checking the CTEs in id order guarantees that. This is because how we do CTE resolution (assign ids to defs) in CTESubstitution. cleanCTEMap() also uses this property to quickly clear useless transitive reference counts.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 24edc0e Nov 9, 2023
beliefer added a commit that referenced this pull request Nov 10, 2023
…ions

### What changes were proposed in this pull request?
#43614 let unreferenced `CTE` checked by `CheckAnalysis0`.
This PR follows up #43614 to simplify the code for check unreferenced CTE relations.

### Why are the changes needed?
Simplify the code for check unreferenced CTE relations

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43727 from beliefer/SPARK-45752_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Jiaan Geng <beliefer@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants