[SPARK-56050][SQL] Eagerly resolve IDENTIFIER() with string literals at parse time#54888
[SPARK-56050][SQL] Eagerly resolve IDENTIFIER() with string literals at parse time#54888cloud-fan wants to merge 2 commits intoapache:masterfrom
Conversation
… parse time
When IDENTIFIER() is called with a string literal argument (e.g., IDENTIFIER('db.table')), the identifier can be resolved immediately at parse time rather than deferring to analysis via PlanWithUnresolvedIdentifier.
This optimizes withIdentClause to check if the expression is a string literal and resolve it eagerly, falling back to deferred resolution for non-literal expressions.
5beb0d9 to
05867ee
Compare
…withIdentClause The IdentifierReferenceContext only covers the identifier part of the SQL statement, not the full statement. Wrapping the builder call with withOrigin(ctx) caused ParseException to capture the narrow identifier position instead of the full statement origin set by the calling visit methods. Co-authored-by: Isaac
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Do you think we can have a test coverage for this, @cloud-fan ?
BTW, FYI, when I ask Cluade Caude to write a PR description based on the pull request template (here, I used the file reference feature of Claude Claude), Claude Code works well by respecting the ASF and Apache Spark Pull Request template like the the following. In these days, it's rare to write the PR description from the scratches.
spark/.github/PULL_REQUEST_TEMPLATE
Lines 55 to 56 in 11a99d0
https://www.apache.org/legal/generative-tooling.html
### Was this patch authored or co-authored using generative AI tooling?
- Yes, Claude Opus 4.6.
+ Generated-by: Claude Code (Claude Opus 4.6)|
cc @peter-toth for this SQL parser optimization. |
|
Merged to master for Apache Spark 4.2.0 because the test coverage or PR description is nit. |
|
it's a perf improvement so I didn't add a new test. Actually the PR description was generated by claude code, hopefully llm can follow the PR template more strictly after we improving |
…at parse time
### What changes were proposed in this pull request?
This PR optimizes `withIdentClause` in `AstBuilder` to eagerly resolve `IDENTIFIER()` calls with string literal arguments at parse time instead of deferring to the analysis phase via `PlanWithUnresolvedIdentifier`.
### Why are the changes needed?
When `IDENTIFIER()` is called with a string literal (e.g., `IDENTIFIER('db.table')`), the value is already known at parse time. The current implementation always defers resolution to analysis by wrapping it in `PlanWithUnresolvedIdentifier`, which is unnecessary overhead for this case.
The optimized `withIdentClause` resolves string literals eagerly and falls back to `PlanWithUnresolvedIdentifier` (deferred resolution) for non-literal expressions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
### Was this patch authored or co-authored using generative AI tooling?
Yes, Claude Opus 4.6.
Closes apache#54888 from cloud-fan/SPARK-XXX-resolve-identifier-at-parse-time.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…hOrigin pitfall Followup to apache#54888. Adding a comment to withIdentClause explaining why withOrigin(ctx) should not be added to this method, and where to add it instead.
…hOrigin pitfall ### What changes were proposed in this pull request? Followup to #54888. This PR adds a documentation comment to `withIdentClause` in `AstBuilder` explaining why `withOrigin(ctx)` should not be added to the method body, and where to place it instead (inside specific call site builder lambdas). ### Why are the changes needed? 1. `withIdentClause` is called from many places with builders that create different types of plans — some create full command plans (e.g., `ReplaceTable`, `CreateFunction`), not just identifier-related plans. 2. Adding `withOrigin(ctx)` to the method body would set `CurrentOrigin` to the identifier reference for all code in the builder, including error-throwing validation code. Since `ParseException.getQueryContext()` reads from `CurrentOrigin`, this would cause error contexts to incorrectly point to the identifier instead of the full statement. 3. This is a non-obvious pitfall that is worth documenting to prevent future contributors from introducing this bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Comment-only change, no tests needed. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #55001 from cloud-fan/SPARK-56050-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR optimizes
withIdentClauseinAstBuilderto eagerly resolveIDENTIFIER()calls with string literal arguments at parse time instead of deferring to the analysis phase viaPlanWithUnresolvedIdentifier.Why are the changes needed?
When
IDENTIFIER()is called with a string literal (e.g.,IDENTIFIER('db.table')), the value is already known at parse time. The current implementation always defers resolution to analysis by wrapping it inPlanWithUnresolvedIdentifier, which is unnecessary overhead for this case.The optimized
withIdentClauseresolves string literals eagerly and falls back toPlanWithUnresolvedIdentifier(deferred resolution) for non-literal expressions.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
Yes, Claude Opus 4.6.