fix(spark): Improve query error handling for UnresolvedException by nada-attia · Pull Request #18147 · apache/hudi

nada-attia · 2026-02-09T20:44:31Z

Describe the issue this Pull Request addresses

When Spark SQL queries contain unresolved columns or tables (e.g., typos, missing table definitions), users receive a cryptic error message like "Invalid call to dataType on unresolved object" which provides no actionable information. This PR improves error handling to catch UnresolvedException and provide user-friendly error messages that help users identify and fix the issue.

Summary and Changelog

Summary: Improved error handling in Hudi's Spark SQL analysis phase to provide clear, actionable error messages when queries contain unresolved references.

Changelog:

Modified ProducesHudiMetaFields.unapply in HoodieAnalysis.scala to catch UnresolvedException
Added collectUnresolvedReferences helper method to identify specific unresolved column/table names
Throws AnalysisException with helpful error message including:
- List of unresolved references found in the query
- Suggestions to check for typos, missing table definitions, incorrect schema references
- Original error message for debugging

Impact

User-facing change: Users will now see clear error messages like:

Failed to resolve query. The query contains unresolved columns or tables. Unresolved references: [nonexistent_column].
Please check for: (1) typos in column or table names, (2) missing table definitions, 
(3) incorrect database/schema references, (4) columns that don't exist in the source tables.

No API changes
No performance impact

Risk Level

Low - This change only affects error handling in the analysis phase. It catches a specific exception type (UnresolvedException) and re-throws it as a more informative AnalysisException. Normal query execution paths are unaffected.

Documentation Update

None - This is an internal improvement to error messages that doesn't require documentation updates.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

closes #18151

Summary: User errors (referencing non-existent columns in query expressions) are bubbling up as the following: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object ``` This message is not useful for customers. This diff improves the error handling to provide more useful information. Test Plan: Ran the following: ``` % TEST_CATEGORY=hudi-error-handling TEST_NAME=testDeeplyNestedUnionWithInvalidColumn drogon launch --app hudi_spark_integ_test -c dca1 -tb -d ``` Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, pwason Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, pwason JIRA Issues: HUDI-7572 Differential Revision: https://code.uberinternal.com/D20875953

Add TestHoodieAnalysisErrorHandling.scala with tests to verify that: - MergeInto with unresolved columns in source query provides helpful error messages - MergeInto with unresolved columns in ON condition provides helpful error messages - InsertInto from non-existent source table provides helpful error messages - MergeInto with typos in column names provides helpful error messages

- Update InsertInto test to check for 'typos in column or table names' and 'unresolved' in error message - Fix MergeInto typo test to reference non-existent source.pricee when source only has 'price' column

yihua

Thanks for contributing! Improving error messages for unresolved references is a worthwhile goal. My main concern is that throwing from inside a pattern-match extractor changes the control flow in a way that could prevent Spark's own (often better) error reporting from kicking in — please see the inline comments for details.

yihua · 2026-02-11T22:08:51Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

+          }
+        } catch {
+          case e: UnresolvedException =>
+            val unresolvedRefs = collectUnresolvedReferences(plan)


This extractor is used inside pattern matches like query match { case ProducesHudiMetaFields(output) => ...; case _ => None }. Before this change, an UnresolvedException from analyzer.execute() would propagate naturally and Spark's own analyzer would eventually produce its own (usually quite good) error message. Now we're catching it and throwing a new AnalysisException — this short-circuits Spark's normal error-handling and could surface a less precise Hudi-specific message for cases where Spark would have reported the exact column/table problem. Have you considered returning None here instead of throwing, and letting Spark's built-in analysis error reporting handle it? That would preserve the pattern-match fallthrough semantics this extractor is designed for.

yihua · 2026-02-11T22:08:52Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

+            }
+            throw new AnalysisException(
+              s"Failed to resolve query. The query contains unresolved columns or tables.$unresolvedInfo " +
+              s"Please check for: (1) typos in column or table names, (2) missing table definitions, " +


I wonder if the error message is too Hudi-specific for something that's really a general Spark analysis failure. Spark already provides messages like below — wrapping it in a generic "check for typos" message might actually lose information. Could we revisit if the exception throwing logic to follow Spark standard?

For example, on Spark 3.5, the second and fourth test cases already throw readable exception without the changes in this PR:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `target`.`nonexistent_id` cannot be resolved. Did you mean one of the following? [`target`.`name`, `target`.`price`, `target`.`id`, `target`.`ts`, `target`.`_hoodie_file_name`].; line 6 pos 3; 'MergeIntoTable ('target.nonexistent_id = id#106), [updateaction(None, assignment(id#115, id#106), assignment(name#116, name#107), assignment(price#117, price#108), assignment(ts#118, ts#109))], [insertaction(None, assignment(id#115, id#106), assignment(name#116, name#107), assignment(price#117, price#108), assignment(ts#118, ts#109))] :- SubqueryAlias target : +- SubqueryAlias spark_catalog.default.htesthoodieanalysiserrorhandling_2 : +- Relation spark_catalog.default.htesthoodieanalysiserrorhandling_2[_hoodie_commit_time#110,_hoodie_commit_seqno#111,_hoodie_record_key#112,_hoodie_partition_path#113,_hoodie_file_name#114,id#115,name#116,price#117,ts#118] HudiFileGroup +- SubqueryAlias source +- Project [1 AS id#106, updated AS name#107, 20.0 AS price#108, 2000 AS ts#109] +- OneRowRelation at org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedAttributeError(QueryCompilationErrors.scala:306) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$$failUnresolvedAttribute(CheckAnalysis.scala:141) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:299) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:297) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243) at scala.collection.Iterator.foreach(Iterator.scala:943)

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name source.pricee cannot be resolved. Did you mean one of the following? [target._hoodie_commit_seqno, target._hoodie_commit_time, target._hoodie_file_name, target._hoodie_partition_path, target._hoodie_record_key, source.id, target.id, source.name, target.name, source.price, target.price, source.ts, target.ts].; line 10 pos 10 org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name source.pricee cannot be resolved. Did you mean one of the following? [target._hoodie_commit_seqno, target._hoodie_commit_time, target._hoodie_file_name, target._hoodie_partition_path, target._hoodie_record_key, source.id, target.id, source.name, target.name, source.price, target.price, source.ts, target.ts].; line 10 pos 10 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52) at org.apache.spark.sql.HoodieSpark35CatalystPlanUtils$.failAnalysisForMIT(HoodieSpark35CatalystPlanUtils.scala:80) at org.apache.spark.sql.hudi.analysis.ResolveReferences.$anonfun$resolveMergeExprOrFail$2(HoodieSparkBaseAnalysis.scala:270) at org.apache.spark.sql.hudi.analysis.ResolveReferences.$anonfun$resolveMergeExprOrFail$2$adapted(HoodieSparkBaseAnalysis.scala:265) at scala.collection.mutable.LinkedHashSet.foreach(LinkedHashSet.scala:95) at org.apache.spark.sql.catalyst.expressions.AttributeSet.foreach(AttributeSet.scala:137) at org.apache.spark.sql.hudi.analysis.ResolveReferences.resolveMergeExprOrFail(HoodieSparkBaseAnalysis.scala:265) at org.apache.spark.sql.hudi.analysis.ResolveReferences.$anonfun$resolveAssignments$1(HoodieSparkBaseAnalysis.scala:254)

For example, we have certain fail analysis method and on Spark 3.5 it uses proper classification, which should be used if required:

sparkAdapter.getCatalystPlanUtils.failAnalysisForMIT override def failAnalysisForMIT(a: Attribute, cols: String): Unit = { a.failAnalysis( errorClass = "UNRESOLVED_COLUMN.WITH_SUGGESTION", messageParameters = Map( "objectName" -> a.sql, "proposal" -> cols)) }

It would be good to check version-specific exception handling in Hudi Spark integration as well.

yihua · 2026-02-11T22:08:52Z

...park/src/test/scala/org/apache/spark/sql/hudi/analysis/TestHoodieAnalysisErrorHandling.scala

+      }
+
+      // Verify the error message contains helpful information
+      val errorMessage = exception.getMessage


Several test assertions use || with contains("cannot be resolved") — that would match Spark's existing error messages even without this PR. Could you tighten the assertions to specifically verify the new behavior (e.g., assert on "Failed to resolve query" or "Please check for") so these tests actually validate the change? and also account for exception message difference across Spark versions.

yihua · 2026-02-12T01:52:02Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

-          Some(resolved.output)
-        } else {
-          None
+          if (resolved.output.exists(attr => isMetaField(attr.name))) {


Instead, should we fix these places to let Spark's analysis produce a clear error?

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala 166 val sourceTable = if (sourceTableO.resolved) sourceTableO else analyzer.execute(sourceTableO) 167 val m = mO.asInstanceOf[MergeIntoTable].copy(targetTable = targetTable, sourceTable = sourceTable) 168 // END: custom Hudi change 169 - EliminateSubqueryAliases(targetTable) match { 169 + // If source table still has unresolved references (e.g., non-existent columns/tables), 170 + // return the partially resolved plan and let Spark's CheckAnalysis produce a clear error. 171 + if (!sourceTable.resolved) { 172 + m 173 + } else EliminateSubqueryAliases(targetTable) match { 174 case r: NamedRelation if r.skipSchemaResolution => 175 // Do not resolve the expression if the target table accepts any schema. hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala 305 analyzer.execute(plan) 306 } 307 308 - if (resolved.output.exists(attr => isMetaField(attr.name))) { 308 + // If the plan is still not resolved (e.g., references non-existent tables/columns), 309 + // skip meta-field checks and let Spark's standard analysis produce a clear error. 310 + if (resolved.resolved && resolved.output.exists(attr => isMetaField(attr.name))) { 311 Some(resolved.output) 312 } else {

…d query errors Move AnalysisException creation for unresolved query errors to version-specific CatalystPlanUtils implementations to fix Spark 4.x build failures. Spark 4.x removed the simple AnalysisException(message) constructor, requiring the use of errorClass-based constructors instead. This change: - Adds failUnresolvedQuery method to HoodieCatalystPlansUtils interface - Implements version-specific exception creation: - Spark 3.3/3.4/3.5: Uses AnalysisException(message) constructor - Spark 4.0: Uses AnalysisException(errorClass, messageParameters) - Updates HoodieAnalysis to delegate to sparkAdapter.getCatalystPlanUtils - Updates test assertions to be specific to the enhanced error message format

Update test assertions to match actual error handling behavior: - ON condition/UPDATE clause errors: Caught by Spark's standard analysis, expect [UNRESOLVED_COLUMN.WITH_SUGGESTION] error with column suggestions - InsertInto unresolved table: Goes through Hudi's enhanced error handling, expect "Failed to resolve query" with detailed guidance Each assertion is now specific and self-documenting with individual failure messages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…esolvedException Remove test cases that result in Spark's AnalysisException with UNRESOLVED_COLUMN error class, as these are not handled by Hudi's UnresolvedException error handling in ProducesHudiMetaFields.unapply. Removed tests: - MergeInto with unresolved column in ON condition - MergeInto with typo in column name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

hudi-bot · 2026-02-13T17:12:36Z

CI report:

2186107 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nada-attia added 2 commits February 9, 2026 14:39

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 9, 2026

nada-attia mentioned this pull request Feb 9, 2026

Spark SQL queries with unresolved columns show cryptic error messages #18151

Open

apache deleted a comment from hudi-bot Feb 10, 2026

fix(spark): Fix assertions in UnresolvedException error handling tests

42c3adc

- Update InsertInto test to check for 'typos in column or table names' and 'unresolved' in error message - Fix MergeInto typo test to reference non-existent source.pricee when source only has 'price' column

yihua reviewed Feb 12, 2026

View reviewed changes

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Feb 12, 2026

nada-attia and others added 2 commits February 12, 2026 13:40

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): Improve query error handling for UnresolvedException#18147

fix(spark): Improve query error handling for UnresolvedException#18147
nada-attia wants to merge 6 commits intoapache:masterfrom
nada-attia:nada_oss_commit_porting_05

nada-attia commented Feb 9, 2026 •

edited

Loading

Uh oh!

yihua left a comment

Uh oh!

yihua Feb 11, 2026

Uh oh!

yihua Feb 11, 2026

Uh oh!

yihua Feb 12, 2026

Uh oh!

yihua Feb 12, 2026

Uh oh!

yihua Feb 12, 2026

Uh oh!

yihua Feb 11, 2026

Uh oh!

yihua Feb 12, 2026

Uh oh!

hudi-bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

nada-attia commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 13, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

nada-attia commented Feb 9, 2026 •

edited

Loading