[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute by MaxGekk · Pull Request #41368 · apache/spark

MaxGekk · 2023-05-29T11:56:52Z

What changes were proposed in this pull request?

In the PR, I propose to change the approach of stripping the common part of candidate qualifiers in StringUtils.orderSuggestedIdentifiersBySimilarity:

If all candidates have the same qualifier including namespace and table name, drop it. It should be dropped if the base string (unresolved attribute) doesn't include a namespace and table name. For example:
- [ns1.table1.col1, ns1.table1.col2] -> [col1, col2] for unresolved attribute col0
- [ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2] for unresolved attribute table1.col0
If all candidates belong to the same namespace, just drop it. It should be dropped for any non-fully qualified unresolved attribute. For example:
- [ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2] for unresolved attribute col0 or table0.col0
- [ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, ns1.table1.col2] for unresolved attribute ns0.table0.col0
Otherwise take the suggested candidates AS IS.
Sort the candidate list using the levenshtein distance.

Why are the changes needed?

This should improve user experience with Spark SQL by simplifying the error message about an unresolved attribute.

Does this PR introduce any user-facing change?

Yes, it changes the error message.

How was this patch tested?

By running the existing test suites:

$ build/sbt "test:testOnly *AnalysisErrorSuite"
$ build/sbt "test:testOnly *QueryCompilationErrorsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *DatasetUnpivotSuite"
$ build/sbt "test:testOnly *DatasetSuite"

…n-list

MaxGekk · 2023-05-29T20:37:36Z

@cloud-fan @vitaliili-db @bersprockets @srielau Could you review this PR, please.

bersprockets · 2023-05-30T00:34:40Z

I don't know all the details about prefixes (how many parts there can be, etc.), but given that, this looks fine to me.

Related to this area, but not caused by this PR, is the oddity of seeing auto-generated prefixes, which I assume are internal to Spark, in the list of suggestions:

with v1 as (
 select * from values (1, 2) as (c1, c2)
),
v2 as (
  select * from values (2, 3) as (c1, c2)
)
select v1.b
from (
  select coalesce(v1.c1, v2.c1) as c1, v1.c1 as v1_c1, v1.c2 as v1_c2, v2.c1 as v2_c1, v2.c2 as v2_c2
  from v1
  full outer join v2
  on v1.c1 = v2.c1
);
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `v1`.`b` cannot be resolved. Did you mean one of the following?
[`__auto_generated_subquery_name`.`c1`, `__auto_generated_subquery_name`.`v1_c1`, `__auto_generated_subquery_name`.`v1_c2`, `__auto_generated_subquery_name`.`v2_c1`, `__auto_generated_subquery_name`.`v2_c2`].; line 7 pos 7;

Someone might follow the suggestion and use __auto_generated_subquery_name.c1 in their query, only to have a later update to Spark change the internal name. Not sure if that's in scope here.

cloud-fan · 2023-05-30T08:15:25Z

@@ -83,35 +83,27 @@ object StringUtils extends Logging {
  private[spark] def orderSuggestedIdentifiersBySimilarity(
      baseString: String,


can we let the caller pass the column name as Seq[String]?

Unfortunately, not. In some cases, the caller have to deal with attribute sub-classes where qualifier is defined as:

override def qualifier: Seq[String] = throw new UnresolvedException("qualifier")

and there is not method which returns all name parts.

I would try to do such refactoring outside of the PR since it is not related to the algorithm which we focus on.

Also, the caller gets struct fields (not attributes).

cloud-fan · 2023-05-30T08:16:10Z

@@ -83,35 +83,27 @@ object StringUtils extends Logging {
  private[spark] def orderSuggestedIdentifiersBySimilarity(
      baseString: String,
      testStrings: Seq[String]): Seq[String] = {


ditto, it should be Seq[Seq[String]]

MaxGekk · 2023-05-30T13:57:38Z

Someone might follow the suggestion and use __auto_generated_subquery_name.c1 in their query, only to have a later update to Spark change the internal name. Not sure if that's in scope here.

I do believe it is out of scope of this PR. Let's strip the prefix separately.

…n-list

MaxGekk · 2023-05-31T18:04:18Z

Merging to master. Thank you, @cloud-fan @bersprockets for review.

MaxGekk · 2023-06-01T06:40:51Z

Someone might follow the suggestion and use __auto_generated_subquery_name.c1 in their query, only to have a later update to Spark change the internal name. Not sure if that's in scope here.

Here is the PR #41411

### What changes were proposed in this pull request? In the PR, I propose to change the approach of stripping the common part of candidate qualifiers in `StringUtils.orderSuggestedIdentifiersBySimilarity`: 1. If all candidates have the same qualifier including namespace and table name, drop it. It should be dropped if the base string (unresolved attribute) doesn't include a namespace and table name. For example: - `[ns1.table1.col1, ns1.table1.col2] -> [col1, col2]` for unresolved attribute `col0` - `[ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2]` for unresolved attribute `table1.col0` 2. If all candidates belong to the same namespace, just drop it. It should be dropped for any non-fully qualified unresolved attribute. For example: - `[ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2]` for unresolved attribute `col0` or `table0.col0` - `[ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, ns1.table1.col2]` for unresolved attribute `ns0.table0.col0` 4. Otherwise take the suggested candidates AS IS. 5. Sort the candidate list using the levenshtein distance. ### Why are the changes needed? This should improve user experience with Spark SQL by simplifying the error message about an unresolved attribute. ### Does this PR introduce _any_ user-facing change? Yes, it changes the error message. ### How was this patch tested? By running the existing test suites: ``` $ build/sbt "test:testOnly *AnalysisErrorSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *DatasetUnpivotSuite" $ build/sbt "test:testOnly *DatasetSuite" ``` Closes apache#41368 from MaxGekk/fix-suggested-column-list. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

Improve suggested candidates for unresolved attribute

63df086

github-actions bot added the SQL label May 29, 2023

MaxGekk added 3 commits May 29, 2023 14:58

Merge remote-tracking branch 'origin/master' into fix-suggested-colum…

436f88e

…n-list

Fix

d20f704

Re-gen golden files

8066da1

MaxGekk commented May 29, 2023

View reviewed changes

Comment thread sql/core/src/test/resources/sql-tests/results/identifier-clause.sql.out Outdated

MaxGekk changed the title ~~[WIP][SQL] Improve suggested candidates for unresolved attribute~~ [WIP][SPARK-43867][SQL] Improve suggested candidates for unresolved attribute May 29, 2023

Revert identifier-clause.sql.out

adc8807

MaxGekk changed the title ~~[WIP][SPARK-43867][SQL] Improve suggested candidates for unresolved attribute~~ [SPARK-43867][SQL] Improve suggested candidates for unresolved attribute May 29, 2023

MaxGekk marked this pull request as ready for review May 29, 2023 20:34

MaxGekk requested a review from cloud-fan May 29, 2023 20:34

cloud-fan reviewed May 30, 2023

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala Outdated

Trigger build

4fb8946

MaxGekk added 3 commits May 30, 2023 17:52

Merge remote-tracking branch 'origin/master' into fix-suggested-colum…

dab782a

…n-list

Bug fix

2a0fc81

Revert identifier-clause.sql.out

2c83a92

cloud-fan approved these changes May 31, 2023

View reviewed changes

Fix failed tests

60e6930

MaxGekk closed this in a889342 May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute#41368

[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute#41368
MaxGekk wants to merge 10 commits intoapache:masterfrom
MaxGekk:fix-suggested-column-list

MaxGekk commented May 29, 2023 •

edited

Loading

Uh oh!

Uh oh!

MaxGekk commented May 29, 2023

Uh oh!

bersprockets commented May 30, 2023

Uh oh!

cloud-fan May 30, 2023 •

edited

Loading

Uh oh!

MaxGekk May 30, 2023

Uh oh!

MaxGekk May 30, 2023

Uh oh!

MaxGekk May 30, 2023

Uh oh!

cloud-fan May 30, 2023

Uh oh!

MaxGekk May 31, 2023

Uh oh!

Uh oh!

MaxGekk commented May 30, 2023

Uh oh!

MaxGekk commented May 31, 2023

Uh oh!

MaxGekk commented Jun 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -83,35 +83,27 @@ object StringUtils extends Logging {
		private[spark] def orderSuggestedIdentifiersBySimilarity(
		baseString: String,

Conversation

MaxGekk commented May 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

MaxGekk commented May 29, 2023

Uh oh!

bersprockets commented May 30, 2023

Uh oh!

cloud-fan May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 30, 2023

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 30, 2023

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 30, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 30, 2023

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 31, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaxGekk commented May 30, 2023

Uh oh!

MaxGekk commented May 31, 2023

Uh oh!

MaxGekk commented Jun 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MaxGekk commented May 29, 2023 •

edited

Loading

cloud-fan May 30, 2023 •

edited

Loading