Skip to content

[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute#41368

Closed
MaxGekk wants to merge 10 commits intoapache:masterfrom
MaxGekk:fix-suggested-column-list
Closed

[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute#41368
MaxGekk wants to merge 10 commits intoapache:masterfrom
MaxGekk:fix-suggested-column-list

Conversation

@MaxGekk
Copy link
Copy Markdown
Member

@MaxGekk MaxGekk commented May 29, 2023

What changes were proposed in this pull request?

In the PR, I propose to change the approach of stripping the common part of candidate qualifiers in StringUtils.orderSuggestedIdentifiersBySimilarity:

  1. If all candidates have the same qualifier including namespace and table name, drop it. It should be dropped if the base string (unresolved attribute) doesn't include a namespace and table name. For example:
    • [ns1.table1.col1, ns1.table1.col2] -> [col1, col2] for unresolved attribute col0
    • [ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2] for unresolved attribute table1.col0
  2. If all candidates belong to the same namespace, just drop it. It should be dropped for any non-fully qualified unresolved attribute. For example:
    • [ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2] for unresolved attribute col0 or table0.col0
    • [ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, ns1.table1.col2] for unresolved attribute ns0.table0.col0
  3. Otherwise take the suggested candidates AS IS.
  4. Sort the candidate list using the levenshtein distance.

Why are the changes needed?

This should improve user experience with Spark SQL by simplifying the error message about an unresolved attribute.

Does this PR introduce any user-facing change?

Yes, it changes the error message.

How was this patch tested?

By running the existing test suites:

$ build/sbt "test:testOnly *AnalysisErrorSuite"
$ build/sbt "test:testOnly *QueryCompilationErrorsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *DatasetUnpivotSuite"
$ build/sbt "test:testOnly *DatasetSuite"

@github-actions github-actions bot added the SQL label May 29, 2023
Comment thread sql/core/src/test/resources/sql-tests/results/identifier-clause.sql.out Outdated
@MaxGekk MaxGekk changed the title [WIP][SQL] Improve suggested candidates for unresolved attribute [WIP][SPARK-43867][SQL] Improve suggested candidates for unresolved attribute May 29, 2023
@MaxGekk MaxGekk changed the title [WIP][SPARK-43867][SQL] Improve suggested candidates for unresolved attribute [SPARK-43867][SQL] Improve suggested candidates for unresolved attribute May 29, 2023
@MaxGekk MaxGekk marked this pull request as ready for review May 29, 2023 20:34
@MaxGekk MaxGekk requested a review from cloud-fan May 29, 2023 20:34
@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented May 29, 2023

@cloud-fan @vitaliili-db @bersprockets @srielau Could you review this PR, please.

@bersprockets
Copy link
Copy Markdown
Contributor

I don't know all the details about prefixes (how many parts there can be, etc.), but given that, this looks fine to me.

Related to this area, but not caused by this PR, is the oddity of seeing auto-generated prefixes, which I assume are internal to Spark, in the list of suggestions:

with v1 as (
 select * from values (1, 2) as (c1, c2)
),
v2 as (
  select * from values (2, 3) as (c1, c2)
)
select v1.b
from (
  select coalesce(v1.c1, v2.c1) as c1, v1.c1 as v1_c1, v1.c2 as v1_c2, v2.c1 as v2_c1, v2.c2 as v2_c2
  from v1
  full outer join v2
  on v1.c1 = v2.c1
);
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `v1`.`b` cannot be resolved. Did you mean one of the following?
[`__auto_generated_subquery_name`.`c1`, `__auto_generated_subquery_name`.`v1_c1`, `__auto_generated_subquery_name`.`v1_c2`, `__auto_generated_subquery_name`.`v2_c1`, `__auto_generated_subquery_name`.`v2_c2`].; line 7 pos 7;

Someone might follow the suggestion and use __auto_generated_subquery_name.c1 in their query, only to have a later update to Spark change the internal name. Not sure if that's in scope here.

@@ -83,35 +83,27 @@ object StringUtils extends Logging {
private[spark] def orderSuggestedIdentifiersBySimilarity(
baseString: String,
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we let the caller pass the column name as Seq[String]?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, not. In some cases, the caller have to deal with attribute sub-classes where qualifier is defined as:

  override def qualifier: Seq[String] = throw new UnresolvedException("qualifier")

and there is not method which returns all name parts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to do such refactoring outside of the PR since it is not related to the algorithm which we focus on.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the caller gets struct fields (not attributes).

@@ -83,35 +83,27 @@ object StringUtils extends Logging {
private[spark] def orderSuggestedIdentifiersBySimilarity(
baseString: String,
testStrings: Seq[String]): Seq[String] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, it should be Seq[Seq[String]]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala Outdated
@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented May 30, 2023

Someone might follow the suggestion and use __auto_generated_subquery_name.c1 in their query, only to have a later update to Spark change the internal name. Not sure if that's in scope here.

I do believe it is out of scope of this PR. Let's strip the prefix separately.

@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented May 31, 2023

Merging to master. Thank you, @cloud-fan @bersprockets for review.

@MaxGekk MaxGekk closed this in a889342 May 31, 2023
@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented Jun 1, 2023

Someone might follow the suggestion and use __auto_generated_subquery_name.c1 in their query, only to have a later update to Spark change the internal name. Not sure if that's in scope here.

Here is the PR #41411

czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
### What changes were proposed in this pull request?
In the PR, I propose to change the approach of stripping the common part of candidate qualifiers in `StringUtils.orderSuggestedIdentifiersBySimilarity`:
1. If all candidates have the same qualifier including namespace and table name, drop it. It should be dropped if the base string (unresolved attribute) doesn't include a namespace and table name. For example:
    - `[ns1.table1.col1, ns1.table1.col2] -> [col1, col2]` for unresolved attribute `col0`
    - `[ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2]` for unresolved attribute `table1.col0`
2. If all candidates belong to the same namespace, just drop it. It should be dropped for any non-fully qualified unresolved attribute. For example:
    - `[ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2]` for unresolved attribute `col0` or `table0.col0`
    - `[ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, ns1.table1.col2]` for unresolved attribute `ns0.table0.col0`
4. Otherwise take the suggested candidates AS IS.
5. Sort the candidate list using the levenshtein distance.

### Why are the changes needed?
This should improve user experience with Spark SQL by simplifying the error message about an unresolved attribute.

### Does this PR introduce _any_ user-facing change?
Yes, it changes the error message.

### How was this patch tested?
By running the existing test suites:
```
$ build/sbt "test:testOnly *AnalysisErrorSuite"
$ build/sbt "test:testOnly *QueryCompilationErrorsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *DatasetUnpivotSuite"
$ build/sbt "test:testOnly *DatasetSuite"

```

Closes apache#41368 from MaxGekk/fix-suggested-column-list.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants