[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute#41368
[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute#41368MaxGekk wants to merge 10 commits intoapache:masterfrom
Conversation
|
@cloud-fan @vitaliili-db @bersprockets @srielau Could you review this PR, please. |
|
I don't know all the details about prefixes (how many parts there can be, etc.), but given that, this looks fine to me. Related to this area, but not caused by this PR, is the oddity of seeing auto-generated prefixes, which I assume are internal to Spark, in the list of suggestions: Someone might follow the suggestion and use |
| @@ -83,35 +83,27 @@ object StringUtils extends Logging { | |||
| private[spark] def orderSuggestedIdentifiersBySimilarity( | |||
| baseString: String, | |||
There was a problem hiding this comment.
can we let the caller pass the column name as Seq[String]?
There was a problem hiding this comment.
Unfortunately, not. In some cases, the caller have to deal with attribute sub-classes where qualifier is defined as:
override def qualifier: Seq[String] = throw new UnresolvedException("qualifier")
and there is not method which returns all name parts.
There was a problem hiding this comment.
I would try to do such refactoring outside of the PR since it is not related to the algorithm which we focus on.
There was a problem hiding this comment.
Also, the caller gets struct fields (not attributes).
| @@ -83,35 +83,27 @@ object StringUtils extends Logging { | |||
| private[spark] def orderSuggestedIdentifiersBySimilarity( | |||
| baseString: String, | |||
| testStrings: Seq[String]): Seq[String] = { | |||
There was a problem hiding this comment.
ditto, it should be Seq[Seq[String]]
I do believe it is out of scope of this PR. Let's strip the prefix separately. |
|
Merging to master. Thank you, @cloud-fan @bersprockets for review. |
Here is the PR #41411 |
### What changes were proposed in this pull request?
In the PR, I propose to change the approach of stripping the common part of candidate qualifiers in `StringUtils.orderSuggestedIdentifiersBySimilarity`:
1. If all candidates have the same qualifier including namespace and table name, drop it. It should be dropped if the base string (unresolved attribute) doesn't include a namespace and table name. For example:
- `[ns1.table1.col1, ns1.table1.col2] -> [col1, col2]` for unresolved attribute `col0`
- `[ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2]` for unresolved attribute `table1.col0`
2. If all candidates belong to the same namespace, just drop it. It should be dropped for any non-fully qualified unresolved attribute. For example:
- `[ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2]` for unresolved attribute `col0` or `table0.col0`
- `[ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, ns1.table1.col2]` for unresolved attribute `ns0.table0.col0`
4. Otherwise take the suggested candidates AS IS.
5. Sort the candidate list using the levenshtein distance.
### Why are the changes needed?
This should improve user experience with Spark SQL by simplifying the error message about an unresolved attribute.
### Does this PR introduce _any_ user-facing change?
Yes, it changes the error message.
### How was this patch tested?
By running the existing test suites:
```
$ build/sbt "test:testOnly *AnalysisErrorSuite"
$ build/sbt "test:testOnly *QueryCompilationErrorsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *DatasetUnpivotSuite"
$ build/sbt "test:testOnly *DatasetSuite"
```
Closes apache#41368 from MaxGekk/fix-suggested-column-list.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
In the PR, I propose to change the approach of stripping the common part of candidate qualifiers in
StringUtils.orderSuggestedIdentifiersBySimilarity:[ns1.table1.col1, ns1.table1.col2] -> [col1, col2]for unresolved attributecol0[ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2]for unresolved attributetable1.col0[ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2]for unresolved attributecol0ortable0.col0[ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, ns1.table1.col2]for unresolved attributens0.table0.col0Why are the changes needed?
This should improve user experience with Spark SQL by simplifying the error message about an unresolved attribute.
Does this PR introduce any user-facing change?
Yes, it changes the error message.
How was this patch tested?
By running the existing test suites: