[SPARK-48670][SQL] Providing suggestion as part of error message when invalid collation name is given#47040
[SPARK-48670][SQL] Providing suggestion as part of error message when invalid collation name is given#47040dbatomic wants to merge 8 commits intoapache:masterfrom
Conversation
nikolamand-db
left a comment
There was a problem hiding this comment.
LGTM, added few minor suggestions.
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala
Show resolved
Hide resolved
|
nit: Could you change title in the ticket to follow this one, so that JIRA picks PR up. |
| } | ||
|
|
||
| // Split modifiers and locale name. | ||
| final int MODIFIER_LENGTH = 3; |
There was a problem hiding this comment.
I wonder if we have this somewhere (else) in CollationFactory already? If not, perhaps that (outside of getClosestSuggestionOnInvalidName) would be the place to put it
There was a problem hiding this comment.
for example, the collationSpecs
There was a problem hiding this comment.
Let's wait until we have at least one more usage and then we can give it more visibility.
| ("UNICODE_LCASE_X","UNICODE"), | ||
| ("UTF8_UNICODE","UTF8_LCASE"), | ||
| ("UTF8_BINARY_UNICODE","UTF8_BINARY"), | ||
| ("CI_UNICODE", "UNICODE"), |
There was a problem hiding this comment.
for possible future improvement: could we maybe modify the proposal choice so that we get "UNICODE_CI" here?
for example, by adding some sort of second criteria (apart from Levenshtein)
There was a problem hiding this comment.
Yeah, that would be possible with more fine tuning. My proposal is:
- Go with current suggestions.
- Based on telemetry, iterate when we see what are the most common customer mistakes.
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Show resolved
Hide resolved
|
|
||
| // Find the closest locale name. | ||
| final String finalLocaleName = localeName; | ||
| String closestLocale = Collections.min(List.of(validRootNames), Comparator.comparingInt( |
There was a problem hiding this comment.
shall we return more than one proposals?
There was a problem hiding this comment.
FYI, for column not exist error, we provide up to 5 proposals. It may be too much for string collations, but sometimes the distances are very close and it makes more sense to return more than one proposals.
There was a problem hiding this comment.
I initially went with 3 proposals but result looked weird - locale names are sometimes pretty distant so suggestions were very far off.
In the latest iteration I propose following:
- Take top 3.
- Always return the closest one.
- For the other 2, check if their distance is < levenshtein distance threshold (number of characters needed to be changed in order to reach the correct name). If we are below the limit we include other suggestions as well.
Btw, soon we will include some way to list all collations (e.g. SHOW COLLATIONS) and extend error message to point to a way how to see all collations available on the system.
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR improves error reporting in collation space. Currently, when invalid collation name is provided, caller just gets information that collation name can't be accepted. This PR will also return a suggestion on valid collation name that is similar to invalid one that was provided.
We propose following rules on generating the suggestion:
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Existing tests for invalid collation names are extended to also cover suggestion checks.
Was this patch authored or co-authored using generative AI tooling?