Skip to content

Comments

[SPARK-48670][SQL] Providing suggestion as part of error message when invalid collation name is given#47040

Closed
dbatomic wants to merge 8 commits intoapache:masterfrom
dbatomic:collation_suggestion_on_error
Closed

[SPARK-48670][SQL] Providing suggestion as part of error message when invalid collation name is given#47040
dbatomic wants to merge 8 commits intoapache:masterfrom
dbatomic:collation_suggestion_on_error

Conversation

@dbatomic
Copy link
Contributor

What changes were proposed in this pull request?

This PR improves error reporting in collation space. Currently, when invalid collation name is provided, caller just gets information that collation name can't be accepted. This PR will also return a suggestion on valid collation name that is similar to invalid one that was provided.

We propose following rules on generating the suggestion:

  1. Find locale that is the closest valid locale measured by Levenshtein distance.
  2. Remove duplicate modifiers (e.g. CS_AI_AI_CS becomes CS_AI).
  3. Remove invalid combinations (e.g. CS_CI becomes CS).

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Existing tests for invalid collation names are extended to also cover suggestion checks.

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the SQL label Jun 20, 2024
Copy link
Contributor

@nikolamand-db nikolamand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added few minor suggestions.

@mihailomilosevic2001
Copy link
Contributor

nit: Could you change title in the ticket to follow this one, so that JIRA picks PR up.

}

// Split modifiers and locale name.
final int MODIFIER_LENGTH = 3;
Copy link
Contributor

@uros-db uros-db Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we have this somewhere (else) in CollationFactory already? If not, perhaps that (outside of getClosestSuggestionOnInvalidName) would be the place to put it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, the collationSpecs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait until we have at least one more usage and then we can give it more visibility.

("UNICODE_LCASE_X","UNICODE"),
("UTF8_UNICODE","UTF8_LCASE"),
("UTF8_BINARY_UNICODE","UTF8_BINARY"),
("CI_UNICODE", "UNICODE"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for possible future improvement: could we maybe modify the proposal choice so that we get "UNICODE_CI" here?
for example, by adding some sort of second criteria (apart from Levenshtein)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be possible with more fine tuning. My proposal is:

  1. Go with current suggestions.
  2. Based on telemetry, iterate when we see what are the most common customer mistakes.

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


// Find the closest locale name.
final String finalLocaleName = localeName;
String closestLocale = Collections.min(List.of(validRootNames), Comparator.comparingInt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we return more than one proposals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, for column not exist error, we provide up to 5 proposals. It may be too much for string collations, but sometimes the distances are very close and it makes more sense to return more than one proposals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially went with 3 proposals but result looked weird - locale names are sometimes pretty distant so suggestions were very far off.

In the latest iteration I propose following:

  1. Take top 3.
  2. Always return the closest one.
  3. For the other 2, check if their distance is < levenshtein distance threshold (number of characters needed to be changed in order to reach the correct name). If we are below the limit we include other suggestions as well.

Btw, soon we will include some way to list all collations (e.g. SHOW COLLATIONS) and extend error message to point to a way how to see all collations available on the system.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7a1608b Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants