Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix string comparison to avoid using the collator. #87

Merged
merged 3 commits into from
Dec 10, 2023

Conversation

nolaviz
Copy link
Contributor

@nolaviz nolaviz commented Jan 13, 2023

Benchmarking at Google has shown an up-to-1000x performance degradation from using the collator.

shown an up-to-1000x performance degradation from using the collator.
@ddekany
Copy link
Contributor

ddekany commented Jan 13, 2023

There's a complication here. The Collator does UNICODE normalization, while String.equals does not. Like if you put together "á" from "a" and the combining accent character, they won't be equal anymore with this patch, while they look exactly the same for the user. On the other hand, we don't want to know which string comes first, so maybe just doing UNICODE normalization and then calling equals will solve this.

@ddekany
Copy link
Contributor

ddekany commented Jan 13, 2023

Also, when it comes to backward compatibility, I don't know right now what other magic Collators do... Like if different kind of dashes are equal and such. So probably this will need a configuration.incompatibleImprovements >= 2.3.33 condition. Not sure yet, had to research more...

@nolaviz
Copy link
Contributor Author

nolaviz commented Jan 13, 2023

The collator used here (Environment.getCollator()) is always initialized to Collator.getInstance(getLocale()); OpenJDK's CollatorProviderImpl disables decomposition on the collators it returns (https://github.com/openjdk/jdk/blob/be8e6d05db2f623626506e64a2fb7caf755d5d06/src/java.base/share/classes/sun/util/locale/provider/CollatorProviderImpl.java#L122). This means - for the equality case - we can expect the collator to always fall back to String.compareTo (https://github.com/openjdk/jdk/blob/be8e6d05db2f623626506e64a2fb7caf755d5d06/src/java.base/share/classes/java/text/RuleBasedCollator.java#L557).

@ddekany
Copy link
Contributor

ddekany commented Jan 13, 2023

Quickly checked on Windows 10, Oracle Java 8... it does (de)normalization apparently:

        String s1 = "a\u0301";
        String s2 = "á";
        System.out.println(s1.equals(s2)); // false => not equal
        System.out.println(Collator.getInstance(Locale.US).compare(s1, s2)); // 0 => equal

If this either worked or not based on if it's OpenJDK, or Oracle, or who knows based on what other factor, that's another issue to fix in itself. If (de)normalization is not terribly slow, then probably I would chose doing that.

@nolaviz
Copy link
Contributor Author

nolaviz commented Jan 13, 2023

Just to note - a very simplistic benchmark (https://gist.github.com/nolaviz/94455756da704892c4dca377062ab82a) shows the following results:

String.compareTo: PT0.804698705S
Normalization + String.compareTo: PT1.274591945S
Collator.compare: PT3M59.458385549S

i.e., Collator.compare is ~295x slower than String.compareTo in this mini-benchmark, while normalization + compareTo is just ~1.6x slower.

This should be more compatible with the previous behavior
(Collator-based).
@ddekany
Copy link
Contributor

ddekany commented Jan 17, 2023

The last thing we have to figure out if we can always do this (normalize then equals), or it has to be activated by cfg.incompatibleImprovements >= 2.3.33. That depends on whether some collators do fancier normalization than this.

@ddekany
Copy link
Contributor

ddekany commented Jan 26, 2023

I see you made this activated by incompatibleImprovements recently. I guess to be on the safe side, that's how it will be, but, have you found some concrete backward compatibility issues?

@nolaviz
Copy link
Contributor Author

nolaviz commented Jan 26, 2023

I haven't found any concrete backward compatibility issues. But - better safe than sorry.

@ddekany ddekany merged commit dbacf3b into apache:2.3-gae Dec 10, 2023
asfgit pushed a commit that referenced this pull request Dec 10, 2023
asfgit pushed a commit that referenced this pull request Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants