Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation not respected properly by LIKE and REGEXP_LIKE #4176

Closed
knutwannheden opened this issue Aug 23, 2022 · 11 comments
Closed

Collation not respected properly by LIKE and REGEXP_LIKE #4176

knutwannheden opened this issue Aug 23, 2022 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@knutwannheden
Copy link

When executing the following script:

create table t(n varchar(100) collate utf8mb4_unicode_ci);

insert into t(n) values ('schön'), ('schon');

select *
from t
where n like '%o%';

the result from the last query returns one row only, whereas two rows would be expected. For reference see: https://www.db-fiddle.com/f/doWaAhgwu5qhG2dYo9A34A/0

@timsehn
Copy link
Sponsor Contributor

timsehn commented Aug 23, 2022

We're working through charset/collation tests/bugs over the next couple days. @Hydrocharged will look into this with that.

@timsehn timsehn added the bug Something isn't working label Aug 23, 2022
@timsehn
Copy link
Sponsor Contributor

timsehn commented Aug 23, 2022

@Hydrocharged tells me regexes are not supported with collations. We'll update the docs.

@knutwannheden
Copy link
Author

Including LIKE expressions?

@Hydrocharged
Copy link
Contributor

Hydrocharged commented Aug 24, 2022

Indeed, all LIKE expressions use regexes to determine matches. Our regex logic is not collation-aware and treats all strings as the default collation (utf8mb4_0900_bin), which will especially return incorrect results for case-insensitive collations. We're researching a solution, but it will take quite some time.

@timsehn
Copy link
Sponsor Contributor

timsehn commented Aug 24, 2022

Maybe we should error (or warn) when a user uses a regex with a non-default utf8 collation or another charset.

@knutwannheden
Copy link
Author

knutwannheden commented Aug 24, 2022

A LIKE expression may not need a regular expression to be implemented, but regardless, I would like to point out, that if the query in the example is changed to use a regular equality predicate (e.g. n = 'schön' or n = 'schon'), it should also return two rows, so this behavior is not specific to LIKE expressions in MySQL. See https://www.db-fiddle.com/f/omaXi9JrUj2mFbQoDKBpLX/0

@Hydrocharged
Copy link
Contributor

Hydrocharged commented Aug 24, 2022

@timsehn I'll add a warning

@knutwannheden You are correct that LIKE expressions are not exactly regular expressions, but are a form of pattern matching. REGEXP_LIKE implements full regular expression pattern matching. At the moment, we do not support any form of pattern matching over collations. I apologize for the lack of proper terminology on my part. Also, the regular equality check should work just fine. Here is the output from my machine:

$ dolt sql -q "create table t(n varchar(100) collate utf8mb4_unicode_ci);"

$ dolt sql -q "insert into t(n) values ('schön'), ('schon');"
Query OK, 2 rows affected

$ dolt sql -q "select * from t where n = 'schön'"
+-------+
| n     |
+-------+
| schon |
| schön |
+-------+

$ dolt sql -q "select * from t where n = 'schon'"
+-------+
| n     |
+-------+
| schon |
| schön |
+-------+

Are you receiving different results? Only pattern matching is not supported for collations, all other forms of matching should work. If not, definitely file a bug and we'll get on it!

@knutwannheden
Copy link
Author

Hmm... Not sure what happened there, but I tested it again now and can confirm it works with the equality operator.

I will adjust the issue title to make it clear that it pertains to LIKE.

@knutwannheden knutwannheden changed the title Bug in utf8mb4_unicode_ci collation Collation not respected properly by LIKE and REGEXP_LIKE Aug 24, 2022
@Hydrocharged
Copy link
Contributor

Thank you for the title update!

Is it possible that you used a different connector the first time? Theoretically it shouldn't make a difference, but we have found some very strange bugs when using different connectors.

@knutwannheden
Copy link
Author

Based on the most recent release notes, I get the impression that this issue can be closed. I will try to verify.

@Hydrocharged
Copy link
Contributor

Only LIKE was implemented. I created a new issue for REGEXP_LIKE specifically. We added the original issue as a test case, so I'll go ahead and close this issue. For REGEXP_LIKE, we'll use the other issue.

#4427

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants