Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix case-insensetive search in UTF8 strings #30663

Merged
merged 1 commit into from
Oct 27, 2021

Conversation

tavplubix
Copy link
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Functions for case-insensitive search in UTF8 strings like positionCaseInsensitiveUTF8 and countSubstringsCaseInsensitiveUTF8 might find substrings that actually does not match, it's fixed.

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Oct 25, 2021
@alexey-milovidov
Copy link
Member

Do I understand correctly that we just perform full comparison after found a candidate match, and before this PR it was something less trivial?

PS. Do we have perf test for it?

@alexey-milovidov alexey-milovidov self-assigned this Oct 26, 2021
@tavplubix
Copy link
Member Author

we just perform full comparison after found a candidate match?

Yep

before this PR it was something less trivial

Not precisely, before this PR we just skipped first cache_valid_len bytes of needle assuming that these bytes match (this assumption was incorrect) and then performed full comparison of the rest bytes. The only difference is that now we do not skip needle prefix and compare it too.

PS. Do we have perf test for it?

<!-- Sophisticated substring search. -->
<query>SELECT count() FROM hits_10m_single WHERE positionCaseInsensitiveUTF8(URL, 'новости') != 0 SETTINGS max_threads = 1</query>
<query>SELECT count() FROM hits_100m_single WHERE positionCaseInsensitiveUTF8(URL, 'новости') != 0</query>

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Oct 26, 2021

Alright! I expect there will be no visible performance degradation.

@alexey-milovidov
Copy link
Member

Integration tests were temporarily broken for a moment - ok.

@alexey-milovidov
Copy link
Member

No difference in peft test.

@alexey-milovidov alexey-milovidov merged commit 9f9b496 into master Oct 27, 2021
@alexey-milovidov alexey-milovidov deleted the fix_five_years_old_bug branch October 27, 2021 09:10
alexey-milovidov added a commit that referenced this pull request Oct 27, 2021
Backport #30663 to 21.9: Fix case-insensetive search in UTF8 strings
alexey-milovidov added a commit that referenced this pull request Oct 27, 2021
Backport #30663 to 21.8: Fix case-insensetive search in UTF8 strings
alexey-milovidov added a commit that referenced this pull request Oct 27, 2021
Backport #30663 to 21.10: Fix case-insensetive search in UTF8 strings
alexey-milovidov added a commit that referenced this pull request Oct 27, 2021
Backport #30663 to 21.3: Fix case-insensetive search in UTF8 strings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-bugfix Pull request with bugfix, not backported by default
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants