Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EZP-30481: Character "İ" causes publishing to fail with legacy search #2754

Closed
wants to merge 3 commits into from

Conversation

glye
Copy link
Member

@glye glye commented Aug 28, 2019

Question Answer
JIRA issue EZP-30481
Bug/Improvement yes
New feature no
Target version 7.5 (should be 6.7 for final fix)
BC breaks TBD
Tests pass TBD
Doc needed no

Problem

  • Publishing Turkish text can lead to SQL errors in the legacy search, because of collation and transformation issues with Turkish dotted and dotless i's. Turkish has dotted and dotless i's in both upper and lower case: İ i - I ı. Case changes in PHP, and non-Turkish collation in the database, leads to errors in conversion between them. This in itself is a minor issue here.
  • We cannot expect Turkish to behave 100% correctly in eZ Platform without adding Turkish transformation rules. This is not a bug - we do not guarantee correct transformations of every language in the world (it's an enormous task). Adding this is a feature request.
  • Changing letter case is locale dependent. PHP mbstring functions cannot be trusted to do this correctly in every language.
  • However, eZ Platform should not produce invalid SQL queries, even if a language isn't fully supported in eZ Platform or in PHP.
  • Legacy codebase is probably also affected by this, since the code is ported from there.
  • The original code is performance optimised by working in batches. This means there is no way to determine the relation between the word we try to index, and the word that ends up being stored in the DB. I suspect a complete fix will require us to work on one word at a time, which will hurt performance.

Fix

  • I believe a proper, by-the-book, correct-in-all-cases fix would require eZ Platform to be fully aware of the collation rules in the DB, the locale rules in PHP, and to have correct transformation rules for Turkish. This is out of scope. Avoiding SQL errors is the main goal, and making the transformed words searchable.
  • be9c599 fixes the symptom, not the cause, by inserting nothing rather than a faulty ezsearch_object_word_link entry. This means the corresponding word is not found in searches, and it leaves inconsistencies in the ezsearch_word table, but those faults also existed before the fix.
  • 91abe52 ensures words with transformed chars are searchable. It keeps the efficient batch processing of most words. Only when words end up not being stored as is, because of transformation, will it process those words one by one. It creates references from the original word to the transformed word in the DB, which means indexWords() doesn't crash, and the words are searchable. (This makes the previous commit redundant, but I like it as a failsafe anyway, though it ought to have error logging.)

TODO:

  • Implement feature / fix a bug.
  • Implement tests.
  • Fix new code according to Coding Standards ($ composer fix-cs).
  • Ask for Code Review.

@glye glye marked this pull request as draft April 22, 2021 15:07
@glye glye changed the title [WIP] EZP-30481: Character "İ" causes publishing to fail with legacy search EZP-30481: Character "İ" causes publishing to fail with legacy search Apr 22, 2021
@sonarcloud
Copy link

sonarcloud bot commented Oct 22, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@adamwojs
Copy link
Member

Closing PR as obsolete. eZ Platform 2.5 has reached EOM, so please reopen PR in ezsystems/ezplatform-kernel (bug fixes) or ibexa/core (features/improvements) if issue is still valid for v3.3 / v4.x and you are willing to work on it.

@adamwojs adamwojs closed this Apr 19, 2022
@adamwojs adamwojs deleted the ezp30481_turkish_i_breaks_legacy_search branch April 19, 2022 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants