New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the Unicode collation for almost equal literals and the external vocabulary #312
Fix the Unicode collation for almost equal literals and the external vocabulary #312
Conversation
…sure total ordering.
- The Operation classes now have a getChildren() method that returns non-owning pointers to all of their children - Each Operation holds a vector of strings to store warnings that are emitted during the result computation. Those can be recursively retrieved using the collectWarnings() function that internally uses the getChildren() function mentioned above
- Out-of-vocab entries in value clauses previously triggered exceptions. - Now rows that contain unknown words are ignored and trigger a warning in the result json. - This is also tested in the end-to-end test
…abulary, because everything else leads to incorrect behavior.
The code is still somewhat ugly, but already much less uglier.
The external vocabulary was so far created using the ICU collation (or a wrong version of it bc. of the "Externalization Prefix". This should be now fixed, but we still should refactor the Vocabulary to use proper strict and static typin
This now only fixes the unicode stuff.
@joka921 Thanks, Johannes, having a well-defined total order makes a lot of sense! Can you say something about the average performance compared to an ordinary strcmp, now that you pass your comparison objects by reference and the factor 5-10 from before has gone away? |
@joka921 can you look at the merge conflicts? I'll try to do some review this weekend |
…lVocabularyUnicodeFix # Conflicts: # e2e/scientists_queries.yaml # src/index/Index.cpp # src/index/StringSortComparator.h
Ok, I fixed the conflicts so you can have a peek. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Introduced a "TOTAL" collation order (first sort by IDENTICAL Level, then by the language tag, and then by the actual binary representation). This always is and has to be used during index build to enforce, that bitwise different strings never compare equal and have a deterministic ordering.
It also allows us to use a different (lower) Sort Level when using the index, e.g. "which collation Level do we want to use for Filters".
Also included the Correct collation into the external vocabulary which was broken before in its ordering.