Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the Unicode collation for almost equal literals and the external vocabulary #312

Merged
merged 13 commits into from Mar 21, 2020

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jan 26, 2020

  • Introduced a "TOTAL" collation order (first sort by IDENTICAL Level, then by the language tag, and then by the actual binary representation). This always is and has to be used during index build to enforce, that bitwise different strings never compare equal and have a deterministic ordering.

  • It also allows us to use a different (lower) Sort Level when using the index, e.g. "which collation Level do we want to use for Filters".

  • Also included the Correct collation into the external vocabulary which was broken before in its ordering.

- The Operation classes now have a getChildren()
  method that returns non-owning pointers to all of their children

- Each Operation holds a vector of strings to store warnings that are emitted during the result computation. Those can be recursively retrieved
  using the collectWarnings() function that internally uses the getChildren()
  function mentioned above
- Out-of-vocab entries in value clauses previously triggered exceptions.
- Now rows that contain unknown words are ignored and trigger a warning in the result json.
- This is also tested in the end-to-end test
…abulary, because everything else leads to incorrect behavior.
The code is still somewhat ugly, but already much less uglier.
The external vocabulary was so far created using the ICU collation (or a wrong version of it bc.
of the "Externalization Prefix". This should be now fixed, but we still should refactor the Vocabulary to use proper strict and static typin
This now only fixes the unicode stuff.
@joka921 joka921 requested a review from niklas88 January 26, 2020 10:02
@hannahbast
Copy link
Member

@joka921 Thanks, Johannes, having a well-defined total order makes a lot of sense! Can you say something about the average performance compared to an ordinary strcmp, now that you pass your comparison objects by reference and the factor 5-10 from before has gone away?

@niklas88
Copy link
Member

@joka921 can you look at the merge conflicts? I'll try to do some review this weekend

…lVocabularyUnicodeFix

# Conflicts:
#	e2e/scientists_queries.yaml
#	src/index/Index.cpp
#	src/index/StringSortComparator.h
@joka921
Copy link
Member Author

joka921 commented Mar 21, 2020

Ok, I fixed the conflicts so you can have a peek.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@niklas88 niklas88 merged commit 0c0f5de into ad-freiburg:master Mar 21, 2020
@joka921 joka921 deleted the f.ExternalVocabularyUnicodeFix branch May 8, 2021 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants