Fix the Unicode collation for almost equal literals and the external vocabulary #312

joka921 · 2020-01-26T10:01:37Z

Introduced a "TOTAL" collation order (first sort by IDENTICAL Level, then by the language tag, and then by the actual binary representation). This always is and has to be used during index build to enforce, that bitwise different strings never compare equal and have a deterministic ordering.
It also allows us to use a different (lower) Sort Level when using the index, e.g. "which collation Level do we want to use for Filters".
Also included the Correct collation into the external vocabulary which was broken before in its ordering.

…sure total ordering.

- The Operation classes now have a getChildren() method that returns non-owning pointers to all of their children - Each Operation holds a vector of strings to store warnings that are emitted during the result computation. Those can be recursively retrieved using the collectWarnings() function that internally uses the getChildren() function mentioned above

- Out-of-vocab entries in value clauses previously triggered exceptions. - Now rows that contain unknown words are ignored and trigger a warning in the result json. - This is also tested in the end-to-end test

…abulary, because everything else leads to incorrect behavior.

The code is still somewhat ugly, but already much less uglier.

The external vocabulary was so far created using the ICU collation (or a wrong version of it bc. of the "Externalization Prefix". This should be now fixed, but we still should refactor the Vocabulary to use proper strict and static typin

This now only fixes the unicode stuff.

hannahbast · 2020-01-26T10:18:53Z

@joka921 Thanks, Johannes, having a well-defined total order makes a lot of sense! Can you say something about the average performance compared to an ordinary strcmp, now that you pass your comparison objects by reference and the factor 5-10 from before has gone away?

niklas88 · 2020-03-20T16:59:13Z

@joka921 can you look at the merge conflicts? I'll try to do some review this weekend

…lVocabularyUnicodeFix # Conflicts: # e2e/scientists_queries.yaml # src/index/Index.cpp # src/index/StringSortComparator.h

joka921 · 2020-03-21T17:35:27Z

Ok, I fixed the conflicts so you can have a peek.

niklas88

LGTM

joka921 added 11 commits January 11, 2020 19:52

Set the default Collation Level at IndexBuild time to IDENTICAL to en…

5dbe3fc

…sure total ordering.

Implemented and tested Unicode Normalization.

c5cbdaf

Actually Normalize the strings.

c281e71

Merge remote-tracking branch 'upstream/master' into upstream

95cdb0d

Fixed the Values out-of-vocab bug

3783e08

- Out-of-vocab entries in value clauses previously triggered exceptions. - Now rows that contain unknown words are ignored and trigger a warning in the result json. - This is also tested in the end-to-end test

Reset everything to use the IDENTICAL Level when dealing with the voc…

8875333

…abulary, because everything else leads to incorrect behavior.

The whole Result to JSON pipeline now uses nlohmann::json.

4c51116

The code is still somewhat ugly, but already much less uglier.

Refactored the json s.t. the E2E script now likes it.

e1db584

Complete Split this from the Values business.

62dcc6e

This now only fixes the unicode stuff.

joka921 requested a review from niklas88 January 26, 2020 10:02

joka921 added 2 commits March 21, 2020 18:22

Merge remote-tracking branch 'remotes/upstream/master' into f.Externa…

4ecc828

…lVocabularyUnicodeFix # Conflicts: # e2e/scientists_queries.yaml # src/index/Index.cpp # src/index/StringSortComparator.h

Fixed the bugs that occured during merging and ran clang-format

ddceed5

niklas88 approved these changes Mar 21, 2020

View reviewed changes

niklas88 merged commit 0c0f5de into ad-freiburg:master Mar 21, 2020

joka921 deleted the f.ExternalVocabularyUnicodeFix branch May 8, 2021 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the Unicode collation for almost equal literals and the external vocabulary #312

Fix the Unicode collation for almost equal literals and the external vocabulary #312

joka921 commented Jan 26, 2020

hannahbast commented Jan 26, 2020

niklas88 commented Mar 20, 2020

joka921 commented Mar 21, 2020

niklas88 left a comment

Fix the Unicode collation for almost equal literals and the external vocabulary #312

Fix the Unicode collation for almost equal literals and the external vocabulary #312

Conversation

joka921 commented Jan 26, 2020

hannahbast commented Jan 26, 2020

niklas88 commented Mar 20, 2020

joka921 commented Mar 21, 2020

niklas88 left a comment

Choose a reason for hiding this comment