F.case insensitive label sorting #209

joka921 · 2019-03-16T18:20:03Z

Allow case-insensitive ordering/filtering as a index-build-time option.

setting "ignore-case: true" in the settings.json will enable this setting for the index build.
Entity Names in triples have to match exactly.
All filters for string values are done in a case-insensitive manner (this includes the prefix filters for autocompletion)

joka921 · 2019-03-16T20:44:50Z

CAVEAT: Currently "Hannibal Hamlin"@en < "Hannibal"@en (because of the ").
I have to check, what the SPARQL standard says here and adapt it (compare without the language tags.
But as it seems, this should also be broken with the default sorting.

niklas88

Just a first pass, I'm still a bit unclear about the details so most of my comments are actually questions. Also I think we have to look into Unicode awareness. We do have ad_utility::getLowercaseUtf8() but it's a bit ugly and we have to lo look at the performance impact.

niklas88 · 2019-03-18T16:04:37Z

src/engine/Filter.cpp

+          switch (_type) {
+            case SparqlFilter::GE:
+            case SparqlFilter::LT: {
+              std::transform(rhs_string.begin(), rhs_string.end(),


We have ad_utility::getUppercaseUtf8() which handles UTF8 (though it's quite ugly at the moment)

niklas88 · 2019-03-18T16:05:47Z

src/engine/Filter.cpp

+            case SparqlFilter::GT:
+            case SparqlFilter::LE: {
+              std::transform(rhs_string.begin(), rhs_string.end(),
+                             rhs_string.begin(), ::tolower);


ad_utility::getLowercaseUtf8()

niklas88 · 2019-03-18T16:09:01Z

src/index/Index.cpp

    LOG(INFO) << "Done\n";
+
+    if (_vocabPrefixCompressed && _vocab.getCaseInsensitiveOrdering()) {


This looks like a copy of lines 223 and onwards? Can we use a method for it?

niklas88 · 2019-03-18T16:11:16Z

src/index/Vocabulary.h

+      // BUG<Johannes>
+      // We have to make sure that the literals are all in contiguous space to
+      // make this work.
+      // TODO<Johannes> Ideally we want to have this also when doing


Is this still true after the latest commits?

niklas88 · 2019-03-18T16:13:19Z

src/index/Vocabulary.h

+    const auto result =
+        std::mismatch(a.val.cbegin(), a.val.cend(), b.val.cbegin(),
+                      b.val.cend(), [](const auto& lhs, const auto& rhs) {
+                        return tolower(lhs) == tolower(rhs);


This too should use an Unicode aware tolower()

I think we currently have to map the whole string to lowercase using utf-8 and then do the mismatch thing because in c++ the whole "variable length" business is not well-supported. (as long as the performance is ok, I will argue about this below).

niklas88 · 2019-03-18T16:14:55Z

src/index/Vocabulary.h

+
+    // neither string is a prefix of the other, look at the first mismatch
+    // character if we have reach here, both iterators are save to dereference.
+    return tolower(*result.first) < tolower(*result.second);


This too needs Unicode awareness

niklas88 · 2019-03-18T16:22:13Z

src/index/VocabularyImpl.h

@@ -280,8 +281,9 @@ template <class S>
 bool PrefixComparator<S>::operator()(const string& lhs,
                                     const string& rhs) const {
  // TODO<joka921> use string_view for the substrings
-  return (lhs.size() > _prefixLength ? lhs.substr(0, _prefixLength) : lhs) <
-         (rhs.size() > _prefixLength ? rhs.substr(0, _prefixLength) : rhs);
+  return _vocab->getCaseComparator()(


What about the above TODO, is this harder than we thought?

niklas88 · 2019-03-18T16:22:57Z

test/VocabularyTest.cpp

+  ASSERT_TRUE(comp("ALPHA", "alpha"));
+
+  // TODO: check what to do about these cases
+  // ASSERT_TRUE(comp("\"Hannibal\"@en", "\"Hannibal Hamlin\"@en"));


Shouldn't this work with the latest commit that splits of the language tag and only compares values?

joka921 · 2019-03-18T17:50:18Z

In General, Performance is as follows:
We need the comparisons for the initial vocabulary sorting. (Timing irrelevant, not a bottleneck in Index Building). And then we need one binary search for each Entity in the query (in a triple or a filter, can also be a literal). This is log(n) * few entries, so I don't think it matters too much, but we can and shall always measure.

joka921 · 2019-03-21T09:27:27Z

This current rebased commit should work

still todo: End-to-end / filter tests

This does the job for completely case-insensitive filtering. Otherwise the ordering is according to lowercase utf codepoints.

There are solutions that also manage to sort like a ä b instead of a b ä but those are all very very slow (using std::locale() which i think always aquires locks etc or has to do a manual comparison of certain codepoints.) This would currently slow down the index build, but maybe I can do something with parallel sorting here (if it is not the locking that is the problem). But I would use this version to test the prefix autocompletion.

joka921 · 2019-04-04T14:31:59Z

I think this feature is currently complete wrt to the code.
I still should document it in the guides. Question:
Should the case-insensitive sorting be the default or not?

niklas88 · 2019-04-04T14:41:49Z

I still have to built a test index with this and synchronize with the version you used for Blazegraph. Then we can decide what the impact of this is

joka921 · 2019-04-05T07:48:19Z

I just rebased this the the merged parser, in case you want to build an index.

- Enabled by setting "ignore-case":true in the configuration json - Strings that only differ in their case form a contiguous range - Range filters (< <= >= >) are case-insensitive - Works on UTF-8 using Niklas' case-conversion methods - Includes Unit-Tests for the sorting operator.

for Wikidata

niklas88 · 2019-04-10T12:17:55Z

@joka921 do you think we should merge this before the #227 PR?

niklas88 · 2019-04-11T09:17:39Z

As discussed offline, lets merge this as part of #227 since that also has the fix for prefix search

niklas88 reviewed Mar 18, 2019

View reviewed changes

joka921 force-pushed the f.caseInsensitiveLabelSorting branch from fb3ae82 to fdf9275 Compare March 21, 2019 09:22

joka921 force-pushed the f.caseInsensitiveLabelSorting branch from fdf9275 to dd4eff8 Compare April 4, 2019 13:19

joka921 force-pushed the f.caseInsensitiveLabelSorting branch from 3a30d12 to 5892974 Compare April 5, 2019 07:46

joka921 mentioned this pull request Apr 8, 2019

Faster Index Build- Phase 1 #227

Merged

joka921 added 2 commits April 9, 2019 17:33

Removed a warning + made case-insensitive the default

4b0acf7

for Wikidata

joka921 force-pushed the f.caseInsensitiveLabelSorting branch from 5892974 to 4b0acf7 Compare April 9, 2019 15:35

niklas88 closed this Apr 11, 2019

niklas88 mentioned this pull request Apr 16, 2019

Fast case insensitive sort order and regex #149

Closed

joka921 deleted the f.caseInsensitiveLabelSorting branch August 24, 2022 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F.case insensitive label sorting #209

F.case insensitive label sorting #209

joka921 commented Mar 16, 2019

joka921 commented Mar 16, 2019

niklas88 left a comment

niklas88 Mar 18, 2019

niklas88 Mar 18, 2019

niklas88 Mar 18, 2019

niklas88 Mar 18, 2019

niklas88 Mar 18, 2019

joka921 Mar 18, 2019

niklas88 Mar 18, 2019

niklas88 Mar 18, 2019

niklas88 Mar 18, 2019

joka921 commented Mar 18, 2019

joka921 commented Mar 21, 2019

joka921 commented Apr 4, 2019

niklas88 commented Apr 4, 2019 via email •

edited

joka921 commented Apr 5, 2019

niklas88 commented Apr 10, 2019

niklas88 commented Apr 11, 2019

		LOG(INFO) << "Done\n";

		if (_vocabPrefixCompressed && _vocab.getCaseInsensitiveOrdering()) {

F.case insensitive label sorting #209

F.case insensitive label sorting #209

Conversation

joka921 commented Mar 16, 2019

joka921 commented Mar 16, 2019

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joka921 commented Mar 18, 2019

joka921 commented Mar 21, 2019

joka921 commented Apr 4, 2019

niklas88 commented Apr 4, 2019 via email • edited

joka921 commented Apr 5, 2019

niklas88 commented Apr 10, 2019

niklas88 commented Apr 11, 2019

niklas88 commented Apr 4, 2019 via email •

edited