Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F.case insensitive label sorting #209

Closed

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Mar 16, 2019

Allow case-insensitive ordering/filtering as a index-build-time option.

  • setting "ignore-case: true" in the settings.json will enable this setting for the index build.
  • Entity Names in triples have to match exactly.
  • All filters for string values are done in a case-insensitive manner (this includes the prefix filters for autocompletion)

@joka921
Copy link
Member Author

joka921 commented Mar 16, 2019

CAVEAT: Currently "Hannibal Hamlin"@en < "Hannibal"@en (because of the ").
I have to check, what the SPARQL standard says here and adapt it (compare without the language tags.
But as it seems, this should also be broken with the default sorting.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass, I'm still a bit unclear about the details so most of my comments are actually questions. Also I think we have to look into Unicode awareness. We do have ad_utility::getLowercaseUtf8() but it's a bit ugly and we have to lo look at the performance impact.

switch (_type) {
case SparqlFilter::GE:
case SparqlFilter::LT: {
std::transform(rhs_string.begin(), rhs_string.end(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have ad_utility::getUppercaseUtf8() which handles UTF8 (though it's quite ugly at the moment)

case SparqlFilter::GT:
case SparqlFilter::LE: {
std::transform(rhs_string.begin(), rhs_string.end(),
rhs_string.begin(), ::tolower);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ad_utility::getLowercaseUtf8()

LOG(INFO) << "Done\n";

if (_vocabPrefixCompressed && _vocab.getCaseInsensitiveOrdering()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy of lines 223 and onwards? Can we use a method for it?

// BUG<Johannes>
// We have to make sure that the literals are all in contiguous space to
// make this work.
// TODO<Johannes> Ideally we want to have this also when doing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true after the latest commits?

const auto result =
std::mismatch(a.val.cbegin(), a.val.cend(), b.val.cbegin(),
b.val.cend(), [](const auto& lhs, const auto& rhs) {
return tolower(lhs) == tolower(rhs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too should use an Unicode aware tolower()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we currently have to map the whole string to lowercase using utf-8 and then do the mismatch thing because in c++ the whole "variable length" business is not well-supported. (as long as the performance is ok, I will argue about this below).


// neither string is a prefix of the other, look at the first mismatch
// character if we have reach here, both iterators are save to dereference.
return tolower(*result.first) < tolower(*result.second);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too needs Unicode awareness

@@ -280,8 +281,9 @@ template <class S>
bool PrefixComparator<S>::operator()(const string& lhs,
const string& rhs) const {
// TODO<joka921> use string_view for the substrings
return (lhs.size() > _prefixLength ? lhs.substr(0, _prefixLength) : lhs) <
(rhs.size() > _prefixLength ? rhs.substr(0, _prefixLength) : rhs);
return _vocab->getCaseComparator()(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the above TODO, is this harder than we thought?

ASSERT_TRUE(comp("ALPHA", "alpha"));

// TODO: check what to do about these cases
// ASSERT_TRUE(comp("\"Hannibal\"@en", "\"Hannibal Hamlin\"@en"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this work with the latest commit that splits of the language tag and only compares values?

@joka921
Copy link
Member Author

joka921 commented Mar 18, 2019

In General, Performance is as follows:
We need the comparisons for the initial vocabulary sorting. (Timing irrelevant, not a bottleneck in Index Building). And then we need one binary search for each Entity in the query (in a triple or a filter, can also be a literal). This is log(n) * few entries, so I don't think it matters too much, but we can and shall always measure.

@joka921 joka921 force-pushed the f.caseInsensitiveLabelSorting branch from fb3ae82 to fdf9275 Compare March 21, 2019 09:22
@joka921
Copy link
Member Author

joka921 commented Mar 21, 2019

This current rebased commit should work

  • still todo: End-to-end / filter tests

This does the job for completely case-insensitive filtering. Otherwise the ordering is according to lowercase utf codepoints.

There are solutions that also manage to sort like a ä b instead of a b ä but those are all very very slow (using std::locale() which i think always aquires locks etc or has to do a manual comparison of certain codepoints.) This would currently slow down the index build, but maybe I can do something with parallel sorting here (if it is not the locking that is the problem). But I would use this version to test the prefix autocompletion.

@joka921 joka921 force-pushed the f.caseInsensitiveLabelSorting branch from fdf9275 to dd4eff8 Compare April 4, 2019 13:19
@joka921
Copy link
Member Author

joka921 commented Apr 4, 2019

I think this feature is currently complete wrt to the code.
I still should document it in the guides. Question:
Should the case-insensitive sorting be the default or not?

@niklas88
Copy link
Member

niklas88 commented Apr 4, 2019 via email

@joka921 joka921 force-pushed the f.caseInsensitiveLabelSorting branch from 3a30d12 to 5892974 Compare April 5, 2019 07:46
@joka921
Copy link
Member Author

joka921 commented Apr 5, 2019

I just rebased this the the merged parser, in case you want to build an index.

- Enabled by setting "ignore-case":true in the configuration json
- Strings that only differ in their case form a contiguous range
- Range filters (< <= >= >) are case-insensitive
- Works on UTF-8 using Niklas' case-conversion methods
- Includes Unit-Tests for the sorting operator.
@joka921 joka921 force-pushed the f.caseInsensitiveLabelSorting branch from 5892974 to 4b0acf7 Compare April 9, 2019 15:35
@niklas88
Copy link
Member

@joka921 do you think we should merge this before the #227 PR?

@niklas88
Copy link
Member

As discussed offline, lets merge this as part of #227 since that also has the fix for prefix search

@niklas88 niklas88 closed this Apr 11, 2019
@joka921 joka921 deleted the f.caseInsensitiveLabelSorting branch August 24, 2022 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants