New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
F.case insensitive label sorting #209
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass, I'm still a bit unclear about the details so most of my comments are actually questions. Also I think we have to look into Unicode awareness. We do have ad_utility::getLowercaseUtf8()
but it's a bit ugly and we have to lo look at the performance impact.
src/engine/Filter.cpp
Outdated
switch (_type) { | ||
case SparqlFilter::GE: | ||
case SparqlFilter::LT: { | ||
std::transform(rhs_string.begin(), rhs_string.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have ad_utility::getUppercaseUtf8()
which handles UTF8 (though it's quite ugly at the moment)
src/engine/Filter.cpp
Outdated
case SparqlFilter::GT: | ||
case SparqlFilter::LE: { | ||
std::transform(rhs_string.begin(), rhs_string.end(), | ||
rhs_string.begin(), ::tolower); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad_utility::getLowercaseUtf8()
src/index/Index.cpp
Outdated
LOG(INFO) << "Done\n"; | ||
|
||
if (_vocabPrefixCompressed && _vocab.getCaseInsensitiveOrdering()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a copy of lines 223 and onwards? Can we use a method for it?
// BUG<Johannes> | ||
// We have to make sure that the literals are all in contiguous space to | ||
// make this work. | ||
// TODO<Johannes> Ideally we want to have this also when doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true after the latest commits?
src/index/Vocabulary.h
Outdated
const auto result = | ||
std::mismatch(a.val.cbegin(), a.val.cend(), b.val.cbegin(), | ||
b.val.cend(), [](const auto& lhs, const auto& rhs) { | ||
return tolower(lhs) == tolower(rhs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too should use an Unicode aware tolower()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we currently have to map the whole string to lowercase using utf-8 and then do the mismatch thing because in c++ the whole "variable length" business is not well-supported. (as long as the performance is ok, I will argue about this below).
src/index/Vocabulary.h
Outdated
|
||
// neither string is a prefix of the other, look at the first mismatch | ||
// character if we have reach here, both iterators are save to dereference. | ||
return tolower(*result.first) < tolower(*result.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too needs Unicode awareness
@@ -280,8 +281,9 @@ template <class S> | |||
bool PrefixComparator<S>::operator()(const string& lhs, | |||
const string& rhs) const { | |||
// TODO<joka921> use string_view for the substrings | |||
return (lhs.size() > _prefixLength ? lhs.substr(0, _prefixLength) : lhs) < | |||
(rhs.size() > _prefixLength ? rhs.substr(0, _prefixLength) : rhs); | |||
return _vocab->getCaseComparator()( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the above TODO, is this harder than we thought?
test/VocabularyTest.cpp
Outdated
ASSERT_TRUE(comp("ALPHA", "alpha")); | ||
|
||
// TODO: check what to do about these cases | ||
// ASSERT_TRUE(comp("\"Hannibal\"@en", "\"Hannibal Hamlin\"@en")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this work with the latest commit that splits of the language tag and only compares values?
In General, Performance is as follows: |
fb3ae82
to
fdf9275
Compare
This current rebased commit should work
This does the job for completely case-insensitive filtering. Otherwise the ordering is according to lowercase utf codepoints. There are solutions that also manage to sort like a ä b instead of a b ä but those are all very very slow (using std::locale() which i think always aquires locks etc or has to do a manual comparison of certain codepoints.) This would currently slow down the index build, but maybe I can do something with parallel sorting here (if it is not the locking that is the problem). But I would use this version to test the prefix autocompletion. |
fdf9275
to
dd4eff8
Compare
I think this feature is currently complete wrt to the code. |
I still have to built a test index with this and synchronize with the version you used for Blazegraph. Then we can decide what the impact of this is
|
3a30d12
to
5892974
Compare
I just rebased this the the merged parser, in case you want to build an index. |
- Enabled by setting "ignore-case":true in the configuration json - Strings that only differ in their case form a contiguous range - Range filters (< <= >= >) are case-insensitive - Works on UTF-8 using Niklas' case-conversion methods - Includes Unit-Tests for the sorting operator.
5892974
to
4b0acf7
Compare
As discussed offline, lets merge this as part of #227 since that also has the fix for prefix search |
Allow case-insensitive ordering/filtering as a index-build-time option.