New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Index Build- Phase 1 #227
Faster Index Build- Phase 1 #227
Conversation
@joka921 ok I've got this running with the newly built case insensitive Wikidata Full and completion now works flawlessly, well done! It's not even significantly slower so that's pretty nice too. |
- Enabled by setting "ignore-case":true in the configuration json - Strings that only differ in their case form a contiguous range - Range filters (< <= >= >) + Prefix filters are case-insensitive - Works on UTF-8 using Niklas' case-conversion methods - Includes Unit-Tests for the sorting operator.
- blank nodes are left as is, this saves us backing up a hash map - set the default constants for the IndexBuilder Pipeline to better values - the partial vocabularies are sorted and written asynchronously while the parser continues working. This helps especially with case-insensitive sorting which takes several minutes per partial vocabulary. - Set the default STXXL disk size to 1TB. Being too big should do no harm due to sparse files but setting it too small slowed down the index build for some reason
- Permutations with the same first-order key don't require fully resorting the triple vector. This is exploited in this commit. This saves us three calls to STXXL::sort (~3 hours in total for full Wikidata)
- Settings like "internal languages" of "externalized prefixes" are only useful when this flag is activated. - Automatically activate _onDiskLiterals when one of those settings is present.
…ters - In the vocabulary merge, the k-way merge (read from disk + determine order by priority queue) and the output of the result to the partial id-maps are now performed in a parallel pipelined way, this increases the speed of this step (previously: 3 hours). - Fixed the case-insensitive prefix filtering: - They have a separate mechanism which I had previously forgotten - made the StringSortComparator work also for "partial literals" like "field of work (no trailing quotation marks) as they occur in prefix filters for autocompletion)
d6da472
to
b8e2e4a
Compare
rhs = StringSortComparator::rdfLiteralToValueForLT(rhs); | ||
|
||
LOG(INFO) << "upperBound was converted to " << upperBoundStr << '\n'; | ||
LOG(INFO) << "lowerBound was converted to " << rhs << '\n'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments + lower log level
src/engine/Filter.cpp
Outdated
break; | ||
} | ||
} | ||
LOG(INFO) << "Finished conversion of filter value" << rhs_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove logging
} | ||
return rhs_string; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document those methods
src/index/Vocabulary.h
Outdated
} else { | ||
// TODO<Johannes> Ideally we want to have this also when doing | ||
// case-insensitive compare, but it currently breaks the prefix | ||
// compression (there we really need ordering by correct bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niklas88 I could implement this also for case-insensitive sort (sorting first by "inner" literal value and only then by language tag, should I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this comment, I thought we built two sorting orders with one being used for prefix compression so how does this break it and what exactly are you proposing to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is the following:
"pref!" < "pref"
, because ! < "
in ASCII. (assume all quotation marks to be escaped).
This can be circumvented by first getting the actual string value without the ""
and then sorting by this. I am doing this in the case-insensitive case, but it is also possible for case-sensitive sorting.
In this case we would also have to build two vocabularies, one for compression and one not, even if using case-sensitive search.
However, this only affects words ending with !
since this is the only printable ASCII character that has a smaller value than "
Does this make more sense to you and help you to make an informed decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense. I'd propose we keep the case sensitive variant as is. Having it be just byte wise comparison makes it at least easier to reason about. Also adding the described behavior sounds like complication for little gain
274c0d8
to
6efd6b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, mostly needed clarifications. Overall looks really solid to me, great work!
#include "index/Vocabulary.h" | ||
#include "index/VocabularyGenerator.h" | ||
|
||
int main(int argc, char** argv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives an unused warning for the int argc
parameter.
// Copyright 2019, University of Freiburg, | ||
// Chair of Algorithms and Data Structures. | ||
// Author: Johannes Kalmbach(joka921) <johannes.kalmbach@gmail.com> | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a small comment what this is used for. I assume it's for manually merging vocabularies if there was an error? Is this generally useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the comment, the reason is more "Benchmarking the vocabulary Merging without having to wait 9 hours for the TurtleParser"
#include "index/Vocabulary.h" | ||
#include "index/VocabularyGenerator.h" | ||
|
||
int main(int argc, char** argv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in an unused parameter
error for int argc
src/index/VocabularyGenerator.cpp
Outdated
return comp(p1.first, p2.first); | ||
}; | ||
if constexpr (USE_PARALLEL_SORT) { | ||
if (USE_PARALLEL_SORT && doParallelSort) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the outer if constexpr
you can drop the inner USE_PARALLEL_SORT
@@ -55,7 +55,7 @@ void Distinct::computeResult(ResultTable* result) { | |||
subRes->_resultTypes.begin(), | |||
subRes->_resultTypes.end()); | |||
result->_localVocab = subRes->_localVocab; | |||
int width = subRes->_data.size(); | |||
int width = subRes->_data.cols(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhoh, good find!
src/util/MmapVector.h
Outdated
template <typename... Args> | ||
MmapVector(Args&&... args) : MmapVector() { | ||
open(std::forward<Args>(args)...); | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented code
src/util/StringUtils.h
Outdated
@@ -234,7 +234,7 @@ string getUppercase(const string& orig) { | |||
} | |||
|
|||
// ____________________________________________________________________________ | |||
string getLowercaseUtf8(const string& orig) { | |||
string getLowercaseUtf8(std::string_view orig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be const std::string_view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
src/util/StringUtils.h
Outdated
@@ -263,7 +263,7 @@ string getLowercaseUtf8(const string& orig) { | |||
} | |||
|
|||
// ____________________________________________________________________________ | |||
string getUppercaseUtf8(const string& orig) { | |||
string getUppercaseUtf8(std::string_view orig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be const std::string_view
src/util/StringUtils.h
Outdated
// being escaped by backslashes. If it is not found at all, string::npos is | ||
// returned. | ||
inline size_t findLiteralEnd(std::string_view input, | ||
std::string_view literalEnd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be const std::string_view
string createBlankNode() { | ||
string res = "_:" + std::to_string(_numBlankNodes); | ||
string createAnonNode() { | ||
string res = ANON_NODE_PREFIX + ":" + std::to_string(_numBlankNodes); | ||
_numBlankNodes++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is now called "Anon" instead of "Blank". Also we are parsing in parallel now, right? So what prevents _numBlankNodes++
updates from racing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This is only used for Anon nodes now: Those must be unique, so we need to find our own representation with the counter + some prefix that never occurs in any knowledge base.
Previously I also did this for blank nodes, but that slowed down the parser.
Blank nodes can be repeated and we have to make sure that the same blank node in the input stands for the same blank node in the graph/internal representation. For this we can simply reuse the additional blank node. So I renamed this, since it is only used for unique anon nodes. -
There is only one parser thread and the parser itself is not threadsafe (there is no real good way to parse in parallel with LL1-Grammars. There could be a parallel parsing of Wikidata using their dumps, under the following assumptions:
- All prefix/base declarations come before all triples.
- There is a unique way to determine the beginning of a statement (e.g. newlines that are not followed by whitespace).
I even thought about recommending this to the W3 community, as it would make the parsing so so much faster. But we are currently getting closer to the disk write being our limiting factor.
But give me parallel disks and there is much more that we can speed up.
So currently: parser not threadsafe.
@niklas88 Thanks for your review. I have added some comment clarifications, please check whether you agree. All the stuff that has no comments will be fixed exactly in the way you suggested since I agree with you in those places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This builds on #209 and tweaks the parser + the vocabulary building pipeline s.t. the first stage of index building gets faster.