Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Index Build- Phase 1 #227

Merged
merged 8 commits into from Apr 14, 2019

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Apr 8, 2019

This builds on #209 and tweaks the parser + the vocabulary building pipeline s.t. the first stage of index building gets faster.

@niklas88
Copy link
Member

@joka921 ok I've got this running with the newly built case insensitive Wikidata Full and completion now works flawlessly, well done! It's not even significantly slower so that's pretty nice too.

- Enabled by setting "ignore-case":true in the configuration json
- Strings that only differ in their case form a contiguous range
- Range filters (< <= >= >) + Prefix filters are case-insensitive
- Works on UTF-8 using Niklas' case-conversion methods
- Includes Unit-Tests for the sorting operator.
- blank nodes are left as is, this saves us backing up a hash map
- set the default constants for the IndexBuilder Pipeline to better values
- the partial vocabularies are sorted and written asynchronously while the parser continues working. This helps especially with case-insensitive sorting which takes several minutes per partial vocabulary.
- Set the default STXXL disk size to 1TB. Being too big should do no harm due to sparse files but setting it too small slowed down the index build for some reason
- Permutations with the same first-order key don't require fully resorting the triple vector. This is exploited in this commit. This saves us three calls to STXXL::sort (~3 hours in total for full Wikidata)
- Settings like "internal languages" of "externalized prefixes" are only useful when this flag is activated.
- Automatically activate _onDiskLiterals when one of those settings is present.
…ters

- In the vocabulary merge, the k-way merge (read from disk + determine order by priority queue) and the output of the result to the partial id-maps are now performed in a parallel pipelined way, this increases the speed of this step (previously: 3 hours).

- Fixed the case-insensitive prefix filtering:
   - They have a separate mechanism which I had previously forgotten
   - made the StringSortComparator work also for "partial literals" like "field of work
     (no trailing quotation marks) as they occur in prefix filters for autocompletion)
src/engine/Distinct.cpp Show resolved Hide resolved
rhs = StringSortComparator::rdfLiteralToValueForLT(rhs);

LOG(INFO) << "upperBound was converted to " << upperBoundStr << '\n';
LOG(INFO) << "lowerBound was converted to " << rhs << '\n';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments + lower log level

src/engine/Filter.cpp Outdated Show resolved Hide resolved
break;
}
}
LOG(INFO) << "Finished conversion of filter value" << rhs_string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove logging

}
return rhs_string;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document those methods

} else {
// TODO<Johannes> Ideally we want to have this also when doing
// case-insensitive compare, but it currently breaks the prefix
// compression (there we really need ordering by correct bytes)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklas88 I could implement this also for case-insensitive sort (sorting first by "inner" literal value and only then by language tag, should I?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this comment, I thought we built two sorting orders with one being used for prefix compression so how does this break it and what exactly are you proposing to do here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is the following:
"pref!" < "pref", because ! < " in ASCII. (assume all quotation marks to be escaped).
This can be circumvented by first getting the actual string value without the "" and then sorting by this. I am doing this in the case-insensitive case, but it is also possible for case-sensitive sorting.

In this case we would also have to build two vocabularies, one for compression and one not, even if using case-sensitive search.

However, this only affects words ending with ! since this is the only printable ASCII character that has a smaller value than "

Does this make more sense to you and help you to make an informed decision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. I'd propose we keep the case sensitive variant as is. Having it be just byte wise comparison makes it at least easier to reason about. Also adding the described behavior sounds like complication for little gain

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, mostly needed clarifications. Overall looks really solid to me, great work!

#include "index/Vocabulary.h"
#include "index/VocabularyGenerator.h"

int main(int argc, char** argv) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives an unused warning for the int argc parameter.

// Copyright 2019, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach(joka921) <johannes.kalmbach@gmail.com>
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a small comment what this is used for. I assume it's for manually merging vocabularies if there was an error? Is this generally useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the comment, the reason is more "Benchmarking the vocabulary Merging without having to wait 9 hours for the TurtleParser"

#include "index/Vocabulary.h"
#include "index/VocabularyGenerator.h"

int main(int argc, char** argv) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in an unused parameter error for int argc

return comp(p1.first, p2.first);
};
if constexpr (USE_PARALLEL_SORT) {
if (USE_PARALLEL_SORT && doParallelSort) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the outer if constexpr you can drop the inner USE_PARALLEL_SORT

@@ -55,7 +55,7 @@ void Distinct::computeResult(ResultTable* result) {
subRes->_resultTypes.begin(),
subRes->_resultTypes.end());
result->_localVocab = subRes->_localVocab;
int width = subRes->_data.size();
int width = subRes->_data.cols();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhoh, good find!

template <typename... Args>
MmapVector(Args&&... args) : MmapVector() {
open(std::forward<Args>(args)...);
}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

@@ -234,7 +234,7 @@ string getUppercase(const string& orig) {
}

// ____________________________________________________________________________
string getLowercaseUtf8(const string& orig) {
string getLowercaseUtf8(std::string_view orig) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be const std::string_view

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -263,7 +263,7 @@ string getLowercaseUtf8(const string& orig) {
}

// ____________________________________________________________________________
string getUppercaseUtf8(const string& orig) {
string getUppercaseUtf8(std::string_view orig) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be const std::string_view

// being escaped by backslashes. If it is not found at all, string::npos is
// returned.
inline size_t findLiteralEnd(std::string_view input,
std::string_view literalEnd) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be const std::string_view

string createBlankNode() {
string res = "_:" + std::to_string(_numBlankNodes);
string createAnonNode() {
string res = ANON_NODE_PREFIX + ":" + std::to_string(_numBlankNodes);
_numBlankNodes++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is now called "Anon" instead of "Blank". Also we are parsing in parallel now, right? So what prevents _numBlankNodes++ updates from racing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is only used for Anon nodes now: Those must be unique, so we need to find our own representation with the counter + some prefix that never occurs in any knowledge base.
    Previously I also did this for blank nodes, but that slowed down the parser.
    Blank nodes can be repeated and we have to make sure that the same blank node in the input stands for the same blank node in the graph/internal representation. For this we can simply reuse the additional blank node. So I renamed this, since it is only used for unique anon nodes.

  2. There is only one parser thread and the parser itself is not threadsafe (there is no real good way to parse in parallel with LL1-Grammars. There could be a parallel parsing of Wikidata using their dumps, under the following assumptions:

  • All prefix/base declarations come before all triples.
  • There is a unique way to determine the beginning of a statement (e.g. newlines that are not followed by whitespace).

I even thought about recommending this to the W3 community, as it would make the parsing so so much faster. But we are currently getting closer to the disk write being our limiting factor.
But give me parallel disks and there is much more that we can speed up.

So currently: parser not threadsafe.

@joka921
Copy link
Member Author

joka921 commented Apr 12, 2019

@niklas88 Thanks for your review. I have added some comment clarifications, please check whether you agree. All the stuff that has no comments will be fixed exactly in the way you suggested since I agree with you in those places.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@niklas88 niklas88 merged commit b6e4c4d into ad-freiburg:master Apr 14, 2019
@joka921 joka921 deleted the f.fasterIndexBuild2019-1 branch April 15, 2019 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants