Faster Index Build- Phase 1 #227

joka921 · 2019-04-08T09:30:18Z

This builds on #209 and tweaks the parser + the vocabulary building pipeline s.t. the first stage of index building gets faster.

niklas88 · 2019-04-11T15:32:01Z

@joka921 ok I've got this running with the newly built case insensitive Wikidata Full and completion now works flawlessly, well done! It's not even significantly slower so that's pretty nice too.

- Enabled by setting "ignore-case":true in the configuration json - Strings that only differ in their case form a contiguous range - Range filters (< <= >= >) + Prefix filters are case-insensitive - Works on UTF-8 using Niklas' case-conversion methods - Includes Unit-Tests for the sorting operator.

- blank nodes are left as is, this saves us backing up a hash map - set the default constants for the IndexBuilder Pipeline to better values - the partial vocabularies are sorted and written asynchronously while the parser continues working. This helps especially with case-insensitive sorting which takes several minutes per partial vocabulary. - Set the default STXXL disk size to 1TB. Being too big should do no harm due to sparse files but setting it too small slowed down the index build for some reason

- Permutations with the same first-order key don't require fully resorting the triple vector. This is exploited in this commit. This saves us three calls to STXXL::sort (~3 hours in total for full Wikidata)

- Settings like "internal languages" of "externalized prefixes" are only useful when this flag is activated. - Automatically activate _onDiskLiterals when one of those settings is present.

…ters - In the vocabulary merge, the k-way merge (read from disk + determine order by priority queue) and the output of the result to the partial id-maps are now performed in a parallel pipelined way, this increases the speed of this step (previously: 3 hours). - Fixed the case-insensitive prefix filtering: - They have a separate mechanism which I had previously forgotten - made the StringSortComparator work also for "partial literals" like "field of work (no trailing quotation marks) as they occur in prefix filters for autocompletion)

src/engine/Distinct.cpp

joka921 · 2019-04-12T06:29:21Z

src/engine/Filter.cpp

+          rhs = StringSortComparator::rdfLiteralToValueForLT(rhs);
+
+          LOG(INFO) << "upperBound was converted to " << upperBoundStr << '\n';
+          LOG(INFO) << "lowerBound was converted to " << rhs << '\n';


More comments + lower log level

src/engine/Filter.cpp

joka921 · 2019-04-12T06:30:37Z

src/engine/Filter.cpp

+              break;
+          }
+        }
+        LOG(INFO) << "Finished conversion of filter value" << rhs_string


remove logging

joka921 · 2019-04-12T06:32:59Z

src/index/Vocabulary.h

+    }
+    return rhs_string;
+  }
+


Document those methods

joka921 · 2019-04-12T06:35:45Z

src/index/Vocabulary.h

+    } else {
+      // TODO<Johannes> Ideally we want to have this also when doing
+      // case-insensitive compare, but it currently breaks the prefix
+      // compression (there we really need ordering by correct bytes)


@niklas88 I could implement this also for case-insensitive sort (sorting first by "inner" literal value and only then by language tag, should I?

I'm a bit confused by this comment, I thought we built two sorting orders with one being used for prefix compression so how does this break it and what exactly are you proposing to do here?

The point is the following:
"pref!" < "pref", because ! < " in ASCII. (assume all quotation marks to be escaped).
This can be circumvented by first getting the actual string value without the "" and then sorting by this. I am doing this in the case-insensitive case, but it is also possible for case-sensitive sorting.

In this case we would also have to build two vocabularies, one for compression and one not, even if using case-sensitive search.

However, this only affects words ending with ! since this is the only printable ASCII character that has a smaller value than "

Does this make more sense to you and help you to make an informed decision?

Yes that makes sense. I'd propose we keep the case sensitive variant as is. Having it be just byte wise comparison makes it at least easier to reason about. Also adding the described behavior sounds like complication for little gain

…a default

niklas88

Some comments, mostly needed clarifications. Overall looks really solid to me, great work!

niklas88 · 2019-04-11T15:12:16Z

src/VocabularyMergerMain.cpp

+#include "index/Vocabulary.h"
+#include "index/VocabularyGenerator.h"
+
+int main(int argc, char** argv) {


This gives an unused warning for the int argc parameter.

niklas88 · 2019-04-11T15:33:16Z

src/VocabularyMergerMain.cpp

+// Copyright 2019, University of Freiburg,
+// Chair of Algorithms and Data Structures.
+// Author: Johannes Kalmbach(joka921) <johannes.kalmbach@gmail.com>
+//


Could you add a small comment what this is used for. I assume it's for manually merging vocabularies if there was an error? Is this generally useful?

I will add the comment, the reason is more "Benchmarking the vocabulary Merging without having to wait 9 hours for the TurtleParser"

niklas88 · 2019-04-11T15:33:49Z

src/VocabularyMergerMain.cpp

+#include "index/Vocabulary.h"
+#include "index/VocabularyGenerator.h"
+
+int main(int argc, char** argv) {


This results in an unused parameter error for int argc

niklas88 · 2019-04-11T15:35:49Z

src/index/VocabularyGenerator.cpp

+    return comp(p1.first, p2.first);
+  };
+  if constexpr (USE_PARALLEL_SORT) {
+    if (USE_PARALLEL_SORT && doParallelSort) {


with the outer if constexpr you can drop the inner USE_PARALLEL_SORT

niklas88 · 2019-04-11T15:36:56Z

src/engine/Distinct.cpp

@@ -55,7 +55,7 @@ void Distinct::computeResult(ResultTable* result) {
                              subRes->_resultTypes.begin(),
                              subRes->_resultTypes.end());
  result->_localVocab = subRes->_localVocab;
-  int width = subRes->_data.size();
+  int width = subRes->_data.cols();


Uhoh, good find!

niklas88 · 2019-04-12T08:34:50Z

src/util/MmapVector.h

  template <typename... Args>
  MmapVector(Args&&... args) : MmapVector() {
    open(std::forward<Args>(args)...);
  }
+   */


Remove commented code

niklas88 · 2019-04-12T08:38:08Z

src/util/StringUtils.h

@@ -234,7 +234,7 @@ string getUppercase(const string& orig) {
 }

 // ____________________________________________________________________________
-string getLowercaseUtf8(const string& orig) {
+string getLowercaseUtf8(std::string_view orig) {


This can be const std::string_view

Same as above.

niklas88 · 2019-04-12T08:38:14Z

src/util/StringUtils.h

@@ -263,7 +263,7 @@ string getLowercaseUtf8(const string& orig) {
 }

 // ____________________________________________________________________________
-string getUppercaseUtf8(const string& orig) {
+string getUppercaseUtf8(std::string_view orig) {


This can be const std::string_view

niklas88 · 2019-04-12T08:38:26Z

src/util/StringUtils.h

+// being escaped by backslashes. If it is not found at all, string::npos is
+// returned.
+inline size_t findLiteralEnd(std::string_view input,
+                             std::string_view literalEnd) {


This can be const std::string_view

niklas88 · 2019-04-12T08:41:06Z

src/parser/TurtleParser.h

-  string createBlankNode() {
-    string res = "_:" + std::to_string(_numBlankNodes);
+  string createAnonNode() {
+    string res = ANON_NODE_PREFIX + ":" + std::to_string(_numBlankNodes);
    _numBlankNodes++;


Any reason this is now called "Anon" instead of "Blank". Also we are parsing in parallel now, right? So what prevents _numBlankNodes++ updates from racing?

This is only used for Anon nodes now: Those must be unique, so we need to find our own representation with the counter + some prefix that never occurs in any knowledge base.
Previously I also did this for blank nodes, but that slowed down the parser.
Blank nodes can be repeated and we have to make sure that the same blank node in the input stands for the same blank node in the graph/internal representation. For this we can simply reuse the additional blank node. So I renamed this, since it is only used for unique anon nodes.

There is only one parser thread and the parser itself is not threadsafe (there is no real good way to parse in parallel with LL1-Grammars. There could be a parallel parsing of Wikidata using their dumps, under the following assumptions:

All prefix/base declarations come before all triples.

There is a unique way to determine the beginning of a statement (e.g. newlines that are not followed by whitespace).

I even thought about recommending this to the W3 community, as it would make the parsing so so much faster. But we are currently getting closer to the disk write being our limiting factor.
But give me parallel disks and there is much more that we can speed up.

So currently: parser not threadsafe.

joka921 · 2019-04-12T10:20:35Z

@niklas88 Thanks for your review. I have added some comment clarifications, please check whether you agree. All the stuff that has no comments will be fixed exactly in the way you suggested since I agree with you in those places.

niklas88

LGTM

This was referenced Apr 8, 2019

Preallocated Space for STXXL does not suit small NOR big knowledge bases. #225

Closed

F.case insensitive label sorting #209

Closed

joka921 added 6 commits April 12, 2019 08:08

Got rid of blankNodeMap in parser.

689bb7a

Create two permutations at once.

fad4425

- Permutations with the same first-order key don't require fully resorting the triple vector. This is exploited in this commit. This saves us three calls to STXXL::sort (~3 hours in total for full Wikidata)

Always activate _onDiskLiterals when the settings file implies it.

06af9ab

- Settings like "internal languages" of "externalized prefixes" are only useful when this flag is activated. - Automatically activate _onDiskLiterals when one of those settings is present.

joka921 force-pushed the f.fasterIndexBuild2019-1 branch from d6da472 to b8e2e4a Compare April 12, 2019 06:25

joka921 commented Apr 12, 2019

View reviewed changes

Bugfix for distinct + disabled case-insensitive ordering as a Wikidat…

6efd6b9

…a default

joka921 force-pushed the f.fasterIndexBuild2019-1 branch from 274c0d8 to 6efd6b9 Compare April 12, 2019 07:10

niklas88 reviewed Apr 12, 2019

View reviewed changes

Changes requested by @niklas88 's review

2d45553

niklas88 approved these changes Apr 14, 2019

View reviewed changes

niklas88 merged commit b6e4c4d into ad-freiburg:master Apr 14, 2019

joka921 deleted the f.fasterIndexBuild2019-1 branch April 15, 2019 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Index Build- Phase 1 #227

Faster Index Build- Phase 1 #227

joka921 commented Apr 8, 2019

niklas88 commented Apr 11, 2019

joka921 Apr 12, 2019

joka921 Apr 12, 2019

joka921 Apr 12, 2019

joka921 Apr 12, 2019

niklas88 Apr 12, 2019

joka921 Apr 12, 2019

niklas88 Apr 12, 2019

niklas88 left a comment

niklas88 Apr 11, 2019

niklas88 Apr 11, 2019

joka921 Apr 12, 2019

niklas88 Apr 11, 2019

niklas88 Apr 11, 2019

niklas88 Apr 11, 2019

niklas88 Apr 12, 2019

niklas88 Apr 12, 2019

joka921 Apr 12, 2019

niklas88 Apr 12, 2019

niklas88 Apr 12, 2019

niklas88 Apr 12, 2019

joka921 Apr 12, 2019

joka921 commented Apr 12, 2019

niklas88 left a comment

Faster Index Build- Phase 1 #227

Faster Index Build- Phase 1 #227

Conversation

joka921 commented Apr 8, 2019

niklas88 commented Apr 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joka921 commented Apr 12, 2019

niklas88 left a comment

Choose a reason for hiding this comment