Speeding up the first phase of Index Building #302

joka921 · 2020-01-08T16:08:43Z

Identified the writing of Triple elements to hash maps for deduplication as a bottleneck.
Speed this up by writing to multiple hash maps at once and merging them afterwards
Also concurrently pipelined all the other steps in the first index building phase.
That way the TurtleParsing now becomes the bottleneck (Teaser: This can be sped up by 30% using
compile time regexes, so this might become even faster in a later PR).
This was done by Implementing an abstract, templated BatchedPipeline that abstracts
over the creation and transformation of values in a pipeline that allows to control
the degree of concurrency used on each level and between the different levels.
This pipeline infrastructure was heavily unit-tested to ensure its correctness since it internally
uses quite some template magic.
This Commit also introduces the absl::flat_hash_map that is faster than the dense_hash_map used before.
But in doubt I can also revert this change since in the meantime the parallelism is what helps us more
than the faster hash maps. But I have thought, that we wanted to try out those anyway.
This can already be reviewed, especially the BatchedPipeline.h file. Before merging I would suggest merging
the Unicode PR, because there is some merging work to be done (but not too much) between those PRs.

hannahbast · 2020-01-08T16:22:16Z

@joka921 That sounds amazing! I have a few questions:

Did you already try it? What is the speedup?
Can you describe in a little more detail what you mean by "writing to multiple hash maps at once". We are talking about hash maps from strings (names) to IDs, right? How many hash maps? And how do you decide which hash map to use for a particular element?
Can you describe in a little more detail, which operations you pipeline? You mention "values" and "levels", what are those when you use your abstract class for building the index?
Does the absl::flat_hash_map have any disadvantages compared to Google's dense_hash_map?

joka921 · 2020-01-08T16:47:21Z

Ok, so for each triple we have to do the following steps:

Parse it from Turtle
Convert each of subject, predicate, object to an internal representation (e.g. string normalization, special handling of numeric or date literals etc.)
Assign each of subject, predicate, object an ID (lookup in a hash map if this element already has an ID, then reuse it, otherwise assign the next available ID and store it in the hash map).
Append the Triple's Ids to a large external vector that stores all the triples.

In a pipeline this looks as follows:

Parse a batch of Triples and pass this batch to the next level. Immediately continue with next batch.
Take a batch of Triples from level 1 and convert each element to the representation, then pass this
converted batch to level 3. Immediately continue with next batch
The first 25% of each batch of Triples get Ids from the first hash map, the second 25% from the second....
get all the Ids from the different hashMaps and store the Id Triples.

Trick a) all of those levels happen at the same time for different batches (pipeline principle)
Trick b) We make sure that the hashMaps hand out disjoint sets of Ids (this is easy because we know the maximum number of triples we deal with at once in this whole process.

After we are finished, we have the problem, that some words may have multiple Ids (if they occured in different triples that were assigned to different hash maps). Unifying these to one Id and updating the triple after we are done
is again relatively simple and can be done concurrently for each partial vocabulary.

The speed up is s.t. the parser becomes the bottle neck. It is about 40 Million triples per Minute so
we should be able to deal with 1 Billion triples in about 4 -5 hours vs 10-12 before. My personal target is 3, but this requires more work on the parser.

The only thing the abls::flat_hash_map does not have is a bucket interface (access to linked lists for chained hashing). This gains speed but does not conform to the C++ standard for unordered_map. But since nobody uses this, especially not us this is not a disadvantage. Concerning memory consumption both are rather expensive (optimized for speed) but since we currently use only relatively small hash maps (in my case they are even temporary and only used in the IndexBuild) this is not an issue for us.

hannahbast · 2020-01-08T17:15:29Z

@joka921 Thanks for the explanation!

You write that "We make sure that the hashMaps hand out disjoint sets of Ids (this is easy because we know the maximum number of triples we deal with at once in this whole process.)". Does this mean that with this PR there are gaps in the ID space? Or do these gaps vanish in the ID unification process.

I am slightly worried about code complexity and maintainability. Do I understand you correctly that all the complexity is hidden in the BatchedPipeline class? Is the code outside that class reasonably simple? For example, if someone else wants to extend the parser in some (not overly dramatic way), will they be able to do it without understanding the intricacies of your batched pipeline?

joka921 · 2020-01-08T17:34:10Z

No, we will not have Gaps in the Id space. The procedure described above does merge the Ids in a gap free way. Additionally it only creates a partial vocabulary. Those are merged after the complete parsing of the whole knowledge base is done. Only there the Ids are really finalized in a gap free way.
Anything else would be very wrong since the Id of a string must be its index in the vocabulary vector when we are completely finished.
Using the batched pipeline is really really simple, and I have written quite some documentation.
The Parser implementation is completely independent, as long as the parser can give us triples
and signal when the file is finished and there are more triples.
The interface of the pipeline is basically setupPipeline(one Lambda per Step)
You can look at it yourself, or we can meet some time soon and discuss on how to comment this
best and make the actual usage readable in the best possible way.

joka921 · 2020-01-08T17:36:24Z

   {
      auto p = ad_pipeline::setupParallelPipeline<1, NUM_PARALLEL_ITEM_MAPS>(
          _parserBatchSize,
          [parserPtr = &parser, i = 0ull, linesPerPartial,
           &parserExhausted]() mutable -> std::optional<Triple> {
            if (i >= linesPerPartial) {
              return std::nullopt;
            }
            Triple t;
            if (parserPtr->getLine(t)) {
              i++;
              return std::optional(std::move(t));
            } else {
              parserExhausted = true;
              return std::nullopt;
            }
          },
          [this](Triple&& t) {
            return tripleToInternalRepresentation(std::move(t));
          },
          std::move(itemMapLambdaTuple));

      while (auto opt = p.getNextValue()) {
        i++;
        for (const auto& innerOpt : opt.value()) {
          if (innerOpt) {
            actualCurrentPartialSize++;
            localWriter << innerOpt.value();
          }
        }

if this is too dense, each of the inner lambdas can also be setup externally
and properly named.

joka921 · 2020-01-08T20:45:54Z

@niklas88 The Travis build says that it has passed when I click on the Details link, But
on this site the yellow light remains. Does the CI pipeline have a hickup again?

niklas88 · 2020-01-08T22:17:50Z

@niklas88 The Travis build says that it has passed when I click on the Details link, But
on this site the yellow light remains. Does the CI pipeline have a hickup again?

I just reran the build and this time its propagated correctly. So yeaah looks like a Travis hickup. I'll look into this in more detail soon but it already sounds pretty amazing!

hannahbast · 2020-01-09T07:11:28Z

src/index/Index.cpp

+        }
+        if (i % 10000000 == 0) {
+          LOG(INFO) << "Lines (from KB-file) processed: " << i << '\n';
+        }
      }


@joka921 Could you add a few well-placed and concise comments in this code block so that the structure becomes clearer and what the individual lambdas do?

I think you mentioned that you could also replace the lambdas with well named functions. I think this might indeed be helpful here

hannahbast · 2020-01-11T08:12:49Z

I just tried to build a version of the current master (includes the unicode upgrade) with this PR merged locally and encountered this error while building the docker container:

CMake Error at CMakeLists.txt:60 (add_subdirectory):
  The source directory

    /app/third_party/abseil-cpp

  does not contain a CMakeLists.txt file.

joka921 · 2020-01-11T08:56:17Z

`git submodule init` `git submodule update` should Download abseil. Did the merge Work without conflicts? Hannah Bast <notifications@github.com> schrieb am Sa., 11. Jan. 2020, 09:12:

…

I just tried to build a version of the current master (includes the unicode upgrade) with this PR merged locally and encountered this error while building the docker container: CMake Error at CMakeLists.txt:60 (add_subdirectory): The source directory /app/third_party/abseil-cpp does not contain a CMakeLists.txt file. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#302>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADS6SATV2RHIZOSOTKB57W3Q5F5QDANCNFSM4KEKWXEQ> .

hannahbast · 2020-01-11T09:51:42Z

@joka921 In fact, it didn't, sorry for the confusion. Below you the see the list of the conflicting files (and the GitHub page for the PR shows the exact same list of files). So I will wait until you have modified this pull request to be mergeable with the current master.

bast@galera:QLever$ git merge master
Auto-merging test/VocabularyGeneratorTest.cpp
CONFLICT (content): Merge conflict in test/VocabularyGeneratorTest.cpp
Auto-merging test/CMakeLists.txt
CONFLICT (content): Merge conflict in test/CMakeLists.txt
Auto-merging src/index/VocabularyGeneratorImpl.h
CONFLICT (content): Merge conflict in src/index/VocabularyGeneratorImpl.h
Auto-merging src/index/VocabularyGenerator.h
CONFLICT (content): Merge conflict in src/index/VocabularyGenerator.h
Auto-merging src/index/Index.h
CONFLICT (content): Merge conflict in src/index/Index.h
Auto-merging src/index/Index.cpp
CONFLICT (content): Merge conflict in src/index/Index.cpp
Auto-merging src/index/CMakeLists.txt
CONFLICT (content): Merge conflict in src/index/CMakeLists.txt
Auto-merging CMakeLists.txt
Automatic merge failed; fix conflicts and then commit the result.

hannahbast · 2020-01-12T08:54:17Z

Awesome, I have started a build with this code at 9:39 h today, see my WA message. Fingers crossed

hannahbast

@ 5dbe3fc

Didn't we say earlier that using the QUATERNARY level is enough and that using the IDENTICAL level would affect the performance negatively, or am I confusing something here? Here is the corresponding quote from the ICU documentation, which Niklas included in his comments on this PR on 2020-01-02 13:57, highlights added by me:

"Identical Level: When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared at this level, just in case there is no difference at levels 1-4 . For example, Hebrew cantillation marks are only distinguished at this level. This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. Using this level substantially decreases the performance for both incremental comparison and sort key generation (as well as increasing the sort key length). It is also known as level 5 strength."

hannahbast · 2020-01-12T10:17:05Z

@joka921 A minor detail: can you please call the intermediate files

...partial-vocabulary-012 instead of ...partial-vocabulary12
...partial-ids-mmap-018 instead of ...partial-ids-mmap18
...compression-index... instead of ...compression_index...

Note that this entails three changes: (1) a dash before the final number, (2) fixed-width formatting of the number, so that the lexicographic order makes more sense when seeing them in a directory listing, (3) a dash instead of an underscore. If you feel that a fixed width of three is not enough, you can also make it four.

joka921 · 2020-01-12T11:05:15Z

I will make this optionally. Without Level 5 there can be literals that compare equal but are different. This leads to Index builds that behave equally, but Producer different Indices in Case two of those equal literals are sorted in a different Order in the vocabulary. I Wang to have reproducible Index builds at least optionally as a Check dir Changes in the Index builder. Thema I can simply Run cmp in the Index Files to See If I have broken anything. Hannah Bast <notifications@github.com> schrieb am So., 12. Jan. 2020, 10:02:

…

***@***.**** commented on this pull request. Didn't we say earlier that using the QUATERNARY level is enough and that using the IDENTICAL level would affect the performance negatively, or am I confusing something here? Here is the corresponding quote from the ICU documentation, which Niklas included in his comments on this PR on 2020-01-02 13:57, highlights added by me: "Identical Level: When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared at this level, just in case there is no difference at levels 1-4 . For example, Hebrew cantillation marks are only distinguished at this level. *This level should be used sparingly*, as only code point values differences between two strings is an extremely rare occurrence. *Using this level substantially decreases the performance* for both incremental comparison and sort key generation (as well as increasing the sort key length). It is also known as level 5 strength." — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#302>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADS6SAQT3BQEIPAP7LUTHBLQ5LMDJANCNFSM4KEKWXEQ> .

hannahbast · 2020-01-16T08:16:56Z

@joka921 Travis complains that

Error: The source file ./src/util/BatchedPipeline.h does not match the code style
Use clang-format with the .clang-format provided in the QLever
repository's root to ensure all code files are formatted properly. We currently use the clang-format 8.0.0-3

niklas88

I only did a first very rough pass of this. Looks great so far but I had a few comments and would like to look at this with my non-after-work eyes again. If you merge the CTRE PR I think some of it is in here too so it would become smaller right?

niklas88 · 2020-01-22T20:11:02Z

src/index/ConstantsIndexCreation.h

@@ -53,3 +53,7 @@ static const std::string PARTIAL_MMAP_IDS = ".partial-ids-mmap";

 // ________________________________________________________________
 static const std::string TMP_BASENAME_COMPRESSION = ".tmp.compression_index";
+
+// _________________________________________________________________
+// TODO: Comment


yeah I guess you're right

niklas88 · 2020-01-22T20:13:01Z

src/index/Index.cpp

+        }
+        if (i % 10000000 == 0) {
+          LOG(INFO) << "Lines (from KB-file) processed: " << i << '\n';
+        }
      }


I think you mentioned that you could also replace the lambdas with well named functions. I think this might indeed be helpful here

niklas88 · 2020-01-22T20:23:58Z

src/index/VocabularyGeneratorImpl.h

-    } else {
-      std::sort(begin(els), end(els), pred);
+// ____________________________________________________________________________________________________________
+absl::flat_hash_map<Id, Id> createInternalMapping(std::vector<std::pair<string, Id>>* elsPtr) {


I'd like to stick to just one hash map implementation in normal QLever code (i.e. in libraries it's ok). So I think we should make util::HashMap use absl::flat_hash_map I think I once refactored it so that this shouldn't be too hard. What do you think?

For this PR I removed absl again, since the parallelism takes care of my current performance issues and we can do the change in a clean separate PR.

niklas88

A few more comments but this is great work!

niklas88 · 2020-01-25T19:09:24Z

src/index/Index.cpp

@@ -235,8 +256,11 @@ VocabularyData Index::passFileForVocabulary(const string& filename,
  VocabularyMerger::VocMergeRes mergeRes;
  {
    VocabularyMerger v;
-    mergeRes =
-        v.mergeVocabulary(_onDiskBase, numFiles, _vocab.getCaseComparator());
+    auto identicalPred = [c = _vocab.getCaseComparator()](const auto& a,


can be const auto

I changed it, but can you point out any source where making lambdas const auto helps performance. Even the all things constexpr guys seem to always use plain auto for lambdas.
The only thing that you can prevent that way, is moving the lambda out, but typically compilers see through the lambda very well and the call operator of lambdas is always const by default, so I am not convinced that this substantially helps the code.

Hmm, good point I guess you're right

niklas88 · 2020-01-25T19:11:24Z

src/index/Index.cpp

-  ItemMap& map = *mapPtr;
+// ____
+// TODO<joka921: are those unused now and can be
+// removed?>_______________________________________________________________________


Well that's not too hard to test in a compiled language :D

niklas88 · 2020-01-25T19:12:34Z

src/index/Index.h

@@ -444,7 +445,11 @@ class Index {
    LOG(DEBUG) << "Scan done, got " << result->size() << " elements.\n";
  }

-  using ItemMap = ad_utility::HashMap<string, Id>;
+  template <typename K, typename V>
+  using HashMap = absl::flat_hash_map<K, V>;


As in the other comment, if we can I'd prefer a single hash map library to be used and I think Abseil is definitely a great one and this would also kill dependence on order. I'd be fine with splitting this in another PR of course

niklas88 · 2020-01-25T19:14:18Z

src/index/Index.h

+  template <class Map>
+  static Id assignNextId(Map* mapPtr, const string& key);
+
+  // TODO<joka921> This should also be unused


Then it should be removed ;-) removed code is bug free code

niklas88 · 2020-01-25T19:15:20Z

src/index/StringSortComparator.h

+   * @param input The String to be normalized. Must be UTF-8 encoded
+   * @return The NFC canonical form of NFC in UTF-8 encoding.
+   */
+  std::string normalizeUtf8(std::string_view input) const {


Can we move the LocaleManager to the uilt/ folder and the other string helpers?

I opened Issue #313 for this so I don't forget it. I do not want to do this while I still have PRs open that modify it to not get into rebasing hell. Otherwise I agree.

niklas88 · 2020-01-25T19:48:18Z

src/util/BatchedPipeline.h

+      return std::move(_buffer[_bufferPosition++]);
+    } else {
+      // we can only reach this if the buffer is exhausted and there is nothing
+      // more to parse


niklas88 · 2020-01-25T19:49:41Z

src/util/BatchedPipeline.h

+
+/**
+ * @brief setup a pipeline that efficiently creates and transforms values. The
+ * Concurrency is used between the different levels


I think steps or stages would be more fitting words than levels because that applies a hierarchy

I chose stages.

src/util/BatchedPipeline.h

niklas88 · 2020-01-25T19:54:13Z

src/util/TupleHelpers.h

+namespace detail {
+/* Implementation of setupTupleFromCallable (see below)
+ * Needed because we need the Pattern matching on the index_sequence
+ * TODO<joka921> In C++ 20 this could be done in place with templated


I hope we'll get a C++20 capable compiler with Ubuntu 20.04

I highly doubt it since the standard is feature freezed but not yet published.

niklas88 · 2020-01-25T20:06:11Z

test/BatchedPipelineTest.cpp

+          }
+          return std::nullopt;
+        },
+        [a = int(0)](const auto& x) mutable {


The use of int(0) is inconsistent to the above use of [i = 0]

niklas88

Great work and thanks for addressing my questions!

- Identified the writing of Triple elements to hash maps for deduplication as a bottleneck. - Speed this up by writing to multiple hash maps at once and merging them afterwards - Also concurrently pipelined all the other steps in the first index building phase. - That way the TurtleParsing now becomes the bottleneck (Teaser: This can be sped up by 30% using compile time regexes, so this might become even faster in a later PR). - This was done by Implementing an abstract, templated BatchedPipeline that abstracts over the creation and transformation of values in a pipeline that allows to control the degree of concurrency used on each level and between the different levels. - This pipeline infrastructure was heavily unit-tested to ensure its correctness since it internally uses quite some template magic. - This Commit also introduces the absl::flat_hash_map that is faster than the dense_hash_map used before. But in doubt I can also revert this change since in the meantime the parallelism is what helps us more than the faster hash maps. But I have thought, that we wanted to try out those anyway. - This can already be reviewed, especially the BatchedPipeline.h file. Before merging I would suggest merging the Unicode PR, because there is some merging work to be done (but not too much) between those PRs.

…sure total ordering.

… Builds TODO: this also appears in another Branch, after merging, rewrite the history?

I would like to test the performance before merging this, as it should be at least somewhat fast.

…ement when we first see it - This boosts up the speed in our parallel pipeline.

It is much faster, especially for large hash maps Replaced the default implementation of ad_utility::HashMap. We no longer need a default-key provider. absl strictly randomizes the iteration order of the hash map, so some unit tests had to be changed.

…o account for the overhead of the SortKeys.

joka921 · 2020-01-27T13:54:19Z

@hannahbast Sorry for the internal and meaningsless commit messages, I used this branch to transport content from my local machine to Galera.

@hannahbast @niklas88
I have integrated Niklas' suggestions yesterday. Namely I have added comments and refactored the large lambdas out.

I have decided to integrate two more changes into this PR, they are separate commits and thus they should be easy to review separately.

We create SortKeys for all Vocabulary Elements on first sight, so that the actual sorting becomes cheaper. Otherwise the Sorting using ICU seems to be the bottleneck during IndexBuilding.
I replaced the ad_utility::HashMap Implementation with absl::flat_hash_map (previously google::sparsehash.) This is much faster and is necessary to get an actual speedup using this PR.

All in all, maybe @hannahbast should try building an index using this PR, and @niklas88 could have a look at the two additional changes.

niklas88 · 2020-01-27T22:08:11Z

@joka921 the code looks good but the commit for replacing google:sparsehash the QueryPlannerTest faileds in Travis. I suspect this is the same hash order dependency problem that broke that test on s390x (issue #294). I don't see the fix to the test but the last commit is green so I'm worrying that this is only by luck.

niklas88 · 2020-01-27T22:11:29Z

@joka921 I actually started on fixing that damn QueryPlannerTest but is was just too frustrating for the free time coding work I'm doing on QLever at the moment. I'm sorry this now falls on your feet. Honestly I'd be fine with just scrapping the half of that test that just compare to expected trees which is super flaky to begin with.

joka921 · 2020-01-27T23:36:16Z

@niklas88 @hannahbast The issue of the QueryPlannerTest seemed fixable to me (Whenever the QueryPlanner can choose between equivalent Trees, we break the tie according to the CacheKey.

However implementing this I stumbled upon a probably serious issue: #314
Which I will deal with after at least a night's sleep.

- (mostly) In the unit tests there sometimes are Execution(Sub)Trees that are equivalent and return the same costEstimate(). The query planner must deterministically decide between them to make the current unit tests pass. - This is not the case with the newly integrated absl::flat_hash_map which purposely randomizes its iteration order. - with this commit, The QueryPlanner detects that it is in the unit test mode (no ExecutionContext assigned) and then deterministically chooses the alternative with the smaller cache key on equal cost. - Some of the unit tests had to be adapted to match this behavior.

joka921 · 2020-01-28T14:46:23Z

@niklas88 I applied a fix for the queryPlannerTest business. Let me know what you think of it (last commit) and verify if this indeed fixes the problem (I don't have that many machines to my disposal with different architectures).

niklas88 · 2020-01-28T19:12:36Z

@joka921 the build currently fails both on Travis and locally with GCC 9.2.x

niklas88 · 2020-01-28T21:22:32Z

util/HashSet.h uses Google dense_hashset, do you know if there is also an absl equivalent then we could remove the sparsehash dependency

niklas88 · 2020-01-28T21:25:29Z

On the other hand absl uses x86_64 specific options when the compiler is detected as GNU so I'll definitely have to report a bug there

joka921 · 2020-01-28T21:33:33Z

@niklas88 there is absl::flat_hash_set which I am happy to integrate.
What do you want to file a bug for? I have not found any strange behavior, does something not work on your special Mainframe ISAs?

niklas88 · 2020-01-28T21:45:04Z

Ok so I'm still getting a test failure in the QueryPlannerTest that looks like some order dependence (log attached). I guess this could be triggered by something other than hash map order, maybe endianess as s390x is basically the last pure Big Endian arch?

The abseil CMake magic used -maes -msse4.2 whenever it detected a GNU compiler. With the current absl master branch this is fixed. Our special Mainframe ISA does indeed also have in CPU crypto but I'd already be happy with not trying to use SSE 🗡️
failing_test.log

joka921 · 2020-01-28T23:20:16Z

I just spent some time running a debugger on the first of your failing unit tests.
At first glance this is NOT and ordering problem, since the Execution Tree chosen by your machine is more expensive than the expected one. So I assume that some of the CostEstimates are platform dependent. (Hopefully only the dummy ones used during unit tests when there is no QueryExecutionContext). I'll sent you a verbosity patch that outputs all the sizeEstimates tomorrow, then we can compare, where the difference might come from.

joka921 · 2020-01-29T10:30:24Z

@niklas88
Ok so I just pushed a commit that verbosely outputs all the cost estimates of all subtrees.
Could you please run the QueryPlannerTest and send me the output so I can compare, where the differences lie in the estimates;

niklas88 · 2020-01-29T17:29:48Z

test_fail.txt
Do you have any objections to setting the abseil subrepository to the current HEAD so we get that build fix?

Previously the size estimate dummys for Execution trees used std::hash<string> which is implementation-defined. Thus the QueryPlannerTest failed on some platforms without indicating a bug in the QueryPlanner or QLever in general. Now we use deterministic estimates.

joka921 · 2020-01-30T17:01:24Z

@niklas88
Ok, so now this should work. If not, please send me the same file as yesterday (this time you have to build with -DLOGLEVEL=TRACE to obtain the verbose output.
In case it does work, have a glance at the code of the last two commits and feel free to finally merge this

niklas88 · 2020-01-30T20:56:10Z

Great work! The last commit also now fixes #294 and make all tests pass on s390x including the E2E Tests

joka921 requested review from floriankramer and niklas88 January 8, 2020 16:09

hannahbast reviewed Jan 9, 2020

View reviewed changes

joka921 force-pushed the f.pipelinedIndexBuild branch from 3069736 to 9c801b7 Compare January 11, 2020 12:00

hannahbast reviewed Jan 12, 2020

View reviewed changes

niklas88 reviewed Jan 22, 2020

View reviewed changes

niklas88 reviewed Jan 25, 2020

View reviewed changes

niklas88 approved these changes Jan 26, 2020

View reviewed changes

joka921 force-pushed the f.pipelinedIndexBuild branch from 903a42d to 90c946c Compare January 27, 2020 09:56

joka921 added 7 commits January 27, 2020 14:34

Added a rudimentary script to check indices for equality.

a8d47fb

Set the default Collation Level at IndexBuild time to IDENTICAL to en…

e712839

…sure total ordering.

Included Unicode Normalization To ensure the reproducability of Index…

8781f06

… Builds TODO: this also appears in another Branch, after merging, rewrite the history?

Adressed the changes from Niklas' Review.

0f3dfbc

I would like to test the performance before merging this, as it should be at least somewhat fast.

We don't collate using ICU directly, but create SortKeys for every El…

0e9934d

…ement when we first see it - This boosts up the speed in our parallel pipeline.

joka921 force-pushed the f.pipelinedIndexBuild branch from 96babbd to cc711ec Compare January 27, 2020 13:40

Only use 50M triples per Partial vocabulary for the wikidata build, t…

d22e798

…o account for the overhead of the SortKeys.

Fixed the build failure

72e1aef

Verbose Output for Query Planner only for testing.

2c68abc

Also output size estimate.

0d32451

joka921 added 2 commits January 30, 2020 17:48

niklas88 merged commit 335deb2 into ad-freiburg:master Jan 30, 2020

joka921 deleted the f.pipelinedIndexBuild branch May 8, 2021 07:30

Speeding up the first phase of Index Building #302

Speeding up the first phase of Index Building #302

Conversation

joka921 commented Jan 8, 2020

hannahbast commented Jan 8, 2020 • edited

joka921 commented Jan 8, 2020

hannahbast commented Jan 8, 2020 • edited

joka921 commented Jan 8, 2020

joka921 commented Jan 8, 2020 • edited

joka921 commented Jan 8, 2020

niklas88 commented Jan 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahbast commented Jan 11, 2020

joka921 commented Jan 11, 2020 via email

hannahbast commented Jan 11, 2020 • edited

hannahbast commented Jan 12, 2020

hannahbast left a comment • edited

Choose a reason for hiding this comment

hannahbast commented Jan 12, 2020

joka921 commented Jan 12, 2020 via email

hannahbast commented Jan 16, 2020

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

joka921 commented Jan 27, 2020

niklas88 commented Jan 27, 2020

niklas88 commented Jan 27, 2020

joka921 commented Jan 27, 2020

joka921 commented Jan 28, 2020

niklas88 commented Jan 28, 2020

niklas88 commented Jan 28, 2020

niklas88 commented Jan 28, 2020

joka921 commented Jan 28, 2020

niklas88 commented Jan 28, 2020 • edited

joka921 commented Jan 28, 2020

joka921 commented Jan 29, 2020

niklas88 commented Jan 29, 2020

joka921 commented Jan 30, 2020

niklas88 commented Jan 30, 2020

hannahbast commented Jan 8, 2020 •

edited

hannahbast commented Jan 8, 2020 •

edited

joka921 commented Jan 8, 2020 •

edited

hannahbast commented Jan 11, 2020 •

edited

hannahbast left a comment •

edited

niklas88 commented Jan 28, 2020 •

edited