Faster Index Building using a single-pass approach. #139

joka921 · 2018-10-07T12:44:03Z

Only parsing the KnowledgeBase file once.

While parsing the file (nt/ttl/tsv) we write a Vector of temporary ids which correspond to the
order of appearance in the partial vocabulary.
Merging the partial vocabularies now creates a
mapping from temporary to global ids.
In a second pass, use this mapping to convert
temporary to global ids. This is much faster
than parsing the input file for a second time
in Addition, the parser is run in parallel (using std::async) during the first pass.
This gives us another speedup.

This pull request speeds up the index building process by only parsing the knowledge base file once (previously twice) which was a serious bottleneck. - While parsing the file (nt/ttl/tsv) we write a Vector of temporary ids which correspond to the order of appearance in the partial vocabulary. - Merging the partial vocabularies now creates a mapping from temporary/partial ids to global ids. - In a second pass, use this mapping to convert temporary ids to global ids. This is much faster than parsing the input file for a second time - in Addition, the parser is run in parallel (using std::async) during the first pass. This gives us more speedup.

- Dynamic Array that handles small and big data efficiently as well - internally holds a std::vector and an MmapVector - As long as the vector is small, the std::vector is used (fast) - If the size goes beyond a user-defined threshold the data is shifted to the MmapVector (probably slower, but saves RAM) - We use this while creating the Single relations for the permutations. - Most of them are really small but previously some problems occured with big relations causing RAM problems in this step. - The BufferedVector efficiently deals with this problem

niklas88

LGTM with a few proposed changes to variable and file names. Are you already running a full Wikidata rebuild with this?

niklas88 · 2018-12-17T09:13:14Z

CMakeLists.txt

@@ -161,5 +161,6 @@ add_test(HashMapTest test/HashMapTest)
 add_test(HashSetTest test/HashSetTest)
 add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest)
 add_test(MmapVectorTest test/MmapVectorTest)
+add_test(BuferedVectorTest test/BufferedVectorTest)


Should be "Buffered" not "Bufered"

niklas88 · 2018-12-17T09:25:05Z

src/index/Index.h

 using std::array;
 using std::string;
 using std::tuple;
 using std::vector;

 using json = nlohmann::json;

-using IdPairMMapVec = ad_utility::MmapVector<std::array<Id, 2>>;
 // a simple struct for better naming
 struct LinesAndWords {


I think with the new parameters LinesAndWords is not the right name for this stuct anymore. Maybe VocabularyData?

niklas88 · 2018-12-17T09:27:02Z

src/index/Index.h

  // Create Vocabulary and directly write it to disk. Create ExtVec which can be
  // used for creating permutations
  // Member _vocab will be empty after this because it is not needed for index
  // creation once the ExtVec is set up and it would be a waste of RAM
  template <class Parser>
-  ExtVec createExtVecAndVocab(const string& ntFile);
+  std::unique_ptr<ExtVec> createExtVecAndVocab(const string& ntFile);


I'm not super happy with the whole ExtVec naming, it sounds a lot like a "how" not a "why" or "what". In this case can't we just drop it from the method name?

The ExtVec is renamed to StxxlVec and the Method now is
createIdTriplesAndVocab. Is that better?

niklas88 · 2018-12-17T09:30:16Z

src/index/VocabularyGenerator.cpp

@@ -18,44 +18,66 @@
 #include "./ConstantsIndexCreation.h"
 #include "./Vocabulary.h"

-class PairCompare {
+// helper struct used in the priority queue for merging.
+// represents tokens/words in a certain partial vocabular


niklas88 · 2018-12-17T09:31:18Z

src/index/VocabularyGenerator.cpp

-class PairCompare {
+// helper struct used in the priority queue for merging.
+// represents tokens/words in a certain partial vocabular
+struct QueueValue {


Maybe QueueWord because it's a word of the vocabulary?

niklas88 · 2018-12-17T09:42:44Z

src/parser/ParallelParseBuffer.h

+  // until the (asynchronous) call to parseBatch has finished. Returns false if
+  // the parser has completely parsed the file. In this case the argument is
+  // untouched
+  bool getline(std::array<string, 3>& spo) {


should probably be called getTriple()

niklas88 · 2018-12-17T09:44:06Z

src/parser/ParallelParseBuffer.h

+  std::future<std::pair<bool, std::vector<array<string, 3>>>> _fut;
+
+  // this function extracts _bufferSize many triples from the parser.
+  // If the bool argument is false, the parser is exhausted and further calls


bool return value

joka921 · 2018-12-20T14:02:12Z

I have changed the requested places. I also ran end-to-end builds for Wikidata Truthy without any trouble.
The single passing part was also used for Wikidata Full, but there I had some issues with the permutations which were bugfixed here (STXXL with OpenMP can be made to compile, but behaves strangely etc.)

niklas88

LGTM

niklas88 added this to In progress in QLever Oct 8, 2018

joka921 force-pushed the f.singlePassProper branch from 6968447 to 49e8e84 Compare November 30, 2018 11:54

joka921 force-pushed the f.singlePassProper branch 2 times, most recently from 47c02b0 to 8256da1 Compare December 13, 2018 13:37

joka921 force-pushed the f.singlePassProper branch from f4f83ac to 441f999 Compare December 16, 2018 18:01

joka921 force-pushed the f.singlePassProper branch from 441f999 to 6a7937b Compare December 16, 2018 18:26

niklas88 reviewed Dec 17, 2018

View reviewed changes

Changes on PR requested by Niklas

87b5e17

niklas88 approved these changes Dec 21, 2018

View reviewed changes

QLever automation moved this from In progress to Review Approved Dec 21, 2018

niklas88 merged commit 8e798c4 into ad-freiburg:master Dec 21, 2018

QLever automation moved this from Review Approved to Done Dec 21, 2018

joka921 deleted the f.singlePassProper branch May 8, 2021 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Index Building using a single-pass approach. #139

Faster Index Building using a single-pass approach. #139

joka921 commented Oct 7, 2018

niklas88 left a comment

niklas88 Dec 17, 2018

niklas88 Dec 17, 2018 •

edited

niklas88 Dec 17, 2018

joka921 Dec 20, 2018

niklas88 Dec 17, 2018

niklas88 Dec 17, 2018

niklas88 Dec 17, 2018

niklas88 Dec 17, 2018

joka921 commented Dec 20, 2018

niklas88 left a comment

Faster Index Building using a single-pass approach. #139

Faster Index Building using a single-pass approach. #139

Conversation

joka921 commented Oct 7, 2018

niklas88 left a comment

Choose a reason for hiding this comment

niklas88 Dec 17, 2018

Choose a reason for hiding this comment

niklas88 Dec 17, 2018 • edited

Choose a reason for hiding this comment

niklas88 Dec 17, 2018

Choose a reason for hiding this comment

joka921 Dec 20, 2018

Choose a reason for hiding this comment

niklas88 Dec 17, 2018

Choose a reason for hiding this comment

niklas88 Dec 17, 2018

Choose a reason for hiding this comment

niklas88 Dec 17, 2018

Choose a reason for hiding this comment

niklas88 Dec 17, 2018

Choose a reason for hiding this comment

joka921 commented Dec 20, 2018

niklas88 left a comment

Choose a reason for hiding this comment

niklas88 Dec 17, 2018 •

edited