Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Index Building using a single-pass approach. #139

Merged
merged 3 commits into from Dec 21, 2018

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Oct 7, 2018

Only parsing the KnowledgeBase file once.

  • While parsing the file (nt/ttl/tsv) we write a Vector of temporary ids which correspond to the
    order of appearance in the partial vocabulary.

  • Merging the partial vocabularies now creates a
    mapping from temporary to global ids.

  • In a second pass, use this mapping to convert
    temporary to global ids. This is much faster
    than parsing the input file for a second time

  • in Addition, the parser is run in parallel (using std::async) during the first pass.
    This gives us another speedup.

@niklas88 niklas88 added this to In progress in QLever Oct 8, 2018
@joka921 joka921 force-pushed the f.singlePassProper branch 2 times, most recently from 47c02b0 to 8256da1 Compare December 13, 2018 13:37
This pull request speeds up the index building process by only parsing
the knowledge base file once (previously twice) which was a serious
bottleneck.

- While parsing the file (nt/ttl/tsv) we write a Vector of temporary ids which correspond to the
  order of appearance in the partial vocabulary.
- Merging the partial vocabularies now creates a
  mapping from temporary/partial ids to global ids.

- In a second pass, use this mapping to convert
  temporary ids to global ids. This is much faster
  than parsing the input file for a second time

- in Addition, the parser is run in parallel (using std::async) during the first pass.
  This gives us more speedup.
- Dynamic Array that handles small and big data efficiently as well
- internally holds a std::vector and an MmapVector
- As long as the vector is small, the std::vector is used (fast)
- If the size goes beyond a user-defined threshold the data is shifted
to the MmapVector (probably slower, but saves RAM)

- We use this while creating the Single relations for the permutations.
- Most of them are really small but previously some problems occured
with big relations causing RAM problems in this step.
- The BufferedVector efficiently deals with this problem
Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few proposed changes to variable and file names. Are you already running a full Wikidata rebuild with this?

CMakeLists.txt Outdated
@@ -161,5 +161,6 @@ add_test(HashMapTest test/HashMapTest)
add_test(HashSetTest test/HashSetTest)
add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest)
add_test(MmapVectorTest test/MmapVectorTest)
add_test(BuferedVectorTest test/BufferedVectorTest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "Buffered" not "Bufered"

using std::array;
using std::string;
using std::tuple;
using std::vector;

using json = nlohmann::json;

using IdPairMMapVec = ad_utility::MmapVector<std::array<Id, 2>>;
// a simple struct for better naming
struct LinesAndWords {
Copy link
Member

@niklas88 niklas88 Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the new parameters LinesAndWords is not the right name for this stuct anymore. Maybe VocabularyData?

// Create Vocabulary and directly write it to disk. Create ExtVec which can be
// used for creating permutations
// Member _vocab will be empty after this because it is not needed for index
// creation once the ExtVec is set up and it would be a waste of RAM
template <class Parser>
ExtVec createExtVecAndVocab(const string& ntFile);
std::unique_ptr<ExtVec> createExtVecAndVocab(const string& ntFile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with the whole ExtVec naming, it sounds a lot like a "how" not a "why" or "what". In this case can't we just drop it from the method name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExtVec is renamed to StxxlVec and the Method now is
createIdTriplesAndVocab. Is that better?

@@ -18,44 +18,66 @@
#include "./ConstantsIndexCreation.h"
#include "./Vocabulary.h"

class PairCompare {
// helper struct used in the priority queue for merging.
// represents tokens/words in a certain partial vocabular
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vocabulary

class PairCompare {
// helper struct used in the priority queue for merging.
// represents tokens/words in a certain partial vocabular
struct QueueValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe QueueWord because it's a word of the vocabulary?

// until the (asynchronous) call to parseBatch has finished. Returns false if
// the parser has completely parsed the file. In this case the argument is
// untouched
bool getline(std::array<string, 3>& spo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be called getTriple()

std::future<std::pair<bool, std::vector<array<string, 3>>>> _fut;

// this function extracts _bufferSize many triples from the parser.
// If the bool argument is false, the parser is exhausted and further calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool return value

@joka921
Copy link
Member Author

joka921 commented Dec 20, 2018

I have changed the requested places. I also ran end-to-end builds for Wikidata Truthy without any trouble.
The single passing part was also used for Wikidata Full, but there I had some issues with the permutations which were bugfixed here (STXXL with OpenMP can be made to compile, but behaves strangely etc.)

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

QLever automation moved this from In progress to Review Approved Dec 21, 2018
@niklas88 niklas88 merged commit 8e798c4 into ad-freiburg:master Dec 21, 2018
QLever automation moved this from Review Approved to Done Dec 21, 2018
@joka921 joka921 deleted the f.singlePassProper branch May 8, 2021 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
QLever
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants