New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Index Building using a single-pass approach. #139
Conversation
6968447
to
49e8e84
Compare
47c02b0
to
8256da1
Compare
This pull request speeds up the index building process by only parsing the knowledge base file once (previously twice) which was a serious bottleneck. - While parsing the file (nt/ttl/tsv) we write a Vector of temporary ids which correspond to the order of appearance in the partial vocabulary. - Merging the partial vocabularies now creates a mapping from temporary/partial ids to global ids. - In a second pass, use this mapping to convert temporary ids to global ids. This is much faster than parsing the input file for a second time - in Addition, the parser is run in parallel (using std::async) during the first pass. This gives us more speedup.
f4f83ac
to
441f999
Compare
- Dynamic Array that handles small and big data efficiently as well - internally holds a std::vector and an MmapVector - As long as the vector is small, the std::vector is used (fast) - If the size goes beyond a user-defined threshold the data is shifted to the MmapVector (probably slower, but saves RAM) - We use this while creating the Single relations for the permutations. - Most of them are really small but previously some problems occured with big relations causing RAM problems in this step. - The BufferedVector efficiently deals with this problem
441f999
to
6a7937b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few proposed changes to variable and file names. Are you already running a full Wikidata rebuild with this?
CMakeLists.txt
Outdated
@@ -161,5 +161,6 @@ add_test(HashMapTest test/HashMapTest) | |||
add_test(HashSetTest test/HashSetTest) | |||
add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest) | |||
add_test(MmapVectorTest test/MmapVectorTest) | |||
add_test(BuferedVectorTest test/BufferedVectorTest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "Buffered" not "Bufered"
src/index/Index.h
Outdated
using std::array; | ||
using std::string; | ||
using std::tuple; | ||
using std::vector; | ||
|
||
using json = nlohmann::json; | ||
|
||
using IdPairMMapVec = ad_utility::MmapVector<std::array<Id, 2>>; | ||
// a simple struct for better naming | ||
struct LinesAndWords { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with the new parameters LinesAndWords
is not the right name for this stuct anymore. Maybe VocabularyData
?
src/index/Index.h
Outdated
// Create Vocabulary and directly write it to disk. Create ExtVec which can be | ||
// used for creating permutations | ||
// Member _vocab will be empty after this because it is not needed for index | ||
// creation once the ExtVec is set up and it would be a waste of RAM | ||
template <class Parser> | ||
ExtVec createExtVecAndVocab(const string& ntFile); | ||
std::unique_ptr<ExtVec> createExtVecAndVocab(const string& ntFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy with the whole ExtVec
naming, it sounds a lot like a "how" not a "why" or "what". In this case can't we just drop it from the method name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ExtVec is renamed to StxxlVec and the Method now is
createIdTriplesAndVocab
. Is that better?
src/index/VocabularyGenerator.cpp
Outdated
@@ -18,44 +18,66 @@ | |||
#include "./ConstantsIndexCreation.h" | |||
#include "./Vocabulary.h" | |||
|
|||
class PairCompare { | |||
// helper struct used in the priority queue for merging. | |||
// represents tokens/words in a certain partial vocabular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vocabulary
src/index/VocabularyGenerator.cpp
Outdated
class PairCompare { | ||
// helper struct used in the priority queue for merging. | ||
// represents tokens/words in a certain partial vocabular | ||
struct QueueValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe QueueWord
because it's a word of the vocabulary?
src/parser/ParallelParseBuffer.h
Outdated
// until the (asynchronous) call to parseBatch has finished. Returns false if | ||
// the parser has completely parsed the file. In this case the argument is | ||
// untouched | ||
bool getline(std::array<string, 3>& spo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be called getTriple()
std::future<std::pair<bool, std::vector<array<string, 3>>>> _fut; | ||
|
||
// this function extracts _bufferSize many triples from the parser. | ||
// If the bool argument is false, the parser is exhausted and further calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool return value
I have changed the requested places. I also ran end-to-end builds for Wikidata Truthy without any trouble. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Only parsing the KnowledgeBase file once.
While parsing the file (nt/ttl/tsv) we write a Vector of temporary ids which correspond to the
order of appearance in the partial vocabulary.
Merging the partial vocabularies now creates a
mapping from temporary to global ids.
In a second pass, use this mapping to convert
temporary to global ids. This is much faster
than parsing the input file for a second time
in Addition, the parser is run in parallel (using std::async) during the first pass.
This gives us another speedup.