New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient index creation #59
Efficient index creation #59
Conversation
I think something went wrong with the duplicated commits, I see them in a local checkout too. How did you rebase on the current upstream version? The way I usually do this for feature branches is by having an To clean things up at the end I like to do Alternatively you can also do this in a single step with just an upstream remote and As for the build failure on |
Ok @joka921 I've made the Travis build use GCC 5, can you try a |
@joka921 looks like you stil merged instead of rebased? Also something is weird with the branch name |
@joka921 sorry I was wrong, apparently rebase doesn't work across remotes so you really do need the upstream tracking branch. |
e60c913
to
55840e1
Compare
So I tried this with your changes on my version and here is what I did:
Then a vim window opens and I replaced pick with drop for the duplicated old commits |
4e71d73
to
49df2e6
Compare
@niklas88 Thank you for your help. After hardresetting, rebasing and force-pushing I think I got a single-commit version online which I consider to be on the state I actually wanted it. Thank you for your help, I think I have learned something for the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I added quite a few comments, this really looks quite good overall, especially for a first upstream contribution. I'm not super sure about some of my comments so I'd be glad to hear your thoughts on them.
src/index/Index.cpp
Outdated
while (p.getLine(spo)) { | ||
if (ad_utility::isXsdValue(spo[2])) { | ||
spo[2] = ad_utility::convertValueLiteralToIndexWord(spo[2]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superfluous spaces before end of line
src/index/Index.cpp
Outdated
++i; | ||
if (i % 10000000 == 0) { | ||
std::cout << "Lines processed: " << i << '\n'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these always direct std::cout
instead of LOG(INFO)
, I'm not sure at the moment?
src/index/Index.cpp
Outdated
LOG(INFO) << "Lines processed: " << i << '\n'; | ||
std::cout << "writing partial vocab no. " << numFiles << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be LOG(INFO)
and move it to the point where we are actually writing the file
src/index/Index.cpp
Outdated
if (items.size() > 0) { | ||
// write remainder | ||
|
||
std::cout << "writing partial vocab no. " << numFiles << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG(LEVEL)
src/index/Index.cpp
Outdated
LOG(INFO) << "Making pass over NTriples " << ntFile | ||
<< " and creating stxxl vector.\n"; | ||
array<string, 3> spo; | ||
NTriplesParser p(ntFile); | ||
google::sparse_hash_map<string, Id> vocabMap = _vocab.asMap(); | ||
LOG(INFO) << "Reading partial vocab ...\n"; | ||
google::sparse_hash_map<string, Id> vocabMap = vocabMapFromPartialIndexedFile(_onDiskBase + "partialVocabulary0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stick with using a .extension
scheme (i.e. add a .
), also others are all lowercase with -
word seperators
src/index/VocabularyGenerator.cpp
Outdated
if (infiles[i].read((char*)&len, sizeof(len))) { | ||
vecs[i].emplace_back(); | ||
vecs[i].back().resize(len); | ||
infiles[i].read(&(vecs[i].back()[0]), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we can soon use C++17, non const std::vector::data()
really is much nicer to read
src/index/VocabularyGenerator.cpp
Outdated
google::sparse_hash_map<string, Id> vocabMapFromPartialIndexedFile(const string& partialFile) { | ||
std::ifstream file(partialFile, std::ios_base::binary); | ||
google::sparse_hash_map<string, Id> vocabMap; | ||
unsigned int len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t
src/index/VocabularyGenerator.cpp
Outdated
while (file.read((char*)&len, sizeof(len))) { | ||
std::string word; | ||
word.resize(len); | ||
file.read(&(word[0]), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove parenthesis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[] has indeed higher precedence than & but I still find this clearer to read, but can remove it if you wish, since I now know about the preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strong preference but looked weird to me
src/index/VocabularyGenerator.h
Outdated
using std::string; | ||
// _______________________________________________________________ | ||
// merge the partial vocabularies at basenamepartialVocabulary0 to basename + | ||
// partialVocabulary + numFiles-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why always with the last?
@@ -0,0 +1,5 @@ | |||
#include "./VocabularyGenerator.h" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is integrated with the index building, or is this just for testing? That said there should be a VocabularyGeneratorTest
@joka921 yes, that single commit version looks a lot cleaner. I think you have done well, this stuff is quite tricky. Just two years ago I had to learn that same procedure on a PR for the Go programming language standard library, only that reviewer was the founder of LiveJournal and his time reviewing my changes (and mistakes) probably cost Google a couple hundred bucks when he could have fixed them in far less time. Still I learned a lot and it's definitely come in handy already |
// _____________________________________________________________________________ | ||
void ExternalVocabulary::buildFromTextFile(const string& textFileName, | ||
const string& outFileName) { | ||
_file.open(outFileName.c_str(), "w"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing a check if the open()
actually worked. I just ran this in docker and had wrong permissions on the volume where the index is created and only when creating the real index did I see it fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a check in the current version
One more thing, am I seeing this right and the partial vocabularies are currently not deleted after writing the full vocabulary? Ideally deleting them would also mean that we could continue after writing only some of them, however I think to work correctly we would need to first write each partial vocabulary and then atomically rename (mv) it to the final name. |
On my Wikipedia Freebase (not Easy) index build I'm currently getting a lot of
|
I have just fixed your problem. This is not normal and an error, I should let the program terminate at that point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nitpicks but overall I really like these changes. Now my Index build on metropolis ran out of disk instead of RAM 🎉 so I hopefully with more stuff deleted it will go throught next time
src/index/Index.cpp
Outdated
@@ -91,7 +91,7 @@ Index::ExtVec Index::createExtVecAndVocabFromNTriples(const string& ntFile, | |||
bool onDiskLiterals) { | |||
size_t nofLines = passNTriplesFileForVocabulary(ntFile, onDiskLiterals, NUM_TRIPLES_PER_PARTIAL_VOCAB); | |||
if (onDiskLiterals) { | |||
_vocab.externalizeLiteralsFromTextFile(onDiskBase + "externalTextFile", onDiskBase + ".literals-index"); | |||
_vocab.externalizeLiteralsFromTextFile(onDiskBase + ".externalTextFile", onDiskBase + ".literals-index"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this suffix, all other use foo-bar
instead of fooBar
, also File
as a suffix adds no information so I propose .externalized-text
src/global/Constants.h
Outdated
@@ -51,3 +52,4 @@ static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19; | |||
// How many lines are parsed at once during index creation. | |||
// Reduce to save RAM | |||
static const int NUM_TRIPLES_PER_PARTIAL_VOCAB = 100000000; | |||
static const std::string PARTIAL_VOCAB_FILE_NAME = ".partial_vocabulary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use -
instead of _
to make even more consistent. I think I misstated it for snake case in the last review.
src/index/VocabularyGenerator.cpp
Outdated
@@ -20,12 +21,15 @@ void mergeVocabulary(const std::string& basename, size_t numFiles) { | |||
std::vector<std::fstream> infiles; | |||
std::vector<std::vector<std::string>> vecs(numFiles); | |||
std::ofstream outfile(basename + ".vocabulary"); | |||
std::ofstream outfileExternal(basename + "externalTextFile"); | |||
AD_CHECK(outfile.is_open()); | |||
std::ofstream outfileExternal(basename + ".externalTextFile"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should make this a constant like for .partial-vocabulary
src/index/VocabularyGenerator.cpp
Outdated
infiles[top.second].write((char*)&totalWritten, sizeof(totalWritten)); | ||
|
||
auto minusOne = totalWritten - 1; | ||
infiles[top.second].write((char*)&minusOne, sizeof(totalWritten)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use sizeof()
on the same variable to be more consistent.
test/CMakeLists.txt
Outdated
SparsehashTest | ||
VocabularyGeneratorTest | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This )
seems to be wrongly indented
test/VocabularyGeneratorTest.cpp
Outdated
// may throw? | ||
MergeVocabularyTest() { | ||
// name of random subdirectory | ||
_basePath = boost::filesystem::unique_path("qleverVocTest%%%%_%%%%_%%%%_%%%%"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some Boost use or am I looking at outdated code?
15daf5a
to
9de42ca
Compare
src/index/VocabularyGenerator.cpp
Outdated
uint32_t len; | ||
while (file.read((char*)&len, sizeof(len))) { | ||
std::string word; | ||
word.resize(len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of initializing a default string and only then resizing it probably makes more sense to create a string of the correct size filled with zeros using the size based constructor.
As for the buffering since we read sequentially and only know the size after reading the length I don't think custom buffering makes sense here. Let's just leave that to the OS. Alternatively it may also make more sense to read the whole partialFile
into a single buffer in one go if it's not too large. That would save some syscalls.
src/index/Index.cpp
Outdated
@@ -159,6 +160,11 @@ void Index::createFromNTriplesFile(const string& ntFile, | |||
_patterns, _maxNumPatterns); | |||
} | |||
openFileHandles(); | |||
// we have deleted the vocabulary in the process to save ram, so now we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So since it is in Index.cpp
it's not just for text indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is a duplicate.
The actual solution is by setting _onDiskLiterals and _onDiskBase before, then it suffices to read it in Index.Text.cpp. I am going to fix this tomorrow.
8730c41
to
5875961
Compare
- Using k-way merge on partial vocabulary to reduce size of hashmaps needed during creation - external literals (long/from "exotic" languages) are directly handled on disk without ever completely reading them into RAM - added UnitTest for mergeVocabulary - Added new file /src/indexConstantsIndexCreation.h to store constants which are only used during index creation TODO: - TODO: Only implemented for NTRIPLE-Files so far, also handle for CSV (a lot of code duplication) - make text index creation also efficient.
- Running IndexBuilderMain with option -A assumes already existing index at location specified by option -i with valid vocabulary and uses this to add a text index as specified by -w and -k - adds ability to continue runs which have suceeded in building an index but failed at creating the text index. - removed unused buffers in vocabulary Generator : With binary files, which are contiguously read, the paging system should provide enough buffers.
13ef438
to
70c845e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall but still have some comments
onDiskLiterals); | ||
} else { | ||
index.createFromOnDiskIndex(baseName, allPermutations, onDiskLiterals); | ||
// TODO(j.kalmbach): onDiskLiterals is now redundant in all other functions, probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it everywhere but for createFromOnDiskIndex, there the name is so integral to the function that it is probably better to use an argument.
src/index/VocabularyGenerator.cpp
Outdated
std::vector<std::vector<std::string>::iterator> iterators; | ||
|
||
using pair_T = std::pair<std::string, size_t>; | ||
std::priority_queue<pair_T, std::vector<pair_T>, PairCompare> queue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<T>
is the default Container for a priority queue so I think we should not make this explicit if it's the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not possible, since I need the third template parameter (the custom comparator) and one cannot "skip" template parameters to my knowledge.
src/index/VocabularyGenerator.cpp
Outdated
if (top.first < string({EXTERNALIZED_LITERALS_PREFIX})) { | ||
outfile << top.first << std::endl; | ||
} else { | ||
outfileExternal << top.first << std::endl; | ||
} | ||
|
||
// write id to partial vocabulary | ||
infiles[top.second].seekp(infiles[top.second].tellp()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a flush()
would make the intent clearer, or at least a seek()
to the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by "a seek() to the end". We are not at the end of the file but in the middle (the ids which we write are directly next to the words we are reading.
This worked and there was some trouble with the flush(). I am also not sure about the performance, maybe this seek can be cheaper than a full flush to enforce the standard-compliance?
Could you in this case agree with explaining this by a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the interface of istream suggests, that the write and the read position are in fact two different things. They are implemented as one and the same position, but my version deals explicitly with the strangeness off the seekp and seekg function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see
// _____________________________________________________________________________ | ||
void ExternalVocabulary::buildFromTextFile(const string& textFileName, | ||
const string& outFileName) { | ||
_file.open(outFileName.c_str(), "w"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a check in the current version
src/index/Index.cpp
Outdated
if (!keepTempFiles) { | ||
// remove temporary files only used during index creation | ||
LOG(INFO) << "Removing temporary files (partial vocabulary and external text file...\n"; | ||
string removeCommand1 = "rm " + onDiskBase + EXTERNAL_LITS_TEXT_FILE_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use "rm --"
to make sure no one can pass e.g. -rf
as part of onDiskBase
. Also I'm not sure if we really should use the system()
function here, sadly the only nice standard way needs C++17. Now that Ubuntu 18.04 is out I'd be willing to use that in QLever though for a lot of our machines that means it really only works in a docker container at the moment. So maybe for now just add a TODO
src/index/Index.cpp
Outdated
// _____________________________________________________________________________ | ||
void Index::createFromNTriplesFile(const string& ntFile, | ||
const string& onDiskBase, | ||
bool allPermutations, bool onDiskLiterals) { | ||
bool allPermutations, bool onDiskLiterals, | ||
bool keepTempFiles) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indent looks wrong, did you try using our .clang-format
settings?
src/index/VocabularyGenerator.cpp
Outdated
|
||
std::vector<std::fstream> infiles; | ||
// currently only one word is buffered, no matter how big the bufferSize is | ||
// TODO: buffer, or throw the buffer out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you threw it out, remove this comment
test/CMakeLists.txt
Outdated
add_library(tests | ||
SparqlParserTest | ||
StringUtilsTest | ||
StringUtilsTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look wrongly indented
test/CMakeLists.txt
Outdated
@@ -69,4 +73,5 @@ add_library(tests | |||
ConversionsTest | |||
SparsehashTest | |||
GroupByTest | |||
) | |||
VocabularyGeneratorTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mismatched indent
test/VocabularyGeneratorTest.cpp
Outdated
_path0 = std::string(PARTIAL_VOCAB_FILE_NAME + std::to_string(0)); | ||
_path1 = std::string(PARTIAL_VOCAB_FILE_NAME + std::to_string(1)); | ||
// those names can be random | ||
_pathExp0 = std::string(".partialVocabExp0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we moved to snake_case_names
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for filenames it was names-with-dashes, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple missing spaces left. Feel free to merge after
src/index/Index.cpp
Outdated
@@ -99,9 +99,12 @@ Index::ExtVec Index::createExtVecAndVocabFromNTriples(const string& ntFile) { | |||
if (!_keepTempFiles) { | |||
// remove temporary files only used during index creation | |||
LOG(INFO) << "Removing temporary files (partial vocabulary and external text file...\n"; | |||
string removeCommand1 = "rm " + _onDiskBase + EXTERNAL_LITS_TEXT_FILE_NAME; | |||
|
|||
//TODO: using system and rm is not really elegant nor portable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space between //
and TODO
src/index/Index.cpp
Outdated
|
||
//TODO: using system and rm is not really elegant nor portable. | ||
// use std::filesystem as soon as QLever is ported to C++17 | ||
string removeCommand1 = "rm --" + _onDiskBase + EXTERNAL_LITS_TEXT_FILE_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before _onDiskBase
, sorry this was my mistake
src/index/VocabularyGenerator.cpp
Outdated
if (top.first < string({EXTERNALIZED_LITERALS_PREFIX})) { | ||
outfile << top.first << std::endl; | ||
} else { | ||
outfileExternal << top.first << std::endl; | ||
} | ||
|
||
// write id to partial vocabulary | ||
infiles[top.second].seekp(infiles[top.second].tellp()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see
264f882
to
eafc456
Compare
-base name and externalization settings are now handled via member functions of Index class. - exception: createFromOnDiskIndex still gets the name of the index and saves it internally (I considered this to be more intuitive)
eafc456
to
08539e0
Compare
Efficient index creation for RDF-NTriples