Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient index creation #59

Merged
merged 4 commits into from Jul 11, 2018

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jun 7, 2018

Efficient index creation for RDF-NTriples

@niklas88 niklas88 self-requested a review June 7, 2018 13:34
@niklas88 niklas88 changed the title F.efficient index creation Efficient index creation Jun 8, 2018
@niklas88
Copy link
Member

niklas88 commented Jun 8, 2018

I think something went wrong with the duplicated commits, I see them in a local checkout too. How did you rebase on the current upstream version? The way I usually do this for feature branches is by having an upstream branch and remote so that this branch can follow the upstream:master with just git pull --ff-only upstream master. Then I can rebase my work with that branch. When changes are large merge can be easier than rebasing though. See allso this tutorial. Alternatively you c

To clean things up at the end I like to do git rebase --interactive upstream and then squash fix commits. When the PR is already pushed pushing the new version needs --force but since you are working on that branch alone that's ok.

Alternatively you can also do this in a single step with just an upstream remote and git rebase --interactive upstream master

As for the build failure on emplace_back() I too don't really like that fix. Let's see if I can easily get a more recent GCC into Travis, I somehow though Travis was on 16.04 but it really still is on 14.04 so I think we should target GCC 5.4.x as used on our 16.04 boxes.

@niklas88
Copy link
Member

niklas88 commented Jun 8, 2018

Ok @joka921 I've made the Travis build use GCC 5, can you try a git rebase --interactive upstream master, dropping the duplicated commits and squashing the fix up commits into the ones with real changes?

@niklas88
Copy link
Member

niklas88 commented Jun 8, 2018

@joka921 looks like you stil merged instead of rebased? Also something is weird with the branch name

@niklas88
Copy link
Member

niklas88 commented Jun 8, 2018

@joka921 sorry I was wrong, apparently rebase doesn't work across remotes so you really do need the upstream tracking branch.

@niklas88
Copy link
Member

niklas88 commented Jun 8, 2018

So I tried this with your changes on my version and here is what I did:

git checkout -b efficient-index-creation
git pull git://github.com/joka921/QLever.git f.efficientIndexCreation
git rebase --interactive master # my master is the same as this version so updated with --ff-only

Then a vim window opens and I replaced pick with drop for the duplicated old commits
(it shows instructions at the bottom). And fixup for the commits that are just tiny fix ups like removing the unused variable. And then you need to git push --force to your github

@joka921 joka921 force-pushed the f.efficientIndexCreation branch 2 times, most recently from 4e71d73 to 49df2e6 Compare June 8, 2018 15:03
@joka921
Copy link
Member Author

joka921 commented Jun 8, 2018

@niklas88 Thank you for your help. After hardresetting, rebasing and force-pushing I think I got a single-commit version online which I consider to be on the state I actually wanted it. Thank you for your help, I think I have learned something for the future.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I added quite a few comments, this really looks quite good overall, especially for a first upstream contribution. I'm not super sure about some of my comments so I'd be glad to hear your thoughts on them.

while (p.getLine(spo)) {
if (ad_utility::isXsdValue(spo[2])) {
spo[2] = ad_utility::convertValueLiteralToIndexWord(spo[2]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous spaces before end of line

++i;
if (i % 10000000 == 0) {
std::cout << "Lines processed: " << i << '\n';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these always direct std::cout instead of LOG(INFO), I'm not sure at the moment?

LOG(INFO) << "Lines processed: " << i << '\n';
std::cout << "writing partial vocab no. " << numFiles << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be LOG(INFO) and move it to the point where we are actually writing the file

if (items.size() > 0) {
// write remainder

std::cout << "writing partial vocab no. " << numFiles << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG(LEVEL)

LOG(INFO) << "Making pass over NTriples " << ntFile
<< " and creating stxxl vector.\n";
array<string, 3> spo;
NTriplesParser p(ntFile);
google::sparse_hash_map<string, Id> vocabMap = _vocab.asMap();
LOG(INFO) << "Reading partial vocab ...\n";
google::sparse_hash_map<string, Id> vocabMap = vocabMapFromPartialIndexedFile(_onDiskBase + "partialVocabulary0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stick with using a .extension scheme (i.e. add a .), also others are all lowercase with - word seperators

if (infiles[i].read((char*)&len, sizeof(len))) {
vecs[i].emplace_back();
vecs[i].back().resize(len);
infiles[i].read(&(vecs[i].back()[0]), len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can soon use C++17, non const std::vector::data() really is much nicer to read

google::sparse_hash_map<string, Id> vocabMapFromPartialIndexedFile(const string& partialFile) {
std::ifstream file(partialFile, std::ios_base::binary);
google::sparse_hash_map<string, Id> vocabMap;
unsigned int len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32_t

while (file.read((char*)&len, sizeof(len))) {
std::string word;
word.resize(len);
file.read(&(word[0]), len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove parenthesis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] has indeed higher precedence than & but I still find this clearer to read, but can remove it if you wish, since I now know about the preference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong preference but looked weird to me

using std::string;
// _______________________________________________________________
// merge the partial vocabularies at basenamepartialVocabulary0 to basename +
// partialVocabulary + numFiles-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why always with the last?

@@ -0,0 +1,5 @@
#include "./VocabularyGenerator.h"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is integrated with the index building, or is this just for testing? That said there should be a VocabularyGeneratorTest

@niklas88
Copy link
Member

niklas88 commented Jun 8, 2018

@joka921 yes, that single commit version looks a lot cleaner. I think you have done well, this stuff is quite tricky. Just two years ago I had to learn that same procedure on a PR for the Go programming language standard library, only that reviewer was the founder of LiveJournal and his time reviewing my changes (and mistakes) probably cost Google a couple hundred bucks when he could have fixed them in far less time. Still I learned a lot and it's definitely come in handy already

// _____________________________________________________________________________
void ExternalVocabulary::buildFromTextFile(const string& textFileName,
const string& outFileName) {
_file.open(outFileName.c_str(), "w");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing a check if the open() actually worked. I just ran this in docker and had wrong permissions on the volume where the index is created and only when creating the real index did I see it fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a check in the current version

@niklas88
Copy link
Member

One more thing, am I seeing this right and the partial vocabularies are currently not deleted after writing the full vocabulary? Ideally deleting them would also mean that we could continue after writing only some of them, however I think to work correctly we would need to first write each partial vocabulary and then atomically rename (mv) it to the final name.

@niklas88
Copy link
Member

On my Wikipedia Freebase (not Easy) index build I'm currently getting a lot of not found in partial Vocab log entries, is this normal?

Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: "9788420743523"@en
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0b1ldcq>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/common.topic.notable_types>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0775xx5>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0b1ldcq>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/key/soft.isbn>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: "9788420743523"
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0b1ldcq>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/key/soft.isbn>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: "9788420743523"
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0b1ldcq>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/common.topic.notable_for>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/g.1258_r_bd>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0b1ldcq>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/type.object.permission>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0775xvm>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/m.0b1ldcq>
Fri Jun 15 09:25:12.107 - INFO:  not found in partial Vocab: <http://rdf.freebase.com/ns/type.object.key>

@joka921
Copy link
Member Author

joka921 commented Jun 15, 2018

I have just fixed your problem. This is not normal and an error, I should let the program terminate at that point.
There was a wrong name for the first partialVocabulary file (error during refactoring). I fixed it and also added an assertion which checks this. When I have time I think I should write an end-to-end test for the index creation to avoid stuff like this

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nitpicks but overall I really like these changes. Now my Index build on metropolis ran out of disk instead of RAM 🎉 so I hopefully with more stuff deleted it will go throught next time

@@ -91,7 +91,7 @@ Index::ExtVec Index::createExtVecAndVocabFromNTriples(const string& ntFile,
bool onDiskLiterals) {
size_t nofLines = passNTriplesFileForVocabulary(ntFile, onDiskLiterals, NUM_TRIPLES_PER_PARTIAL_VOCAB);
if (onDiskLiterals) {
_vocab.externalizeLiteralsFromTextFile(onDiskBase + "externalTextFile", onDiskBase + ".literals-index");
_vocab.externalizeLiteralsFromTextFile(onDiskBase + ".externalTextFile", onDiskBase + ".literals-index");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this suffix, all other use foo-bar instead of fooBar, also File as a suffix adds no information so I propose .externalized-text

@@ -51,3 +52,4 @@ static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;
// How many lines are parsed at once during index creation.
// Reduce to save RAM
static const int NUM_TRIPLES_PER_PARTIAL_VOCAB = 100000000;
static const std::string PARTIAL_VOCAB_FILE_NAME = ".partial_vocabulary";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use -instead of _ to make even more consistent. I think I misstated it for snake case in the last review.

@@ -20,12 +21,15 @@ void mergeVocabulary(const std::string& basename, size_t numFiles) {
std::vector<std::fstream> infiles;
std::vector<std::vector<std::string>> vecs(numFiles);
std::ofstream outfile(basename + ".vocabulary");
std::ofstream outfileExternal(basename + "externalTextFile");
AD_CHECK(outfile.is_open());
std::ofstream outfileExternal(basename + ".externalTextFile");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should make this a constant like for .partial-vocabulary

infiles[top.second].write((char*)&totalWritten, sizeof(totalWritten));

auto minusOne = totalWritten - 1;
infiles[top.second].write((char*)&minusOne, sizeof(totalWritten));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use sizeof() on the same variable to be more consistent.

SparsehashTest
VocabularyGeneratorTest
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ) seems to be wrongly indented

// may throw?
MergeVocabularyTest() {
// name of random subdirectory
_basePath = boost::filesystem::unique_path("qleverVocTest%%%%_%%%%_%%%%_%%%%");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some Boost use or am I looking at outdated code?

@joka921 joka921 force-pushed the f.efficientIndexCreation branch 3 times, most recently from 15daf5a to 9de42ca Compare June 24, 2018 12:52
uint32_t len;
while (file.read((char*)&len, sizeof(len))) {
std::string word;
word.resize(len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of initializing a default string and only then resizing it probably makes more sense to create a string of the correct size filled with zeros using the size based constructor.

As for the buffering since we read sequentially and only know the size after reading the length I don't think custom buffering makes sense here. Let's just leave that to the OS. Alternatively it may also make more sense to read the whole partialFile into a single buffer in one go if it's not too large. That would save some syscalls.

@@ -159,6 +160,11 @@ void Index::createFromNTriplesFile(const string& ntFile,
_patterns, _maxNumPatterns);
}
openFileHandles();
// we have deleted the vocabulary in the process to save ram, so now we have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So since it is in Index.cpp it's not just for text indices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is a duplicate.
The actual solution is by setting _onDiskLiterals and _onDiskBase before, then it suffices to read it in Index.Text.cpp. I am going to fix this tomorrow.

- Using k-way merge on partial vocabulary to reduce size of hashmaps needed during creation
- external literals (long/from "exotic" languages) are directly handled on disk without ever completely reading them into RAM
- added UnitTest for mergeVocabulary
- Added new file /src/indexConstantsIndexCreation.h to store constants which are
  only used during index creation

TODO:
- TODO: Only implemented for NTRIPLE-Files so far, also handle for CSV (a lot of code duplication)
- make text index creation also efficient.
- Running IndexBuilderMain with option -A assumes already existing index at
  location specified by option -i with
  valid vocabulary and uses this to add a text index as specified by -w and -k
- adds ability to continue runs which have suceeded in building an index but
  failed at creating the text index.

- removed unused buffers in vocabulary Generator : With binary files, which are
  contiguously read, the paging system should
  provide enough buffers.
Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall but still have some comments

onDiskLiterals);
} else {
index.createFromOnDiskIndex(baseName, allPermutations, onDiskLiterals);
// TODO(j.kalmbach): onDiskLiterals is now redundant in all other functions, probably
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it everywhere but for createFromOnDiskIndex, there the name is so integral to the function that it is probably better to use an argument.

std::vector<std::vector<std::string>::iterator> iterators;

using pair_T = std::pair<std::string, size_t>;
std::priority_queue<pair_T, std::vector<pair_T>, PairCompare> queue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector<T> is the default Container for a priority queue so I think we should not make this explicit if it's the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not possible, since I need the third template parameter (the custom comparator) and one cannot "skip" template parameters to my knowledge.

if (top.first < string({EXTERNALIZED_LITERALS_PREFIX})) {
outfile << top.first << std::endl;
} else {
outfileExternal << top.first << std::endl;
}

// write id to partial vocabulary
infiles[top.second].seekp(infiles[top.second].tellp());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a flush() would make the intent clearer, or at least a seek() to the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by "a seek() to the end". We are not at the end of the file but in the middle (the ids which we write are directly next to the words we are reading.
This worked and there was some trouble with the flush(). I am also not sure about the performance, maybe this seek can be cheaper than a full flush to enforce the standard-compliance?
Could you in this case agree with explaining this by a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the interface of istream suggests, that the write and the read position are in fact two different things. They are implemented as one and the same position, but my version deals explicitly with the strangeness off the seekp and seekg function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see

// _____________________________________________________________________________
void ExternalVocabulary::buildFromTextFile(const string& textFileName,
const string& outFileName) {
_file.open(outFileName.c_str(), "w");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a check in the current version

if (!keepTempFiles) {
// remove temporary files only used during index creation
LOG(INFO) << "Removing temporary files (partial vocabulary and external text file...\n";
string removeCommand1 = "rm " + onDiskBase + EXTERNAL_LITS_TEXT_FILE_NAME;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use "rm --" to make sure no one can pass e.g. -rf as part of onDiskBase. Also I'm not sure if we really should use the system() function here, sadly the only nice standard way needs C++17. Now that Ubuntu 18.04 is out I'd be willing to use that in QLever though for a lot of our machines that means it really only works in a docker container at the moment. So maybe for now just add a TODO

// _____________________________________________________________________________
void Index::createFromNTriplesFile(const string& ntFile,
const string& onDiskBase,
bool allPermutations, bool onDiskLiterals) {
bool allPermutations, bool onDiskLiterals,
bool keepTempFiles) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indent looks wrong, did you try using our .clang-format settings?


std::vector<std::fstream> infiles;
// currently only one word is buffered, no matter how big the bufferSize is
// TODO: buffer, or throw the buffer out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you threw it out, remove this comment

add_library(tests
SparqlParserTest
StringUtilsTest
StringUtilsTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look wrongly indented

@@ -69,4 +73,5 @@ add_library(tests
ConversionsTest
SparsehashTest
GroupByTest
)
VocabularyGeneratorTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mismatched indent

_path0 = std::string(PARTIAL_VOCAB_FILE_NAME + std::to_string(0));
_path1 = std::string(PARTIAL_VOCAB_FILE_NAME + std::to_string(1));
// those names can be random
_pathExp0 = std::string(".partialVocabExp0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we moved to snake_case_names?

Copy link
Member Author

@joka921 joka921 Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for filenames it was names-with-dashes, fixed.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple missing spaces left. Feel free to merge after

@@ -99,9 +99,12 @@ Index::ExtVec Index::createExtVecAndVocabFromNTriples(const string& ntFile) {
if (!_keepTempFiles) {
// remove temporary files only used during index creation
LOG(INFO) << "Removing temporary files (partial vocabulary and external text file...\n";
string removeCommand1 = "rm " + _onDiskBase + EXTERNAL_LITS_TEXT_FILE_NAME;

//TODO: using system and rm is not really elegant nor portable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space between // and TODO


//TODO: using system and rm is not really elegant nor portable.
// use std::filesystem as soon as QLever is ported to C++17
string removeCommand1 = "rm --" + _onDiskBase + EXTERNAL_LITS_TEXT_FILE_NAME;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before _onDiskBase, sorry this was my mistake

if (top.first < string({EXTERNALIZED_LITERALS_PREFIX})) {
outfile << top.first << std::endl;
} else {
outfileExternal << top.first << std::endl;
}

// write id to partial vocabulary
infiles[top.second].seekp(infiles[top.second].tellp());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see

-base name and externalization settings are now  handled via
 member functions of Index  class.
- exception: createFromOnDiskIndex still gets the name of the index and saves it
  internally (I considered this to be more intuitive)
@niklas88 niklas88 merged commit 6db0324 into ad-freiburg:master Jul 11, 2018
@joka921 joka921 deleted the f.efficientIndexCreation branch August 23, 2022 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants