New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix Compression and faster startup time #105
Conversation
src/MetaDataConverterMain.cpp
Outdated
permutName + MMAP_FILE_SUFFIX); | ||
*/ | ||
addBlockListToMmapMetaDataPermutation(permutName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I will add the automatic version detection so that everything is correctly converted and only if necessary
src/index/Index.Text.cpp
Outdated
@@ -509,7 +509,7 @@ void Index::openTextFileHandle() { | |||
} | |||
|
|||
// _____________________________________________________________________________ | |||
string Index::wordIdToString(Id id) const { return _textVocab[id]; } | |||
const string& Index::wordIdToString(Id id) const { return _textVocab[id]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how does this not return a reference to a local variable? When the string is compressed, AccessReturnType
is std::string
and the result of expandPrefix()
. Now this returns a const string&
to it, but then that returned string immediately goes out of scope?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _textVocab
is not compressed and thus is able to return a valid reference.
I could also change this return value to be the proper AccessReturnType but I liked this interface more and as soon as we would think about compressing the text vocabulary, the compiler will complain.
src/index/IndexMetaData.h
Outdated
@@ -163,6 +169,13 @@ using IndexMetaDataMmapView = IndexMetaData< | |||
// constants for Magic Numbers to separate different types of MetaData; | |||
const size_t MAGIC_NUMBER_MMAP_META_DATA = static_cast<size_t>(-1); | |||
const size_t MAGIC_NUMBER_SPARSE_META_DATA = static_cast<size_t>(-2); | |||
const size_t MAGIC_NUMBER_MMAP_META_DATA_BLOCK_LIST = static_cast<size_t>(-3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this gone? The version is still called V_BLOCK_LIST_AND_STATISTICS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would suggest a magic number that stays fixed from now on and a version Tag which can be incremented. The magic number must still discriminate whether we have a version tag at all, and thus we have 4 Magic Numbers for sparse/dense and with/without version.
The version tag describes the features of this version (thus "block list" and "statistics").
In one of the next PRs I will clean the serialization/deserialization of the IndexMetaData a bit more, e.g. only have one createFromByteBuffer that handles sparse and dense meta data using constexpr if etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought wherever we added a magic number we also added a version field? If not maybe we should just drop support for those files where we didn't do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is a design decision:
- Dropping the support makes the code easier
- Dropping the support breaks ALL Indices created so far
- We could also archive the old versions of createFromByteBuf somewhere for the MetaDataConverter. We would then have some code duplications, but somewhere in a frozen state far away from the actual production code.
Just tell me what you like best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean didn't we add a version wherever we added a magic number? So that we only got two formats
- No magic number and no version = legacy format
- Magic number and version = can increment version and keep magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we forgot to add a version with the IndexMetaData.
But I think we could implement your suggestion and we would break only the Indices built in the last days, using the "MMapBasedVector"-PR. I think this only affects your freebase builds which were converted from old builds and could be converted again. Then we would only have ONE magic number for sparse/dense each and the rest is handled by a version tag which I consider to be cleaner. What do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I converted most of our QLever indices already and @hannahbast built a new version of Wikipedia + Freebase Easy but that's ok. I'd rather we break some intermediate indices now than be left with a worse design later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niklas88 Ok, I think I finally got you now.( I first implemented a version that can read ANYTHING). What I can do with this design is Inform the User, that his index can not be fixed (should only affect you and Prof.Bast)
EDIT:
This should not be necessary, we can keep everything.
The versioning is done in an extra function, where the "intermediate" Magic number is correctly converted to the version tag 0. What I could do is Rename the constants so that the
old ones have a _DEPRECATED suffix and the newer ones NO _VERSION suffix.
But I think there have been so many changes that you could have a look again at my completely refactored serialization of the createFromByteBuffer things.
6c9e7b1
to
8291aff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far
@@ -4,3 +4,6 @@ | |||
[submodule "third_party/googletest"] | |||
path = third_party/googletest | |||
url = https://github.com/google/googletest.git | |||
[submodule "third_party/json"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think version 3.2 was released 2 days ago so we might want to make sure we are at that version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, how do I choose/fixate a certain commit of a certain branch of the submodule for my project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMakeLists.txt
Outdated
@@ -106,6 +113,9 @@ target_link_libraries (WriteIndexListsMain engine ${CMAKE_THREAD_LIBS_INIT}) | |||
add_executable(MetaDataConverterMain src/MetaDataConverterMain.cpp) | |||
target_link_libraries (MetaDataConverterMain metaConverter ${CMAKE_THREAD_LIBS_INIT}) | |||
|
|||
add_executable(PrefixHeuristicMain src/PrefixHeuristicMain.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that executable needs a better name, to make it clear what it does. I.e. does it just print the compression list as information or does it compress an existing vocabulary on disk..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is right. In fact this currently does not do anything "useful" for Qlever but I probably can use it for my thesis to evaluate the potential of the method. I integrated it here to make use of the easy linking with cmake. Is it okay to leave it there?
And actually you remind me of something: we have to update the converter or some part of the program to also convert/compress the vocabulary to the new format (one of the disadvantages of "the first char is always a prefix code"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep it but it should say what it does
e2e/e2e.sh
Outdated
@@ -41,7 +41,7 @@ INDEX="e2e_data/scientists-index" | |||
if [ "$1" != "no-index" ]; then | |||
rm -f "$INDEX.*" | |||
pushd "./build" | |||
./IndexBuilderMain -a -l -i "../$INDEX" \ | |||
./IndexBuilderMain -a -l -c -i "../$INDEX" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have quite a few flags. Maybe we should start making some of them the default/mandatory. Is there any reason why one would not want compression? I think eliminating some of those could even affect some hot paths or at least make the code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree to make compression the default (which perfectly makes sense), I happily agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should do that, especially since it's a different vocabulary format and otherwise that will lead to confusion.
src/PrefixHeuristicMain.cpp
Outdated
@@ -0,0 +1,19 @@ | |||
// Copyright 2018, University of Freiburg, | |||
// Chair of Algorithms and Data Structures. | |||
// Author: Johannes Kalmbach<joka921> (johannes.kalmbach@gmail.com) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment what it does. Also this should probably be added to the usage string. At least if we expect that tool to be useful in the future i.e.
@@ -15,7 +15,7 @@ int main(int argc, char** argv) { | |||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit of additional information on what this does
src/index/Vocabulary.h
Outdated
*/ | ||
|
||
// TODO< this overload is needed for creating the prefix heuristic | ||
template <class StringRange, typename = std::enable_if_t<_isCompressed>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document what StringRange
might be
|
||
// set the list of prefixes for words which will become part of the | ||
// externalized vocabulary. Good for entity names that normally don't appear | ||
// in queries or results but take a lot of space (e.g. Wikidata statements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
// in the same order as the infile | ||
// prefixes - a list of prefixes which we will compress | ||
// | ||
// Returns: A json array with information about the prefixes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have this as a json object rather than say vector<tuple<size_t, string>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, then we can handle the internals of the .configuration file all in one place in the index class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it for this place. But I am not so sure anymore if this is the way to go (it also affects some different places):
This is a serialization information, which the Vocabulary needs exactly in this form to setup the compression when initializing an index from disk. The index class does not need to understand the exact format but only has to store this serialized information (in this case a json type) in a file.
I think this is a common pattern when serializing, just call serialize on all the members:)
src/index/VocabularyImpl.h
Outdated
@@ -0,0 +1,346 @@ | |||
// Copyright 2011, University of Freiburg, | |||
// Chair of Algorithms and Data Structures. | |||
// Author: Björn Buchhold <buchholb> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add youself as author, also in the non impl header
src/util/StringUtils.h
Outdated
@@ -61,6 +61,9 @@ inline bool endsWith(const string& text, const string& suffix); | |||
//! will return false. Case sensitive. | |||
inline bool endsWith(const string& text, const char* suffix); | |||
|
|||
//! Returns the longest prefix that the two arguments have in common | |||
inline string commonPrefix(const string& a, const string& b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string_view
?
And don't worry about that test failure, I found the bug and will fix it tomorrow afternoon. |
f1983e7
to
888b6b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments but this is shaping up nicely
src/MetaDataConverterMain.cpp
Outdated
@@ -9,6 +9,15 @@ | |||
#include "./util/File.h" | |||
|
|||
// _________________________________________________________ | |||
// Opens an index from disk. Determines whether this index was built by an older | |||
// QLever version and has to be updated in ordere to use it (efficiently or at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order
src/global/Constants.h
Outdated
@@ -51,3 +51,12 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30; | |||
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19; | |||
|
|||
static const std::string MMAP_FILE_SUFFIX = ".meta-mmap"; | |||
static const std::string CONFIGURATION_FILE = ".configuration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
".conf"
is more canonical I think
src/global/Constants.h
Outdated
@@ -51,3 +51,12 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30; | |||
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19; | |||
|
|||
static const std::string MMAP_FILE_SUFFIX = ".meta-mmap"; | |||
static const std::string CONFIGURATION_FILE = ".configuration"; | |||
|
|||
static constexpr size_t MIN_COMPRESSION_PREFIX = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be unit8_t
to make it clear that they fit in a single byte
src/global/Constants.h
Outdated
static constexpr size_t NUM_COMPRESSION_PREFIXES = 127; | ||
// if this is the first character of a compressed string, this means that no | ||
// compression has been applied to a word | ||
static const char NO_PREFIX_CHAR = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char
signedness is implementation defined (and different between ARM Linux and x86_64 Linux) so on x86_64 this will be a negative number while on ARM it will be positive. Therefore I'd make this uint8_t
as well.
} | ||
|
||
// explicit conversions to strings and string_views | ||
string toString() const { return *this; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is toType()
a convention we use? I think it's asString()
for QueryExecutionTree
. Is there some standard for this? I think in Boost I've seen foo.as<Bar>()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c++ stl has std::to_string(42); But if this is the convention I am going to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No makes sense to keep it in our camel case format.
// __________________________________________________________________________ | ||
inline void notifyCreated(const string& filename, bool hasConvertedSuffix) { | ||
if (hasConvertedSuffix) { | ||
std::cout << "created new file " << filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice
src/index/VocabularyImpl.h
Outdated
template <typename> | ||
string Vocabulary<S>::expandPrefix(const CompressedString& word) const { | ||
assert(!word.empty()); | ||
auto idx = static_cast<unsigned char>(word[0]) - MIN_COMPRESSION_PREFIX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find unint8_t
clearer here because it's not really a character
src/index/VocabularyImpl.h
Outdated
if (idx < NUM_COMPRESSION_PREFIXES) { | ||
return _prefixMap[idx] + word.toStringView().substr(1); | ||
} else { | ||
return string(word.toStringView().substr(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the outer string()
is redundant here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is not. Converting from string_view to string is not possible and construction is explicit.
src/index/VocabularyImpl.h
Outdated
if (ad_utility::startsWith(word, p._fulltext)) { | ||
auto res = CompressedString::fromString( | ||
p._prefix + std::string_view(word).substr(p._fulltext.size())); | ||
LOG(DEBUG) << "compressed " << word << " to " << res.toString() << '\n'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that toString()
needed? I'd think that ostream
handles string_view
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a string view but a CompressedString. But I can teach ostream to do this, would probably be cleaner. But in this place I anyway want to throw out that LOG, because it is unreadable and much too verbose for every word. (This was for tracking a bug, that has been removed for some time now).
src/index/VocabularyImpl.h
Outdated
el = ""; | ||
} | ||
_prefixVec.clear(); | ||
unsigned char idx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit8_t
src/global/Constants.h
Outdated
static constexpr size_t MIN_COMPRESSION_PREFIX = 128; | ||
static constexpr size_t NUM_COMPRESSION_PREFIXES = 127; | ||
// Constants for the range of valid compression prefixes | ||
// all printable characters are left out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say ASCII not printable, we are assuming UTF-8 in most of QLever. Also, is this really relevant now that we always use the first byte to indicate compression. We could just as easily use 'N'
and 'C'
for (non-) compressed.
Can you rebase on the freshly merged disambiguate optional work with the changes to |
a54c454
to
99d9d34
Compare
src/PrefixHeuristicEvaluatorMain.cpp
Outdated
@@ -34,7 +34,7 @@ int main(int argc, char** argv) { | |||
exit(1); | |||
} | |||
|
|||
for (const auto& p : calculatePrefixes(argv[1], 127, 1)) { | |||
for (const auto& p : calculatePrefixes(argv[1], 127, 1, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should use the NUM_COMPRESSION_PREFIXES
constant so that it always matches the encoding actually in use
So am I seeing this right, the last couple of e2e test failures were just flakyness? |
@niklas88 The first one was a bug with getopt (I removed the -l flag which was still present in the e2e test). |
@joka921 yeah, maybe something todo with the server startup. I've seen Travis CI being flaky before so don't worry. What are we still missing besides rebasing into coherent commits? |
I Implemented the changes we discussed on Monday so I think we are good for now. The things that could be improved are all stuff for separate PRs in my opinion. If you agree I will squash the commits and merge this, I just wanted to give you the chance to look at it again first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after fixing the redefinition compilation error
- Implemented the Prefix Compression Heuristic using a Trie-based greedy algorithm - Integrated Prefix compression into the Vocabulary class (templated for prefix compression / no prefix compression) - Converted SPO and SOP to Mmap based data - faster startup time for ServerMain by not iterating over the complete Mmap Vector for statistics - Added nlohnmann/json as a submodule - IndexBuilder writes a configuration file (json) that contains information needed for the prefix compression. Can be extended for other settings and statistics - it is also possible to externalize entities from the vocabulary (e.g. Wikidata statement ids) - it is possible to pass a settings-json file to the index builder. Currently supported: Prefixes that shall be externalized (see above) - Storing statistics in MMap based Meta data - only calculate expensive statistics at index creation time - also add a simple versioning system to the meta data
34c1977
to
e9279fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix compilation
-- from every format for IndexMetaData we have introduced so far it is now possible to automatically determine the correct version to read and convert it Eliminated -l flag for ServerMain - this boolean flag can not be chosen by the user and has to match the settings chosen at index-build time. This information is passed via the .meta-data.json file Prefix Compression now also takes into account that we ALWAYS add a code prefix even if there's nothing to compress that way, the Prefix " will also be compressed (one byte of saving per literal that is not being compressed otherwise) Separation between different types of vocabulary. We have disabled externalizing parts of an uncompressed vocabulary, because there the method resolving IDs is returning a reference which also does not work with external literals.
e9279fa
to
f4bf41a
Compare
Current status: I am going to start a self-review later/tomorrow
This is a lot of change for one PR but they are separated relatively well from each other.
Implemented the Prefix Compression Heuristic using a Trie-based
greedy algorithm
Integrated Prefix compression into the Vocabulary class
(templated for prefix compression / no prefix compression)
Converted SPO and SOP to Mmap based data
faster startup time for ServerMain by not iterating over the complete
Mmap Vector for statistics
Added nlohnmann/json as a submodule
IndexBuilder writes a configuration file (json) that contains
information needed for the prefix compression. Can be extended for
other settings and statistics
it is also possible to externalize entities from the vocabulary (e.g.
Wikidata statement ids)
it is possible to pass a settings-json file to the index builder.
Currently supported: Prefixes that shall be externalized (see above)