Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix Compression and faster startup time #105

Merged
merged 2 commits into from Sep 1, 2018

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Aug 19, 2018

Current status: I am going to start a self-review later/tomorrow
This is a lot of change for one PR but they are separated relatively well from each other.

  • Implemented the Prefix Compression Heuristic using a Trie-based
    greedy algorithm

  • Integrated Prefix compression into the Vocabulary class
    (templated for prefix compression / no prefix compression)

  • Converted SPO and SOP to Mmap based data

  • faster startup time for ServerMain by not iterating over the complete
    Mmap Vector for statistics

  • Added nlohnmann/json as a submodule

  • IndexBuilder writes a configuration file (json) that contains
    information needed for the prefix compression. Can be extended for
    other settings and statistics

  • it is also possible to externalize entities from the vocabulary (e.g.
    Wikidata statement ids)

  • it is possible to pass a settings-json file to the index builder.
    Currently supported: Prefixes that shall be externalized (see above)

permutName + MMAP_FILE_SUFFIX);
*/
addBlockListToMmapMetaDataPermutation(permutName,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I will add the automatic version detection so that everything is correctly converted and only if necessary

@joka921 joka921 requested a review from niklas88 August 20, 2018 09:35
@@ -509,7 +509,7 @@ void Index::openTextFileHandle() {
}

// _____________________________________________________________________________
string Index::wordIdToString(Id id) const { return _textVocab[id]; }
const string& Index::wordIdToString(Id id) const { return _textVocab[id]; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how does this not return a reference to a local variable? When the string is compressed, AccessReturnType is std::string and the result of expandPrefix(). Now this returns a const string& to it, but then that returned string immediately goes out of scope?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _textVocab is not compressed and thus is able to return a valid reference.
I could also change this return value to be the proper AccessReturnType but I liked this interface more and as soon as we would think about compressing the text vocabulary, the compiler will complain.

@@ -163,6 +169,13 @@ using IndexMetaDataMmapView = IndexMetaData<
// constants for Magic Numbers to separate different types of MetaData;
const size_t MAGIC_NUMBER_MMAP_META_DATA = static_cast<size_t>(-1);
const size_t MAGIC_NUMBER_SPARSE_META_DATA = static_cast<size_t>(-2);
const size_t MAGIC_NUMBER_MMAP_META_DATA_BLOCK_LIST = static_cast<size_t>(-3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this gone? The version is still called V_BLOCK_LIST_AND_STATISTICS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would suggest a magic number that stays fixed from now on and a version Tag which can be incremented. The magic number must still discriminate whether we have a version tag at all, and thus we have 4 Magic Numbers for sparse/dense and with/without version.

The version tag describes the features of this version (thus "block list" and "statistics").

In one of the next PRs I will clean the serialization/deserialization of the IndexMetaData a bit more, e.g. only have one createFromByteBuffer that handles sparse and dense meta data using constexpr if etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought wherever we added a magic number we also added a version field? If not maybe we should just drop support for those files where we didn't do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is a design decision:

  1. Dropping the support makes the code easier
  2. Dropping the support breaks ALL Indices created so far
  3. We could also archive the old versions of createFromByteBuf somewhere for the MetaDataConverter. We would then have some code duplications, but somewhere in a frozen state far away from the actual production code.

Just tell me what you like best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean didn't we add a version wherever we added a magic number? So that we only got two formats

  • No magic number and no version = legacy format
  • Magic number and version = can increment version and keep magic number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we forgot to add a version with the IndexMetaData.
But I think we could implement your suggestion and we would break only the Indices built in the last days, using the "MMapBasedVector"-PR. I think this only affects your freebase builds which were converted from old builds and could be converted again. Then we would only have ONE magic number for sparse/dense each and the rest is handled by a version tag which I consider to be cleaner. What do you think of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted most of our QLever indices already and @hannahbast built a new version of Wikipedia + Freebase Easy but that's ok. I'd rather we break some intermediate indices now than be left with a worse design later.

Copy link
Member Author

@joka921 joka921 Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklas88 Ok, I think I finally got you now.( I first implemented a version that can read ANYTHING). What I can do with this design is Inform the User, that his index can not be fixed (should only affect you and Prof.Bast)
EDIT:
This should not be necessary, we can keep everything.
The versioning is done in an extra function, where the "intermediate" Magic number is correctly converted to the version tag 0. What I could do is Rename the constants so that the
old ones have a _DEPRECATED suffix and the newer ones NO _VERSION suffix.
But I think there have been so many changes that you could have a look again at my completely refactored serialization of the createFromByteBuffer things.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far

@@ -4,3 +4,6 @@
[submodule "third_party/googletest"]
path = third_party/googletest
url = https://github.com/google/googletest.git
[submodule "third_party/json"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think version 3.2 was released 2 days ago so we might want to make sure we are at that version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, how do I choose/fixate a certain commit of a certain branch of the submodule for my project?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMakeLists.txt Outdated
@@ -106,6 +113,9 @@ target_link_libraries (WriteIndexListsMain engine ${CMAKE_THREAD_LIBS_INIT})
add_executable(MetaDataConverterMain src/MetaDataConverterMain.cpp)
target_link_libraries (MetaDataConverterMain metaConverter ${CMAKE_THREAD_LIBS_INIT})

add_executable(PrefixHeuristicMain src/PrefixHeuristicMain.cpp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that executable needs a better name, to make it clear what it does. I.e. does it just print the compression list as information or does it compress an existing vocabulary on disk..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is right. In fact this currently does not do anything "useful" for Qlever but I probably can use it for my thesis to evaluate the potential of the method. I integrated it here to make use of the easy linking with cmake. Is it okay to leave it there?

And actually you remind me of something: we have to update the converter or some part of the program to also convert/compress the vocabulary to the new format (one of the disadvantages of "the first char is always a prefix code"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it but it should say what it does

e2e/e2e.sh Outdated
@@ -41,7 +41,7 @@ INDEX="e2e_data/scientists-index"
if [ "$1" != "no-index" ]; then
rm -f "$INDEX.*"
pushd "./build"
./IndexBuilderMain -a -l -i "../$INDEX" \
./IndexBuilderMain -a -l -c -i "../$INDEX" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have quite a few flags. Maybe we should start making some of them the default/mandatory. Is there any reason why one would not want compression? I think eliminating some of those could even affect some hot paths or at least make the code simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree to make compression the default (which perfectly makes sense), I happily agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should do that, especially since it's a different vocabulary format and otherwise that will lead to confusion.

@@ -0,0 +1,19 @@
// Copyright 2018, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach<joka921> (johannes.kalmbach@gmail.com)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment what it does. Also this should probably be added to the usage string. At least if we expect that tool to be useful in the future i.e.

@@ -15,7 +15,7 @@ int main(int argc, char** argv) {
exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a bit of additional information on what this does

*/

// TODO< this overload is needed for creating the prefix heuristic
template <class StringRange, typename = std::enable_if_t<_isCompressed>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what StringRange might be


// set the list of prefixes for words which will become part of the
// externalized vocabulary. Good for entity names that normally don't appear
// in queries or results but take a lot of space (e.g. Wikidata statements)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

// in the same order as the infile
// prefixes - a list of prefixes which we will compress
//
// Returns: A json array with information about the prefixes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have this as a json object rather than say vector<tuple<size_t, string>>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, then we can handle the internals of the .configuration file all in one place in the index class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it for this place. But I am not so sure anymore if this is the way to go (it also affects some different places):
This is a serialization information, which the Vocabulary needs exactly in this form to setup the compression when initializing an index from disk. The index class does not need to understand the exact format but only has to store this serialized information (in this case a json type) in a file.

I think this is a common pattern when serializing, just call serialize on all the members:)

@@ -0,0 +1,346 @@
// Copyright 2011, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Björn Buchhold <buchholb>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add youself as author, also in the non impl header

@@ -61,6 +61,9 @@ inline bool endsWith(const string& text, const string& suffix);
//! will return false. Case sensitive.
inline bool endsWith(const string& text, const char* suffix);

//! Returns the longest prefix that the two arguments have in common
inline string commonPrefix(const string& a, const string& b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_view?

@joka921
Copy link
Member Author

joka921 commented Aug 22, 2018

And don't worry about that test failure, I found the bug and will fix it tomorrow afternoon.
The test is not good there: it implicitly checks implementation details (assumes a certain size for the serialization of the MetaData instead of explicitly checking it or blackbox test it). The effect was a nondeterministic test failure (works on my machine) for code which is probably fine. You'll see what I mean tomorrow.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments but this is shaping up nicely

@@ -9,6 +9,15 @@
#include "./util/File.h"

// _________________________________________________________
// Opens an index from disk. Determines whether this index was built by an older
// QLever version and has to be updated in ordere to use it (efficiently or at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order

@@ -51,3 +51,12 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;

static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
static const std::string CONFIGURATION_FILE = ".configuration";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

".conf" is more canonical I think

@@ -51,3 +51,12 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;

static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
static const std::string CONFIGURATION_FILE = ".configuration";

static constexpr size_t MIN_COMPRESSION_PREFIX = 128;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be unit8_t to make it clear that they fit in a single byte

static constexpr size_t NUM_COMPRESSION_PREFIXES = 127;
// if this is the first character of a compressed string, this means that no
// compression has been applied to a word
static const char NO_PREFIX_CHAR =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char signedness is implementation defined (and different between ARM Linux and x86_64 Linux) so on x86_64 this will be a negative number while on ARM it will be positive. Therefore I'd make this uint8_t as well.

}

// explicit conversions to strings and string_views
string toString() const { return *this; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is toType() a convention we use? I think it's asString() for QueryExecutionTree. Is there some standard for this? I think in Boost I've seen foo.as<Bar>()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++ stl has std::to_string(42); But if this is the convention I am going to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No makes sense to keep it in our camel case format.

// __________________________________________________________________________
inline void notifyCreated(const string& filename, bool hasConvertedSuffix) {
if (hasConvertedSuffix) {
std::cout << "created new file " << filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice

template <typename>
string Vocabulary<S>::expandPrefix(const CompressedString& word) const {
assert(!word.empty());
auto idx = static_cast<unsigned char>(word[0]) - MIN_COMPRESSION_PREFIX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find unint8_t clearer here because it's not really a character

if (idx < NUM_COMPRESSION_PREFIXES) {
return _prefixMap[idx] + word.toStringView().substr(1);
} else {
return string(word.toStringView().substr(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the outer string() is redundant here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not. Converting from string_view to string is not possible and construction is explicit.

if (ad_utility::startsWith(word, p._fulltext)) {
auto res = CompressedString::fromString(
p._prefix + std::string_view(word).substr(p._fulltext.size()));
LOG(DEBUG) << "compressed " << word << " to " << res.toString() << '\n';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that toString() needed? I'd think that ostream handles string_view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a string view but a CompressedString. But I can teach ostream to do this, would probably be cleaner. But in this place I anyway want to throw out that LOG, because it is unreadable and much too verbose for every word. (This was for tracking a bug, that has been removed for some time now).

el = "";
}
_prefixVec.clear();
unsigned char idx = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit8_t

static constexpr size_t MIN_COMPRESSION_PREFIX = 128;
static constexpr size_t NUM_COMPRESSION_PREFIXES = 127;
// Constants for the range of valid compression prefixes
// all printable characters are left out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say ASCII not printable, we are assuming UTF-8 in most of QLever. Also, is this really relevant now that we always use the first byte to indicate compression. We could just as easily use 'N' and 'C' for (non-) compressed.

@niklas88
Copy link
Member

Can you rebase on the freshly merged disambiguate optional work with the changes to operator[] as discussed in that PR. After that I think we can merge this.

@@ -34,7 +34,7 @@ int main(int argc, char** argv) {
exit(1);
}

for (const auto& p : calculatePrefixes(argv[1], 127, 1)) {
for (const auto& p : calculatePrefixes(argv[1], 127, 1, true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use the NUM_COMPRESSION_PREFIXES constant so that it always matches the encoding actually in use

@niklas88
Copy link
Member

So am I seeing this right, the last couple of e2e test failures were just flakyness?

@joka921
Copy link
Member Author

joka921 commented Aug 29, 2018

@niklas88 The first one was a bug with getopt (I removed the -l flag which was still present in the e2e test).
The second failure I did not understand (worked fine locally, also with Release builds). Obviously the server failed to open, so I added a Server run in the foreground which of course let the test stall but yielded no results. After simply removing this again it worked fine so I assume that this one build I cannot explain (3ccb9ec) was some versioning/checkout problem with Travis.

@niklas88
Copy link
Member

@joka921 yeah, maybe something todo with the server startup. I've seen Travis CI being flaky before so don't worry. What are we still missing besides rebasing into coherent commits?

@joka921
Copy link
Member Author

joka921 commented Aug 29, 2018

I Implemented the changes we discussed on Monday so I think we are good for now. The things that could be improved are all stuff for separate PRs in my opinion. If you agree I will squash the commits and merge this, I just wanted to give you the chance to look at it again first.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after fixing the redefinition compilation error

@niklas88 niklas88 added this to Review in QLever Aug 31, 2018
@niklas88 niklas88 moved this from Review to Reviewer approved in QLever Aug 31, 2018
- Implemented the Prefix Compression Heuristic using a Trie-based
  greedy algorithm

- Integrated Prefix compression into the Vocabulary class
  (templated for prefix compression / no prefix compression)

- Converted SPO and SOP to Mmap based data

- faster startup time for ServerMain by not iterating over the complete
  Mmap Vector for statistics

- Added nlohnmann/json as a submodule
- IndexBuilder writes a configuration file (json) that contains
  information needed for the prefix compression. Can be extended for
  other settings and statistics

- it is also possible to externalize entities from the vocabulary (e.g.
  Wikidata statement ids)

- it is possible to pass a settings-json file to the index builder.
  Currently supported: Prefixes that shall be externalized (see above)

 - Storing statistics in MMap based Meta data
   - only calculate expensive statistics at index creation time
- also add a simple versioning system to the meta data
QLever automation moved this from Review Approved to Review Sep 1, 2018
Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix compilation

-- from every format for IndexMetaData we have introduced so far it is
now possible to automatically determine the correct version to read and
convert it

Eliminated -l flag for ServerMain

- this boolean flag can not be chosen by the user and has to match the
  settings chosen at index-build time. This information is passed via
  the .meta-data.json file

Prefix Compression now also takes into account that
we ALWAYS add a code prefix even if there's nothing to compress
that way, the Prefix " will also be compressed (one byte of saving per
literal that is not being compressed otherwise)

Separation between different types of vocabulary.

We have disabled externalizing parts of an uncompressed vocabulary,
because there the method resolving IDs is returning a reference which
also does not work with external literals.
@joka921 joka921 merged commit 240daaa into ad-freiburg:master Sep 1, 2018
QLever automation moved this from Review to Done Sep 1, 2018
@joka921 joka921 deleted the f.prefixCompressionNew branch August 23, 2022 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
QLever
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants