Prefix Compression and faster startup time #105

joka921 · 2018-08-19T15:33:12Z

Current status: I am going to start a self-review later/tomorrow
This is a lot of change for one PR but they are separated relatively well from each other.

Implemented the Prefix Compression Heuristic using a Trie-based
greedy algorithm
Integrated Prefix compression into the Vocabulary class
(templated for prefix compression / no prefix compression)
Converted SPO and SOP to Mmap based data
faster startup time for ServerMain by not iterating over the complete
Mmap Vector for statistics
Added nlohnmann/json as a submodule
IndexBuilder writes a configuration file (json) that contains
information needed for the prefix compression. Can be extended for
other settings and statistics
it is also possible to externalize entities from the vocabulary (e.g.
Wikidata statement ids)
it is possible to pass a settings-json file to the index builder.
Currently supported: Prefixes that shall be externalized (see above)

joka921 · 2018-08-19T15:35:02Z

src/MetaDataConverterMain.cpp

                                        permutName + MMAP_FILE_SUFFIX);
+                                        */
+    addBlockListToMmapMetaDataPermutation(permutName,


Here I will add the automatic version detection so that everything is correctly converted and only if necessary

niklas88 · 2018-08-20T10:57:49Z

src/index/Index.Text.cpp

@@ -509,7 +509,7 @@ void Index::openTextFileHandle() {
 }

 // _____________________________________________________________________________
-string Index::wordIdToString(Id id) const { return _textVocab[id]; }
+const string& Index::wordIdToString(Id id) const { return _textVocab[id]; }


Hmm, how does this not return a reference to a local variable? When the string is compressed, AccessReturnType is std::string and the result of expandPrefix(). Now this returns a const string& to it, but then that returned string immediately goes out of scope?!

The _textVocab is not compressed and thus is able to return a valid reference.
I could also change this return value to be the proper AccessReturnType but I liked this interface more and as soon as we would think about compressing the text vocabulary, the compiler will complain.

niklas88 · 2018-08-20T11:00:36Z

src/index/IndexMetaData.h

@@ -163,6 +169,13 @@ using IndexMetaDataMmapView = IndexMetaData<
 // constants for Magic Numbers to separate different types of MetaData;
 const size_t MAGIC_NUMBER_MMAP_META_DATA = static_cast<size_t>(-1);
 const size_t MAGIC_NUMBER_SPARSE_META_DATA = static_cast<size_t>(-2);
-const size_t MAGIC_NUMBER_MMAP_META_DATA_BLOCK_LIST = static_cast<size_t>(-3);


where is this gone? The version is still called V_BLOCK_LIST_AND_STATISTICS

Yes, I would suggest a magic number that stays fixed from now on and a version Tag which can be incremented. The magic number must still discriminate whether we have a version tag at all, and thus we have 4 Magic Numbers for sparse/dense and with/without version.

The version tag describes the features of this version (thus "block list" and "statistics").

In one of the next PRs I will clean the serialization/deserialization of the IndexMetaData a bit more, e.g. only have one createFromByteBuffer that handles sparse and dense meta data using constexpr if etc.

I thought wherever we added a magic number we also added a version field? If not maybe we should just drop support for those files where we didn't do that?

Okay, this is a design decision:

Dropping the support makes the code easier

Dropping the support breaks ALL Indices created so far

We could also archive the old versions of createFromByteBuf somewhere for the MetaDataConverter. We would then have some code duplications, but somewhere in a frozen state far away from the actual production code.

Just tell me what you like best.

I mean didn't we add a version wherever we added a magic number? So that we only got two formats

No magic number and no version = legacy format

Magic number and version = can increment version and keep magic number

Unfortunately we forgot to add a version with the IndexMetaData.
But I think we could implement your suggestion and we would break only the Indices built in the last days, using the "MMapBasedVector"-PR. I think this only affects your freebase builds which were converted from old builds and could be converted again. Then we would only have ONE magic number for sparse/dense each and the rest is handled by a version tag which I consider to be cleaner. What do you think of this?

I converted most of our QLever indices already and @hannahbast built a new version of Wikipedia + Freebase Easy but that's ok. I'd rather we break some intermediate indices now than be left with a worse design later.

@niklas88 Ok, I think I finally got you now.( I first implemented a version that can read ANYTHING). What I can do with this design is Inform the User, that his index can not be fixed (should only affect you and Prof.Bast)
EDIT:
This should not be necessary, we can keep everything.
The versioning is done in an extra function, where the "intermediate" Magic number is correctly converted to the version tag 0. What I could do is Rename the constants so that the
old ones have a _DEPRECATED suffix and the newer ones NO _VERSION suffix.
But I think there have been so many changes that you could have a look again at my completely refactored serialization of the createFromByteBuffer things.

niklas88

Looks great so far

niklas88 · 2018-08-22T13:52:14Z

.gitmodules

@@ -4,3 +4,6 @@
 [submodule "third_party/googletest"]
 	path = third_party/googletest
 	url = https://github.com/google/googletest.git
+[submodule "third_party/json"]


I think version 3.2 was released 2 days ago so we might want to make sure we are at that version

Yes indeed, how do I choose/fixate a certain commit of a certain branch of the submodule for my project?

https://stackoverflow.com/a/5828396/692303

niklas88 · 2018-08-22T13:54:10Z

CMakeLists.txt

@@ -106,6 +113,9 @@ target_link_libraries (WriteIndexListsMain engine ${CMAKE_THREAD_LIBS_INIT})
 add_executable(MetaDataConverterMain src/MetaDataConverterMain.cpp)
 target_link_libraries (MetaDataConverterMain metaConverter ${CMAKE_THREAD_LIBS_INIT})

+add_executable(PrefixHeuristicMain src/PrefixHeuristicMain.cpp)


I think that executable needs a better name, to make it clear what it does. I.e. does it just print the compression list as information or does it compress an existing vocabulary on disk..

Yes, that is right. In fact this currently does not do anything "useful" for Qlever but I probably can use it for my thesis to evaluate the potential of the method. I integrated it here to make use of the easy linking with cmake. Is it okay to leave it there?

And actually you remind me of something: we have to update the converter or some part of the program to also convert/compress the vocabulary to the new format (one of the disadvantages of "the first char is always a prefix code"

I think we should keep it but it should say what it does

niklas88 · 2018-08-22T13:55:19Z

e2e/e2e.sh

@@ -41,7 +41,7 @@ INDEX="e2e_data/scientists-index"
 if [ "$1" != "no-index" ]; then
 	rm -f "$INDEX.*"
 	pushd "./build"
-	./IndexBuilderMain -a -l -i "../$INDEX" \
+	./IndexBuilderMain -a -l -c -i "../$INDEX" \


We already have quite a few flags. Maybe we should start making some of them the default/mandatory. Is there any reason why one would not want compression? I think eliminating some of those could even affect some hot paths or at least make the code simpler.

If you agree to make compression the default (which perfectly makes sense), I happily agree.

Yes I think we should do that, especially since it's a different vocabulary format and otherwise that will lead to confusion.

niklas88 · 2018-08-22T13:56:59Z

src/PrefixHeuristicMain.cpp

@@ -0,0 +1,19 @@
+// Copyright 2018, University of Freiburg,
+// Chair of Algorithms and Data Structures.
+// Author: Johannes Kalmbach<joka921> (johannes.kalmbach@gmail.com)


Needs a comment what it does. Also this should probably be added to the usage string. At least if we expect that tool to be useful in the future i.e.

niklas88 · 2018-08-22T13:57:58Z

src/MetaDataConverterMain.cpp

@@ -15,7 +15,7 @@ int main(int argc, char** argv) {
    exit(1);


Add a bit of additional information on what this does

niklas88 · 2018-08-22T15:17:25Z

src/index/Vocabulary.h

+  */
+
+  // TODO< this overload is needed for creating the prefix heuristic
+  template <class StringRange, typename = std::enable_if_t<_isCompressed>>


Document what StringRange might be

niklas88 · 2018-08-22T15:17:42Z

src/index/Vocabulary.h

+
+  // set the list of prefixes for words which will become part of the
+  // externalized vocabulary. Good for entity names that normally don't appear
+  // in queries or results but take a lot of space (e.g. Wikidata statements)


niklas88 · 2018-08-22T15:19:11Z

src/index/Vocabulary.h

+  //            in the same order as the infile
+  //   prefixes - a list of prefixes which we will compress
+  //
+  //   Returns: A json array with information about the prefixes,


Does it make sense to have this as a json object rather than say vector<tuple<size_t, string>>?

Yes you are right, then we can handle the internals of the .configuration file all in one place in the index class.

I changed it for this place. But I am not so sure anymore if this is the way to go (it also affects some different places):
This is a serialization information, which the Vocabulary needs exactly in this form to setup the compression when initializing an index from disk. The index class does not need to understand the exact format but only has to store this serialized information (in this case a json type) in a file.

I think this is a common pattern when serializing, just call serialize on all the members:)

niklas88 · 2018-08-22T15:22:00Z

src/index/VocabularyImpl.h

@@ -0,0 +1,346 @@
+// Copyright 2011, University of Freiburg,
+// Chair of Algorithms and Data Structures.
+// Author: Björn Buchhold <buchholb>


Add youself as author, also in the non impl header

niklas88 · 2018-08-22T15:23:17Z

src/util/StringUtils.h

@@ -61,6 +61,9 @@ inline bool endsWith(const string& text, const string& suffix);
 //! will return false. Case sensitive.
 inline bool endsWith(const string& text, const char* suffix);

+//! Returns the longest prefix that the two arguments have in common
+inline string commonPrefix(const string& a, const string& b);


string_view?

joka921 · 2018-08-22T19:57:50Z

And don't worry about that test failure, I found the bug and will fix it tomorrow afternoon.
The test is not good there: it implicitly checks implementation details (assumes a certain size for the serialization of the MetaData instead of explicitly checking it or blackbox test it). The effect was a nondeterministic test failure (works on my machine) for code which is probably fine. You'll see what I mean tomorrow.

niklas88

A couple more comments but this is shaping up nicely

niklas88 · 2018-08-23T19:47:23Z

src/MetaDataConverterMain.cpp

@@ -9,6 +9,15 @@
 #include "./util/File.h"

 // _________________________________________________________
+// Opens an index from disk. Determines whether this index was built by an older
+// QLever version and has to be updated in ordere to use it (efficiently or at


niklas88 · 2018-08-24T08:03:04Z

src/global/Constants.h

@@ -51,3 +51,12 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
 static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;

 static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
+static const std::string CONFIGURATION_FILE = ".configuration";


".conf" is more canonical I think

niklas88 · 2018-08-24T08:04:22Z

src/global/Constants.h

@@ -51,3 +51,12 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
 static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;

 static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
+static const std::string CONFIGURATION_FILE = ".configuration";
+
+static constexpr size_t MIN_COMPRESSION_PREFIX = 128;


I think these should be unit8_t to make it clear that they fit in a single byte

niklas88 · 2018-08-24T08:11:48Z

src/global/Constants.h

+static constexpr size_t NUM_COMPRESSION_PREFIXES = 127;
+// if this is the first character of a compressed string, this means that no
+// compression has been applied to  a word
+static const char NO_PREFIX_CHAR =


char signedness is implementation defined (and different between ARM Linux and x86_64 Linux) so on x86_64 this will be a negative number while on ARM it will be positive. Therefore I'd make this uint8_t as well.

niklas88 · 2018-08-24T08:16:09Z

src/index/CompressedString.h

+  }
+
+  // explicit conversions to strings and string_views
+  string toString() const { return *this; }


is toType() a convention we use? I think it's asString() for QueryExecutionTree. Is there some standard for this? I think in Boost I've seen foo.as<Bar>()?

c++ stl has std::to_string(42); But if this is the convention I am going to change it.

No makes sense to keep it in our camel case format.

niklas88 · 2018-08-24T09:03:49Z

src/index/MetaDataConverter.h

+// __________________________________________________________________________
+inline void notifyCreated(const string& filename, bool hasConvertedSuffix) {
+  if (hasConvertedSuffix) {
+    std::cout << "created new file " << filename


This is nice

niklas88 · 2018-08-24T09:10:52Z

src/index/VocabularyImpl.h

+template <typename>
+string Vocabulary<S>::expandPrefix(const CompressedString& word) const {
+  assert(!word.empty());
+  auto idx = static_cast<unsigned char>(word[0]) - MIN_COMPRESSION_PREFIX;


I find unint8_t clearer here because it's not really a character

niklas88 · 2018-08-24T09:15:04Z

src/index/VocabularyImpl.h

+  if (idx < NUM_COMPRESSION_PREFIXES) {
+    return _prefixMap[idx] + word.toStringView().substr(1);
+  } else {
+    return string(word.toStringView().substr(1));


I think the outer string() is redundant here

No, it is not. Converting from string_view to string is not possible and construction is explicit.

niklas88 · 2018-08-24T09:16:25Z

src/index/VocabularyImpl.h

+    if (ad_utility::startsWith(word, p._fulltext)) {
+      auto res = CompressedString::fromString(
+          p._prefix + std::string_view(word).substr(p._fulltext.size()));
+      LOG(DEBUG) << "compressed " << word << " to " << res.toString() << '\n';


is that toString() needed? I'd think that ostream handles string_view?

This is not a string view but a CompressedString. But I can teach ostream to do this, would probably be cleaner. But in this place I anyway want to throw out that LOG, because it is unreadable and much too verbose for every word. (This was for tracking a bug, that has been removed for some time now).

niklas88 · 2018-08-24T09:17:24Z

src/index/VocabularyImpl.h

+    el = "";
+  }
+  _prefixVec.clear();
+  unsigned char idx = 0;


niklas88 · 2018-08-27T08:14:39Z

src/global/Constants.h

-static constexpr size_t MIN_COMPRESSION_PREFIX = 128;
-static constexpr size_t NUM_COMPRESSION_PREFIXES = 127;
+// Constants for the range of valid compression prefixes
+// all printable characters are left out


Say ASCII not printable, we are assuming UTF-8 in most of QLever. Also, is this really relevant now that we always use the first byte to indicate compression. We could just as easily use 'N' and 'C' for (non-) compressed.

niklas88 · 2018-08-27T08:16:08Z

Can you rebase on the freshly merged disambiguate optional work with the changes to operator[] as discussed in that PR. After that I think we can merge this.

niklas88 · 2018-08-28T12:42:50Z

src/PrefixHeuristicEvaluatorMain.cpp

@@ -34,7 +34,7 @@ int main(int argc, char** argv) {
    exit(1);
  }

-  for (const auto& p : calculatePrefixes(argv[1], 127, 1)) {
+  for (const auto& p : calculatePrefixes(argv[1], 127, 1, true)) {


I think this should use the NUM_COMPRESSION_PREFIXES constant so that it always matches the encoding actually in use

niklas88 · 2018-08-29T08:43:26Z

So am I seeing this right, the last couple of e2e test failures were just flakyness?

joka921 · 2018-08-29T08:54:39Z

@niklas88 The first one was a bug with getopt (I removed the -l flag which was still present in the e2e test).
The second failure I did not understand (worked fine locally, also with Release builds). Obviously the server failed to open, so I added a Server run in the foreground which of course let the test stall but yielded no results. After simply removing this again it worked fine so I assume that this one build I cannot explain (3ccb9ec) was some versioning/checkout problem with Travis.

niklas88 · 2018-08-29T12:22:49Z

@joka921 yeah, maybe something todo with the server startup. I've seen Travis CI being flaky before so don't worry. What are we still missing besides rebasing into coherent commits?

joka921 · 2018-08-29T14:57:55Z

I Implemented the changes we discussed on Monday so I think we are good for now. The things that could be improved are all stuff for separate PRs in my opinion. If you agree I will squash the commits and merge this, I just wanted to give you the chance to look at it again first.

niklas88

LGTM after fixing the redefinition compilation error

- Implemented the Prefix Compression Heuristic using a Trie-based greedy algorithm - Integrated Prefix compression into the Vocabulary class (templated for prefix compression / no prefix compression) - Converted SPO and SOP to Mmap based data - faster startup time for ServerMain by not iterating over the complete Mmap Vector for statistics - Added nlohnmann/json as a submodule - IndexBuilder writes a configuration file (json) that contains information needed for the prefix compression. Can be extended for other settings and statistics - it is also possible to externalize entities from the vocabulary (e.g. Wikidata statement ids) - it is possible to pass a settings-json file to the index builder. Currently supported: Prefixes that shall be externalized (see above) - Storing statistics in MMap based Meta data - only calculate expensive statistics at index creation time - also add a simple versioning system to the meta data

niklas88

Fix compilation

-- from every format for IndexMetaData we have introduced so far it is now possible to automatically determine the correct version to read and convert it Eliminated -l flag for ServerMain - this boolean flag can not be chosen by the user and has to match the settings chosen at index-build time. This information is passed via the .meta-data.json file Prefix Compression now also takes into account that we ALWAYS add a code prefix even if there's nothing to compress that way, the Prefix " will also be compressed (one byte of saving per literal that is not being compressed otherwise) Separation between different types of vocabulary. We have disabled externalizing parts of an uncompressed vocabulary, because there the method resolving IDs is returning a reference which also does not work with external literals.

joka921 commented Aug 19, 2018

View reviewed changes

joka921 requested a review from niklas88 August 20, 2018 09:35

niklas88 reviewed Aug 20, 2018

View reviewed changes

joka921 force-pushed the f.prefixCompressionNew branch from 6c9e7b1 to 8291aff Compare August 22, 2018 09:36

niklas88 reviewed Aug 22, 2018

View reviewed changes

joka921 force-pushed the f.prefixCompressionNew branch from f1983e7 to 888b6b4 Compare August 23, 2018 14:00

niklas88 reviewed Aug 24, 2018

View reviewed changes

niklas88 reviewed Aug 27, 2018

View reviewed changes

joka921 force-pushed the f.prefixCompressionNew branch from a54c454 to 99d9d34 Compare August 28, 2018 11:49

niklas88 reviewed Aug 28, 2018

View reviewed changes

niklas88 approved these changes Aug 29, 2018

View reviewed changes

niklas88 added this to Review in QLever Aug 31, 2018

niklas88 moved this from Review to Reviewer approved in QLever Aug 31, 2018

joka921 force-pushed the f.prefixCompressionNew branch from 34c1977 to e9279fa Compare September 1, 2018 07:46

QLever automation moved this from Review Approved to Review Sep 1, 2018

niklas88 suggested changes Sep 1, 2018

View reviewed changes

joka921 force-pushed the f.prefixCompressionNew branch from e9279fa to f4bf41a Compare September 1, 2018 11:50

joka921 merged commit 240daaa into ad-freiburg:master Sep 1, 2018

QLever automation moved this from Review to Done Sep 1, 2018

joka921 deleted the f.prefixCompressionNew branch August 23, 2022 20:40

Prefix Compression and faster startup time #105

Prefix Compression and faster startup time #105

Conversation

joka921 commented Aug 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joka921 Aug 22, 2018 • edited

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joka921 commented Aug 22, 2018

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented Aug 27, 2018

Choose a reason for hiding this comment

niklas88 commented Aug 29, 2018

joka921 commented Aug 29, 2018

niklas88 commented Aug 29, 2018

joka921 commented Aug 29, 2018

niklas88 left a comment • edited

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment

joka921 Aug 22, 2018 •

edited

niklas88 left a comment •

edited