Merge pull request #80 from joka921/f.mmapBasedArray

Dense Meta Data using MMap-Based arrays.
ad-freiburg · Aug 12, 2018 · 7b6e89b · 7b6e89b
2 parents 77bde8c + 5194941
commit 7b6e89b
Show file tree

Hide file tree

Showing 26 changed files with 2,334 additions and 431 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -103,6 +103,8 @@ target_link_libraries (ServerMain engine ${CMAKE_THREAD_LIBS_INIT})
 add_executable(WriteIndexListsMain src/WriteIndexListsMain.cpp)
 target_link_libraries (WriteIndexListsMain engine ${CMAKE_THREAD_LIBS_INIT})
 
+add_executable(MetaDataConverterMain src/MetaDataConverterMain.cpp)
+target_link_libraries (MetaDataConverterMain metaConverter ${CMAKE_THREAD_LIBS_INIT})
 
 #add_executable(TextFilterComparison src/experiments/TextFilterComparison.cpp)
 #target_link_libraries (TextFilterComparison experiments)
@@ -126,3 +128,4 @@ add_test(QueryPlannerTest test/QueryPlannerTest)
 add_test(ConversionsTest test/ConversionsTest)
 add_test(SparsehashTest test/SparsehashTest)
 add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest)
+add_test(MmapVectorTest test/MmapVectorTest)
diff --git a/README.md b/README.md
@@ -380,6 +380,24 @@ Group by is supported, but aggregate aliases may currently only be used within t
 Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.
 Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable.
 
+# 6. Converting Old Indices For Current QLever Versions
+
+We have recently updated the way the index meta data (offsets of relations
+within the permutations) is stored. Old index builds with 6 permutations will
+not work directly with the recent QLever version while 2 permutation indices
+will work but throw a warning at runtime. We have provided a converter which
+allows to only modify the meta data without having to rebuild the index. Just
+run `./MetaDataConverterMain <index-prefix>` . This will not automatically
+overwrite the old index but copy the permutations and create new files with the
+suffix `.converted` (e.g. `<index-prefix>.index.ops.converted` These suffixes
+have to be removed manually in order to use the converted index (rename to
+`<index-prefix>.index.ops` in our example). Please consider creating backups of
+the "original" index files before overwriting them like this.  them. Please note
+that for 6 permutations the converter also builds new files
+`<index-prefix>.index.xxx.meta-mmap` where parts of the meta data of OPS and OSP
+permutations will be stored.
+
+
 # How to obtain data to play around with
 
 ## Use the tiny examples contained in the repository
@@ -526,29 +544,11 @@ If you have problems, try to rebuild when compiling with -DCMAKE_BUILD_TYPE=Debu
 In particular also rebuild the index.
 The release build assumes machine written words- and docsfiles and omits sanity checks for the sake of speed.
 
-## Excessive RAM usage
+## High RAM Usage During Runtime
 
 QLever uses an on-disk index and is usually able to operate with pretty low RAM
 usage. However, there are data layouts that currently lead to an excessive
-amount of memory being used. A future release should take care of that. There
-are two scenarios where this can happen.
-
-### High RAM usage during index construction
-
-While building an index, QLever does need more memory than when running. Part of
-this required memory is what is described below for runtime memory usage.
-However, in addition not all memory seems to be freed as soon as possible. This
-needs further investitation. For now, there is an easy workaround to build
-KB and text index in two steps (two calls of IndexBuilderMain) or to pre-build
-the index on a server with lots of available resources.
-
-In general, not building all 6 permutations helps a lot. If this is enough (e.g.
-for emulating the Brococli search engine), this reduces RAM very significantly.
-
-
-### High RAM usage during runtime
-
-Firstly, note that even very large text corpora have little impact on
+amount of memory being used. Firstly, note that even very large text corpora have little impact on
 memory usage. Larger KBs are much more problematic.
 There are two things that can contribute to high RAM usage (and large startup
 times) during runtime:
@@ -561,16 +561,11 @@ by editing directly in the code during index construction)
 
 2) Building all 6 permutations over large KBs (or generelly having a
 permutation, where the primary ordering element takes many different values).
-
-Typically, the problem happens for OPS and OSP permutations over KBs with many
-different objects, especially string literals. As of now (release for CIKM
-2017), the index keeps some meta data in RAM for each main element of the
-permutation. For the two "main" permutation,  PSO and POS, this is very
-reasonnable and resembles what is done in the Broccoli search engine. Having a
-few bytes (32 + extra bytes for blocks inside relations, which aren't the main problem anymore)
-for each of a few hundred thousand predicates is no problem, even for the largest KBs. However, having the same meta data for
-several hundreds of millions of objects (500M for Freebase, require twice, for
-OSP and OPS, plus 2 times 125M subjects) quickly adds up beyond acceptable numbers.
+To handle this issue, the meta data of OPS and OSP are not loaded into RAM but
+read from disk. This saves a lot of RAM with only little impact on the speed of
+the query execution. We will evaluate if it's  worth to also externalize SPO and
+SOP permutations in this way to further reduce the RAM usage or to let the user
+decide which permutations shall be stored in which format.
 
 Workarounds:
 
@@ -580,15 +575,3 @@ but is no very "clean".
 * Reduce ID size and switch from 64 to 32bit IDs. However this would only yield save a
 low portion of memory since it doesn't effect pointers byte offsets into index
 files.
-
-* There is a branch called six\_permut\_smaller\_memory\_footprint, that tackles the
-  problem in general but is not finished yet. In short, the idea is to store the
-  meta data within the on-disk index and make it configurable which lists to
-  load into RAM on startup (PSO and POS probably being the sensible default, but
-  sometimes also having SPO can be a nice addition). Whatever isn't loaded into
-  RAM has to be read form disk at query time if it is actually needed.
-  In particle, however, query planning has to respect which meta data is available in RAM and special care has to be taken for queries involving ?s ?p ?o triples. This is the reason why the change sn't trivial.
-  The branch also adds another readme, PERFORMANCE\_TUNING.md with more
-  detailed information about the trade-offs. However, this file also isn't finished,
-  yet.
-
diff --git a/e2e/scientists_queries.yaml b/e2e/scientists_queries.yaml
@@ -94,7 +94,7 @@ queries:
         checks:
           - num_cols: 2
           # The query returns to many rows, the current limit is 4096
-          # - num_rows: 5295 
+          # - num_rows: 5295
           - selected: ["?place", "?count2"]
           - order_numeric: {"dir": "DESC", "var": "?count2"}
   - query: scientists-order-by-aggregate-avg
@@ -109,8 +109,8 @@ queries:
           GROUP BY ?profession
           ORDER BY ASC((AVG(?height) as ?avg))
         checks:
-          - num_cols: 2 
-          - num_rows: 209 
+          - num_cols: 2
+          - num_rows: 209
           - selected: ["?profession", "?avg2"]
           - order_numeric: {"dir": "ASC", "var": "?avg2"}
   - query: group-by-profession-average-height
@@ -174,7 +174,7 @@ queries:
           - selected: ["?r", "?count"]
           - contains_row: ["<Religion>", 1185]
           - order_numeric: {"dir": "DESC", "var": "?count"}
-  - query : has-predicate-full 
+  - query : has-predicate-full
     solutions:
       - type: no-text
         sparql: |
@@ -200,3 +200,20 @@ queries:
           - num_cols: 2
           - selected: ["?entity", "?r"]
           - contains_row: ["<Geographer>", "<Profession>"]
+  - query : full-osp-scan
+    solutions:
+      - type: no-text
+        sparql: |
+          SELECT DISTINCT ?p WHERE {
+            ?x <is-a> <Scientist> .
+            ?y <is-a> <Scientist> .
+            ?x ?p ?y .
+          }
+        checks:
+          - num_rows: 17
+          - num_cols: 1
+          - selected: ["?p"]
+          - contains_row: ["<Academic_advisor>"]
+          - contains_row: ["<Named_after>"]
+          - contains_row: ["<Influenced_By>"]
+          - contains_row: ["<Production_staff>"]
diff --git a/src/MetaDataConverterMain.cpp b/src/MetaDataConverterMain.cpp
@@ -0,0 +1,43 @@
+// Copyright 2018, University of Freiburg,
+// Chair of Algorithms and Data Structures.
+// Author: Johannes Kalmbach (johannes.kalmbach@gmail.com)
+//
+#include "./index/MetaDataConverter.h"
+#include <array>
+#include <iostream>
+#include "./global/Constants.h"
+#include "./util/File.h"
+
+// _________________________________________________________
+int main (int argc, char** argv) {
+  if (argc != 2) {
+    std::cerr << "Usage: ./MetaDataConverterMain <indexPrefix>\n";
+    exit(1);
+  }
+  std::string in = argv[1];
+  std::array<std::string, 4> sparseNames{".pso", ".pos", ".spo", ".sop"};
+  for (const auto& n : sparseNames) {
+    std::string permutName = in + ".index" + n;
+    if (!ad_utility::File::exists(permutName)) {
+      std::cerr << "Permutation file " << permutName
+                << " was not found. Maybe not all permutations were built for "
+                   "this index. Skipping\n";
+      continue;
+    }
+    addMagicNumberToSparseMetaDataPermutation(permutName,
+                                              permutName + ".converted");
+  }
+
+  std::array<std::string, 2> denseNames{".osp", ".ops"};
+  for (const auto& n : denseNames) {
+    std::string permutName = in + ".index" + n;
+    if (!ad_utility::File::exists(permutName)) {
+      std::cerr << "Permutation file " << permutName
+                << " was not found. Maybe not all permutations were built for "
+                   "this index. Skipping\n";
+      continue;
+    }
+    convertHmapBasedPermutatationToMmap(permutName, permutName + ".converted",
+                                        permutName + MMAP_FILE_SUFFIX);
+  }
+}
diff --git a/src/global/Constants.h b/src/global/Constants.h
@@ -3,6 +3,8 @@
 // Author: Björn Buchhold (buchhold@informatik.uni-freiburg.de)
 #pragma once
 
+#include <string>
+
 static const int STXXL_MEMORY_TO_USE = 1024 * 1024 * 1024;
 static const int STXXL_DISK_SIZE_INDEX_BUILDER = 500 * 1000;
 static const int STXXL_DISK_SIZE_INDEX_TEST = 10;
@@ -47,3 +49,5 @@ static const int DEFAULT_NOF_VALUE_INTEGER_DIGITS = 50;
 static const int DEFAULT_NOF_VALUE_EXPONENT_DIGITS = 20;
 static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
 static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;
+
+static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
diff --git a/src/index/CMakeLists.txt b/src/index/CMakeLists.txt
@@ -5,10 +5,18 @@ add_library(index
         VocabularyGenerator.h VocabularyGenerator.cpp
         ConstantsIndexCreation.h
         ExternalVocabulary.h ExternalVocabulary.cpp
-        IndexMetaData.h IndexMetaData.cpp
+        IndexMetaData.h IndexMetaDataImpl.h
+        MetaDataTypes.h MetaDataTypes.cpp
+        MetaDataHandler.h
         StxxlSortFunctors.h
         TextMetaData.cpp TextMetaData.h
         DocsDB.cpp DocsDB.h
         FTSAlgorithms.cpp FTSAlgorithms.h)
 
 target_link_libraries(index parser ${STXXL_LIBRARIES})
+
+add_library(metaConverter
+            MetaDataConverter.cpp MetaDataConverter.h)
+
+target_link_libraries(metaConverter index)
+
diff --git a/src/index/ExternalVocabulary.h b/src/index/ExternalVocabulary.h
@@ -46,7 +46,7 @@ class ExternalVocabulary {
  private:
   mutable ad_utility::File _file;
   off_t _startOfOffsets;
-  size_t _size;
+  size_t _size = 0;
 
   Id binarySearchInVocab(const string& word) const;
 };
diff --git a/src/index/Index.Text.cpp b/src/index/Index.Text.cpp
@@ -2,13 +2,13 @@
 // Chair of Algorithms and Data Structures.
 // Author: Björn Buchhold (buchhold@informatik.uni-freiburg.de)
 
+#include "./Index.h"
 #include <stxxl/algorithm>
 #include <tuple>
 #include <utility>
 #include "../parser/ContextFileParser.h"
 #include "../util/Simple8bCode.h"
 #include "./FTSAlgorithms.h"
-#include "./Index.h"
 
 // _____________________________________________________________________________
 void Index::addTextFromContextFile(const string& contextFile) {
@@ -469,11 +469,10 @@ void Index::createCodebooks(const vector<Index::Posting>& postings,
             [](const std::pair<Id, size_t>& a, const std::pair<Id, size_t>& b) {
               return a.second > b.second;
             });
-  std::sort(
-      sfVec.begin(), sfVec.end(),
-      [](const std::pair<Score, size_t>& a, const std::pair<Score, size_t>& b) {
-        return a.second > b.second;
-      });
+  std::sort(sfVec.begin(), sfVec.end(), [](const std::pair<Score, size_t>& a,
+                                           const std::pair<Score, size_t>& b) {
+    return a.second > b.second;
+  });
   for (size_t j = 0; j < wfVec.size(); ++j) {
     wordCodebook.push_back(wfVec[j].first);
     wordCodemap[wfVec[j].first] = j;
@@ -573,8 +572,9 @@ void Index::getWordPostingsForTerm(const string& term, vector<Id>& cids,
       entityTerm
           ? _textMeta.getBlockInfoByEntityId(idRange._first)
           : _textMeta.getBlockInfoByWordRange(idRange._first, idRange._last);
-  if (tbmd._cl.hasMultipleWords() && !(tbmd._firstWordId == idRange._first &&
-                                       tbmd._lastWordId == idRange._last)) {
+  if (tbmd._cl.hasMultipleWords() &&
+      !(tbmd._firstWordId == idRange._first &&
+        tbmd._lastWordId == idRange._last)) {
     vector<Id> blockCids;
     vector<Id> blockWids;
     vector<Score> blockScores;