Skip to content

Commit

Permalink
Merge pull request #80 from joka921/f.mmapBasedArray
Browse files Browse the repository at this point in the history
Dense Meta Data using MMap-Based arrays.
  • Loading branch information
joka921 committed Aug 12, 2018
2 parents 77bde8c + 5194941 commit 7b6e89b
Show file tree
Hide file tree
Showing 26 changed files with 2,334 additions and 431 deletions.
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,8 @@ target_link_libraries (ServerMain engine ${CMAKE_THREAD_LIBS_INIT})
add_executable(WriteIndexListsMain src/WriteIndexListsMain.cpp)
target_link_libraries (WriteIndexListsMain engine ${CMAKE_THREAD_LIBS_INIT})

add_executable(MetaDataConverterMain src/MetaDataConverterMain.cpp)
target_link_libraries (MetaDataConverterMain metaConverter ${CMAKE_THREAD_LIBS_INIT})

#add_executable(TextFilterComparison src/experiments/TextFilterComparison.cpp)
#target_link_libraries (TextFilterComparison experiments)
Expand All @@ -126,3 +128,4 @@ add_test(QueryPlannerTest test/QueryPlannerTest)
add_test(ConversionsTest test/ConversionsTest)
add_test(SparsehashTest test/SparsehashTest)
add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest)
add_test(MmapVectorTest test/MmapVectorTest)
67 changes: 25 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,24 @@ Group by is supported, but aggregate aliases may currently only be used within t
Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.
Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable.

# 6. Converting Old Indices For Current QLever Versions

We have recently updated the way the index meta data (offsets of relations
within the permutations) is stored. Old index builds with 6 permutations will
not work directly with the recent QLever version while 2 permutation indices
will work but throw a warning at runtime. We have provided a converter which
allows to only modify the meta data without having to rebuild the index. Just
run `./MetaDataConverterMain <index-prefix>` . This will not automatically
overwrite the old index but copy the permutations and create new files with the
suffix `.converted` (e.g. `<index-prefix>.index.ops.converted` These suffixes
have to be removed manually in order to use the converted index (rename to
`<index-prefix>.index.ops` in our example). Please consider creating backups of
the "original" index files before overwriting them like this. them. Please note
that for 6 permutations the converter also builds new files
`<index-prefix>.index.xxx.meta-mmap` where parts of the meta data of OPS and OSP
permutations will be stored.


# How to obtain data to play around with

## Use the tiny examples contained in the repository
Expand Down Expand Up @@ -526,29 +544,11 @@ If you have problems, try to rebuild when compiling with -DCMAKE_BUILD_TYPE=Debu
In particular also rebuild the index.
The release build assumes machine written words- and docsfiles and omits sanity checks for the sake of speed.

## Excessive RAM usage
## High RAM Usage During Runtime

QLever uses an on-disk index and is usually able to operate with pretty low RAM
usage. However, there are data layouts that currently lead to an excessive
amount of memory being used. A future release should take care of that. There
are two scenarios where this can happen.

### High RAM usage during index construction

While building an index, QLever does need more memory than when running. Part of
this required memory is what is described below for runtime memory usage.
However, in addition not all memory seems to be freed as soon as possible. This
needs further investitation. For now, there is an easy workaround to build
KB and text index in two steps (two calls of IndexBuilderMain) or to pre-build
the index on a server with lots of available resources.

In general, not building all 6 permutations helps a lot. If this is enough (e.g.
for emulating the Brococli search engine), this reduces RAM very significantly.


### High RAM usage during runtime

Firstly, note that even very large text corpora have little impact on
amount of memory being used. Firstly, note that even very large text corpora have little impact on
memory usage. Larger KBs are much more problematic.
There are two things that can contribute to high RAM usage (and large startup
times) during runtime:
Expand All @@ -561,16 +561,11 @@ by editing directly in the code during index construction)

2) Building all 6 permutations over large KBs (or generelly having a
permutation, where the primary ordering element takes many different values).

Typically, the problem happens for OPS and OSP permutations over KBs with many
different objects, especially string literals. As of now (release for CIKM
2017), the index keeps some meta data in RAM for each main element of the
permutation. For the two "main" permutation, PSO and POS, this is very
reasonnable and resembles what is done in the Broccoli search engine. Having a
few bytes (32 + extra bytes for blocks inside relations, which aren't the main problem anymore)
for each of a few hundred thousand predicates is no problem, even for the largest KBs. However, having the same meta data for
several hundreds of millions of objects (500M for Freebase, require twice, for
OSP and OPS, plus 2 times 125M subjects) quickly adds up beyond acceptable numbers.
To handle this issue, the meta data of OPS and OSP are not loaded into RAM but
read from disk. This saves a lot of RAM with only little impact on the speed of
the query execution. We will evaluate if it's worth to also externalize SPO and
SOP permutations in this way to further reduce the RAM usage or to let the user
decide which permutations shall be stored in which format.

Workarounds:

Expand All @@ -580,15 +575,3 @@ but is no very "clean".
* Reduce ID size and switch from 64 to 32bit IDs. However this would only yield save a
low portion of memory since it doesn't effect pointers byte offsets into index
files.

* There is a branch called six\_permut\_smaller\_memory\_footprint, that tackles the
problem in general but is not finished yet. In short, the idea is to store the
meta data within the on-disk index and make it configurable which lists to
load into RAM on startup (PSO and POS probably being the sensible default, but
sometimes also having SPO can be a nice addition). Whatever isn't loaded into
RAM has to be read form disk at query time if it is actually needed.
In particle, however, query planning has to respect which meta data is available in RAM and special care has to be taken for queries involving ?s ?p ?o triples. This is the reason why the change sn't trivial.
The branch also adds another readme, PERFORMANCE\_TUNING.md with more
detailed information about the trade-offs. However, this file also isn't finished,
yet.

25 changes: 21 additions & 4 deletions e2e/scientists_queries.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ queries:
checks:
- num_cols: 2
# The query returns to many rows, the current limit is 4096
# - num_rows: 5295
# - num_rows: 5295
- selected: ["?place", "?count2"]
- order_numeric: {"dir": "DESC", "var": "?count2"}
- query: scientists-order-by-aggregate-avg
Expand All @@ -109,8 +109,8 @@ queries:
GROUP BY ?profession
ORDER BY ASC((AVG(?height) as ?avg))
checks:
- num_cols: 2
- num_rows: 209
- num_cols: 2
- num_rows: 209
- selected: ["?profession", "?avg2"]
- order_numeric: {"dir": "ASC", "var": "?avg2"}
- query: group-by-profession-average-height
Expand Down Expand Up @@ -174,7 +174,7 @@ queries:
- selected: ["?r", "?count"]
- contains_row: ["<Religion>", 1185]
- order_numeric: {"dir": "DESC", "var": "?count"}
- query : has-predicate-full
- query : has-predicate-full
solutions:
- type: no-text
sparql: |
Expand All @@ -200,3 +200,20 @@ queries:
- num_cols: 2
- selected: ["?entity", "?r"]
- contains_row: ["<Geographer>", "<Profession>"]
- query : full-osp-scan
solutions:
- type: no-text
sparql: |
SELECT DISTINCT ?p WHERE {
?x <is-a> <Scientist> .
?y <is-a> <Scientist> .
?x ?p ?y .
}
checks:
- num_rows: 17
- num_cols: 1
- selected: ["?p"]
- contains_row: ["<Academic_advisor>"]
- contains_row: ["<Named_after>"]
- contains_row: ["<Influenced_By>"]
- contains_row: ["<Production_staff>"]
43 changes: 43 additions & 0 deletions src/MetaDataConverterMain.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// Copyright 2018, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach (johannes.kalmbach@gmail.com)
//
#include "./index/MetaDataConverter.h"
#include <array>
#include <iostream>
#include "./global/Constants.h"
#include "./util/File.h"

// _________________________________________________________
int main (int argc, char** argv) {
if (argc != 2) {
std::cerr << "Usage: ./MetaDataConverterMain <indexPrefix>\n";
exit(1);
}
std::string in = argv[1];
std::array<std::string, 4> sparseNames{".pso", ".pos", ".spo", ".sop"};
for (const auto& n : sparseNames) {
std::string permutName = in + ".index" + n;
if (!ad_utility::File::exists(permutName)) {
std::cerr << "Permutation file " << permutName
<< " was not found. Maybe not all permutations were built for "
"this index. Skipping\n";
continue;
}
addMagicNumberToSparseMetaDataPermutation(permutName,
permutName + ".converted");
}

std::array<std::string, 2> denseNames{".osp", ".ops"};
for (const auto& n : denseNames) {
std::string permutName = in + ".index" + n;
if (!ad_utility::File::exists(permutName)) {
std::cerr << "Permutation file " << permutName
<< " was not found. Maybe not all permutations were built for "
"this index. Skipping\n";
continue;
}
convertHmapBasedPermutatationToMmap(permutName, permutName + ".converted",
permutName + MMAP_FILE_SUFFIX);
}
}
4 changes: 4 additions & 0 deletions src/global/Constants.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
// Author: Björn Buchhold (buchhold@informatik.uni-freiburg.de)
#pragma once

#include <string>

static const int STXXL_MEMORY_TO_USE = 1024 * 1024 * 1024;
static const int STXXL_DISK_SIZE_INDEX_BUILDER = 500 * 1000;
static const int STXXL_DISK_SIZE_INDEX_TEST = 10;
Expand Down Expand Up @@ -47,3 +49,5 @@ static const int DEFAULT_NOF_VALUE_INTEGER_DIGITS = 50;
static const int DEFAULT_NOF_VALUE_EXPONENT_DIGITS = 20;
static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;

static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
10 changes: 9 additions & 1 deletion src/index/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,18 @@ add_library(index
VocabularyGenerator.h VocabularyGenerator.cpp
ConstantsIndexCreation.h
ExternalVocabulary.h ExternalVocabulary.cpp
IndexMetaData.h IndexMetaData.cpp
IndexMetaData.h IndexMetaDataImpl.h
MetaDataTypes.h MetaDataTypes.cpp
MetaDataHandler.h
StxxlSortFunctors.h
TextMetaData.cpp TextMetaData.h
DocsDB.cpp DocsDB.h
FTSAlgorithms.cpp FTSAlgorithms.h)

target_link_libraries(index parser ${STXXL_LIBRARIES})

add_library(metaConverter
MetaDataConverter.cpp MetaDataConverter.h)

target_link_libraries(metaConverter index)

2 changes: 1 addition & 1 deletion src/index/ExternalVocabulary.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ class ExternalVocabulary {
private:
mutable ad_utility::File _file;
off_t _startOfOffsets;
size_t _size;
size_t _size = 0;

Id binarySearchInVocab(const string& word) const;
};
16 changes: 8 additions & 8 deletions src/index/Index.Text.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
// Chair of Algorithms and Data Structures.
// Author: Björn Buchhold (buchhold@informatik.uni-freiburg.de)

#include "./Index.h"
#include <stxxl/algorithm>
#include <tuple>
#include <utility>
#include "../parser/ContextFileParser.h"
#include "../util/Simple8bCode.h"
#include "./FTSAlgorithms.h"
#include "./Index.h"

// _____________________________________________________________________________
void Index::addTextFromContextFile(const string& contextFile) {
Expand Down Expand Up @@ -469,11 +469,10 @@ void Index::createCodebooks(const vector<Index::Posting>& postings,
[](const std::pair<Id, size_t>& a, const std::pair<Id, size_t>& b) {
return a.second > b.second;
});
std::sort(
sfVec.begin(), sfVec.end(),
[](const std::pair<Score, size_t>& a, const std::pair<Score, size_t>& b) {
return a.second > b.second;
});
std::sort(sfVec.begin(), sfVec.end(), [](const std::pair<Score, size_t>& a,
const std::pair<Score, size_t>& b) {
return a.second > b.second;
});
for (size_t j = 0; j < wfVec.size(); ++j) {
wordCodebook.push_back(wfVec[j].first);
wordCodemap[wfVec[j].first] = j;
Expand Down Expand Up @@ -573,8 +572,9 @@ void Index::getWordPostingsForTerm(const string& term, vector<Id>& cids,
entityTerm
? _textMeta.getBlockInfoByEntityId(idRange._first)
: _textMeta.getBlockInfoByWordRange(idRange._first, idRange._last);
if (tbmd._cl.hasMultipleWords() && !(tbmd._firstWordId == idRange._first &&
tbmd._lastWordId == idRange._last)) {
if (tbmd._cl.hasMultipleWords() &&
!(tbmd._firstWordId == idRange._first &&
tbmd._lastWordId == idRange._last)) {
vector<Id> blockCids;
vector<Id> blockWids;
vector<Score> blockScores;
Expand Down

0 comments on commit 7b6e89b

Please sign in to comment.