Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Index Build- Phase 1 #227

Merged
merged 8 commits into from
Apr 14, 2019
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,9 @@ target_link_libraries (PrefixHeuristicEvaluatorMain index ${CMAKE_THREAD_LIBS_IN
add_executable(TurtleParserMain src/TurtleParserMain.cpp)
target_link_libraries(TurtleParserMain parser ${CMAKE_THREAD_LIBS_INIT})

add_executable(VocabularyMergerMain src/VocabularyMergerMain.cpp)
target_link_libraries(VocabularyMergerMain index ${CMAKE_THREAD_LIBS_INIT})

#add_executable(TextFilterComparison src/experiments/TextFilterComparison.cpp)
#target_link_libraries (TextFilterComparison experiments)

23 changes: 23 additions & 0 deletions src/VocabularyMergerMain.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
// Copyright 2019, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach(joka921) <johannes.kalmbach@gmail.com>
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a small comment what this is used for. I assume it's for manually merging vocabularies if there was an error? Is this generally useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the comment, the reason is more "Benchmarking the vocabulary Merging without having to wait 9 hours for the TurtleParser"

// Only performs the "mergeVocabulary" step of the IndexBuilder pipeline
// Can be used e.g. for benchmarking this step to develop faster IndexBuilders.

#include "index/Vocabulary.h"
#include "index/VocabularyGenerator.h"

// ____________________________________________________________________________________________________
int main(int argc, char** argv) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives an unused warning for the int argc parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in an unused parameter error for int argc

if (argc != 3) {
std::cerr
<< "Usage: " << argv[0]
<< "<basename of index> <number of partial vocabulary files to merge>";
}
std::string basename = argv[1];
size_t numFiles = atoi(argv[2]);

VocabularyMerger m;
m.mergeVocabulary(basename, numFiles, StringSortComparator());
}
2 changes: 1 addition & 1 deletion src/engine/Distinct.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ void Distinct::computeResult(ResultTable* result) {
subRes->_resultTypes.begin(),
subRes->_resultTypes.end());
result->_localVocab = subRes->_localVocab;
int width = subRes->_data.size();
int width = subRes->_data.cols();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhoh, good find!

niklas88 marked this conversation as resolved.
Show resolved Hide resolved
CALL_FIXED_SIZE_1(width, getEngine().distinct, subRes->_data, _keepIndices,
&result->_data);
LOG(DEBUG) << "Distinct result computation done." << endl;
Expand Down
40 changes: 39 additions & 1 deletion src/engine/Filter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
// Author: Björn Buchhold (buchhold@informatik.uni-freiburg.de)

#include "Filter.h"
#include <algorithm>
#include <optional>
#include <regex>
#include <sstream>
Expand Down Expand Up @@ -415,10 +416,25 @@ void Filter::computeFilterFixedValue(
// remove the leading '^' symbol
std::string rhs = _rhs.substr(1);
std::string upperBoundStr = rhs;
upperBoundStr[upperBoundStr.size() - 1]++;
if (getIndex().getVocab().isCaseInsensitiveOrdering()) {
upperBoundStr = ad_utility::getUppercaseUtf8(upperBoundStr);
upperBoundStr[upperBoundStr.size() - 1]++;
upperBoundStr =
StringSortComparator::rdfLiteralToValueForLT(upperBoundStr);
// less than and greater equal require the same value
rhs = StringSortComparator::rdfLiteralToValueForLT(rhs);

LOG(INFO) << "upperBound was converted to " << upperBoundStr << '\n';
LOG(INFO) << "lowerBound was converted to " << rhs << '\n';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments + lower log level

} else {
upperBoundStr[upperBoundStr.size() - 1]++;
}

size_t upperBound =
getIndex().getVocab().getValueIdForLT(upperBoundStr);
size_t lowerBound = getIndex().getVocab().getValueIdForGE(rhs);
LOG(DEBUG) << "upper and lower bound are " << upperBound << ' '
<< lowerBound << std::endl;
if (lhs_is_sorted) {
// The input data is sorted, use binary search to locate the first
// and last element that match rhs and copy the range.
Expand Down Expand Up @@ -504,6 +520,28 @@ void Filter::computeResultFixedValue(
rhs_string = ad_utility::convertValueLiteralToIndexWord(rhs_string);
} else if (ad_utility::isNumeric(_rhs)) {
rhs_string = ad_utility::convertNumericToIndexWord(rhs_string);
} else {
if (getIndex().getVocab().isCaseInsensitiveOrdering()) {
// We have to move to the correct end of the
// "same letters but different case" - range
// to make the filters work
switch (_type) {
case SparqlFilter::GE:
case SparqlFilter::LT: {
rhs_string =
StringSortComparator::rdfLiteralToValueForLT(rhs_string);
}

break;
case SparqlFilter::GT:
case SparqlFilter::LE: {
rhs_string =
StringSortComparator::rdfLiteralToValueForGT(rhs_string);
} break;
default:
break;
}
}
}
if (_type == SparqlFilter::EQ || _type == SparqlFilter::NE) {
if (!getIndex().getVocab().getId(_rhs, &rhs)) {
Expand Down
7 changes: 5 additions & 2 deletions src/global/Constants.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
#include <string>

static const size_t STXXL_MEMORY_TO_USE = 1024L * 1024L * 1024L * 2L;
static const size_t STXXL_DISK_SIZE_INDEX_BUILDER = 500 * 1000;
static const size_t STXXL_DISK_SIZE_INDEX_BUILDER = 1000 * 1000;
static const size_t STXXL_DISK_SIZE_INDEX_TEST = 10;

static const size_t NOF_SUBTREES_TO_CACHE = 1000;
static const size_t MAX_NOF_ROWS_IN_RESULT = 100000;
static const size_t MIN_WORD_PREFIX_SIZE = 4;
static const char PREFIX_CHAR = '*';
static const char EXTERNALIZED_LITERALS_PREFIX = 127;
static const std::string EXTERNALIZED_LITERALS_PREFIX = std::string({127});
static const size_t MAX_NOF_NODES = 64;
static const size_t MAX_NOF_FILTERS = 64;

Expand All @@ -36,6 +36,9 @@ static const char INTERNAL_TEXT_MATCH_PREDICATE[] =
static const char HAS_PREDICATE_PREDICATE[] =
"<QLever-internal-function/has-predicate>";

// For anonymous nodes in Turtle.
static const std::string ANON_NODE_PREFIX = "QLever-Anon-Node";

static const std::string URI_PREFIX = "<QLever-internal-function/";

static const std::string LANGUAGE_PREDICATE = URI_PREFIX + "langtag>";
Expand Down
5 changes: 4 additions & 1 deletion src/index/ConstantsIndexCreation.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ static const size_t PARSER_BATCH_SIZE = 1000000;
// That many triples does the turtle parser have to buffer before the call to
// getline returns (unless our input reaches EOF). This makes parsing from
// streams faster.
static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 1000;
static const size_t PARSER_MIN_TRIPLES_AT_ONCE = 100000;

// When reading from a file, Chunks of this size will
// be fed to the parser at once (100 << 20 is exactly 100 MiB
Expand All @@ -50,3 +50,6 @@ static const size_t THRESHOLD_RELATION_CREATION = 2 << 20;
// ________________________________________________________________
static const std::string PARTIAL_VOCAB_FILE_NAME = ".partial-vocabulary";
static const std::string PARTIAL_MMAP_IDS = ".partial-ids-mmap";

// ________________________________________________________________
static const std::string TMP_BASENAME_COMPRESSION = ".tmp.compression_index";