Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WikidataFull for Panarea-Demo. Do NOT MERGE #134

Closed
wants to merge 75 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
16ca4ca
Prefix Compression and faster startup time
joka921 Aug 1, 2018
99d9d34
Updated MetaDataConverter to handle basically any old format
joka921 Aug 22, 2018
d3f7cce
Eliminated -l flag for ServerMain
joka921 Aug 28, 2018
0025d28
Prefix Compression now also takes into account that
joka921 Aug 28, 2018
d51b3b3
Better comment on the range of prefix characters
joka921 Aug 28, 2018
da0ac97
Separation between different types of vocabulary.
joka921 Aug 28, 2018
3ccb9ec
Fix: Removed -l from server
joka921 Aug 28, 2018
7625409
First draft of some parsing rules
joka921 Aug 7, 2018
87562cd
some first regexes for tokenizing turtle inputs
joka921 Aug 10, 2018
691ce39
added many more regexes. completely untested yet. changed to wstring
joka921 Aug 11, 2018
16402d3
Raw tokenizer version, seems to work at least to some extent
joka921 Aug 11, 2018
f6ee5b0
First draft of wstring based parser, not finished nor tested
joka921 Aug 11, 2018
3e2d22a
Some more rules for parser
joka921 Aug 11, 2018
c1c12e0
finished beta for nonTerminals
joka921 Aug 16, 2018
7b608a8
Added many more regexes
joka921 Aug 16, 2018
d3a095f
Parser is compiling
joka921 Aug 16, 2018
84cbf93
Before conversion to google's re2
joka921 Aug 20, 2018
fe8fc8f
Integrated re2 in Makefiles
joka921 Aug 20, 2018
b5aadfb
Compiling with re2
joka921 Aug 20, 2018
d331648
Removed unused Regexes from Tokenizer
joka921 Aug 21, 2018
eb5e3db
Unit tests for some of the string literal regexes
joka921 Aug 21, 2018
d277610
Added many more unit tests
joka921 Aug 21, 2018
3d28807
Character classes working now
joka921 Aug 21, 2018
a017d3b
Tests and updates for prefixed names and blank nodes
joka921 Aug 22, 2018
3b33561
First tests for Parser
joka921 Aug 24, 2018
1fd33ed
Bugfix for getNextToken with multiple candidates
joka921 Aug 24, 2018
9858058
Implemented blankNodePropertyList and additional tests
joka921 Aug 24, 2018
ab80452
Unit tests for all lists etc.
joka921 Aug 24, 2018
61d48ae
Finished for today
joka921 Aug 24, 2018
bf7e8f8
Mmap and getline for TurtleParser + bugfix
joka921 Aug 25, 2018
bc445ff
Bugfix in statement rule
joka921 Aug 25, 2018
a763e16
Turtle Parser Beta (parses from uncompressed files)
joka921 Aug 28, 2018
125afdb
Possibly working TurtleParser
joka921 Aug 28, 2018
7535f05
Beta-Integration of Turtle parser into indexBuilder
joka921 Aug 28, 2018
6b20e50
included bzip2 in dockerfile
joka921 Aug 28, 2018
fc113c4
Eliminated Regexes in expensive places
joka921 Aug 30, 2018
62b4d65
Added code that inserts triples for language relation
joka921 Sep 5, 2018
3aa5b93
Completed Functionality
joka921 Sep 5, 2018
5cf73f3
Fixed language filter for externalized languages
joka921 Sep 5, 2018
1167110
Fixed the externalization character output
joka921 Sep 5, 2018
143f489
Fixed test (unfortunately manually)
joka921 Sep 5, 2018
01fd1b6
Removed some unused code
joka921 Sep 5, 2018
bfcf187
Efficient Permutation creation
joka921 Sep 6, 2018
4089db9
Removed outcommented code
joka921 Sep 7, 2018
6cbeacd
Added special triples like <me> rdfs:label.en "myEnglishName"@en
joka921 Sep 7, 2018
1435392
Changed the format of languagePredicates
joka921 Sep 7, 2018
c10b60b
Output every 10 million lines only for second pass
joka921 Sep 7, 2018
049f202
Merge branch 'f.turtleParser' into f.tmp
joka921 Sep 7, 2018
ac6ed8c
Merge branch 'f.efficientLanguageFilter' into f.everythingMerged
joka921 Sep 7, 2018
250685c
Merge branch 'f.EfficientMultiplicities' into f.everythingMerged
joka921 Sep 7, 2018
ead78f5
Minor fix for stringParse
joka921 Sep 7, 2018
fe7854d
No testing of RE2 lib
joka921 Sep 7, 2018
16034eb
Fixed StringParse Bug
joka921 Sep 7, 2018
7a88bbf
Log output when parser is not successful
joka921 Sep 7, 2018
5497854
Merge branch 'f.everythingMerged' of https://github.com/joka921/QLeve…
joka921 Sep 7, 2018
14a06bf
Parallel Vocabulary Generation
joka921 Sep 7, 2018
7eb7d77
Status message
joka921 Sep 7, 2018
c998268
Bugfix for last Vocabulary entry
joka921 Sep 7, 2018
5b01950
Last (partial) buffer also has to be handled
joka921 Sep 7, 2018
223438d
Reduced Memory Footprint
joka921 Sep 7, 2018
219651d
Moved Items to inner loop. There seems to be memory leaking
joka921 Sep 7, 2018
2194cc2
This compiles. Let's see if it also works
joka921 Sep 8, 2018
45bd1bd
Bugfix: writer for ExtVec has to be finished
joka921 Sep 8, 2018
ff82ec9
Fixed Usage of OPENMP
joka921 Sep 8, 2018
f774a52
LanguagePredicate is no longer at beginning of vocab
joka921 Sep 8, 2018
5e322dc
Compilation bugfix
joka921 Sep 8, 2018
7f509f8
Added support for HAVING. Fixes #104.
floriankramer Sep 13, 2018
f2a5c8f
Replaced resultSortedBy column with a vector of columns.
floriankramer Sep 16, 2018
255e17b
Added group by operations to the optimizer.
floriankramer Sep 16, 2018
9e9ebb5
Fixed GroupBy::computeSortColumns returning wrong columns
floriankramer Sep 18, 2018
ed601a7
Current version for Hannah
joka921 Sep 18, 2018
912780a
Added multithreading as default to Dockerfile
joka921 Sep 18, 2018
e688afb
Added a prefix filter
floriankramer Sep 19, 2018
53280d0
HACK: added support for filters on verbatim columns. This is a hack a…
floriankramer Sep 19, 2018
2590586
Merge branch 'florians-prefix-filter' into f.singlePass
joka921 Sep 19, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ Dockerfile
index/*
e2e_data/*
build/*
.git/*
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@
[submodule "third_party/json"]
path = third_party/json
url = https://github.com/nlohmann/json.git
[submodule "third_party/re2"]
path = third_party/re2
url = https://github.com/google/re2.git
41 changes: 38 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -66,14 +66,41 @@ include_directories(third_party/json/include/)
# STXXL
################################
# Disable GNU parallel as it prevents build on Ubuntu 14.04
set(USE_GNU_PARALLEL OFF CACHE BOOL "Don't use gnu parallel" FORCE)
set(USE_OPENMP OFF CACHE BOOL "Don't use OpenMP" FORCE)
set(USE_GNU_PARALLEL ON CACHE BOOL "Don't use gnu parallel" FORCE)
set(USE_OPENMP ON CACHE BOOL "Don't use OpenMP" FORCE)
add_subdirectory(third_party/stxxl)
# apply STXXL CXXFLAGS
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${STXXL_CXX_FLAGS}")
# add STXXL includes path
include_directories(SYSTEM ${STXXL_INCLUDE_DIRS})

################################
# GNU PARALLEL
################################
if(USE_OPENMP OR USE_GNU_PARALLEL)
include(FindOpenMP)
if(NOT OPENMP_FOUND)
message(STATUS "OpenMP not found. Continuing without parallel algorithm support.")
else()
message(STATUS "OpenMP found, enabling built-in parallel algorithms.")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
endif()

else()
message(STATUS "OpenMP disabled in QLever (no parallelism is used).")

endif(USE_OPENMP OR USE_GNU_PARALLEL)

################################
# RE2
################################
#no unit tests for RE2 (they test long and exhaustive there)
option(RE2_BUILD_TESTING "enable testing for RE2" OFF)
add_subdirectory(third_party/re2)
include_directories(third_party/re2)

message(STATUS ---)
message(STATUS "CXX_FLAGS are : " ${CMAKE_CXX_FLAGS})
message(STATUS "CXX_FLAGS_RELEASE are : " ${CMAKE_CXX_FLAGS_RELEASE})
Expand Down Expand Up @@ -116,6 +143,12 @@ target_link_libraries (MetaDataConverterMain metaConverter ${CMAKE_THREAD_LIBS_I
add_executable(PrefixHeuristicEvaluatorMain src/PrefixHeuristicEvaluatorMain.cpp)
target_link_libraries (PrefixHeuristicEvaluatorMain index ${CMAKE_THREAD_LIBS_INIT})

add_executable(TurtleParserMain src/TurtleParserMain.cpp)
target_link_libraries(TurtleParserMain parser ${CMAKE_THREAD_LIBS_INIT})

add_executable(Bzip2WrapperMain src/parser/Bzip2WrapperMain.cpp)
target_link_libraries(Bzip2WrapperMain -lbz2)

#add_executable(TextFilterComparison src/experiments/TextFilterComparison.cpp)
#target_link_libraries (TextFilterComparison experiments)

Expand All @@ -137,5 +170,7 @@ add_test(FTSAlgorithmsTest test/FTSAlgorithmsTest)
add_test(QueryPlannerTest test/QueryPlannerTest)
add_test(ConversionsTest test/ConversionsTest)
add_test(SparsehashTest test/SparsehashTest)
add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest)
#add_test(VocabularyGeneratorTest test/VocabularyGeneratorTest)
add_test(MmapVectorTest test/MmapVectorTest)
add_test(TokenTest test/TokenTest)
add_test(TurtleParserTest test/TurtleParserTest)
11 changes: 6 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,22 @@ ENV LC_ALL C.UTF-8
ENV LC_CTYPE C.UTF-8

FROM base as builder
RUN apt-get update && apt-get install -y build-essential clang-format cmake libsparsehash-dev
RUN apt-get update && apt-get install -y build-essential cmake libsparsehash-dev libbz2-dev
COPY . /app/

# Check formatting with the .clang-format project style
WORKDIR /app/
RUN misc/format-check.sh

WORKDIR /app/build/
RUN cmake -DCMAKE_BUILD_TYPE=Release .. && make -j $(nproc) && make test
RUN cmake -DCMAKE_BUILD_TYPE=Release .. && make -j $(nproc)

FROM base as runtime
WORKDIR /app
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl
RUN apt-get update && apt-get install -y wget python3-yaml unzip curl libgomp1
ARG UID=1000
RUN groupadd -r qlever && useradd --no-log-init -r -u $UID -g qlever qlever && chown qlever:qlever /app
RUN apt-get update && apt-get install -y bzip2

COPY --from=builder /app/build/*Main /app/src/web/* /app/
COPY --from=builder /app/e2e/* /app/e2e/
Expand All @@ -29,10 +30,10 @@ USER qlever
EXPOSE 7001
VOLUME ["/input", "/index"]

ENV INDEX_PREFIX index
ENV INDEX_PREFIX wikidata-full
# Need the shell to get the INDEX_PREFIX envirionment variable
ENTRYPOINT ["/bin/sh", "-c", "exec ServerMain -i \"/index/${INDEX_PREFIX}\" -p 7001 \"$@\"", "--"]
CMD ["-t", "-a", "-P"]
CMD ["-a", "-j 8"]

# docker build -t qlever-<name> .
# # When running with user namespaces you may need to make the index folder accessible
Expand Down
32 changes: 31 additions & 1 deletion e2e/scientists_queries.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ queries:
- contains_row: ["<Aaron_Antonovsky>","<Helen_Antonovsky>"]
- contains_row: ["<Abraham_Zelmanov>", ""]
- contains_row: ["<Abraham_Pais>","<Ida_Nicolaisen>;<Lila_Lee_Pais>"]
- contains_row: ["<Aafia_Siddiqui>","<Ammar_al-Baluchi>;<Amjad_Mohammed_Khan>"]
- contains_row: ["<Aafia_Siddiqui>","<Amjad_Mohammed_Khan>;<Ammar_al-Baluchi>"]
- query: giant-int-scientists
solutions:
- type: no-text
Expand Down Expand Up @@ -338,3 +338,33 @@ queries:
- contains_row: ["<Albert_Einstein>", "<Nobel_Prize_in_Physics>"]
- contains_row: ["<Albert_Fert>", "<Wolf_Prize_in_Physics>"]
- contains_row: ["<Albert_Overhauser>", "<National_Medal_of_Science_for_Physical_Science>"]
- query : having-predicate-religion
solutions:
- type: no-text
sparql: |
SELECT ?predicate (COUNT(?predicate) as ?count) WHERE {
?x <is-a> <Astronaut> .
?x ql:has-predicate ?predicate .
}
GROUP BY ?predicate
HAVING (?predicate < <Z) (?predicate = <Religion>)
checks:
- num_rows: 1
- num_cols: 2
- selected: ["?predicate", "?count"]
- contains_row: ["<Religion>", "5"]
- query : pattern-trick-automatic-having
solutions:
- type: no-text
sparql: |
SELECT ?predicate (COUNT(?predicate) as ?count) WHERE {
?x ql:has-predicate ?predicate .
FILTER (?predicate = <Gender>)
}
GROUP BY ?predicate
ORDER BY DESC(?count)
checks:
- num_rows: 1
- num_cols: 2
- selected: ["?predicate", "?count"]
- contains_row: ["<Gender>", "18589"]
21 changes: 21 additions & 0 deletions src/TurtleParserMain.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
// Copyright 2018, University of Freiburg,
// Chair of Algorithms and Data Structures.
// Author: Johannes Kalmbach(joka921) <johannes.kalmbach@gmail.com>

#include <array>
#include <iostream>
#include <string>
#include "./parser/TurtleParser.h"

int main(int argc, char** argv) {
if (argc != 2) {
std::cerr << "Usage: ./TurtleParserMain <turtleInput>";
exit(1);
}
TurtleParser p(argv[1]);
std::array<std::string, 3> triple;
while (p.getLine(&triple)) {
std::cout << triple[0] << " " << triple[1] << " " << triple[2] << '\n';
}
}

6 changes: 3 additions & 3 deletions src/engine/CountAvailablePredicates.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,9 @@ string CountAvailablePredicates::asString(size_t indent) const {
size_t CountAvailablePredicates::getResultWidth() const { return 2; }

// _____________________________________________________________________________
size_t CountAvailablePredicates::resultSortedOn() const {
vector<size_t> CountAvailablePredicates::resultSortedOn() const {
// The result is not sorted on any column.
return std::numeric_limits<size_t>::max();
return {};
}

// _____________________________________________________________________________
Expand Down Expand Up @@ -100,7 +100,7 @@ size_t CountAvailablePredicates::getCostEstimate() {
// _____________________________________________________________________________
void CountAvailablePredicates::computeResult(ResultTable* result) const {
result->_nofColumns = 2;
result->_sortedBy = 0;
result->_sortedBy = resultSortedOn();
result->_fixedSizeData = new vector<array<Id, 2>>();
result->_resultTypes.push_back(ResultTable::ResultType::KB);
result->_resultTypes.push_back(ResultTable::ResultType::VERBATIM);
Expand Down
18 changes: 9 additions & 9 deletions src/engine/CountAvailablePredicates.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,32 +41,32 @@ class CountAvailablePredicates : public Operation {
std::shared_ptr<QueryExecutionTree> subtree,
size_t subjectColumnIndex);

virtual string asString(size_t indent = 0) const;
virtual string asString(size_t indent = 0) const override;

virtual size_t getResultWidth() const;
virtual size_t getResultWidth() const override;

virtual size_t resultSortedOn() const;
virtual vector<size_t> resultSortedOn() const override;

std::unordered_map<string, size_t> getVariableColumns() const;

virtual void setTextLimit(size_t limit) {
virtual void setTextLimit(size_t limit) override {
if (_subtree != nullptr) {
_subtree->setTextLimit(limit);
}
}

virtual bool knownEmptyResult() {
virtual bool knownEmptyResult() override {
if (_subtree != nullptr) {
return _subtree->knownEmptyResult();
}
return false;
}

virtual float getMultiplicity(size_t col);
virtual float getMultiplicity(size_t col) override;

virtual size_t getSizeEstimate();
virtual size_t getSizeEstimate() override;

virtual size_t getCostEstimate();
virtual size_t getCostEstimate() override;

void setVarNames(const std::string& predicateVarName,
const std::string& countVarName);
Expand Down Expand Up @@ -103,5 +103,5 @@ class CountAvailablePredicates : public Operation {
std::string _predicateVarName;
std::string _countVarName;

virtual void computeResult(ResultTable* result) const;
virtual void computeResult(ResultTable* result) const override;
};
24 changes: 16 additions & 8 deletions src/engine/Distinct.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,27 +25,35 @@ class Distinct : public Operation {
std::shared_ptr<QueryExecutionTree> subtree,
const vector<size_t>& keepIndices);

virtual string asString(size_t indent = 0) const;
virtual string asString(size_t indent = 0) const override;

virtual size_t resultSortedOn() const { return _subtree->resultSortedOn(); }
virtual vector<size_t> resultSortedOn() const override {
return _subtree->resultSortedOn();
}

virtual void setTextLimit(size_t limit) { _subtree->setTextLimit(limit); }
virtual void setTextLimit(size_t limit) override {
_subtree->setTextLimit(limit);
}

virtual size_t getSizeEstimate() { return _subtree->getSizeEstimate(); }
virtual size_t getSizeEstimate() override {
return _subtree->getSizeEstimate();
}

virtual size_t getCostEstimate() {
virtual size_t getCostEstimate() override {
return getSizeEstimate() + _subtree->getCostEstimate();
}

virtual float getMultiplicity(size_t col) {
virtual float getMultiplicity(size_t col) override {
return _subtree->getMultiplicity(col);
}

virtual bool knownEmptyResult() { return _subtree->knownEmptyResult(); }
virtual bool knownEmptyResult() override {
return _subtree->knownEmptyResult();
}

private:
std::shared_ptr<QueryExecutionTree> _subtree;
vector<size_t> _keepIndices;

virtual void computeResult(ResultTable* result) const;
virtual void computeResult(ResultTable* result) const override;
};