Predicate paths #244

floriankramer · 2019-05-02T12:31:11Z

Adds support for property paths, but does not support transitive operations (e.g *) yet.

niklas88

I really like what I'm seeing so far! The PropertyPathParser looks well built and you already prepared for transitive patterns to be done by special operations. I think this addition will be great in enabling so many more of the Wikidata Query Service examples to be easily used with QLever.
I've only made a preliminary review but only a few nitpicks at the moment.
Do you want to merge this before or after adding transitive path patterns?

src/parser/PropertyPathParser.cpp

niklas88 · 2019-05-03T08:51:16Z

src/parser/PropertyPathParser.cpp

+  DELIMITER_CHARS['^'] = true;
+
+  // Init the valid characters with all delimiters
+  std::memcpy(VALID_CHARS.data(), DELIMITER_CHARS.data(), 256);


As above I would expect std::copy (from <algorithms>) to result in the same machine code without resorting to C style

I replaced the memcpy with an assignment instead, which should have exactly the same effect.

test/SparqlParserTest.cpp

src/parser/PropertyPathParser.h

niklas88 · 2019-05-03T10:05:46Z

src/engine/QueryPlanner.cpp

-QueryPlanner::QueryPlanner(QueryExecutionContext* qec) : _qec(qec) {}
+QueryPlanner::QueryPlanner(QueryExecutionContext* qec)
+    : _qec(qec), _randVarCount(0) {
+  _randomSeed = time(NULL);


Do we want to set this to something static to make runs comparable?

Seems like a good idea. I've set the seed to 42.

src/engine/QueryPlanner.cpp

niklas88 · 2019-05-03T10:13:58Z

src/engine/QueryPlanner.cpp

    }
    // TODO: Could use more refactoring.
    // In Earlier versions there were no ql:contains... predicates but
    // a symmetric <in-text> predicate. Therefore some parts are still more
    // complex than need be.
-    if (t._p == CONTAINS_WORD_PREDICATE || t._p == CONTAINS_ENTITY_PREDICATE) {
-      contextVars.insert(t._s);
+    if (t._p._iri == CONTAINS_WORD_PREDICATE ||


Is the above TODO still valid? This doesn't look too complicated

I'm not sure. Do we know who added this TODO?

According to git blame it was added by Björn 2 years ago, so I would say that's a no

niklas88 · 2019-05-03T10:18:16Z

src/engine/QueryPlanner.cpp

+}
+
+// _____________________________________________________________________________
+std::string QueryPlanner::generateRandomVarName() {


Ok so I think since you are using _randVarCount anyway, we really only need the prefix to be unique against all user defined variables. Maybe we could just use something that's forbidden in standard SPARQL. For example we could start it with 0x00b7 ("·" MIDDLE DOT), which according to the grammar rule can't appear in the beginning. Interestingly this exact character for a very similar purpose has precedent in the Go compiler

Using a fixed instead of a random prefix seems like a good idea. I'd use a colon though, as it is not allowed as the first character of a variable (only [a-z][A-Z][0-9] '_' and some unicode characters are allowed) and it's a less obscure character.

src/engine/QueryPlanner.cpp

niklas88 · 2019-05-03T10:46:16Z

By the way, you can use the following bash script with my ubuntu-clang-format container to format the code on Arch (it will download the image automatically)

#!/usr/bin/env sh
docker run --rm -v "$(pwd):/input" -u $(id -u):$(id -g) niklas88/ubuntu-clang-format $@

With zsh I usually run the following command where ubuntu-clang-format is the above script in my PATH.

ubuntu-clang-format -i (src|test)/**/*.(cpp|h)

src/engine/QueryPlanner.cpp

floriankramer · 2019-05-07T09:04:33Z

My plan was to implement the transitive operation and add some e2e tests before merging this. Should be done by the end of the week.

niklas88 · 2019-05-15T08:31:50Z

The transitive operation doesn't work right in my limited testing (its running on Wikidata Full at the moment). Which I think should have been caught by better E2E Tests, I do realize though that this is somewhat hindered by the limits of the Scientists KB.

The following query gives fewer results than the same but without /wdt:P279* which can't be since * should also match the empty extension.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?river WHERE
{
  ?river wdt:P31/wdt:P279* wd:Q355304 .
  ?river wdt:P30 wd:Q51 .
}

Also the following query adding OPTIONAL crashes the instance

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?river ?location WHERE
{
  ?river wdt:P31/wdt:P279* wd:Q355304 .
  ?river wdt:P30 wd:Q51 .
  OPTIONAL { ?river wdt:P625 ?location . }
}

This can't entirely be a problem with the * or OPTIONAL since the following also crashes

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?river WHERE
{
  ?river wdt:P31/wdt:P279/wdt:P279 wd:Q355304 .
  ?river wdt:P30 wd:Q51 .
}

Interestingly using wdt:P31|(wdt:P31/wdt:P279) seems to work and correctly includes the result from both sides of the |, it also matches the result of the Wikidata Query Service

floriankramer · 2019-05-15T10:56:50Z

@niklas88 I just split the swicth of GraphPatterns to using std::shared_ptrs pretty much everywhere into a separate commit (it was so far part of the single commit in #245) and added it onto this pr. This fixed several memory related issues with the GraphPatternOperations and may fix the two crashes you had.

floriankramer · 2019-05-15T11:11:46Z

@niklas88 Currently the Transitive operation with a minimum distance of 0 just adds a connection from everything on the left side of its subquery to itself, which only works in a few highly specific cases. Solving this properly would probably require modification of the property paths before optimization to ensure that all transitive operators with a minimum distance of 0 are replaced by a union over the transitive operator with a minimum distance of 1 and the same property path with the transitive operation removed.

niklas88 · 2019-05-15T13:31:24Z

@floriankramer sadly the shared_ptr<> change doesn't fix the crashes I reported earlier. Still I think this is a worthwhile change and I'm not surprised that it fixes other memory issues.

floriankramer · 2019-05-17T15:08:49Z

@niklas88 With the latest changes the transitive operation with a minimum path length of 0 should work correctly, and I wasn't able to reproduce the crashes. An instance with the latest code is currently running on galera on port 7004.

niklas88

This is a lot to review so I'll still need several passes but here are my first comments for the current version. I can also confirm that the previously reported crashes are gone in the new version. I suspect that the reserve() I commented on means that it will run out of memory for longer paths pretty fast.

Also as always this is really great stuff. I'd suspect that we could still optimize on the transitive path operation but I think this is a very good start and it makes sense to start with a canonical implementation and go from there.

niklas88 · 2019-05-21T10:09:57Z

src/engine/QueryPlanner.cpp

+  // We need a union over every combination of joins that exclude any subset the
+  // child paths which are marked as can_be_null.
+  std::vector<std::shared_ptr<ParsedQuery::GraphPattern>> join_patterns;
+  join_patterns.reserve((1 << num_null_children));


I don't think we really want to reserve 2^num_null_children join patterns? With the above limit to 64, that will easily be much more than any computer on earth has RAM

I've changed the code slightly to reduce the exponent and I've also limited the number of children to 10 (we can probably safely do more, but in most use cases that have the user input a query more should hopefully not be required). Before resolving empty nodes into a set of unions the sequence path's children are now split into chunks, with each chunk containing at least one non empty node (and thus the chunk itself being non empty).

src/engine/QueryPlanner.cpp

niklas88 · 2019-05-21T10:16:17Z

src/engine/TransitivePath.cpp

+  for (size_t i = 0; i < indent; ++i) {
+    os << " ";
+  }
+  os << "TRANSITIVE left " << _left << " right " << _right << " minDist "


All other operations use the class name here. Also its automatically capitalized in Hannah's visualization.

I've changed it to TRANSITIVE_PATH

I meant they use the same casing as the class name

niklas88 · 2019-05-21T10:21:51Z

src/engine/TransitivePath.cpp

+// _____________________________________________________________________________
+std::string TransitivePath::getDescriptor() const {
+  std::ostringstream os;
+  os << "TRANSITIVE left " << _left << " right " << _right << " minDist "


niklas88 · 2019-05-21T10:29:46Z

src/engine/QueryPlanner.cpp

+  op->_pathData._left = left;
+  op->_pathData._right = right;
+  op->_pathData._min = 1;
+  op->_pathData._max = std::numeric_limits<size_t>::max();


Maybe we should use a general limit that we can realisitcally compute?

Shouldn't we just try to compute as many indirections as possible, and rely on a general query timeout to prevent this going rampant? That way we never return wrong results (and we support the *<max> syntax Hannah suggested, if the user wants to manually limit the maximum.)

Ok yes that makes sense.

niklas88 · 2019-05-21T12:47:30Z

e2e/scientists_queries.yaml

+      - num_cols: 1
+      - selected: ["?b"]
+      - contains_row: ["<Albert_Heim>"] 
+      - contains_row: ["<Walter_Alvarez>"] 


I think we should add a couple more examples here, e.g. using | and / with differing left and right side as well as empty *. I think we can even make them a bit more realistic e.g. maybe occupations have further predicates.

Also there should be an example for transitive min/max

niklas88 · 2019-05-21T12:48:21Z

src/engine/IdTable.h

@@ -508,7 +508,7 @@ class IdTableStatic : private IdTableImpl<COLS> {
  using Row = typename IdTableImpl<COLS>::row_type;
  using ConstRow = typename IdTableImpl<COLS>::const_row_type;
  using iterator = typename IdTableImpl<COLS>::iterator;
-  using const_iterator = const typename IdTableImpl<COLS>::iterator;
+  using const_iterator = typename IdTableImpl<COLS>::iterator;


Good find makes sense that the iterator itself isn't const

In theory we should have a const_iterator that does not allow changed values, but when I began implementing that I didn't find a nice way to write such an iterator that does not lead to insane amounts of code duplication (as the only difference is removing some const versions of methods) and the templating with base classes gets annoying quickly, as the iterator iterator class is inside of another class.

Yeah I think it's ok to have it not actively enforce the constness

niklas88 · 2019-05-21T12:56:08Z

src/engine/QueryPlanner.cpp

-              new Union(_qec, left->_qet, right->_qet));
-          QueryExecutionTree& tree = *unionPlans.back()._qet.get();
+              std::make_shared<Union>(_qec, left->_qet, right->_qet));
+          QueryExecutionTree& tree = *childPlanStorage.back()._qet.get();


Do we need to use raw pointers here instead of making the childPlanStorage use shared_ptr. What's the reasoning behind the childPlans rename?

childPlanStorage stores SubtreePlans. The reference instead of a shared_ptr is a microoptimization, I guess, as we know that the shared_ptr will not be deleted and the storage is not freed. unionPlans was renamed to reflect that it also stores SubtreePlans for TransitivePath operations (instead of just storing them for Union operations).

src/engine/QueryPlanner.cpp

niklas88 · 2019-05-21T13:12:57Z

src/engine/QueryPlanner.cpp

@@ -737,11 +840,11 @@ QueryPlanner::TripleGraph QueryPlanner::createTripleGraph(
 // _____________________________________________________________________________
 vector<QueryPlanner::SubtreePlan> QueryPlanner::seedWithScansAndText(
    const QueryPlanner::TripleGraph& tg,
-    const vector<QueryPlanner::SubtreePlan*>& children) const {
+    const vector<const QueryPlanner::SubtreePlan*>& children) {


Maybe we should use a vector of shared_ptr here too. That said we also need to keep track if the shared_ptr change has a significant performance impact. Can't see an impact for completions though

Currently we don't use shared_ptr as the data is stored in the childPlanStorage and subqueryPlans vectors which are guaranteed to have the required lifetime to ensure that the data is never used after deletion. Unless we want to use these SubtreePlans somewhere else shared_ptr seems like an unecessary overhead to me (although it might not be noticeable in the end, given the small size of the children vector).

niklas88 · 2019-05-23T13:52:11Z

@floriankramer are you getting mails for failing CI or is that still not working?

floriankramer · 2019-06-03T08:32:13Z

@niklas88 I didn't see a mail for the travis failure. Do I have to subscribe to those somewhere?

niklas88 · 2019-06-04T10:18:38Z

@floriankramer I would guess that this is controlled by the GitHub notification settings in your profile. I'm not sure however since Checks aren't explicitly mentioned there.

niklas88

LGTM

niklas88 · 2019-06-07T08:04:15Z

src/parser/ParsedQuery.cpp

+void ParsedQuery::GraphPattern::recomputeIds(size_t* id_count) {
+  bool allocatedIdCounter = false;
+  if (id_count == nullptr) {
+    id_count = new size_t(0);


I missed this during the review but we could just use a stack var here and pass a pointer to that to the recursive call only if the parameter was null. This leaves the same variable in the recursive calls unused but that should still be faster than an allocation.

floriankramer added 3 commits April 30, 2019 17:21

Implemented a property path parser.

98fb25e

First implementation of property path optimization

91e5599

Implemeted property paths apart from all transitive operations.

d020b99

floriankramer requested a review from niklas88 May 2, 2019 12:31

niklas88 reviewed May 3, 2019

View reviewed changes

src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved

floriankramer added 3 commits May 7, 2019 16:04

Added support for transitive paths.

6682aeb

Added support for transitive paths with a fixed side.

2b1f24a

Added some basic e2e tests

a873f4f

floriankramer changed the title ~~Predicate paths (WIP)~~ Predicate paths May 7, 2019

floriankramer requested a review from niklas88 May 7, 2019 15:26

floriankramer mentioned this pull request May 8, 2019

Better pattern trick detection #245

Merged

Switched GraphPatterns to shared_ptr.

0fbe689

Merge branch 'master' into predicate_paths

eb374e6

Fixed transitive paths with length 0 returning the wrong results.

a0659b0

niklas88 reviewed May 21, 2019

View reviewed changes

Code cleanup and slight improvements to sequence path based seeding.

f28b9f9

Fixed cast to wrong type.

1a60197

Use class name in asString() to match other ops

892f106

niklas88 mentioned this pull request Jun 6, 2019

Write a Proper SparQL parser using Antlr4 #236

Closed

niklas88 approved these changes Jun 7, 2019

View reviewed changes

niklas88 merged commit 47fc38c into ad-freiburg:master Jun 7, 2019

niklas88 reviewed Jun 7, 2019

View reviewed changes

Predicate paths #244

Predicate paths #244

Conversation

floriankramer commented May 2, 2019

niklas88 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 May 3, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented May 3, 2019 • edited

floriankramer commented May 7, 2019

niklas88 commented May 15, 2019 • edited

floriankramer commented May 15, 2019 • edited

floriankramer commented May 15, 2019

niklas88 commented May 15, 2019

floriankramer commented May 17, 2019

niklas88 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floriankramer May 23, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented May 23, 2019

floriankramer commented Jun 3, 2019

niklas88 commented Jun 4, 2019

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 left a comment •

edited

niklas88 May 3, 2019 •

edited

niklas88 commented May 3, 2019 •

edited

niklas88 commented May 15, 2019 •

edited

floriankramer commented May 15, 2019 •

edited

niklas88 left a comment •

edited

floriankramer May 23, 2019 •

edited