Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate paths #244

Merged
merged 12 commits into from Jun 7, 2019
Merged

Predicate paths #244

merged 12 commits into from Jun 7, 2019

Conversation

floriankramer
Copy link
Member

Adds support for property paths, but does not support transitive operations (e.g *) yet.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like what I'm seeing so far! The PropertyPathParser looks well built and you already prepared for transitive patterns to be done by special operations. I think this addition will be great in enabling so many more of the Wikidata Query Service examples to be easily used with QLever.
I've only made a preliminary review but only a few nitpicks at the moment.
Do you want to merge this before or after adding transitive path patterns?

src/parser/PropertyPathParser.cpp Outdated Show resolved Hide resolved
DELIMITER_CHARS['^'] = true;

// Init the valid characters with all delimiters
std::memcpy(VALID_CHARS.data(), DELIMITER_CHARS.data(), 256);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I would expect std::copy (from <algorithms>) to result in the same machine code without resorting to C style

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the memcpy with an assignment instead, which should have exactly the same effect.

test/SparqlParserTest.cpp Show resolved Hide resolved
src/parser/PropertyPathParser.h Outdated Show resolved Hide resolved
QueryPlanner::QueryPlanner(QueryExecutionContext* qec) : _qec(qec) {}
QueryPlanner::QueryPlanner(QueryExecutionContext* qec)
: _qec(qec), _randVarCount(0) {
_randomSeed = time(NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set this to something static to make runs comparable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea. I've set the seed to 42.

src/engine/QueryPlanner.cpp Show resolved Hide resolved
}
// TODO: Could use more refactoring.
// In Earlier versions there were no ql:contains... predicates but
// a symmetric <in-text> predicate. Therefore some parts are still more
// complex than need be.
if (t._p == CONTAINS_WORD_PREDICATE || t._p == CONTAINS_ENTITY_PREDICATE) {
contextVars.insert(t._s);
if (t._p._iri == CONTAINS_WORD_PREDICATE ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the above TODO still valid? This doesn't look too complicated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Do we know who added this TODO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to git blame it was added by Björn 2 years ago, so I would say that's a no

}

// _____________________________________________________________________________
std::string QueryPlanner::generateRandomVarName() {
Copy link
Member

@niklas88 niklas88 May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I think since you are using _randVarCount anyway, we really only need the prefix to be unique against all user defined variables. Maybe we could just use something that's forbidden in standard SPARQL. For example we could start it with 0x00b7 ("·" MIDDLE DOT), which according to the grammar rule can't appear in the beginning. Interestingly this exact character for a very similar purpose has precedent in the Go compiler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed instead of a random prefix seems like a good idea. I'd use a colon though, as it is not allowed as the first character of a variable (only [a-z][A-Z][0-9] '_' and some unicode characters are allowed) and it's a less obscure character.

src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
@niklas88
Copy link
Member

niklas88 commented May 3, 2019

By the way, you can use the following bash script with my ubuntu-clang-format container to format the code on Arch (it will download the image automatically)

#!/usr/bin/env sh
docker run --rm -v "$(pwd):/input" -u $(id -u):$(id -g) niklas88/ubuntu-clang-format $@

With zsh I usually run the following command where ubuntu-clang-format is the above script in my PATH.

ubuntu-clang-format -i (src|test)/**/*.(cpp|h)

@floriankramer
Copy link
Member Author

My plan was to implement the transitive operation and add some e2e tests before merging this. Should be done by the end of the week.

@floriankramer floriankramer changed the title Predicate paths (WIP) Predicate paths May 7, 2019
@niklas88
Copy link
Member

niklas88 commented May 15, 2019

The transitive operation doesn't work right in my limited testing (its running on Wikidata Full at the moment). Which I think should have been caught by better E2E Tests, I do realize though that this is somewhat hindered by the limits of the Scientists KB.

The following query gives fewer results than the same but without /wdt:P279* which can't be since * should also match the empty extension.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?river WHERE
{
  ?river wdt:P31/wdt:P279* wd:Q355304 .
  ?river wdt:P30 wd:Q51 .
}

Also the following query adding OPTIONAL crashes the instance

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?river ?location WHERE
{
  ?river wdt:P31/wdt:P279* wd:Q355304 .
  ?river wdt:P30 wd:Q51 .
  OPTIONAL { ?river wdt:P625 ?location . }
}

This can't entirely be a problem with the * or OPTIONAL since the following also crashes

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?river WHERE
{
  ?river wdt:P31/wdt:P279/wdt:P279 wd:Q355304 .
  ?river wdt:P30 wd:Q51 .
}

Interestingly using wdt:P31|(wdt:P31/wdt:P279) seems to work and correctly includes the result from both sides of the |, it also matches the result of the Wikidata Query Service

@floriankramer
Copy link
Member Author

floriankramer commented May 15, 2019

@niklas88 I just split the swicth of GraphPatterns to using std::shared_ptrs pretty much everywhere into a separate commit (it was so far part of the single commit in #245) and added it onto this pr. This fixed several memory related issues with the GraphPatternOperations and may fix the two crashes you had.

@floriankramer
Copy link
Member Author

@niklas88 Currently the Transitive operation with a minimum distance of 0 just adds a connection from everything on the left side of its subquery to itself, which only works in a few highly specific cases. Solving this properly would probably require modification of the property paths before optimization to ensure that all transitive operators with a minimum distance of 0 are replaced by a union over the transitive operator with a minimum distance of 1 and the same property path with the transitive operation removed.

@niklas88
Copy link
Member

@floriankramer sadly the shared_ptr<> change doesn't fix the crashes I reported earlier. Still I think this is a worthwhile change and I'm not surprised that it fixes other memory issues.

@floriankramer
Copy link
Member Author

@niklas88 With the latest changes the transitive operation with a minimum path length of 0 should work correctly, and I wasn't able to reproduce the crashes. An instance with the latest code is currently running on galera on port 7004.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to review so I'll still need several passes but here are my first comments for the current version. I can also confirm that the previously reported crashes are gone in the new version. I suspect that the reserve() I commented on means that it will run out of memory for longer paths pretty fast.

Also as always this is really great stuff. I'd suspect that we could still optimize on the transitive path operation but I think this is a very good start and it makes sense to start with a canonical implementation and go from there.

// We need a union over every combination of joins that exclude any subset the
// child paths which are marked as can_be_null.
std::vector<std::shared_ptr<ParsedQuery::GraphPattern>> join_patterns;
join_patterns.reserve((1 << num_null_children));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really want to reserve 2^num_null_children join patterns? With the above limit to 64, that will easily be much more than any computer on earth has RAM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the code slightly to reduce the exponent and I've also limited the number of children to 10 (we can probably safely do more, but in most use cases that have the user input a query more should hopefully not be required). Before resolving empty nodes into a set of unions the sequence path's children are now split into chunks, with each chunk containing at least one non empty node (and thus the chunk itself being non empty).

src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
for (size_t i = 0; i < indent; ++i) {
os << " ";
}
os << "TRANSITIVE left " << _left << " right " << _right << " minDist "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other operations use the class name here. Also its automatically capitalized in Hannah's visualization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to TRANSITIVE_PATH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant they use the same casing as the class name

// _____________________________________________________________________________
std::string TransitivePath::getDescriptor() const {
std::ostringstream os;
os << "TRANSITIVE left " << _left << " right " << _right << " minDist "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

op->_pathData._left = left;
op->_pathData._right = right;
op->_pathData._min = 1;
op->_pathData._max = std::numeric_limits<size_t>::max();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a general limit that we can realisitcally compute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just try to compute as many indirections as possible, and rely on a general query timeout to prevent this going rampant? That way we never return wrong results (and we support the *<max> syntax Hannah suggested, if the user wants to manually limit the maximum.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes that makes sense.

- num_cols: 1
- selected: ["?b"]
- contains_row: ["<Albert_Heim>"]
- contains_row: ["<Walter_Alvarez>"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a couple more examples here, e.g. using | and / with differing left and right side as well as empty *. I think we can even make them a bit more realistic e.g. maybe occupations have further predicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there should be an example for transitive min/max

@@ -508,7 +508,7 @@ class IdTableStatic : private IdTableImpl<COLS> {
using Row = typename IdTableImpl<COLS>::row_type;
using ConstRow = typename IdTableImpl<COLS>::const_row_type;
using iterator = typename IdTableImpl<COLS>::iterator;
using const_iterator = const typename IdTableImpl<COLS>::iterator;
using const_iterator = typename IdTableImpl<COLS>::iterator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find makes sense that the iterator itself isn't const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we should have a const_iterator that does not allow changed values, but when I began implementing that I didn't find a nice way to write such an iterator that does not lead to insane amounts of code duplication (as the only difference is removing some const versions of methods) and the templating with base classes gets annoying quickly, as the iterator iterator class is inside of another class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's ok to have it not actively enforce the constness

new Union(_qec, left->_qet, right->_qet));
QueryExecutionTree& tree = *unionPlans.back()._qet.get();
std::make_shared<Union>(_qec, left->_qet, right->_qet));
QueryExecutionTree& tree = *childPlanStorage.back()._qet.get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use raw pointers here instead of making the childPlanStorage use shared_ptr. What's the reasoning behind the childPlans rename?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

childPlanStorage stores SubtreePlans. The reference instead of a shared_ptr is a microoptimization, I guess, as we know that the shared_ptr will not be deleted and the storage is not freed. unionPlans was renamed to reflect that it also stores SubtreePlans for TransitivePath operations (instead of just storing them for Union operations).

src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
@@ -737,11 +840,11 @@ QueryPlanner::TripleGraph QueryPlanner::createTripleGraph(
// _____________________________________________________________________________
vector<QueryPlanner::SubtreePlan> QueryPlanner::seedWithScansAndText(
const QueryPlanner::TripleGraph& tg,
const vector<QueryPlanner::SubtreePlan*>& children) const {
const vector<const QueryPlanner::SubtreePlan*>& children) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a vector of shared_ptr here too. That said we also need to keep track if the shared_ptr change has a significant performance impact. Can't see an impact for completions though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't use shared_ptr as the data is stored in the childPlanStorage and subqueryPlans vectors which are guaranteed to have the required lifetime to ensure that the data is never used after deletion. Unless we want to use these SubtreePlans somewhere else shared_ptr seems like an unecessary overhead to me (although it might not be noticeable in the end, given the small size of the children vector).

@niklas88
Copy link
Member

@floriankramer are you getting mails for failing CI or is that still not working?

@floriankramer
Copy link
Member Author

@niklas88 I didn't see a mail for the travis failure. Do I have to subscribe to those somewhere?

@niklas88
Copy link
Member

niklas88 commented Jun 4, 2019

@floriankramer I would guess that this is controlled by the GitHub notification settings in your profile. I'm not sure however since Checks aren't explicitly mentioned there.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@niklas88 niklas88 merged commit 47fc38c into ad-freiburg:master Jun 7, 2019
void ParsedQuery::GraphPattern::recomputeIds(size_t* id_count) {
bool allocatedIdCounter = false;
if (id_count == nullptr) {
id_count = new size_t(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this during the review but we could just use a stack var here and pass a pointer to that to the recursive call only if the parameter was null. This leaves the same variable in the recursive calls unused but that should still be faster than an allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants