New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predicate paths #244
Predicate paths #244
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like what I'm seeing so far! The PropertyPathParser
looks well built and you already prepared for transitive patterns to be done by special operations. I think this addition will be great in enabling so many more of the Wikidata Query Service examples to be easily used with QLever.
I've only made a preliminary review but only a few nitpicks at the moment.
Do you want to merge this before or after adding transitive path patterns?
src/parser/PropertyPathParser.cpp
Outdated
DELIMITER_CHARS['^'] = true; | ||
|
||
// Init the valid characters with all delimiters | ||
std::memcpy(VALID_CHARS.data(), DELIMITER_CHARS.data(), 256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above I would expect std::copy
(from <algorithms>
) to result in the same machine code without resorting to C style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the memcpy with an assignment instead, which should have exactly the same effect.
src/engine/QueryPlanner.cpp
Outdated
QueryPlanner::QueryPlanner(QueryExecutionContext* qec) : _qec(qec) {} | ||
QueryPlanner::QueryPlanner(QueryExecutionContext* qec) | ||
: _qec(qec), _randVarCount(0) { | ||
_randomSeed = time(NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set this to something static to make runs comparable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea. I've set the seed to 42.
} | ||
// TODO: Could use more refactoring. | ||
// In Earlier versions there were no ql:contains... predicates but | ||
// a symmetric <in-text> predicate. Therefore some parts are still more | ||
// complex than need be. | ||
if (t._p == CONTAINS_WORD_PREDICATE || t._p == CONTAINS_ENTITY_PREDICATE) { | ||
contextVars.insert(t._s); | ||
if (t._p._iri == CONTAINS_WORD_PREDICATE || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the above TODO still valid? This doesn't look too complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Do we know who added this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to git blame
it was added by Björn 2 years ago, so I would say that's a no
src/engine/QueryPlanner.cpp
Outdated
} | ||
|
||
// _____________________________________________________________________________ | ||
std::string QueryPlanner::generateRandomVarName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I think since you are using _randVarCount
anyway, we really only need the prefix to be unique against all user defined variables. Maybe we could just use something that's forbidden in standard SPARQL. For example we could start it with 0x00b7 ("·" MIDDLE DOT), which according to the grammar rule can't appear in the beginning. Interestingly this exact character for a very similar purpose has precedent in the Go compiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a fixed instead of a random prefix seems like a good idea. I'd use a colon though, as it is not allowed as the first character of a variable (only [a-z][A-Z][0-9] '_' and some unicode characters are allowed) and it's a less obscure character.
By the way, you can use the following bash script with my ubuntu-clang-format container to format the code on Arch (it will download the image automatically)
With
|
My plan was to implement the transitive operation and add some e2e tests before merging this. Should be done by the end of the week. |
The transitive operation doesn't work right in my limited testing (its running on Wikidata Full at the moment). Which I think should have been caught by better E2E Tests, I do realize though that this is somewhat hindered by the limits of the Scientists KB. The following query gives fewer results than the same but without
Also the following query adding
This can't entirely be a problem with the
Interestingly using |
@niklas88 I just split the swicth of GraphPatterns to using std::shared_ptrs pretty much everywhere into a separate commit (it was so far part of the single commit in #245) and added it onto this pr. This fixed several memory related issues with the GraphPatternOperations and may fix the two crashes you had. |
@niklas88 Currently the Transitive operation with a minimum distance of 0 just adds a connection from everything on the left side of its subquery to itself, which only works in a few highly specific cases. Solving this properly would probably require modification of the property paths before optimization to ensure that all transitive operators with a minimum distance of 0 are replaced by a union over the transitive operator with a minimum distance of 1 and the same property path with the transitive operation removed. |
@floriankramer sadly the |
@niklas88 With the latest changes the transitive operation with a minimum path length of 0 should work correctly, and I wasn't able to reproduce the crashes. An instance with the latest code is currently running on galera on port 7004. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot to review so I'll still need several passes but here are my first comments for the current version. I can also confirm that the previously reported crashes are gone in the new version. I suspect that the reserve()
I commented on means that it will run out of memory for longer paths pretty fast.
Also as always this is really great stuff. I'd suspect that we could still optimize on the transitive path operation but I think this is a very good start and it makes sense to start with a canonical implementation and go from there.
src/engine/QueryPlanner.cpp
Outdated
// We need a union over every combination of joins that exclude any subset the | ||
// child paths which are marked as can_be_null. | ||
std::vector<std::shared_ptr<ParsedQuery::GraphPattern>> join_patterns; | ||
join_patterns.reserve((1 << num_null_children)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really want to reserve 2^num_null_children
join patterns? With the above limit to 64, that will easily be much more than any computer on earth has RAM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the code slightly to reduce the exponent and I've also limited the number of children to 10 (we can probably safely do more, but in most use cases that have the user input a query more should hopefully not be required). Before resolving empty nodes into a set of unions the sequence path's children are now split into chunks, with each chunk containing at least one non empty node (and thus the chunk itself being non empty).
src/engine/TransitivePath.cpp
Outdated
for (size_t i = 0; i < indent; ++i) { | ||
os << " "; | ||
} | ||
os << "TRANSITIVE left " << _left << " right " << _right << " minDist " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other operations use the class name here. Also its automatically capitalized in Hannah's visualization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to TRANSITIVE_PATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant they use the same casing as the class name
src/engine/TransitivePath.cpp
Outdated
// _____________________________________________________________________________ | ||
std::string TransitivePath::getDescriptor() const { | ||
std::ostringstream os; | ||
os << "TRANSITIVE left " << _left << " right " << _right << " minDist " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
op->_pathData._left = left; | ||
op->_pathData._right = right; | ||
op->_pathData._min = 1; | ||
op->_pathData._max = std::numeric_limits<size_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a general limit that we can realisitcally compute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just try to compute as many indirections as possible, and rely on a general query timeout to prevent this going rampant? That way we never return wrong results (and we support the *<max>
syntax Hannah suggested, if the user wants to manually limit the maximum.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yes that makes sense.
- num_cols: 1 | ||
- selected: ["?b"] | ||
- contains_row: ["<Albert_Heim>"] | ||
- contains_row: ["<Walter_Alvarez>"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a couple more examples here, e.g. using |
and /
with differing left and right side as well as empty *
. I think we can even make them a bit more realistic e.g. maybe occupations have further predicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there should be an example for transitive min/max
@@ -508,7 +508,7 @@ class IdTableStatic : private IdTableImpl<COLS> { | |||
using Row = typename IdTableImpl<COLS>::row_type; | |||
using ConstRow = typename IdTableImpl<COLS>::const_row_type; | |||
using iterator = typename IdTableImpl<COLS>::iterator; | |||
using const_iterator = const typename IdTableImpl<COLS>::iterator; | |||
using const_iterator = typename IdTableImpl<COLS>::iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find makes sense that the iterator itself isn't const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we should have a const_iterator that does not allow changed values, but when I began implementing that I didn't find a nice way to write such an iterator that does not lead to insane amounts of code duplication (as the only difference is removing some const versions of methods) and the templating with base classes gets annoying quickly, as the iterator iterator class is inside of another class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's ok to have it not actively enforce the constness
new Union(_qec, left->_qet, right->_qet)); | ||
QueryExecutionTree& tree = *unionPlans.back()._qet.get(); | ||
std::make_shared<Union>(_qec, left->_qet, right->_qet)); | ||
QueryExecutionTree& tree = *childPlanStorage.back()._qet.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use raw pointers here instead of making the childPlanStorage
use shared_ptr
. What's the reasoning behind the childPlans
rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
childPlanStorage stores SubtreePlans
. The reference instead of a shared_ptr is a microoptimization, I guess, as we know that the shared_ptr will not be deleted and the storage is not freed. unionPlans
was renamed to reflect that it also stores SubtreePlans
for TransitivePath
operations (instead of just storing them for Union
operations).
@@ -737,11 +840,11 @@ QueryPlanner::TripleGraph QueryPlanner::createTripleGraph( | |||
// _____________________________________________________________________________ | |||
vector<QueryPlanner::SubtreePlan> QueryPlanner::seedWithScansAndText( | |||
const QueryPlanner::TripleGraph& tg, | |||
const vector<QueryPlanner::SubtreePlan*>& children) const { | |||
const vector<const QueryPlanner::SubtreePlan*>& children) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a vector of shared_ptr
here too. That said we also need to keep track if the shared_ptr
change has a significant performance impact. Can't see an impact for completions though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't use shared_ptr
as the data is stored in the childPlanStorage
and subqueryPlans
vectors which are guaranteed to have the required lifetime to ensure that the data is never used after deletion. Unless we want to use these SubtreePlans
somewhere else shared_ptr
seems like an unecessary overhead to me (although it might not be noticeable in the end, given the small size of the children
vector).
@floriankramer are you getting mails for failing CI or is that still not working? |
@niklas88 I didn't see a mail for the travis failure. Do I have to subscribe to those somewhere? |
@floriankramer I would guess that this is controlled by the GitHub notification settings in your profile. I'm not sure however since Checks aren't explicitly mentioned there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
void ParsedQuery::GraphPattern::recomputeIds(size_t* id_count) { | ||
bool allocatedIdCounter = false; | ||
if (id_count == nullptr) { | ||
id_count = new size_t(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this during the review but we could just use a stack var here and pass a pointer to that to the recursive call only if the parameter was null. This leaves the same variable in the recursive calls unused but that should still be faster than an allocation.
Adds support for property paths, but does not support transitive operations (e.g *) yet.