Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional property paths (*, ?) are very slow #695

Closed
VladimirAlexiev opened this issue Dec 14, 2016 · 9 comments
Closed

optional property paths (*, ?) are very slow #695

VladimirAlexiev opened this issue Dec 14, 2016 · 9 comments
Labels
🐞 bug issue is a bug
Milestone

Comments

@VladimirAlexiev
Copy link

VladimirAlexiev commented Dec 14, 2016

(Created from OWLIM-1104).

Prop paths using optional constructs (p? and p*) are extremely slow.
These paths alone are necessarily slow because they must return all nodes.
But nobody uses them alone, people always use them in combination with a non-optional path.
So one has to use workarounds, eg rewrite q/p? to q|q/p, to get good perfomance.

Use cases:

  1. In Getty queries Vladimir Alexiev uses many rewritten patterns like this:
gvp:placeType/gvp:broaderGenericExtended? ->
gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended)
  1. In OWLIM-2627 Barry Norton suggests a similar but wordier workaround for BM data:
?x q/p? ?y ->
{?x q ?y} union {?x q ?z . ?z p ?y}
  1. SHACL uses rdf:List for many of its constructs (sometimes unnecessarily) and uses rdf:rest*/rdf:first to unroll the list, eg see ClosedConstraintComponent.
    Unless fixed, one can't use the SHACL SPARQL Definitions to implement SHACL in rdf4j

  2. Another typical example is poor man's RDFS inference: rdf:type/rdfs:subClassOf*

  3. American Art Collaborative discusses using rdf:List for "people depicted, with order". In this case a natural query would be
    crm:P62_depicts/(rdf:rest*/rdf:first)?
    but should be rewritten to
    crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

It seems to me that this can be fixed by internally rewriting to an alternative (union) path: one without the optional property and another with it.

p?/q -> q|p/q
p*/q -> q|p+/q
q/p? -> q|q/p
q/p* -> q|q/p+

Implementaton suggestion:

  • add a path characteristic "positive" (cannot be empty) vs "non-negative" (can be empty); the names are by analogy with integers.
  • design rewrite rules to produce alternatives where the positive path is always present, but the non-negative path alternates

More examples:

p?/q? -> ()|p|q|p/q

This alone will be slow since it's non-negative (() is the bad part), but when combined with a positive path, it can be made fast again, eg:

p?/q?/r ->
  r | (()|p|q|p/q)/r ->
  r | r|p/r|q/r|p/q/r ->
  r|p/r|q/r|p/q/r
@VladimirAlexiev
Copy link
Author

necessarily slow

The reason is that ?x prop* ?y must return EVERY node:
every node is connected by a zero-length path of any prop to itself.

@VladimirAlexiev
Copy link
Author

VladimirAlexiev commented Dec 14, 2016

Some more thoughts on implementation.

First compute path nature (POS, NON) (see prop paths):

  • iri is POS
  • ^elt is same
  • elt1 / elt2 is POS if either one is POS, else NON
  • elt1 | elt2 is POS if both are POS, else NON
  • elt* is NON
  • elt+ is same
  • elt? is NON
  • !whatever is POS (it's a path of length one)
  • (elt) is same

Then the following rewrite rules are executed repeatedly.
pX are POS paths, nX are NON paths, qX are either, () is the empty path (which is NON).

  1. Shorthands
    The 2010 prop path language has some extras that boil down to:
    elt1 ^ elt2 is shorthand for elt1 / ^elt2
    elt{n} is shorthand for elt/.../elt (n times without ?)
    elt{,n} is shorthand for elt?/.../elt? (n times with ?)
    elt{n,} is shorthand for elt/.../elt / elt* (n times without ? then one with *)
    elt{n,m} is shorthand for elt/.../elt / elt?/.../elt? (n without ? then m-n with ?)

  2. Prioritize
    Push POS paths to the left of alternatives (I assume they are evaluated left to right). This matters for LIMIT queries.
    n|p -> p|n

  3. Expand Alternatives
    Push alternatives to the top level.
    ^(q1|q2) -> ^q1 | ^q2
    q1/(q2|q3) -> q1/q2 | q1/q3
    (q1|q2)/q3 -> q1/q3 | q2/q3
    (q1|q2)* -> q1* | q2*
    (q1|q2)+ -> q1+ | q2+
    (q1|q2)? -> q1? | q2?
    q* -> q+ | ()
    q? -> q | ()

  4. Simplify
    q|...|q -> q|... (where ... is any number of alternatives)
    q/() -> q
    ()/q -> q

  5. Simplify iterations
    The expansion could create some consecutive iteration symbols that should also be simplified. But because such are allowed in the original path, I guess there already is such simplification:
    q** -> q*
    q*+ -> q*
    q+* -> q*
    q++ -> q+
    q*? -> q*
    q?* -> q*
    q+? -> q?
    q?+ -> q?
    q?? -> q?

I think that's all that's needed.

@VladimirAlexiev
Copy link
Author

VladimirAlexiev commented Dec 14, 2016

Eg if you try the rewrites on Use Case 5 above, you get this sequence
crm:P62_depicts/(rdf:rest*/rdf:first)? expand
crm:P62_depicts/(rdf:rest+|())/rdf:first)? expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+|())/rdf:first) expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+/rdf:first | ()/rdf:first) expand
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/()/rdf:first simplify
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/rdf:first

This is correct and fast enough. But the most natural way to write it is the fastest, because it finds the shortest paths first (and returns elements in the desired order):
crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

So I think we need these changes:

  1. Prioritize
    q|() -> ()|q
    n|p -> p|n only if n != ()

and then:

  1. Eliminate empties
    ()|... -> ... but only at top level
    This means that a query ?x p* ?y will be treated as ?x p+ ?y and will NOT return every node. This may fail some conformance test, but is exactly the wanted behavior (whoever wants to find all nodes in a repo?)

@abrokenjester
Copy link
Contributor

Possibly related is issue #689 , which turned out to be partly caused by a bug in binding retention, but also a related query optimization issue. We've improved query optimization for arbitrary-length paths in RDF4J 2.1.4 to the effect that the optimizer will almost always execute the zero-or-more part last (leading to most variables being bound already and therefore far less options to search through). I have not yet looked at your analysis in sufficient detail to figure out whether that corresponds with your findings, but in practical terms it might be good to try your slow queries with the new planner, and see if there's an improvement.

@VladimirAlexiev
Copy link
Author

In addition to performance, it's important to return the shortest-path results first,
because that is the expected order (e.g. for list unrolling rdf:rest*/rdf:first).

@catch-point
Copy link

I confirmed the issue exists in RDF4J 2.1.3, 2.1.6, and 2.2
Using the dataset http://vocab.getty.edu/dataset/aat/full.zip I had results similar to the following on each.
prefix gvp:http://vocab.getty.edu/ontology#
ask { ?x gvp:broader|(gvp:broader/gvp:broaderGenericExtended) ?y } # ~25ms
ask { ?x gvp:broader/gvp:broaderGenericExtended? ?y } # ~2500ms

@abrokenjester
Copy link
Contributor

It's been a while since I looked at this but we should investigate how the planner orders execution of the path elements and how it ensures that as much as possible, it executes paths with bound start or end points first.

As for output ordering: a SPARQL query result is unordered by nature, so any ordering that you rely on (other than an externally imposed one using an ORDER BY clause) is bound to give you problems in the future.

@abrokenjester
Copy link
Contributor

Also surprised that the issue still exists post 2.1.4 (and the fix for #689). Might be worth checking if that fix was somehow accidentally not included in later releases.

@VladimirAlexiev
Copy link
Author

As for output ordering: a SPARQL query result is unordered by nature

It won't hurt to have a sensible order, esp for rdf:List

abrokenjester pushed a commit that referenced this issue Apr 3, 2017
Fix #695: Don't treat path modifiers as subqueries
catch-point pushed a commit to catch-point/rdf4j that referenced this issue Apr 19, 2017
Signed-off-by: James Leigh <james.leigh@ontotext.com>
catch-point pushed a commit to catch-point/rdf4j that referenced this issue Apr 20, 2017
…4j#695-join-projection

Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries
hmottestad pushed a commit that referenced this issue Apr 21, 2017
Fix #695: Don't treat path modifiers as subqueries
@abrokenjester abrokenjester added the 🐞 bug issue is a bug label May 4, 2017
@abrokenjester abrokenjester added this to the 2.2.1 milestone May 4, 2017
heshanjse pushed a commit to heshanjse/rdf4j that referenced this issue Aug 27, 2017
Signed-off-by: James Leigh <james.leigh@ontotext.com>

Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>
heshanjse pushed a commit to heshanjse/rdf4j that referenced this issue Aug 27, 2017
…4j#695-join-projection

Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries
Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants