New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional property paths (*, ?) are very slow #695

Closed
VladimirAlexiev opened this Issue Dec 14, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@VladimirAlexiev

VladimirAlexiev commented Dec 14, 2016

(Created from OWLIM-1104).

Prop paths using optional constructs (p? and p*) are extremely slow.
These paths alone are necessarily slow because they must return all nodes.
But nobody uses them alone, people always use them in combination with a non-optional path.
So one has to use workarounds, eg rewrite q/p? to q|q/p, to get good perfomance.

Use cases:

  1. In Getty queries Vladimir Alexiev uses many rewritten patterns like this:
gvp:placeType/gvp:broaderGenericExtended? ->
gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended)
  1. In OWLIM-2627 Barry Norton suggests a similar but wordier workaround for BM data:
?x q/p? ?y ->
{?x q ?y} union {?x q ?z . ?z p ?y}
  1. SHACL uses rdf:List for many of its constructs (sometimes unnecessarily) and uses rdf:rest*/rdf:first to unroll the list, eg see ClosedConstraintComponent.
    Unless fixed, one can't use the SHACL SPARQL Definitions to implement SHACL in rdf4j

  2. Another typical example is poor man's RDFS inference: rdf:type/rdfs:subClassOf*

  3. American Art Collaborative discusses using rdf:List for "people depicted, with order". In this case a natural query would be
    crm:P62_depicts/(rdf:rest*/rdf:first)?
    but should be rewritten to
    crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

It seems to me that this can be fixed by internally rewriting to an alternative (union) path: one without the optional property and another with it.

p?/q -> q|p/q
p*/q -> q|p+/q
q/p? -> q|q/p
q/p* -> q|q/p+

Implementaton suggestion:

  • add a path characteristic "positive" (cannot be empty) vs "non-negative" (can be empty); the names are by analogy with integers.
  • design rewrite rules to produce alternatives where the positive path is always present, but the non-negative path alternates

More examples:

p?/q? -> ()|p|q|p/q

This alone will be slow since it's non-negative (() is the bad part), but when combined with a positive path, it can be made fast again, eg:

p?/q?/r ->
  r | (()|p|q|p/q)/r ->
  r | r|p/r|q/r|p/q/r ->
  r|p/r|q/r|p/q/r
@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Dec 14, 2016

necessarily slow

The reason is that ?x prop* ?y must return EVERY node:
every node is connected by a zero-length path of any prop to itself.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Dec 14, 2016

Some more thoughts on implementation.

First compute path nature (POS, NON) (see prop paths):

  • iri is POS
  • ^elt is same
  • elt1 / elt2 is POS if either one is POS, else NON
  • elt1 | elt2 is POS if both are POS, else NON
  • elt* is NON
  • elt+ is same
  • elt? is NON
  • !whatever is POS (it's a path of length one)
  • (elt) is same

Then the following rewrite rules are executed repeatedly.
pX are POS paths, nX are NON paths, qX are either, () is the empty path (which is NON).

  1. Shorthands
    The 2010 prop path language has some extras that boil down to:
    elt1 ^ elt2 is shorthand for elt1 / ^elt2
    elt{n} is shorthand for elt/.../elt (n times without ?)
    elt{,n} is shorthand for elt?/.../elt? (n times with ?)
    elt{n,} is shorthand for elt/.../elt / elt* (n times without ? then one with *)
    elt{n,m} is shorthand for elt/.../elt / elt?/.../elt? (n without ? then m-n with ?)

  2. Prioritize
    Push POS paths to the left of alternatives (I assume they are evaluated left to right). This matters for LIMIT queries.
    n|p -> p|n

  3. Expand Alternatives
    Push alternatives to the top level.
    ^(q1|q2) -> ^q1 | ^q2
    q1/(q2|q3) -> q1/q2 | q1/q3
    (q1|q2)/q3 -> q1/q3 | q2/q3
    (q1|q2)* -> q1* | q2*
    (q1|q2)+ -> q1+ | q2+
    (q1|q2)? -> q1? | q2?
    q* -> q+ | ()
    q? -> q | ()

  4. Simplify
    q|...|q -> q|... (where ... is any number of alternatives)
    q/() -> q
    ()/q -> q

  5. Simplify iterations
    The expansion could create some consecutive iteration symbols that should also be simplified. But because such are allowed in the original path, I guess there already is such simplification:
    q** -> q*
    q*+ -> q*
    q+* -> q*
    q++ -> q+
    q*? -> q*
    q?* -> q*
    q+? -> q?
    q?+ -> q?
    q?? -> q?

I think that's all that's needed.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Dec 14, 2016

Eg if you try the rewrites on Use Case 5 above, you get this sequence
crm:P62_depicts/(rdf:rest*/rdf:first)? expand
crm:P62_depicts/(rdf:rest+|())/rdf:first)? expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+|())/rdf:first) expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+/rdf:first | ()/rdf:first) expand
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/()/rdf:first simplify
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/rdf:first

This is correct and fast enough. But the most natural way to write it is the fastest, because it finds the shortest paths first (and returns elements in the desired order):
crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

So I think we need these changes:

  1. Prioritize
    q|() -> ()|q
    n|p -> p|n only if n != ()

and then:

  1. Eliminate empties
    ()|... -> ... but only at top level
    This means that a query ?x p* ?y will be treated as ?x p+ ?y and will NOT return every node. This may fail some conformance test, but is exactly the wanted behavior (whoever wants to find all nodes in a repo?)
@jeenbroekstra

This comment has been minimized.

Contributor

jeenbroekstra commented Dec 17, 2016

Possibly related is issue #689 , which turned out to be partly caused by a bug in binding retention, but also a related query optimization issue. We've improved query optimization for arbitrary-length paths in RDF4J 2.1.4 to the effect that the optimizer will almost always execute the zero-or-more part last (leading to most variables being bound already and therefore far less options to search through). I have not yet looked at your analysis in sufficient detail to figure out whether that corresponds with your findings, but in practical terms it might be good to try your slow queries with the new planner, and see if there's an improvement.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Mar 22, 2017

In addition to performance, it's important to return the shortest-path results first,
because that is the expected order (e.g. for list unrolling rdf:rest*/rdf:first).

@jamesrdf

This comment has been minimized.

jamesrdf commented Mar 29, 2017

I confirmed the issue exists in RDF4J 2.1.3, 2.1.6, and 2.2
Using the dataset http://vocab.getty.edu/dataset/aat/full.zip I had results similar to the following on each.
prefix gvp:http://vocab.getty.edu/ontology#
ask { ?x gvp:broader|(gvp:broader/gvp:broaderGenericExtended) ?y } # ~25ms
ask { ?x gvp:broader/gvp:broaderGenericExtended? ?y } # ~2500ms

@jeenbroekstra

This comment has been minimized.

Contributor

jeenbroekstra commented Mar 30, 2017

It's been a while since I looked at this but we should investigate how the planner orders execution of the path elements and how it ensures that as much as possible, it executes paths with bound start or end points first.

As for output ordering: a SPARQL query result is unordered by nature, so any ordering that you rely on (other than an externally imposed one using an ORDER BY clause) is bound to give you problems in the future.

@jeenbroekstra

This comment has been minimized.

Contributor

jeenbroekstra commented Mar 30, 2017

Also surprised that the issue still exists post 2.1.4 (and the fix for #689). Might be worth checking if that fix was somehow accidentally not included in later releases.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Mar 30, 2017

As for output ordering: a SPARQL query result is unordered by nature

It won't hurt to have a sensible order, esp for rdf:List

jeenbroekstra added a commit that referenced this issue Apr 3, 2017

Merge pull request #797 from jamesrdf/issues/#695-join-projection
Fix #695: Don't treat path modifiers as subqueries

jamesrdf pushed a commit to jamesrdf/rdf4j that referenced this issue Apr 19, 2017

James Leigh
Fix eclipse#695: Don't treat path modifiers as subqueries
Signed-off-by: James Leigh <james.leigh@ontotext.com>

jamesrdf pushed a commit to jamesrdf/rdf4j that referenced this issue Apr 20, 2017

Merge pull request eclipse#797 from jamesrdf/issues/eclipse#695-join-…
…projection

Fix eclipse#695: Don't treat path modifiers as subqueries

hmottestad added a commit that referenced this issue Apr 21, 2017

Merge pull request #797 from jamesrdf/issues/#695-join-projection
Fix #695: Don't treat path modifiers as subqueries

@jeenbroekstra jeenbroekstra added the bug label May 4, 2017

@jeenbroekstra jeenbroekstra added this to the 2.2.1 milestone May 4, 2017

heshanjse pushed a commit to heshanjse/rdf4j that referenced this issue Aug 27, 2017

James Leigh
Fix eclipse#695: Don't treat path modifiers as subqueries
Signed-off-by: James Leigh <james.leigh@ontotext.com>

Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>

heshanjse pushed a commit to heshanjse/rdf4j that referenced this issue Aug 27, 2017

Merge pull request eclipse#797 from jamesrdf/issues/eclipse#695-join-…
…projection

Fix eclipse#695: Don't treat path modifiers as subqueries
Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment