optional property paths (*, ?) are very slow #695

VladimirAlexiev · 2016-12-14T08:13:36Z

(Created from OWLIM-1104).

Prop paths using optional constructs (p? and p*) are extremely slow.
These paths alone are necessarily slow because they must return all nodes.
But nobody uses them alone, people always use them in combination with a non-optional path.
So one has to use workarounds, eg rewrite q/p? to q|q/p, to get good perfomance.

Use cases:

In Getty queries Vladimir Alexiev uses many rewritten patterns like this:

gvp:placeType/gvp:broaderGenericExtended? ->
gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended)

In OWLIM-2627 Barry Norton suggests a similar but wordier workaround for BM data:

?x q/p? ?y ->
{?x q ?y} union {?x q ?z . ?z p ?y}

SHACL uses rdf:List for many of its constructs (sometimes unnecessarily) and uses rdf:rest*/rdf:first to unroll the list, eg see ClosedConstraintComponent.
Unless fixed, one can't use the SHACL SPARQL Definitions to implement SHACL in rdf4j
Another typical example is poor man's RDFS inference: rdf:type/rdfs:subClassOf*
American Art Collaborative discusses using rdf:List for "people depicted, with order". In this case a natural query would be
crm:P62_depicts/(rdf:rest*/rdf:first)?
but should be rewritten to
crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

It seems to me that this can be fixed by internally rewriting to an alternative (union) path: one without the optional property and another with it.

p?/q -> q|p/q
p*/q -> q|p+/q
q/p? -> q|q/p
q/p* -> q|q/p+

Implementaton suggestion:

add a path characteristic "positive" (cannot be empty) vs "non-negative" (can be empty); the names are by analogy with integers.
design rewrite rules to produce alternatives where the positive path is always present, but the non-negative path alternates

More examples:

p?/q? -> ()|p|q|p/q

This alone will be slow since it's non-negative (() is the bad part), but when combined with a positive path, it can be made fast again, eg:

p?/q?/r ->
  r | (()|p|q|p/q)/r ->
  r | r|p/r|q/r|p/q/r ->
  r|p/r|q/r|p/q/r

The text was updated successfully, but these errors were encountered:

VladimirAlexiev · 2016-12-14T08:16:58Z

necessarily slow

The reason is that ?x prop* ?y must return EVERY node:
every node is connected by a zero-length path of any prop to itself.

VladimirAlexiev · 2016-12-14T09:22:23Z

Some more thoughts on implementation.

First compute path nature (POS, NON) (see prop paths):

iri is POS
^elt is same
elt1 / elt2 is POS if either one is POS, else NON
elt1 | elt2 is POS if both are POS, else NON
elt* is NON
elt+ is same
elt? is NON
!whatever is POS (it's a path of length one)
(elt) is same

Then the following rewrite rules are executed repeatedly.
pX are POS paths, nX are NON paths, qX are either, () is the empty path (which is NON).

Shorthands
The 2010 prop path language has some extras that boil down to:
elt1 ^ elt2 is shorthand for elt1 / ^elt2
elt{n} is shorthand for elt/.../elt (n times without ?)
elt{,n} is shorthand for elt?/.../elt? (n times with ?)
elt{n,} is shorthand for elt/.../elt / elt* (n times without ? then one with *)
elt{n,m} is shorthand for elt/.../elt / elt?/.../elt? (n without ? then m-n with ?)
Prioritize
Push POS paths to the left of alternatives (I assume they are evaluated left to right). This matters for LIMIT queries.
n|p -> p|n
Expand Alternatives
Push alternatives to the top level.
^(q1|q2) -> ^q1 | ^q2
q1/(q2|q3) -> q1/q2 | q1/q3
(q1|q2)/q3 -> q1/q3 | q2/q3
(q1|q2)* -> q1* | q2*
(q1|q2)+ -> q1+ | q2+
(q1|q2)? -> q1? | q2?
q* -> q+ | ()
q? -> q | ()
Simplify
q|...|q -> q|... (where ... is any number of alternatives)
q/() -> q
()/q -> q
Simplify iterations
The expansion could create some consecutive iteration symbols that should also be simplified. But because such are allowed in the original path, I guess there already is such simplification:
q** -> q*
q*+ -> q*
q+* -> q*
q++ -> q+
q*? -> q*
q?* -> q*
q+? -> q?
q?+ -> q?
q?? -> q?

I think that's all that's needed.

VladimirAlexiev · 2016-12-14T10:05:50Z

Eg if you try the rewrites on Use Case 5 above, you get this sequence
crm:P62_depicts/(rdf:rest*/rdf:first)? expand
crm:P62_depicts/(rdf:rest+|())/rdf:first)? expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+|())/rdf:first) expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+/rdf:first | ()/rdf:first) expand
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/()/rdf:first simplify
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/rdf:first

This is correct and fast enough. But the most natural way to write it is the fastest, because it finds the shortest paths first (and returns elements in the desired order):
crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

So I think we need these changes:

Prioritize
q|() -> ()|q
n|p -> p|n only if n != ()

and then:

Eliminate empties
()|... -> ... but only at top level
This means that a query ?x p* ?y will be treated as ?x p+ ?y and will NOT return every node. This may fail some conformance test, but is exactly the wanted behavior (whoever wants to find all nodes in a repo?)

abrokenjester · 2016-12-17T00:05:36Z

Possibly related is issue #689 , which turned out to be partly caused by a bug in binding retention, but also a related query optimization issue. We've improved query optimization for arbitrary-length paths in RDF4J 2.1.4 to the effect that the optimizer will almost always execute the zero-or-more part last (leading to most variables being bound already and therefore far less options to search through). I have not yet looked at your analysis in sufficient detail to figure out whether that corresponds with your findings, but in practical terms it might be good to try your slow queries with the new planner, and see if there's an improvement.

VladimirAlexiev · 2017-03-22T10:05:39Z

In addition to performance, it's important to return the shortest-path results first,
because that is the expected order (e.g. for list unrolling rdf:rest*/rdf:first).

catch-point · 2017-03-29T20:31:27Z

I confirmed the issue exists in RDF4J 2.1.3, 2.1.6, and 2.2
Using the dataset http://vocab.getty.edu/dataset/aat/full.zip I had results similar to the following on each.
prefix gvp:http://vocab.getty.edu/ontology#
ask { ?x gvp:broader|(gvp:broader/gvp:broaderGenericExtended) ?y } # ~25ms
ask { ?x gvp:broader/gvp:broaderGenericExtended? ?y } # ~2500ms

abrokenjester · 2017-03-30T02:52:46Z

It's been a while since I looked at this but we should investigate how the planner orders execution of the path elements and how it ensures that as much as possible, it executes paths with bound start or end points first.

As for output ordering: a SPARQL query result is unordered by nature, so any ordering that you rely on (other than an externally imposed one using an ORDER BY clause) is bound to give you problems in the future.

abrokenjester · 2017-03-30T02:54:19Z

Also surprised that the issue still exists post 2.1.4 (and the fix for #689). Might be worth checking if that fix was somehow accidentally not included in later releases.

VladimirAlexiev · 2017-03-30T09:28:26Z

As for output ordering: a SPARQL query result is unordered by nature

It won't hurt to have a sensible order, esp for rdf:List

people who use rdf:rest*/rdf:first will get a nasty surprise if it's not
I would guess that Jena's list:member https://jena.apache.org/documentation/query/extension.html#property-functions returns them in order

Fix #695: Don't treat path modifiers as subqueries

Signed-off-by: James Leigh <james.leigh@ontotext.com>

…4j#695-join-projection Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries

Fix #695: Don't treat path modifiers as subqueries

Signed-off-by: James Leigh <james.leigh@ontotext.com> Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>

…4j#695-join-projection Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>

VladimirAlexiev mentioned this issue Dec 14, 2016

Complexity of Lists designforcontext/aac_review_tool#20

Merged

catch-point mentioned this issue Mar 31, 2017

Fix #695: Don't treat path modifiers as subqueries #797

Merged

abrokenjester closed this as completed in 61007ee Apr 3, 2017

abrokenjester pushed a commit that referenced this issue Apr 3, 2017

Merge pull request #797 from jamesrdf/issues/#695-join-projection

b8a5668

Fix #695: Don't treat path modifiers as subqueries

catch-point pushed a commit to catch-point/rdf4j that referenced this issue Apr 19, 2017

Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries

e971397

Signed-off-by: James Leigh <james.leigh@ontotext.com>

catch-point pushed a commit to catch-point/rdf4j that referenced this issue Apr 20, 2017

Merge pull request eclipse-rdf4j#797 from jamesrdf/issues/eclipse-rdf…

18b4df4

…4j#695-join-projection Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries

hmottestad pushed a commit that referenced this issue Apr 21, 2017

Merge pull request #797 from jamesrdf/issues/#695-join-projection

cb07890

Fix #695: Don't treat path modifiers as subqueries

VladimirAlexiev mentioned this issue Apr 27, 2017

Add support for Shapes Constraint Language (SHACL) #743

Closed

abrokenjester added the 🐞 bug issue is a bug label May 4, 2017

abrokenjester added this to the 2.2.1 milestone May 4, 2017

heshanjse pushed a commit to heshanjse/rdf4j that referenced this issue Aug 27, 2017

Fix eclipse-rdf4j#695: Don't treat path modifiers as subqueries

bbe1a38

Signed-off-by: James Leigh <james.leigh@ontotext.com> Signed-off-by: Heshan Jayasinghe <shanujse@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optional property paths (*, ?) are very slow #695

optional property paths (*, ?) are very slow #695

VladimirAlexiev commented Dec 14, 2016 •

edited

VladimirAlexiev commented Dec 14, 2016

VladimirAlexiev commented Dec 14, 2016 •

edited

VladimirAlexiev commented Dec 14, 2016 •

edited

abrokenjester commented Dec 17, 2016

VladimirAlexiev commented Mar 22, 2017

catch-point commented Mar 29, 2017

abrokenjester commented Mar 30, 2017

abrokenjester commented Mar 30, 2017

VladimirAlexiev commented Mar 30, 2017

optional property paths (*, ?) are very slow #695

optional property paths (*, ?) are very slow #695

Comments

VladimirAlexiev commented Dec 14, 2016 • edited

VladimirAlexiev commented Dec 14, 2016

VladimirAlexiev commented Dec 14, 2016 • edited

VladimirAlexiev commented Dec 14, 2016 • edited

abrokenjester commented Dec 17, 2016

VladimirAlexiev commented Mar 22, 2017

catch-point commented Mar 29, 2017

abrokenjester commented Mar 30, 2017

abrokenjester commented Mar 30, 2017

VladimirAlexiev commented Mar 30, 2017

VladimirAlexiev commented Dec 14, 2016 •

edited

VladimirAlexiev commented Dec 14, 2016 •

edited

VladimirAlexiev commented Dec 14, 2016 •

edited