Skip to content
Permalink
Browse files
feat(gravsearch): improve gravsearch performance by using unions in p…
…requery (DEV-492) (#2045)

* Update SparqlTransformer.scala

* feat: add superPropertyOf map to ontology cache

* refactor: reduce logging noise

* chore: add clean-sbt target to makefile

* feat: replace property path query statements with unions for subPropertyOf*

* feat: use unions for subclasses

* refactor: tidy up some old mess

* refactor: add more logging

* add limiting param to transformer to reduce inference

* ignore failing test for now

* feat: start working on reducing union options on basis of the query

* tidy up

* minor improvements

* get tests to pass

* feat: limit subclasses

* feat: include optimization in count query

* test: minimal test for compound objects with gravsearch

* test: test simulated inference with union patterns

* refactor start tidying up

* refactor: more tidying up

* refactor: tidy up

* docs: start documenting the changes

* refactor: remove unused code

* docs: update documentation

* refactor: remove some code smells

* refactor: tidy up, improve variable naming and add documentation

* refactor: format sparqlTransformarSpec.scala

* Apply suggestions from code review

Co-authored-by: irinaschubert <irina.schubert@dasch.swiss>

* tidy up

* wrap up according to review

Co-authored-by: irinaschubert <irina.schubert@dasch.swiss>
  • Loading branch information
BalduinLandolt and irinaschubert committed May 10, 2022
1 parent a9fda7e commit 40354a7d0ee7bc4954adb87e8b16ba4d9fc45784
Showing with 985 additions and 527 deletions.
  1. +7 −0 Makefile
  2. +1 −1 docs/01-introduction/what-is-knora.md
  3. +23 −29 docs/03-apis/api-v2/query-language.md
  4. +13 −34 docs/05-internals/design/api-v2/gravsearch.md
  5. +21 −51 docs/05-internals/design/api-v2/query-design.md
  6. +164 −25 webapi/src/main/scala/org/knora/webapi/messages/util/search/QueryTraverser.scala
  7. +105 −49 webapi/src/main/scala/org/knora/webapi/messages/util/search/SparqlTransformer.scala
  8. +2 −1 ...n/scala/org/knora/webapi/messages/util/search/gravsearch/prequery/AbstractPrequeryGenerator.scala
  9. +6 −2 .../util/search/gravsearch/prequery/NonTriplestoreSpecificGravsearchToCountPrequeryTransformer.scala
  10. +11 −5 ...sages/util/search/gravsearch/prequery/NonTriplestoreSpecificGravsearchToPrequeryTransformer.scala
  11. +6 −3 ...n/scala/org/knora/webapi/messages/util/search/gravsearch/types/GravsearchTypeInspectionUtil.scala
  12. +311 −279 webapi/src/main/scala/org/knora/webapi/responders/v2/SearchResponderV2.scala
  13. +36 −6 webapi/src/main/scala/org/knora/webapi/responders/v2/ontology/Cache.scala
  14. +19 −0 webapi/src/main/scala/org/knora/webapi/responders/v2/ontology/OntologyHelpers.scala
  15. +10 −1 webapi/src/main/scala/org/knora/webapi/store/triplestore/http/HttpTriplestoreConnector.scala
  16. +130 −38 webapi/src/test/scala/org/knora/webapi/messages/util/search/SparqlTransformerSpec.scala
  17. +4 −1 ...l/search/gravsearch/prequery/NonTriplestoreSpecificGravsearchToCountPrequeryTransformerSpec.scala
  18. +2 −1 ...s/util/search/gravsearch/prequery/NonTriplestoreSpecificGravsearchToPrequeryTransformerSpec.scala
  19. +17 −0 webapi/src/test/scala/org/knora/webapi/responders/v2/SearchResponderV2Spec.scala
  20. +96 −0 webapi/src/test/scala/org/knora/webapi/responders/v2/SearchResponderV2SpecFullData.scala
  21. +1 −1 webapi/src/test/scala/org/knora/webapi/util/StartupUtils.scala
@@ -280,6 +280,13 @@ clean-local-tmp:
@rm -rf .tmp
@mkdir .tmp

.PHONY: clean-metals
clean-metals: ## clean SBT and Metals related stuff
@rm -rf .bloop
@rm -rf .bsp
@rm -rf .metals
@rm -rf target

clean: docs-clean clean-local-tmp clean-docker clean-sipi-tmp ## clean build artifacts
@rm -rf .env

@@ -74,7 +74,7 @@ and can regenerate the original XML document at any time.

DSP-API provides a search language, [Gravsearch](../03-apis/api-v2/query-language.md),
that is designed to meet the needs of humanities researchers. Gravsearch supports DSP-API's
humanites-focused data structures, including calendar-independent dates and standoff markup, as well
humanities-focused data structures, including calendar-independent dates and standoff markup, as well
as fast full-text searches. This allows searches to combine text-related criteria with any other
criteria. For example, you could search for a text that contains a certain word
and also mentions a person who lived in the same city as another person who is the
@@ -13,15 +13,15 @@ criteria) while avoiding their drawbacks in terms of performance and
security (see [The Enduring Myth of the SPARQL
Endpoint](https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/)).
It also has the benefit of enabling clients to work with a simpler RDF
data model than the one Knora actually uses to store data in the
data model than the one the API actually uses to store data in the
triplestore, and makes it possible to provide better error-checking.

Rather than being processed directly by the triplestore, a Gravsearch query
is interpreted by Knora, which enforces certain
is interpreted by the API, which enforces certain
restrictions on the query, and implements paging and permission
checking. The API server generates SPARQL based on the Gravsearch query
submitted, queries the triplestore, filters the results according to the
user's permissions, and returns each page of query results as a Knora
user's permissions, and returns each page of query results as an
API response. Thus, Gravsearch is a hybrid between a RESTful API and a
SPARQL endpoint.

@@ -80,14 +80,14 @@ If a gravsearch query times out, a `504 Gateway Timeout` will be returned.
A Gravsearch query can be written in either of the two
[DSP-API v2 schemas](introduction.md#api-schema). The simple schema
is easier to work with, and is sufficient if you don't need to query
anything below the level of a Knora value. If your query needs to refer to
anything below the level of a DSP-API value. If your query needs to refer to
standoff markup, you must use the complex schema. Each query must use a single
schema, with one exception (see [Date Comparisons](#date-comparisons)).

Gravsearch query results can be requested in the simple or complex schema;
see [API Schema](introduction.md#api-schema).

All examples hereafter run with Knora started locally as documented in the section [Getting Started with DSP-API](../../04-publishing-deployment/getting-started.md). If you access another Knora-Stack, you can check the IRI of the ontology you are targeting by requesting the [ontologies metadata](ontology-information.md#querying-ontology-metadata).
All examples hereafter run with the DSP stack started locally as documented in the section [Getting Started with DSP-API](../../04-publishing-deployment/getting-started.md). If you access another stack, you can check the IRI of the ontology you are targeting by requesting the [ontologies metadata](ontology-information.md#querying-ontology-metadata).

### Using the Simple Schema

@@ -100,8 +100,7 @@ PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/simple/v2#>
```

In the simple schema, Knora values are represented as literals, which
can be used `FILTER` expressions
In the simple schema, DSP-API values are represented as literals, which can be used `FILTER` expressions
(see [Filtering on Values in the Simple Schema](#filtering-on-values-in-the-simple-schema)).

### Using the Complex Schema
@@ -115,7 +114,7 @@ PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/v2#>
```

In the complex schema, Knora values are represented as objects belonging
In the complex schema, DSP-API values are represented as objects belonging
to subclasses of `knora-api:Value`, e.g. `knora-api:TextValue`, and have
predicates of their own, which can be used in `FILTER` expressions
(see [Filtering on Values in the Complex Schema](#filtering-on-values-in-the-complex-schema)).
@@ -182,7 +181,7 @@ permission to see a matching dependent resource, the link value is hidden.
## Paging

Gravsearch results are returned in pages. The maximum number of main
resources per page is determined by Knora (and can be configured
resources per page is determined by the API (and can be configured
in `application.conf` via the setting `app/v2/resources-sequence/results-per-page`).
If some resources have been filtered out because the user does not have
permission to see them, a page could contain fewer results, or no results.
@@ -195,25 +194,20 @@ one at a time, until the response does not contain `knora-api:mayHaveMoreResults
## Inference

Gravsearch queries are understood to imply a subset of
[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). Depending on the
triplestore being used, this may be implemented using the triplestore's
own reasoner or by query expansion in Knora.
[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). This is done by the API by expanding the incoming query.

Specifically, if a statement pattern specifies a property, the pattern will
also match subproperties of that property, and if a statement specifies that
a subject has a particular `rdf:type`, the statement will also match subjects
belonging to subclasses of that type.

If you know that reasoning will not return any additional results for
your query, you can disable it by adding this line to the `WHERE` clause:
your query, you can disable it by adding this line to the `WHERE` clause, which may improve query performance:

```sparql
knora-api:GravsearchOptions knora-api:useInference false .
```

If Knora is implementing reasoning by query expansion, disabling it can
improve the performance of some queries.

## Gravsearch Syntax

Every Gravsearch query is a valid SPARQL 1.1
@@ -244,8 +238,8 @@ clauses use the following patterns, with the specified restrictions:
unordered set of triples. However, a Gravsearch query returns an
ordered list of resources, which can be ordered by the values of
specified properties. If the query is written in the complex schema,
items below the level of Knora values may not be used in `ORDER BY`.
- `BIND`: The value assigned must be a Knora resource IRI.
items below the level of DSP-API values may not be used in `ORDER BY`.
- `BIND`: The value assigned must be a DSP resource IRI.

### Resources, Properties, and Values

@@ -269,7 +263,7 @@ must be represented as a query variable.

#### Filtering on Values in the Simple Schema

In the simple schema, a variable representing a Knora value can be used
In the simple schema, a variable representing a DSP-API value can be used
directly in a `FILTER` expression. For example:

```
@@ -279,7 +273,7 @@ FILTER(?title = "Zeitglöcklein des Lebens und Leidens Christi")

Here the type of `?title` is `xsd:string`.

The following Knora value types can be compared with literals in `FILTER`
The following value types can be compared with literals in `FILTER`
expressions in the simple schema:

- Text values (`xsd:string`)
@@ -295,7 +289,7 @@ performing an exact match on a list node's label. Labels can be given in differe
If one of the given list node labels matches, it is considered a match.
Note that in the simple schema, uniqueness is not guaranteed (as opposed to the complex schema).

A Knora value may not be represented as the literal object of a predicate;
A DSP-API value may not be represented as the literal object of a predicate;
for example, this is not allowed:

```
@@ -304,9 +298,9 @@ for example, this is not allowed:

#### Filtering on Values in the Complex Schema

In the complex schema, variables representing Knora values are not literals.
In the complex schema, variables representing DSP-API values are not literals.
You must add something to the query (generally a statement) to get a literal
from a Knora value. For example:
from a DSP-API value. For example:

```
?book incunabula:title ?title .
@@ -479,7 +473,7 @@ within a single paragraph.
If you are only interested in specifying that a resource has some text
value containing a standoff link to another resource, the most efficient
way is to use the property `knora-api:hasStandoffLinkTo`, whose subjects and objects
are resources. This property is automatically maintained by Knora. For example:
are resources. This property is automatically maintained by the API. For example:

```
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
@@ -623,7 +617,7 @@ CONSTRUCT {

### Filtering on `rdfs:label`

The `rdfs:label` of a resource is not a Knora value, but you can still search for it.
The `rdfs:label` of a resource is not a DSP-API value, but you can still search for it.
This can be done in the same ways in the simple or complex schema:

Using a string literal object:
@@ -708,8 +702,8 @@ clause but not in the `CONSTRUCT` clause, the matching resources or values
will not be included in the results.

If the query is written in the complex schema, all variables in the `CONSTRUCT`
clause must refer to Knora resources, Knora values, or properties. Data below
the level of Knora values may not be mentioned in the `CONSTRUCT` clause.
clause must refer to DSP-API resources, DSP-API values, or properties. Data below
the level of values may not be mentioned in the `CONSTRUCT` clause.

Predicates from the `rdf`, `rdfs`, and `owl` ontologies may not be used
in the `CONSTRUCT` clause. The `rdfs:label` of each matching resource is always
@@ -921,7 +915,7 @@ adding statements with the predicate `rdf:type`. The subject must be a resource
and the object must either be `knora-api:Resource` (if the subject is a resource)
or the subject's specific type (if it is a value).

For example, consider this query that uses a non-Knora property:
For example, consider this query that uses a non-DSP property:

```
PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/simple/v2#>
@@ -992,7 +986,7 @@ CONSTRUCT {
Note that it only makes sense to use `dcterms:title` in the simple schema, because
its object is supposed to be a literal.

Here is another example, using a non-Knora class:
Here is another example, using a non-DSP class:

```
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
@@ -128,7 +128,7 @@ pattern orders must be optimised by moving `LuceneQueryPatterns` to the beginnin
- `ConstructToConstructTransformer` (extends `WhereTransformer`): instructions how to turn a triplestore independent Construct query into a triplestore dependent Construct query (implementation of inference).

The traits listed above define methods that are implemented in the transformer classes and called by `QueryTraverser` to perform SPARQL to SPARQL conversions.
When iterating over the statements of the input query, the transformer class's transformation methods are called to perform the conversion.
When iterating over the statements of the input query, the transformer class' transformation methods are called to perform the conversion.

### Prequery

@@ -152,7 +152,7 @@ Next, the Gravsearch query's WHERE clause is transformed and the prequery (SELEC
The transformation of the Gravsearch query's WHERE clause relies on the implementation of the abstract class `AbstractPrequeryGenerator`.

`AbstractPrequeryGenerator` contains members whose state is changed during the iteration over the statements of the input query.
They can then by used to create the converted query.
They can then be used to create the converted query.

- `mainResourceVariable: Option[QueryVariable]`: SPARQL variable representing the main resource of the input query. Present in the prequery's SELECT clause.
- `dependentResourceVariables: mutable.Set[QueryVariable]`: a set of SPARQL variables representing dependent resources in the input query. Used in an aggregation function in the prequery's SELECT clause (see below).
@@ -288,29 +288,12 @@ to the maximum allowed page size, the predicate

## Inference

Gravsearch queries support a subset of RDFS reasoning
(see [Inference](../../../03-apis/api-v2/query-language.md#inference) in the API documentation
Gravsearch queries support a subset of RDFS reasoning (see [Inference](../../../03-apis/api-v2/query-language.md#inference) in the API documentation
on Gravsearch). This is implemented as follows:

When the non-triplestore-specific version of a SPARQL query is generated, statements that do not need
inference are marked with the virtual named graph `<http://www.knora.org/explicit>`.
To simulate RDF inference, the API expands the prequery on basis of the available ontologies. For that reason, `SparqlTransformer.transformStatementInWhereForNoInference` expands all `rdfs:subClassOf` and `rdfs:subPropertyOf` statements using `UNION` statements for all subclasses and subproperties from the ontologies (equivalent to `rdfs:subClassOf*` and `rdfs:subPropertyOf*`).
Similarly, `SparqlTransformer.transformStatementInWhereForNoInference` replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHasStartParent*`.

When the triplestore-specific version of the query is generated:

- If the triplestore is GraphDB, `SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` changes statements
with the virtual graph `<http://www.knora.org/explicit>` so that they are marked with the GraphDB-specific graph
`<http://www.ontotext.com/explicit>`, and leaves other statements unchanged.
`SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` also adds the `valueHasString` statements which GraphDB needs
for text searches.

- If Knora is not using the triplestore's inference (e.g. with Fuseki),
`SparqlTransformer.transformStatementInWhereForNoInference` removes `<http://www.knora.org/explicit>`, and expands unmarked
statements using `rdfs:subClassOf*` and `rdfs:subPropertyOf*`.

Gravsearch also provides some virtual properties, which take advantage of forward-chaining inference
as an optimisation if the triplestore provides it. For example, the virtual property
`knora-api:standoffTagHasStartAncestor` is equivalent to `knora-base:standoffTagHasStartParent*`. If Knora is not using the triplestore's inference, `SparqlTransformer.transformStatementInWhereForNoInference`
replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHasStartParent*`.

# Optimisation of generated SPARQL

@@ -320,8 +303,7 @@ Lucene queries to the beginning of the block in which they occur.

## Query Optimization by Topological Sorting of Statements

GraphDB seems to have inherent algorithms to optimize the query time, however query performance of Fuseki highly depends
on the order of the query statements. For example, a query such as the one below:
In Jena Fuseki, the performance of a query highly depends on the order of the query statements. For example, a query such as the one below:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
@@ -370,8 +352,7 @@ The rest of the query then reads:
?letter beol:creationDate ?date .
```

Since we cannot expect clients to know about performance of triplestores in order to write efficient queries, we have
implemented an optimization method to automatically rearrange the statements of the given queries.
Since users cannot be expected to know about performance of triplestores in order to write efficient queries, an optimization method to automatically rearrange the statements of the given queries has been implemented.
Upon receiving the Gravsearch query, the algorithm converts the query to a graph. For each statement pattern,
the subject of the statement is the origin node, the predicate is a directed edge, and the object
is the target node. For the query above, this conversion would result in the following graph:
@@ -384,17 +365,16 @@ topological sorting algorithm](https://en.wikipedia.org/wiki/Topological_sorting
The algorithm returns the nodes of the graph ordered in several layers, where the
root element `?letter` is in layer 0, `[?date, ?person1, ?person2]` are in layer 1, `[?gnd1, ?gnd2]` in layer 2, and the
leaf nodes `[(DE-588)118531379, (DE-588)118696149]` are given in the last layer (i.e. layer 3).
According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the highest
order to the lowest):
According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the
highest order to the lowest):

- `(?letter, ?date, ?person2, ?person1, ?gnd2, ?gnd1, (DE-588)118696149, (DE-588)118531379)`
- `(?letter, ?date, ?person1, ?person2, ?gnd1, ?gnd2, (DE-588)118531379, (DE-588)118696149)`.

From all valid topological orders, one is chosen based on certain criteria; for example, the leaf should node should not
From all valid topological orders, one is chosen based on certain criteria; for example, the leaf node should not
belong to a statement that has predicate `rdf:type`, since that could match all resources of the specified type.
Once the best order is chosen, it is used to re-arrange the query
statements. Starting from the last leaf node, i.e.
Once the best order is chosen, it is used to re-arrange the query statements. Starting from the last leaf node, i.e.
`(DE-588)118696149`, the method finds the statement pattern which has this node as its object, and brings this statement
to the top of the query. This rearrangement continues so that the statements with the fewest dependencies on other
statements are all brought to the top of the query. The resulting query is as follows:
@@ -423,8 +403,7 @@ CONSTRUCT {

Note that position of the FILTER statements does not play a significant role in the optimization.

If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or
`FILTER NOT EXISTS`, they are reordered
If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or `FILTER NOT EXISTS`, they are reordered
by defining a graph per block. For example, consider the following query with `UNION`:

```sparql

0 comments on commit 40354a7

Please sign in to comment.