Implement Standoff Search #630

tobiasschweizer · 2017-10-11T08:45:13Z

Provide standoff search possibilities: search for a text that is marked up in a certain way.

An example in Sparql:

Search for the word "Mesure" that is marked up as italic and happens to be inside a paragraph. The paragraph does not need to be the immediate parent.

PREFIX standoff: <http://www.knora.org/ontology/standoff#>
PREFIX knora-base: <http://www.knora.org/ontology/knora-base#>
PREFIX beol: <http://www.knora.org/ontology/beol#>
select ?textValue ?markedup ?string where { 
    
    BIND("Mesure" AS ?searchVal)
    
    # use index for query optimisation 
    ?string <http://www.ontotext.com/owlim/lucene#fullTextSearchIndex> ?searchVal .
    
    ?textValue a knora-base:TextValue .
    
    ?textValue knora-base:valueHasString ?string .
    
    ?textValue knora-base:valueHasStandoff ?standoffNode .
        
    ?standoffNode a standoff:StandoffItalicTag .
    
    ?standoffNode knora-base:standoffTagHasStart ?start .
    
    ?standoffNode knora-base:standoffTagHasEnd ?end .
     
    # https://www.w3.org/TR/xpath-functions/#func-substring 
    # The first character of a string is located at position 1, not position 0. -> standoff uses a 0 based index
    BIND(SUBSTR(?string, ?start+1, ?end - ?start) AS ?markedup)
    
    FILTER REGEX(?markedup, ?searchVal, "i")
    
    ?standoffNode knora-base:standoffTagHasStartParent* ?standoffParentTag .
    ?standoffParentTag a standoff:StandoffParagraphTag .
} ORDER BY ?textValue ?start
LIMIT 100

The text was updated successfully, but these errors were encountered:

tobiasschweizer · 2017-10-11T08:52:36Z

Basic idea: use Lucene index to filter out all the text values that do not contain the search term (for optimization). Then select those text values that have an italic standoff tag that contains the search term: first get the whole text marked up as italic and then check that it contains the search term using a FILTER with regex. Then check that the italic standoff node has some parent of type paragraph using property path syntax.

Performance: property path syntax is slow in our experience. So I expect queries making use of them to be slow.

Lucene and Regex: Both have their own syntax. We have to think about what possibilities we would like to offer to the user: Boolean Logic, wildcards etc.

tobiasschweizer · 2017-10-20T08:17:19Z

@benjamingeer suggests:

On GraphDB, we could add our own inference rule for standoff tags, so we could use inference instead of property path syntax.

Try adding this to KnoraRules.pie, just before the section "Knora-specific consistency checks".
Then you'll have to restart GraphDB, then recreate the repository.

Id: standoff_containment
     x  <knora-base:standoffTagHasStartParent>  y    [Constraint x != y, x != z, y != z]
     y  <knora-base:standoffTagHasStartParent>  z
    -------------------------------
     x  <knora-base:standoffTagHasStartParent>  z

Then in your query, instead of this:

?standoffNode knora-base:standoffTagHasStartParent* ?standoffParentTag .

you should be able to write this:

?standoffNode knora-base:standoffTagHasStartParent ?standoffParentTag .

Keep in mind that if you want the immediate parent, you will now have to use http://www.ontotext.com/explicit.

tobiasschweizer · 2017-10-20T08:20:26Z

We should have a look at how the XML db existDB handles searches involving markup and literal text:

benjamingeer · 2017-10-24T13:23:43Z

?book incunabula:description ?description .
?description standoff:hasStandoff ?para .
?para a standoff:StandoffParagraphTag .

To match part of a text value, it looks like we might be able to implement custom functions using the RDF4J SPARQL parser:

http://docs.rdf4j.org/custom-sparql-functions/

Then in KnarQL, we could write filters like this:

FILTER(?para knora-api:contains("Zeitglöcklein"))

Otherwise, we could use statements instead of filters:

?para knora-api:contains "Zeitglöcklein" .

tobiasschweizer · 2017-12-22T08:15:11Z

Maybe we have to provide a custom implementation of the Lucene Indexer (for graphdb: org.apache.lucene.analysis.Analyzer, com.ontotext.trree.plugin.lucene.AnalyzerFactory, http://graphdb.ontotext.com/documentation/standard/full-text-search.html#creating-an-index).

mattssp · 2017-12-22T09:12:54Z

Or rather a Tokenizer. An other problem is that depending on the type of markup, the sequence of the plain text may not be the relevant one for tokenizing. There may be parts of the text that are comments, or there may be constructs like deletions:
Zeitglöck<del>chen</del><add>lein</add>.
This should be tokenized as "Zeitglöckchen" and "Zeitglöcklein" (yes, i would want to find the deleted word, too), not "Zeitglöckchenlein". This can only be done if the Tokenizer/Analyzer understands the markup. The Analyzer would have to consider the onthology of the standoff tags to be able to do this.

benjamingeer · 2017-12-22T09:26:22Z

@mattssp But if the plain text contains Zeitglöckchen Zeitglöcklein, then you can't search for Zeitglöcklein des Lebens.

Another way would be to separate different variants into different resources.

The first resource could represent the diplomatic transcription (with Zeitglöckchenlein, and the markup showing the addition and deletion).

Then you could have different resources for different variants, e.g. one would contain Zeitglöckchen, and another would contain Zeitglöcklein.

That would make all the variants searchable, without the need to customise the full-text search engine (which we can perhaps do with GraphDB using Lucene, but perhaps not with other triplestores).

mattssp · 2017-12-22T09:59:04Z

@benjamingeer that would introduce a lot of redundancy. Still, I see your point. Perhaps there could be a way to mark up search terms for complex sequences via an additional standoff markup layer. These could be indexed easily.

mattssp · 2017-12-22T10:06:26Z

Some sort of preprocessor that can be parametrized by the mapping could create this upon creation of the resource.

benjamingeer · 2017-12-22T11:07:20Z

If I understand your idea correctly, I think the problem is that, in general, it's not possible to predict which sequences of words people will want to search for.

I agree that it is best to avoid redundancy when possible. But on the other hand, I think that it's often not possible to find a single data representation that will meet all needs. For example, people who do quantitative analyses often need something like a spreadsheet that can be fed into statistical software like R. Here the only solution is to generate such a spreadsheet for the purpose of running the analysis. One of our goals in API v2 is to facilitate such scenarios.

Similarly, I doubt that there is a single representation of text with markup that will satisfy everyone. I suspect that in some cases, it will always be necessary to convert text from one form to another before analysing it, e.g. to extract an edited text from a transcription.

Also, we have to consider trade-offs between storage and performance. Eliminating redundancy reduces storage requirements. But storage is cheap, and often it's not easy to get acceptable performance in complex RDF searches. It can be worth using more storage to make searches perform better.

And given our limited resources, we have to consider the development effort that would be necessary to produce a more complex implementation. If we store the actual text that we want to search (e.g. the edited text), then we can use Lucene (and other similar products) to search it, without any additional development effort.

Therefore I'm inclined to think that it's worth storing edited text separately from transcriptions.

tobiasschweizer · 2018-04-06T07:01:06Z

I am getting back to standoff, finally :-)

Consider the following query (in contrast to the one above #630 (comment)):

PREFIX standoff: <http://www.knora.org/ontology/standoff#>
PREFIX knora-base: <http://www.knora.org/ontology/knora-base#>
PREFIX beol: <http://www.knora.org/ontology/beol#>
select DISTINCT ?resource ?textValue ?start ?end ?markedup ?markedup2 ?start2 ?end2 ?string where { 
    
    BIND("Numerum quemcunque esse summam tot quadratorum" AS ?searchVal)
    
    # use index for query optimisation 
    ?string <http://www.ontotext.com/owlim/lucene#fullTextSearchIndex> ?searchVal .
    
    ?textValue a knora-base:TextValue .
    
    ?resource knora-base:hasValue ?textValue .
    
    ?textValue knora-base:valueHasString ?string .
    
    ?textValue knora-base:valueHasStandoff ?standoffNode .
        
    ?standoffNode a standoff:StandoffUnderlineTag .
    
    ?standoffNode knora-base:standoffTagHasStart ?start .
    
    ?standoffNode knora-base:standoffTagHasEnd ?end .
     
    # https://www.w3.org/TR/xpath-functions/#func-substring 
    # The first character of a string is located at position 1, not position 0. -> standoff uses a 0 based index
    BIND(SUBSTR(?string, ?start+1, ?end - ?start) AS ?markedup)
    
    FILTER REGEX(?markedup, ?searchVal, "i")
    
    ?textValue knora-base:valueHasStandoff ?standoffNode2 .
        
    ?standoffNode2 a standoff:StandoffParagraphTag .
    
    ?standoffNode2 knora-base:standoffTagHasStart ?start2 .
    
    ?standoffNode2 knora-base:standoffTagHasEnd ?end2 .
     
    # https://www.w3.org/TR/xpath-functions/#func-substring 
    # The first character of a string is located at position 1, not position 0. -> standoff uses a 0 based index
    BIND(SUBSTR(?string, ?start2+1, ?end2 - ?start2) AS ?markedup2)
    
    FILTER REGEX(?markedup2, ?searchVal, "i")
} ORDER BY ?textValue ?start
LIMIT 100

The query searches for a string that is both marked up as underlined and a paragraph, but does not say that there is a relation between underline and paragraph (e.g., if you think about different standoff layers). The problem here, however, is that the two matches couldn't be related at all if the string occurs several times in the same text value. I think we would have to check that the start and end indexes are related (they are identical or one range is contained in the other).

tobiasschweizer · 2018-05-17T14:30:55Z

57a9b85 provides the functionality to restrict a full text search to a certain standoff class.

benjamingeer · 2018-06-13T15:19:34Z

We could make a property standoffTagHasStartAncestor, a base property of standoffTagHasStartParent. We could even make it an owl:TransitiveProperty. In GraphDB, we wouldn't need to use a complete set of OWL inference rules; we could just add the rule for owl:TransitiveProperty from builtin_owl2-rl.pie to KnoraRules.pie:

Id: prp_trp
  p <rdf:type> <owl:TransitiveProperty>
  x p y
  y p z
  -------------------------------
  x p z

benjamingeer · 2018-06-13T16:35:21Z

@tobiasschweizer Could you write:

A sample Gravsearch query that looks at standoff nodes using the complex schema.
The SPARQL prequery that should result from (1).

tobiasschweizer · 2018-06-14T07:57:07Z

@benjamingeer Yes, I think I could do that. I think the prequery should contain what we already have for the fulltext search: https://github.com/dhlab-basel/Knora/blob/adeb458b5f0aa3a6f85a12a749b25e13d21bd2c2/webapi/src/main/twirl/queries/sparql/v2/searchFulltextGraphDB.scala.txt#L73-L95

And parts of this code block would have to be generated automatically, Gravsearch does not contain it (substring handling).

benjamingeer · 2018-06-14T08:33:18Z

To filter on a StandoffDateTag (or a subclass of it), I think we would need to be able to write something like this in Gravsearch:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX knora-api-c: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/simple/v2#>
PREFIX beol-c: <http://0.0.0.0:3333/ontology/0801/beol/v2#>

CONSTRUCT {
    ?letter knora-api:isMainResource true .
} WHERE {
    ?letter a beol:letter .
    ?letter beol-c:hasText ?text .
    ?text knora-api-c:hasStandoff ?date .
    ?date a knora-api-c:StandoffDateTag .
    FILTER(knora-api-c:date(?date) < “JULIAN:1492”^^knora-api:Date)
}

Something like the knora-api-c:date function above would be needed so the FILTER could compare a standoff date tag with a date literal.

benjamingeer · 2018-06-14T08:40:13Z

But now I realise that it’s actually not correct that a letter (simple schema) could have the property hasText (complex schema). So maybe it would make more sense to write the whole query in the complex schema, and use the simple schema only in FILTERs:

PREFIX knora-api-simple: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>

CONSTRUCT {
    ?letter knora-api:isMainResource true .
} WHERE {
    ?letter a beol:letter .
    ?letter beol:hasText ?text .
    ?text knora-api:hasStandoff ?date .
    ?date a knora-api:StandoffDateTag .
    FILTER(knora-api:date(?date) < “JULIAN:1492”^^knora-api-simple:Date)
}

benjamingeer · 2018-06-14T08:55:04Z

The type checker could make sure that you don’t mix schemas, by checking that there is only one schema used in each statement.

benjamingeer · 2018-06-14T08:56:22Z

Perhaps the conversion from complex to internal wouldn’t be difficult. We could just forbid the use of dateValueHasYear, dateValueHasMonth, etc.

benjamingeer · 2018-06-19T13:48:25Z

So with the current design, the question is what should AbstractSparqlTransformer.handleQueryVar do in this case:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX anything: <http://www.knora.org/ontology/0001/anything#>

CONSTRUCT {
    ?thing knora-api:isMainResource true .
    ?thing anything:hasInteger ?intVal .
} WHERE {
    ?thing a anything:Thing .
    ?thing anything:hasInteger ?intVal .
    ?intVal knora-api:intValueAsInt ?int .
    FILTER(?int < 3)
}

Here the type of ?int is xsd:integer, so handleQueryVar would assume it refers to an IntValue, and add extra statements to make the FILTER work. But here that wouldn't make sense, and actually there's nothing for handleQueryVar to do.

I think the simplest way to handle this would be just to detect that the complex schema is being used in the query (the parser could set a flag for that), and if so, disable the automatic generation of additional statements in handleQueryVar. We would only need to generate them in the case of the date function I suggested above.

benjamingeer · 2018-06-21T16:52:11Z

After #899, I think what's left for this is:

Figure out how the type inspectors should handle standoff tag types, probably by classifying them all as knora-api:StandoffTag.
Get the inferring type inspector to ask the ontology responder about classes mentioned in knora-api:objectType in property definitions, so it can figure out which classes are standoff classes (since there's no equivalent of isResourceProperty for standoff properties).
Implement custom functions for these things:
- Implement Standoff Search #630 (comment)
- Implement Standoff Search #630 (comment)
Allow knora-api:toSimpleDate to be used with a variable referring to a standoff tag.
Implement standoffTagHasStartAncestor (Implement Standoff Search #630 (comment))

tobiasschweizer added the enhancement improve existing code or new feature label Oct 11, 2017

tobiasschweizer self-assigned this Oct 11, 2017

tobiasschweizer added this to the API V2 milestone Oct 11, 2017

tobiasschweizer mentioned this issue Dec 22, 2017

Extension of Standoff Design #699

Open

5 tasks

tobiasschweizer mentioned this issue Apr 16, 2018

Standoff Search #826

Closed

tobiasschweizer mentioned this issue May 23, 2018

feature (search v2): limit fulltext search to standoff class #858

Merged

This was referenced May 30, 2018

Ben's PR history #571

Open

Gravsearch enhancements and fixes #870

Merged

benjamingeer mentioned this issue Jun 14, 2018

Gravsearch response content and type support #889

Closed

benjamingeer mentioned this issue Jun 25, 2018

Query standoff markup in Gravsearch #910

Merged

9 tasks

tobiasschweizer mentioned this issue Jul 4, 2018

Processing fails when a search returns 25 text values with standoff #913

Closed

benjamingeer closed this as completed in #910 Jul 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Standoff Search #630

Implement Standoff Search #630

tobiasschweizer commented Oct 11, 2017 •

edited

Loading

tobiasschweizer commented Oct 11, 2017 •

edited

Loading

tobiasschweizer commented Oct 20, 2017 •

edited

Loading

tobiasschweizer commented Oct 20, 2017 •

edited

Loading

benjamingeer commented Oct 24, 2017

tobiasschweizer commented Dec 22, 2017

mattssp commented Dec 22, 2017

benjamingeer commented Dec 22, 2017

mattssp commented Dec 22, 2017

mattssp commented Dec 22, 2017

benjamingeer commented Dec 22, 2017

tobiasschweizer commented Apr 6, 2018

tobiasschweizer commented May 17, 2018 •

edited

Loading

benjamingeer commented Jun 13, 2018

benjamingeer commented Jun 13, 2018

tobiasschweizer commented Jun 14, 2018 •

edited

Loading

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 19, 2018

benjamingeer commented Jun 21, 2018 •

edited

Loading

Implement Standoff Search #630

Implement Standoff Search #630

Comments

tobiasschweizer commented Oct 11, 2017 • edited Loading

tobiasschweizer commented Oct 11, 2017 • edited Loading

tobiasschweizer commented Oct 20, 2017 • edited Loading

tobiasschweizer commented Oct 20, 2017 • edited Loading

benjamingeer commented Oct 24, 2017

tobiasschweizer commented Dec 22, 2017

mattssp commented Dec 22, 2017

benjamingeer commented Dec 22, 2017

mattssp commented Dec 22, 2017

mattssp commented Dec 22, 2017

benjamingeer commented Dec 22, 2017

tobiasschweizer commented Apr 6, 2018

tobiasschweizer commented May 17, 2018 • edited Loading

benjamingeer commented Jun 13, 2018

benjamingeer commented Jun 13, 2018

tobiasschweizer commented Jun 14, 2018 • edited Loading

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 14, 2018

benjamingeer commented Jun 19, 2018

benjamingeer commented Jun 21, 2018 • edited Loading

tobiasschweizer commented Oct 11, 2017 •

edited

Loading

tobiasschweizer commented Oct 11, 2017 •

edited

Loading

tobiasschweizer commented Oct 20, 2017 •

edited

Loading

tobiasschweizer commented Oct 20, 2017 •

edited

Loading

tobiasschweizer commented May 17, 2018 •

edited

Loading

tobiasschweizer commented Jun 14, 2018 •

edited

Loading

benjamingeer commented Jun 21, 2018 •

edited

Loading