-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Standoff Search #630
Comments
Basic idea: use Lucene index to filter out all the text values that do not contain the search term (for optimization). Then select those text values that have an italic standoff tag that contains the search term: first get the whole text marked up as italic and then check that it contains the search term using a FILTER with regex. Then check that the italic standoff node has some parent of type paragraph using property path syntax. Performance: property path syntax is slow in our experience. So I expect queries making use of them to be slow. Lucene and Regex: Both have their own syntax. We have to think about what possibilities we would like to offer to the user: Boolean Logic, wildcards etc. |
@benjamingeer suggests: On GraphDB, we could add our own inference rule for standoff tags, so we could use inference instead of property path syntax. Try adding this to KnoraRules.pie, just before the section "Knora-specific consistency checks".
Then in your query, instead of this:
you should be able to write this:
Keep in mind that if you want the immediate parent, you will now have to use http://www.ontotext.com/explicit. |
We should have a look at how the XML db existDB handles searches involving markup and literal text: |
To match part of a text value, it looks like we might be able to implement custom functions using the RDF4J SPARQL parser: http://docs.rdf4j.org/custom-sparql-functions/ Then in KnarQL, we could write filters like this:
Otherwise, we could use statements instead of filters:
|
Maybe we have to provide a custom implementation of the Lucene Indexer (for graphdb: org.apache.lucene.analysis.Analyzer, com.ontotext.trree.plugin.lucene.AnalyzerFactory, http://graphdb.ontotext.com/documentation/standard/full-text-search.html#creating-an-index). |
Or rather a Tokenizer. An other problem is that depending on the type of markup, the sequence of the plain text may not be the relevant one for tokenizing. There may be parts of the text that are comments, or there may be constructs like deletions: |
@mattssp But if the plain text contains Another way would be to separate different variants into different resources. The first resource could represent the diplomatic transcription (with Then you could have different resources for different variants, e.g. one would contain That would make all the variants searchable, without the need to customise the full-text search engine (which we can perhaps do with GraphDB using Lucene, but perhaps not with other triplestores). |
@benjamingeer that would introduce a lot of redundancy. Still, I see your point. Perhaps there could be a way to mark up search terms for complex sequences via an additional standoff markup layer. These could be indexed easily. |
Some sort of preprocessor that can be parametrized by the mapping could create this upon creation of the resource. |
If I understand your idea correctly, I think the problem is that, in general, it's not possible to predict which sequences of words people will want to search for. I agree that it is best to avoid redundancy when possible. But on the other hand, I think that it's often not possible to find a single data representation that will meet all needs. For example, people who do quantitative analyses often need something like a spreadsheet that can be fed into statistical software like R. Here the only solution is to generate such a spreadsheet for the purpose of running the analysis. One of our goals in API v2 is to facilitate such scenarios. Similarly, I doubt that there is a single representation of text with markup that will satisfy everyone. I suspect that in some cases, it will always be necessary to convert text from one form to another before analysing it, e.g. to extract an edited text from a transcription. Also, we have to consider trade-offs between storage and performance. Eliminating redundancy reduces storage requirements. But storage is cheap, and often it's not easy to get acceptable performance in complex RDF searches. It can be worth using more storage to make searches perform better. And given our limited resources, we have to consider the development effort that would be necessary to produce a more complex implementation. If we store the actual text that we want to search (e.g. the edited text), then we can use Lucene (and other similar products) to search it, without any additional development effort. Therefore I'm inclined to think that it's worth storing edited text separately from transcriptions. |
I am getting back to standoff, finally :-) Consider the following query (in contrast to the one above #630 (comment)):
The query searches for a string that is both marked up as underlined and a paragraph, but does not say that there is a relation between underline and paragraph (e.g., if you think about different standoff layers). The problem here, however, is that the two matches couldn't be related at all if the string occurs several times in the same text value. I think we would have to check that the start and end indexes are related (they are identical or one range is contained in the other). |
57a9b85 provides the functionality to restrict a full text search to a certain standoff class. |
We could make a property
|
@tobiasschweizer Could you write:
|
@benjamingeer Yes, I think I could do that. I think the prequery should contain what we already have for the fulltext search: https://github.com/dhlab-basel/Knora/blob/adeb458b5f0aa3a6f85a12a749b25e13d21bd2c2/webapi/src/main/twirl/queries/sparql/v2/searchFulltextGraphDB.scala.txt#L73-L95 And parts of this code block would have to be generated automatically, Gravsearch does not contain it (substring handling). |
To filter on a
Something like the |
But now I realise that it’s actually not correct that a
|
The type checker could make sure that you don’t mix schemas, by checking that there is only one schema used in each statement. |
Perhaps the conversion from complex to internal wouldn’t be difficult. We could just forbid the use of |
So with the current design, the question is what should
Here the type of I think the simplest way to handle this would be just to detect that the complex schema is being used in the query (the parser could set a flag for that), and if so, disable the automatic generation of additional statements in |
After #899, I think what's left for this is:
|
Provide standoff search possibilities: search for a text that is marked up in a certain way.
An example in Sparql:
Search for the word "Mesure" that is marked up as italic and happens to be inside a paragraph. The paragraph does not need to be the immediate parent.
The text was updated successfully, but these errors were encountered: