New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Text Query: Additional entity variable adds results #55
Comments
So as another interesting tidbit. Exporting (as TSV) all text with "star" and an entity mention also only brings up a single text snippet when grepping for "Rafael Rosell".
Also adding another entity gives more results |
Hey, I'd assume it's a problem with the distinct. Say there is one
matching doc t0 where RR and EntityX occur.
Then the first query has the result:
?t=t0 ?person=RR
However, the second query has results:
?t=t0 ?person=RR ?unbound=EntityX
?t=t0 ?person=RR ?unbound=RR
FILTER(?unbound != ?person) prevents this, but if there is also EntityY
in that context, the effect intensifies accordingly. The DISTINCT
"""should""" take care of that, but DISTINCT is not implemented properly
at all in the current version -- sorry.
EDIT: removed some other text I included in my reply email that was not supposed to be public on github
Best regards
Björn
Am 08.05.2018 um 19:01 schrieb Niklas Schnelle:
…
The following SPARQL + Text query retrieves 1 result row on Freebase +
ClueWeb
|PREFIX fb: <http://rdf.freebase.com/ns/> PREFIX fbk:
<http://rdf.freebase.com/key/> PREFIX rdf:
<http://www.w3.org/2000/01/rdf-schema#> SELECT DISTINCT TEXT(?t) WHERE
{ ?person rdf:label "Rafael ***@***.*** . ?t ql:contains-entity ?person
. ?t ql:contains-word "star" . } |
Adding another entity variable clearly should /NOT/ give more results,
yet it does
|PREFIX fb: <http://rdf.freebase.com/ns/> PREFIX fbk:
<http://rdf.freebase.com/key/> PREFIX rdf:
<http://www.w3.org/2000/01/rdf-schema#> SELECT DISTINCT TEXT(?t) WHERE
{ ?person rdf:label "Rafael ***@***.*** . ?t ql:contains-entity ?person
. ?t ql:contains-entity ?unbound . ?t ql:contains-word "star" . } |
I tested this on the current commit as well as on aeaf26a
<aeaf26a>
which was just before the OPTIONAL changes both show the same behavior
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#55>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAwx0ZyWN49f4KErH56wvb0wFHLC_3obks5twc91gaJpZM4T3AH1>.
|
Thank you for your input it is quite appreciated. This problem being linked to |
So this problem also happens on the scientists collection, e.g. with the following query (with and without
Adding the |
I haven't worked with the text operations before, so I'm not quite shure if i properly understand how they are supposed to work. But based upon what I infered from the code the behaviour does seem normal to me. Here is how I understand the operation to function: |
@floriankramer so as I understand it a triple Another effect of this is that
should equivalent to classic keyword based text search for "manhattan". Interestingly doing a
gives 714 results while
gives 666 so this might already be broken!? Still you might be onto something here. It seems that there are some implicit constraints that from my understanding shouldn't be there.. For example the following query should only exclude text snippets containing "manhattan" without any entity mention at all:
Yet the following query
still finds 5 more (tested for distinctness by exporting to TSV and running "sort | uniq | wc -l" on the result) |
I'm still at work, but tomorrow I should find some time to also have a look. Is there a runnign isntance where I can have a look at? For now, I guess a good idea may be to select Note that your examples are prefectly valid (still a problem with DISTINCT, though) if the same record IDs get repeated. While adding another triple should only lower or not effect at all the number of matching text records, more lines in the result are perfectly fine (again, of course unless DISTINCT works properly) -- simply because of the corss product that has to be built between matching entities. If the query really returns non-matching text record IDs, it may also be interesting if they're +/- to those that match but I'm starting with far-fecthed speculation, I guess. So, can I reproduce without building and deploying my own instance? |
@Buchhold the original query can be testen on http://qlever.informatik.uni-freiburg.de (or for the old UI at http://qlever.informatik.uni-freiburg.de/api/c+f/ or internally at vulcano:7001). The later queries work on the scientists collection. If you want I can keep one running internally as well but setting that up currently takes little more than a few minutes as it now works with the |
@niklas88
and an entity occurs in two documents only one will be returned for that entity. When adding another unbound variable these missing documents could be included in the result, if they have the highest score out of all documents including a specifc combination of entities and there are other documents with a higher score that contain each one of these entities but not the combination. When reading the docsfile in bash to determine the number of distinct documents containing Manhattan you forgot to cut away the first column of the tsv. Using |
@floriankramer oh wow 🤯 that's what I call a great analysis and explanation, thank you very much. I really din't know about So I guess this is really more of a documentation issue since it even skipped Björn's and Hannah's minds who've been working on QLever for far longer. Also we might still want to figure out why we're seeing the less results than the grep, maybe there is in fact an off-by-one error. |
The following SPARQL + Text query retrieves 1 result row on Freebase + ClueWeb
Adding another entity variable clearly should NOT give more results, yet it does
I tested this on the last commit before multithreading as well as on aeaf26a which was just before the OPTIONAL changes both show the same behavior
The text was updated successfully, but these errors were encountered: