Inconsistent Text Query: Additional entity variable adds results #55

niklas88 · 2018-05-08T17:01:40Z

The following SPARQL + Text query retrieves 1 result row on Freebase + ClueWeb

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?person rdf:label "Rafael Rosell"@en .
  ?t ql:contains-entity ?person .
  ?t ql:contains-word "star" .
}

Adding another entity variable clearly should NOT give more results, yet it does

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?person rdf:label "Rafael Rosell"@en .
  ?t ql:contains-entity ?person .
  ?t ql:contains-entity ?unbound .
  ?t ql:contains-word "star" .
}

I tested this on the last commit before multithreading as well as on aeaf26a which was just before the OPTIONAL changes both show the same behavior

The text was updated successfully, but these errors were encountered:

niklas88 · 2018-05-08T17:25:11Z

So as another interesting tidbit. Exporting (as TSV) all text with "star" and an entity mention also only brings up a single text snippet when grepping for "Rafael Rosell".

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?t ql:contains-entity ?ent .
  ?t ql:contains-word "star" .
}

Also adding another entity gives more results

Buchhold · 2018-05-08T21:11:00Z

Hey, I'd assume it's a problem with the distinct. Say there is one matching doc t0 where RR and EntityX occur. Then the first query has the result: ?t=t0 ?person=RR However, the second query has results: ?t=t0 ?person=RR ?unbound=EntityX ?t=t0 ?person=RR ?unbound=RR FILTER(?unbound != ?person) prevents this, but if there is also EntityY in that context, the effect intensifies accordingly. The DISTINCT """should""" take care of that, but DISTINCT is not implemented properly at all in the current version -- sorry. EDIT: removed some other text I included in my reply email that was not supposed to be public on github Best regards Björn Am 08.05.2018 um 19:01 schrieb Niklas Schnelle:

…

The following SPARQL + Text query retrieves 1 result row on Freebase + ClueWeb |PREFIX fb: <http://rdf.freebase.com/ns/> PREFIX fbk: <http://rdf.freebase.com/key/> PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#> SELECT DISTINCT TEXT(?t) WHERE { ?person rdf:label "Rafael ***@***.*** . ?t ql:contains-entity ?person . ?t ql:contains-word "star" . } | Adding another entity variable clearly should /NOT/ give more results, yet it does |PREFIX fb: <http://rdf.freebase.com/ns/> PREFIX fbk: <http://rdf.freebase.com/key/> PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#> SELECT DISTINCT TEXT(?t) WHERE { ?person rdf:label "Rafael ***@***.*** . ?t ql:contains-entity ?person . ?t ql:contains-entity ?unbound . ?t ql:contains-word "star" . } | I tested this on the current commit as well as on aeaf26a <aeaf26a> which was just before the OPTIONAL changes both show the same behavior — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#55>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwx0ZyWN49f4KErH56wvb0wFHLC_3obks5twc91gaJpZM4T3AH1>.

niklas88 · 2018-05-09T07:36:46Z

Thank you for your input it is quite appreciated. This problem being linked to DISTINCT was also my first guess but as you can try on the new QLever UI the same problem occurs even without DISTINCT. In that case the query with 1 entity still retrieves 1 result while the one with 2 entities finds 10 (which are the 3 from the DISTINCT case + duplication).

niklas88 · 2018-05-09T08:21:52Z

So this problem also happens on the scientists collection, e.g. with the following query (with and without DISTINCT)

SELECT DISTINCT TEXT(?t) WHERE {
  ?t ql:contains-word "manhattan" .
  ?t ql:contains-entity ?ent .
  ?t ql:contains-entity ?ent2 .
} ORDER BY DESC(SCORE(?t))

Adding the … ?ent2 line adds more results. Note that I would kind of understand if it was the other way around and I would get less results with the second entity as that would be explained by not considering contexts with only one entity. The actual behavior however is that it gives more results which I really can't explain

floriankramer · 2018-05-09T10:51:59Z

I haven't worked with the text operations before, so I'm not quite shure if i properly understand how they are supposed to work. But based upon what I infered from the code the behaviour does seem normal to me. Here is how I understand the operation to function:
When using only one variable a list of documents, entities occuring in them and the documents scores for the given words is created, then filtered so that a single result remains per entity, mapping the entity to the highest scoring document.
When two variables are used the same list of documents, entities and scores is created. This time an entry is added for every combination of two entities occcuring in a document together though (including combinations of an entity with it self). Then, for every combination created the document with the highest score for that combination is returned. This can create more results (up to the square of the number of results of a query with only one variable, if every entity that occures in one of the document occurs in all of them) as well as returning documents that would not be returned for the query with one variable (as a document may have a low score but a unique combination of two entities).

niklas88 · 2018-05-09T13:28:56Z

@floriankramer so as I understand it a triple ?t ql:contains-entity ?ent is supposed to match a list of all text snippets in which the entity ?ent is mentioned. If the entity is an unbound variable the corresponding result should be the list of all entity mentions. Of course normally additional constraints apply either on the entity e.g. ?ent fb:type.object.type fb:book.book reducing ?t to all mentions of books. Or on the text snippet e.g. ?t ql:contains-word "read" reducing to those text snippets containing the word "read". So it's not just the highest scoring document.

Another effect of this is that

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
}

should equivalent to classic keyword based text search for "manhattan". Interestingly doing a

grep  "[Mm]anhattan" scientists.docsfile.tsv | sort | uniq |  wc -l

gives 714 results while

sort manhattan.tsv| uniq | wc -l

gives 666 so this might already be broken!?

Still you might be onto something here. It seems that there are some implicit constraints that from my understanding shouldn't be there..

For example the following query should only exclude text snippets containing "manhattan" without any entity mention at all:

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
 ?t ql:contains-entity ?ent .
}

Yet the following query

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
 ?t ql:contains-entity ?ent .
 ?t ql:contains-entity ?ent1 .
}

still finds 5 more (tested for distinctness by exporting to TSV and running "sort | uniq | wc -l" on the result)

Buchhold · 2018-05-09T13:42:49Z

I'm still at work, but tomorrow I should find some time to also have a look. Is there a runnign isntance where I can have a look at?

For now, I guess a good idea may be to select ?t as well. Without the TEXT(...) function, this should yield the record ID of the matching text snippet.

Note that your examples are prefectly valid (still a problem with DISTINCT, though) if the same record IDs get repeated. While adding another triple should only lower or not effect at all the number of matching text records, more lines in the result are perfectly fine (again, of course unless DISTINCT works properly) -- simply because of the corss product that has to be built between matching entities.

If the query really returns non-matching text record IDs, it may also be interesting if they're +/- to those that match but I'm starting with far-fecthed speculation, I guess. So, can I reproduce without building and deploying my own instance?

niklas88 · 2018-05-09T14:05:09Z

@Buchhold the original query can be testen on http://qlever.informatik.uni-freiburg.de (or for the old UI at http://qlever.informatik.uni-freiburg.de/api/c+f/ or internally at vulcano:7001). The later queries work on the scientists collection. If you want I can keep one running internally as well but setting that up currently takes little more than a few minutes as it now works with the libsparsehash-dev package from Ubuntu.

floriankramer · 2018-05-09T14:42:27Z

@niklas88
The text limit determining how many documents are returned per entity is set to 1, so when running the query

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
 ?t ql:contains-entity ?ent .
}

and an entity occurs in two documents only one will be returned for that entity. When adding another unbound variable these missing documents could be included in the result, if they have the highest score out of all documents including a specifc combination of entities and there are other documents with a higher score that contain each one of these entities but not the combination.

When reading the docsfile in bash to determine the number of distinct documents containing Manhattan you forgot to cut away the first column of the tsv. Using grep "[Mm]anhattan" scientists.docsfile.tsv | cut -f2 | sort | uniq | wc -l results in 667 results for me, which is still one more than qlever returns though.

niklas88 · 2018-05-11T07:47:27Z

@floriankramer oh wow 🤯 that's what I call a great analysis and explanation, thank you very much. I really din't know about TEXTLIMIT, must have somehow skipped that section of the README and then never had contact with it. Also good catch with the grep. So testing the original query with TEXTLIMIT 100 I get the whole 4 matching contexts.

So I guess this is really more of a documentation issue since it even skipped Björn's and Hannah's minds who've been working on QLever for far longer. Also we might still want to figure out why we're seeing the less results than the grep, maybe there is in fact an off-by-one error.

niklas88 closed this as completed May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Text Query: Additional entity variable adds results #55

Inconsistent Text Query: Additional entity variable adds results #55

niklas88 commented May 8, 2018 •

edited

niklas88 commented May 8, 2018 •

edited

Buchhold commented May 8, 2018 via email •

edited

niklas88 commented May 9, 2018

niklas88 commented May 9, 2018 •

edited

floriankramer commented May 9, 2018

niklas88 commented May 9, 2018

Buchhold commented May 9, 2018

niklas88 commented May 9, 2018

floriankramer commented May 9, 2018

niklas88 commented May 11, 2018

Inconsistent Text Query: Additional entity variable adds results #55

Inconsistent Text Query: Additional entity variable adds results #55

Comments

niklas88 commented May 8, 2018 • edited

niklas88 commented May 8, 2018 • edited

Buchhold commented May 8, 2018 via email • edited

niklas88 commented May 9, 2018

niklas88 commented May 9, 2018 • edited

floriankramer commented May 9, 2018

niklas88 commented May 9, 2018

Buchhold commented May 9, 2018

niklas88 commented May 9, 2018

floriankramer commented May 9, 2018

niklas88 commented May 11, 2018

niklas88 commented May 8, 2018 •

edited

niklas88 commented May 8, 2018 •

edited

Buchhold commented May 8, 2018 via email •

edited

niklas88 commented May 9, 2018 •

edited