Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Text Query: Additional entity variable adds results #55

Closed
niklas88 opened this issue May 8, 2018 · 10 comments
Closed

Inconsistent Text Query: Additional entity variable adds results #55

niklas88 opened this issue May 8, 2018 · 10 comments

Comments

@niklas88
Copy link
Member

niklas88 commented May 8, 2018

The following SPARQL + Text query retrieves 1 result row on Freebase + ClueWeb

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?person rdf:label "Rafael Rosell"@en .
  ?t ql:contains-entity ?person .
  ?t ql:contains-word "star" .
}

Adding another entity variable clearly should NOT give more results, yet it does

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?person rdf:label "Rafael Rosell"@en .
  ?t ql:contains-entity ?person .
  ?t ql:contains-entity ?unbound .
  ?t ql:contains-word "star" .
}

I tested this on the last commit before multithreading as well as on aeaf26a which was just before the OPTIONAL changes both show the same behavior

@niklas88
Copy link
Member Author

niklas88 commented May 8, 2018

So as another interesting tidbit. Exporting (as TSV) all text with "star" and an entity mention also only brings up a single text snippet when grepping for "Rafael Rosell".

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?t ql:contains-entity ?ent .
  ?t ql:contains-word "star" .
}

Also adding another entity gives more results

@Buchhold
Copy link
Member

Buchhold commented May 8, 2018 via email

@niklas88
Copy link
Member Author

niklas88 commented May 9, 2018

Thank you for your input it is quite appreciated. This problem being linked to DISTINCT was also my first guess but as you can try on the new QLever UI the same problem occurs even without DISTINCT. In that case the query with 1 entity still retrieves 1 result while the one with 2 entities finds 10 (which are the 3 from the DISTINCT case + duplication).

@niklas88
Copy link
Member Author

niklas88 commented May 9, 2018

So this problem also happens on the scientists collection, e.g. with the following query (with and without DISTINCT)

SELECT DISTINCT TEXT(?t) WHERE {
  ?t ql:contains-word "manhattan" .
  ?t ql:contains-entity ?ent .
  ?t ql:contains-entity ?ent2 .
} ORDER BY DESC(SCORE(?t))

Adding the … ?ent2 line adds more results. Note that I would kind of understand if it was the other way around and I would get less results with the second entity as that would be explained by not considering contexts with only one entity. The actual behavior however is that it gives more results which I really can't explain

@floriankramer
Copy link
Member

I haven't worked with the text operations before, so I'm not quite shure if i properly understand how they are supposed to work. But based upon what I infered from the code the behaviour does seem normal to me. Here is how I understand the operation to function:
When using only one variable a list of documents, entities occuring in them and the documents scores for the given words is created, then filtered so that a single result remains per entity, mapping the entity to the highest scoring document.
When two variables are used the same list of documents, entities and scores is created. This time an entry is added for every combination of two entities occcuring in a document together though (including combinations of an entity with it self). Then, for every combination created the document with the highest score for that combination is returned. This can create more results (up to the square of the number of results of a query with only one variable, if every entity that occures in one of the document occurs in all of them) as well as returning documents that would not be returned for the query with one variable (as a document may have a low score but a unique combination of two entities).

@niklas88
Copy link
Member Author

niklas88 commented May 9, 2018

@floriankramer so as I understand it a triple ?t ql:contains-entity ?ent is supposed to match a list of all text snippets in which the entity ?ent is mentioned. If the entity is an unbound variable the corresponding result should be the list of all entity mentions. Of course normally additional constraints apply either on the entity e.g. ?ent fb:type.object.type fb:book.book reducing ?t to all mentions of books. Or on the text snippet e.g. ?t ql:contains-word "read" reducing to those text snippets containing the word "read". So it's not just the highest scoring document.

Another effect of this is that

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
}

should equivalent to classic keyword based text search for "manhattan". Interestingly doing a

grep  "[Mm]anhattan" scientists.docsfile.tsv | sort | uniq |  wc -l

gives 714 results while

sort manhattan.tsv| uniq | wc -l

gives 666 so this might already be broken!?

Still you might be onto something here. It seems that there are some implicit constraints that from my understanding shouldn't be there..

For example the following query should only exclude text snippets containing "manhattan" without any entity mention at all:

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
 ?t ql:contains-entity ?ent .
}

Yet the following query

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
 ?t ql:contains-entity ?ent .
 ?t ql:contains-entity ?ent1 .
}

still finds 5 more (tested for distinctness by exporting to TSV and running "sort | uniq | wc -l" on the result)

@Buchhold
Copy link
Member

Buchhold commented May 9, 2018

I'm still at work, but tomorrow I should find some time to also have a look. Is there a runnign isntance where I can have a look at?

For now, I guess a good idea may be to select ?t as well. Without the TEXT(...) function, this should yield the record ID of the matching text snippet.

Note that your examples are prefectly valid (still a problem with DISTINCT, though) if the same record IDs get repeated. While adding another triple should only lower or not effect at all the number of matching text records, more lines in the result are perfectly fine (again, of course unless DISTINCT works properly) -- simply because of the corss product that has to be built between matching entities.

If the query really returns non-matching text record IDs, it may also be interesting if they're +/- to those that match but I'm starting with far-fecthed speculation, I guess. So, can I reproduce without building and deploying my own instance?

@niklas88
Copy link
Member Author

niklas88 commented May 9, 2018

@Buchhold the original query can be testen on http://qlever.informatik.uni-freiburg.de (or for the old UI at http://qlever.informatik.uni-freiburg.de/api/c+f/ or internally at vulcano:7001). The later queries work on the scientists collection. If you want I can keep one running internally as well but setting that up currently takes little more than a few minutes as it now works with the libsparsehash-dev package from Ubuntu.

@floriankramer
Copy link
Member

@niklas88
The text limit determining how many documents are returned per entity is set to 1, so when running the query

SELECT TEXT(?t) WHERE {
 ?t ql:contains-word "manhattan" .
 ?t ql:contains-entity ?ent .
}

and an entity occurs in two documents only one will be returned for that entity. When adding another unbound variable these missing documents could be included in the result, if they have the highest score out of all documents including a specifc combination of entities and there are other documents with a higher score that contain each one of these entities but not the combination.

When reading the docsfile in bash to determine the number of distinct documents containing Manhattan you forgot to cut away the first column of the tsv. Using grep "[Mm]anhattan" scientists.docsfile.tsv | cut -f2 | sort | uniq | wc -l results in 667 results for me, which is still one more than qlever returns though.

@niklas88
Copy link
Member Author

@floriankramer oh wow 🤯 that's what I call a great analysis and explanation, thank you very much. I really din't know about TEXTLIMIT, must have somehow skipped that section of the README and then never had contact with it. Also good catch with the grep. So testing the original query with TEXTLIMIT 100 I get the whole 4 matching contexts.

So I guess this is really more of a documentation issue since it even skipped Björn's and Hannah's minds who've been working on QLever for far longer. Also we might still want to figure out why we're seeing the less results than the grep, maybe there is in fact an off-by-one error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants