Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scores from words file are not used for ql_textscore computation #1133

Open
aindlq opened this issue Nov 2, 2023 · 2 comments
Open

Scores from words file are not used for ql_textscore computation #1133

aindlq opened this issue Nov 2, 2023 · 2 comments

Comments

@aindlq
Copy link

aindlq commented Nov 2, 2023

Looks like, when doing text search with ql:contains-entity and ql:contains-word, ql_textscore_* variable has simply number of matching documents per entity, but it doesn't take score column from words file into account.

From the documentation:

The SCORE(?text) returns the number of matching records (sums of the score in the wordsfile, see above).

For me looks like a bug, because ordering by "real" score is extremely useful. What is the expected behavior?

@joka921
Copy link
Member

joka921 commented Nov 23, 2023

@NickG-1 is currently working on a thorough refactoring of the text index, that also exports the real score.
However we currently (at least temporarily) will drop the TEXTLIMIT feature (it doesn't quite fit in the SPARQL standard and we also don't find ourselves using it very often).
Would that be an issue for you?

@aindlq
Copy link
Author

aindlq commented Nov 24, 2023

@joka921 thank you for the update! That is good know.

I didn't use it so far exactly because it is non-standard SPARQL extension and all our tooling expects standard sparql on various levels of the system. So having it specified with magical predicate is much more preferable then with non-standard TEXTLIMIT.

In my view something like text limit is necessary to have at some point in time for sure, because otherwise one can get into troubles with search queries that returns too many documents.

For example in our dataset about works of art queering for "anonymous" author or "Madonna" artworks will produce too many matched documents. But it is definitely not a showstopper.

Also just to add that when working with bigger documents, I think it is more convenient to get not the whole document text back, but rather just a matched document ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants