Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into QLever as a potential query engine for Scholia #1774

Open
Daniel-Mietchen opened this issue Jan 27, 2022 · 14 comments
Open

Look into QLever as a potential query engine for Scholia #1774

Daniel-Mietchen opened this issue Jan 27, 2022 · 14 comments
Labels
documentation hopefully helpful explanations of how things work dumps things related to data dumps enhancement some suggestions to improve Scholia events events relevant to Scholia SPARQL the way Scholia queries Wikidata

Comments

@Daniel-Mietchen
Copy link
Member

Is your feature request related to a problem? Please describe.
Scholia currently queries the Wikidata Query Service, which currently relies on Blazegraph, which is suspected to fail within the next few years, as per

Describe the solution you'd like
One of the options to address this is to use another query engine, e.g. QLever.

Describe alternatives you've considered

@Daniel-Mietchen Daniel-Mietchen added enhancement some suggestions to improve Scholia SPARQL the way Scholia queries Wikidata dumps things related to data dumps labels Jan 27, 2022
@Daniel-Mietchen
Copy link
Member Author

Over in

we had a brief discussion about QLever, including an example query https://qlever.cs.uni-freiburg.de/wikidata/J8PSek for which I am pasting a screenshot below:
Screenshot 2022-01-27 at 23-10-35 The QLever SPARQL engine fast, scalable, with autocompletion and text search

@Daniel-Mietchen Daniel-Mietchen added the documentation hopefully helpful explanations of how things work label Jan 27, 2022
@WolfgangFahl
Copy link
Collaborator

WolfgangFahl commented Jan 28, 2022

The corresponding query on Wikidata Query Services takes 2.7. secs to run as of 2022-01-28

With a pdf filter it is slightly slower.

@WolfgangFahl
Copy link
Collaborator

The more elaborate query which should show the authorslist:
Times out on wikidata

fails on qlever

@WolfgangFahl
Copy link
Collaborator

# 
# Example Query for 
# https://github.com/WDscholia/scholia/issues/1774
#
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# Scholarly articles with full text
SELECT ?paper ?paperLabel ?publishedIn ?publishedInLabel ?event ?eventLabel ?fullText ?authors
WHERE 
{
  ?paper wdt:P31 wd:Q13442814.
  ?paper rdfs:label ?paperLabel. 
  filter(lang(?paperLabel) = "en").
  ?paper wdt:P953 ?fullText. 
  ?paper wdt:P953 ?fullText filter (strends(str(?fullText), ".pdf" )). 
  #filter(regex(?fullText, "\\.pdf\\>$" )). 
  ?paper wdt:P1433 ?publishedIn.
  ?publishedIn rdfs:label ?publishedInLabel. 
  filter(lang(?publishedInLabel) = "en" ).
  ?publishedIn wdt:P4745 ?event. 
  ?event rdfs:label ?eventLabel. 
  filter(lang(?eventLabel) = "en"). 
  OPTIONAL
  {    
     SELECT (GROUP_CONCAT(?authorLabel) as ?authors) WHERE {
       ?paper wdt:P50 ?author.
       ?author rdfs:label ?authorLabel filter(lang(?authorLabel) = 'en').
     } GROUP BY ?paper
  }
}

@WolfgangFahl
Copy link
Collaborator

The authors query:

#
# test Query for https://github.com/WDscholia/scholia/issues/1774
# WF 2022-01-28
#
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# Authors
SELECT ?work ?workLabel ?author ?authorLabel
WHERE 
{
  ?work wdt:P50 ?author. 
  ?work rdfs:label ?workLabel .
  ?author rdfs:label ?authorLabel. 
}
LIMIT 10

runs on qlever in some 30 secs with >300 million results (limited to 10)
while the
wikidata query service
takes only 0.2 s

@WolfgangFahl
Copy link
Collaborator

Looks like we need a proper set of queries to do a fair comparison and check for compatibility.

@dpriskorn
Copy link

If we find that QLever is a good alternative for Scholia, I would like to help set it up in the WMC Toolforge Kubernetes cluster :)

@Daniel-Mietchen
Copy link
Member Author

Just for the record, there is a Phabricator ticket to Evaluate QLever as a time lagging SPARQL backend to offload the BlazeGraph cluster with overlapping threads and participants.

@Daniel-Mietchen
Copy link
Member Author

Looks like we need a proper set of queries to do a fair comparison and check for compatibility.

Perhaps some Scholia queries could be part of that benchmarking set.

@Daniel-Mietchen
Copy link
Member Author

If we find that QLever is a good alternative for Scholia, I would like to help set it up in the WMC Toolforge Kubernetes cluster :)

Perhaps we won't find out whether it can be that alternative if we do not have such test instances to play around with and to test the workflows (e.g. including exports/ dumps and database refreshes).

@egonw
Copy link
Collaborator

egonw commented Feb 16, 2024

Testing is easier now, and we should be able to just change this line to test with Qlever:

https://github.com/egonw/scholia/blob/8ff64dee13940ad28fd5d7dd97ad4bdc2d2628b4/scholia/query.py#L64

@dpriskorn
Copy link

# 
# Example Query for 
# https://github.com/WDscholia/scholia/issues/1774
#
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# Scholarly articles with full text
SELECT ?paper ?paperLabel ?publishedIn ?publishedInLabel ?event ?eventLabel ?fullText ?authors
WHERE 
{
  ?paper wdt:P31 wd:Q13442814.
  ?paper rdfs:label ?paperLabel. 
  filter(lang(?paperLabel) = "en").
  ?paper wdt:P953 ?fullText. 
  ?paper wdt:P953 ?fullText filter (strends(str(?fullText), ".pdf" )). 
  #filter(regex(?fullText, "\\.pdf\\>$" )). 
  ?paper wdt:P1433 ?publishedIn.
  ?publishedIn rdfs:label ?publishedInLabel. 
  filter(lang(?publishedInLabel) = "en" ).
  ?publishedIn wdt:P4745 ?event. 
  ?event rdfs:label ?eventLabel. 
  filter(lang(?eventLabel) = "en"). 
  OPTIONAL
  {    
     SELECT (GROUP_CONCAT(?authorLabel) as ?authors) WHERE {
       ?paper wdt:P50 ?author.
       ?author rdfs:label ?authorLabel filter(lang(?authorLabel) = 'en').
     } GROUP BY ?paper
  }
}

I found a small bug in that query, see where it is fixed (there was a missing str() in the regex filter):
https://qlever.cs.uni-freiburg.de/wikidata/KBDY3n
My guess: The author part times out because of a missing index in QLever, we are doing something the designers have not tested.

@WolfgangFahl
Copy link
Collaborator

@dpriskorn Note how #2412 is intended as providing a viable migration path. And yes - we'd like to run a bunch of wikidata endpoints with different technologies in the Wikimedia Foundations data center. Who would be our contact for this?

@fnielsen
Copy link
Collaborator

Do we have a place to setup a Synia webapp? That could go to that endpoint and we could more easily test queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation hopefully helpful explanations of how things work dumps things related to data dumps enhancement some suggestions to improve Scholia events events relevant to Scholia SPARQL the way Scholia queries Wikidata
Projects
None yet
Development

No branches or pull requests

5 participants