Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load SDC dump to Wikidata endpoint #985

Open
tuukka opened this issue May 20, 2023 · 21 comments
Open

Load SDC dump to Wikidata endpoint #985

tuukka opened this issue May 20, 2023 · 21 comments

Comments

@tuukka
Copy link

tuukka commented May 20, 2023

Have you considered supporting SDC (Structured Data on Commons) yet? It is the wikibase of metadata regarding images within Wikimedia Commons, and it uses Wikidata items and properties as its vocabulary. In practice, it expands Wikidata with more images and depiction information.

The support might be as easy as loading the SDC dump into the Wikidata endpoint. Alternatively, there could be a separate SDC endpoint but it would also need to contain (a subset of) Wikidata.

The RDF dumps are available here: https://dumps.wikimedia.org/other/wikibase/commonswiki/

More on SDC: https://commons.wikimedia.org/wiki/Commons:Structured_data

EDIT: Documentation of the triples in the dump: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#MediaInfo and https://www.mediawiki.org/wiki/Extension:WikibaseMediaInfo/RDF_mapping

@hannahbast
Copy link
Member

hannahbast commented May 20, 2023

@tuukka Thanks for the suggestion. I am downloading it right now (that takes a few hours) and will build a QLever instance for it (that will take a few more hours). Looking forward to what's in there, especially since it appears to be quite big (37 GB bz2-compressed).

Do you know why the WCQS does not have unauthenticated access like the WDQS does?

And can you provide one or two useful example queries?

@tuukka
Copy link
Author

tuukka commented May 20, 2023

Do you know why the WCQS does not have unauthenticated access like the WDQS does?

As I understand it: performance reasons - WMF is unwilling to provide more endpoints while there is no solution to the performance needs of the regular Wikidata Query Service either.

@tuukka
Copy link
Author

tuukka commented May 20, 2023

And can you provide one or two useful example queries?

Here's some WCQS example queries from the community: https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples

To start with a comparison, in Wikidata, you get image(s) of Douglas Adams like this:

SELECT ?image {
    wd:Q42 wdt:P18 ?image . # Douglas Adams
}

The result is e.g. image http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg

In SDC, you can get the image above and all other images depicting Douglas Adams like this:

SELECT ?file ?image {
    ?file wdt:P180 wd:Q42 . # depicts: Douglas Adams
    ?file schema:url ?image .
}

And the result is e.g. file https://commons.wikimedia.org/entity/M10031710 image http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg so the same image URL as above.

Combining ontology information from Wikidata, you can query e.g. all quality images depicting any hummingbird species: [original source]

SELECT ?file ?image {
    ?species wdt:P171/wdt:P171* wd:Q43624. # parent taxon: hummingbird
    ?file wdt:P180 ?species . # depicts
    ?file wdt:P6731 wd:Q63348069 . # Commons quality assessment: Commons quality image
    ?file schema:url ?image .
}

@hannahbast
Copy link
Member

hannahbast commented May 20, 2023

The instance is up and running now (it took < 2 h to build it). Here are links to your two example queries:

https://qlever.cs.uni-freiburg.de/wikimedia-commons/4TOZwl

https://qlever.cs.uni-freiburg.de/wikimedia-commons/MyAdzj

@tuukka
Copy link
Author

tuukka commented May 20, 2023

Wow, thank you! I didn't realise you have already implemented federated queries with the SERVICE keyword too.

I've let some people now in the Wikimedia Hackathon know about this (unfortunately I couldn't attend), this can be very valuable to everyone building tools for SDC.

The holy grail application of this would be faceted search, do you have any tips regarding that? I found ql:has-predicate, is that what we should build on top of? And you wouldn't happen to have a UI similar to this already? 😁 https://github.com/joseignm/GraFa

What I have so far:
Property counts https://qlever.cs.uni-freiburg.de/wikidata/XBe4M8
Object counts https://qlever.cs.uni-freiburg.de/wikidata/zu5gUm

@hannahbast
Copy link
Member

Isn't the context-sensitive autocompletion of the QLever UI doing this (and much more)?

For example, if you go to https://qlever.cs.uni-freiburg.de/wikidata you can

  1. Type S and hit Return to get the SELECT * WHERE { ... } query template.
  2. Type a variable name, for example subject
  3. Type any prefix of instance of (or any other alias of wdt:P31) and select wdt:P31/wdt:P279* from the list of suggestions
  4. Type the prefix of any class (for example, per for Person) and select from the list of suggestions
  5. Execute the query

You can incrementally construct arbitrary queries that way.

@hannahbast
Copy link
Member

PS: You can also take your query and extend it by a prefix filter, likes so (prefix filters are very efficient in QLever):

https://qlever.cs.uni-freiburg.de/wikidata/nN9IDv

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?object (SAMPLE(?object_label) AS ?label) (COUNT(?object) as ?count) WHERE {
  ?item wdt:P18 ?image .
  ?item wdt:P31/wdt:P279* wd:Q838948 .
  ?item wdt:P180 ?object .
  ?object rdfs:label ?object_label .
  FILTER (LANG(?object_label) = "en") .
  FILTER REGEX(STR(?object_label), "^per")
}
GROUP BY ?object ?object_label 
ORDER BY DESC(?count)

@tuukka
Copy link
Author

tuukka commented May 20, 2023

Isn't the context-sensitive autocompletion of the QLever UI doing this (and much more)?

We have mostly in mind end users who don't understand Sparql 😁 But yes, the functionality is more or less there in the current editor - that's how I figured it should be feasible. Although, I don't see the counts for the autocompletion candidates in the UI 🤔

PS: You can also take your query and extend it by a prefix filter

Good point. I think I need to add a LIMIT to not have too many options client-side, and instead have a server-side prefix filter like that.

I think I'll first add simple, non-faceted depictions to Wikidocumentaries though, e.g. here (earlier based only on Wikidata and text search of Commons): https://wikidocumentaries-demo.wmcloud.org/wikipedia/en/Hummingbird?language=en

@hannahbast
Copy link
Member

hannahbast commented May 21, 2023

Can you explain the use case for the faceted search a bit more? What is it that users ultimately want when, for example, they type per in order to find human (Q5), or hum to find hummingbird (Q43624).

Is the goal just to find the right QID? That you can also do that with the search box on the top right of https://www.wikidata.org , right. In our Information Retrieval lecture, we have an exercise, where the goal is to build a version of this with fuzzy search (that is, you can make mistakes). Here is a demo: https://qlever.cs.uni-freiburg.de/wikidata-entity-search (for example, type huming).

If that is not the primary goal, what are the subsequent steps?

@tuukka
Copy link
Author

tuukka commented May 21, 2023

In addition to making better query builders, I have in mind using faceted search as a powerful tool for exploring big collections (museums, archives, shops) with potentially spotty metadata.

My example queries above come from the hackathon participants' test case of exploring works of art with photos available. The property counts show there are 800k such works just in Wikidata, so I can't go through them one by one. But the counts also give me the idea I could filter by e.g. the collection, location, author, material, what is depicted etc. Or also by e.g. color, but I wouldn't get that many results. Say, I want to choose what is depicted. Next, I can see I could add a filter to see e.g. 14k portraits or 4k horses. I can continue until I (hopefully) see what I want, perhaps after backtracking a few times.

There will be some complications like in the hummingbird case I want to filter by a property path instead of a direct property, but e.g. Wikidata Query Builder seems to have some knowledge of typical property paths to use ("Include related values in the search").

@hannahbast
Copy link
Member

Interesting, thanks.

As a matter of fact, our older UIs were all faceted-search UIs, for example: https://broccoli.cs.uni-freiburg.de . You can start with a class (for example, Person or Written Work) and then refine from there using the facets (select a subclass, select an instance, add a property, refine via co-occurring words). Is that something of the kind you imagine?

Such UIs are easier to use, but limited in the kind of query you can ask on the data. That's why we eventually developed QLever. UI-wise, the ideas was that it can be useful in two ways:

  1. You can use it to incrementally construct arbitrary SPARQL queries, as done in the QLever UI. That is very powerful, but asking too much from some users.

  2. You can , with little effort, build a special-purpose UI on top of the API. This requires that the suggestions can be computed very efficiently via SPARQL queries themselves, which is the case for QLever (other SPARQL engines are not good at these kind of queries).

@tuukka
Copy link
Author

tuukka commented May 22, 2023

I knew you are working on the cutting edge, but Broccoli must've been 10 years ahead of its time!

Regarding your second point, do you happen to have an example/code of such a special-purpose UI?

We have now a first very limited but testable implementation of faceted browsing in Wikidocumentaries, see "Depictions from Wikimedia Commons" e.g. here: https://wikidocumentaries-demo.wmcloud.org/wikipedia/en/Birds?language=en

Our code is available here and any feedback is welcome especially in how to improve the Sparql queries: https://github.com/Wikidocumentaries/wikidocumentaries-ui/blob/master/src/components/topic_page/DepictingImages.vue

@hannahbast
Copy link
Member

hannahbast commented May 24, 2023

Quick question: Queries like https://qlever.cs.uni-freiburg.de/wikimedia-commons/MdUKbU are coming from you, right?

I am asking because for some reason the contained SERVICE queries all take 5 + epsilon seconds and we don't yet know why (they should be much faster, since the respective queries to the Wikidata instance are fast).

If you want to ask more such queries, I would for now (until we resolve that problem) simply build a joint index for Wikdiata and Wikimedia Commons.

And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?

@tuukka
Copy link
Author

tuukka commented May 24, 2023

Sorry for the delay, for some reason I didn't get a notification about your message.

Quick question: Queries like https://qlever.cs.uni-freiburg.de/wikimedia-commons/MdUKbU are comming from you, right?

Right.

I am asking because for some reason the contained SERVICE queries all take 5 + epsilon seconds and we don't yet know why (they should be much faster, since the respective queries to the Wikidata instance are fast).

Good to know. Something that might matter is that I currently make multiple parallel requests, perhaps they interact badly?

If you want to ask more such queries, I would for now (until we resolve that problem) simply build a joint index for Wikdiata and Wikimedia Commons.

I wanted to ask about that anyway, since the need for SERVICE makes some query features more difficult to write and some perhaps impossible: if I want to ask for ql:has-predicate of something inside SERVICE, can it take into account the restrictions caused by triples outside of SERVICE?

Also, if preferable and the necessary scripts / instructions are available, I may be able to set up an instance on a Wikimedia Cloud VPS.

And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?

Probably both, and also bots. If you have User-Agent logs, you should be able to tell apart Googlebot, my dev environment (on Linux Firefox), and actual users.

@tuukka
Copy link
Author

tuukka commented May 24, 2023

And out of curiosity: who is generating the traffic, is it you via tests or is it actual users?

Probably both, and also bots. If you have User-Agent logs, you should be able to tell apart Googlebot, my dev environment (on Linux Firefox), and actual users.

Also, if you can log the Origin header of the requests, you will see which queries come from the deployed version at https://wikidocumentaries-demo.wmcloud.org/

And let me know if I should limit the number and/or complexity of the requests I'm sending.

@hannahbast
Copy link
Member

hannahbast commented May 25, 2023

Quick update: We have found (already yesterday) the reason why the SERVICE queries always took "5 + epsilon" seconds. The respective QLever backend ran inside of a docker container and it so happened that docker containers on that particular machine had a five second latency for any network-related request (probably due to problems with the DNS lookup). As a quickfix, the backend now runs outside of docker and the SERVICE queries are as fast as they should be, for example: https://qlever.cs.uni-freiburg.de/wikimedia-commons/fwdZ1M

@tuukka
Copy link
Author

tuukka commented Jun 21, 2023

I'm trying to finish a "version 0.9" of the UI, but I'm getting a lot of indeterministic 400 out-of-memory responses. I had a look at the query analyzer and two things popped out:

  1. Even if there are few items (and files), ?file (wdt:P180|wdt:P921|wdt:P6243|wdt:P195|wdt:P608) ?item is slow. It seems it would be faster to do ?file ?p ?item and filter afterwards. Is this expected?

  2. Even if there are few files (and images), ?file schema:url ?image is slow:

INDEX SCAN ?file <url> ?image
Cols: ?file, ?image
Size: 87,160,263 x 2 [~ 87,160,263]
Time: 137ms [~ 87,160,263]

Full query: https://qlever.cs.uni-freiburg.de/wikimedia-commons/dGYCdH

And more in general, are there any new thoughts regarding this topic (faceted browsing of SDC+Wikidata, and my implementation of it) from your side?

@hannahbast
Copy link
Member

I am currently traveling and will try to look at it tonight or tomorrow. Maybe @joka921 can say something about the out of memory responses?

@tuukka
Copy link
Author

tuukka commented Jun 21, 2023

Thank you @hannahbast!

After I wrote my previous message, the Sparql endpoint went down and now only responds 503 Service Unavailable: https://qlever.cs.uni-freiburg.de/api/wikimedia-commons

@hannahbast
Copy link
Member

I am sorry, I don't know what happened, but the endpoint is now up again. More tomorrow

@joka921
Copy link
Member

joka921 commented Jun 28, 2023

@tuukka Thanks for your feedback. It seems like your faceted system issues a lot of queries that contain follow a similar template and only have a small variable part (That is very typical for such applications where you build some kind of frontend application that internally issues SPARQL queries). The easiest solution would be to identify the building blocks of your queries that

  1. Are part of (almost) every query your system issues and
  2. Are comparatively expensive to compute.

Given your example queries given previously in this thread, it seems like this could be the case for the complete
schema:url predicate, and the union (wdt:P180|wdt:P921|wdt:P6243|wdt:P195|wdt:P608)
We could then precompute these building blocks, and pin them to our subtree cache, then they wouldn't have to be computed from scratch for every query that uses them. By default we pin for example the predicates for English labels, as they occur in almost every query.

Additionally we could in general try to perform some query engineering (reformulate queries in a way that is equivalent but cheaper to compute, query planning is a hard problem, and sometimes we can help the systems).

If you can identify such parts of queries that happen in many of your requests and point them out, then we can try to pin some of them to the cache and see, whether this helps your system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants