Feat add endpoint and pages options to sparql filter#105
Feat add endpoint and pages options to sparql filter#105
Conversation
… endpoint which will be used
…fetches the page urls from SPARQL
|
hmmph, need redo this as it deletes configs |
|
it was OK afterall, just weird whitespace change effect as the edit was to the end of the file |
There was a problem hiding this comment.
Pull request overview
Adds support to the SPARQL filter for (1) querying against a configurable SPARQL endpoint and (2) a new mode=pages where the query returns article URIs directly (for multiwiki/incubator use cases like sparqlbridge).
Changes:
- Extend
SparqlFilterto acceptendpointandmode(itemsvspages) and to query the configured endpoint. - Add tests covering the new endpoint/mode plumbing and
pagesURL filtering behavior. - Expose
endpoint/modetemplate parameters in multiple site configs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
ukbot/filters.py |
Adds endpoint + mode support, endpoint validation, and pages-mode URI parsing. |
test/test_filters.py |
Adds unit tests for SparqlFilter endpoint/mode behavior and pages filtering. |
config/sites/nowiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/sites/glwiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/sites/fiwiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/sites/euwiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/sites/eswiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/sites/enwiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/sites/cawiki.yml |
Exposes endpoint and mode as SPARQL filter params. |
config/config.se.yml |
Adds endpoint mapping for the SPARQL filter (but not mode). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| query_param = cfg['params']['query'] | ||
| if not tpl.has_param(query_param): | ||
| raise RuntimeError(_('No "%s" parameter given') % cfg['params']['query']) | ||
|
|
||
| endpoint_param = cfg['params'].get('endpoint') |
There was a problem hiding this comment.
SparqlFilter.make is now reading query_param from cfg['params']['query'] (a localized parameter name) and then passing that into tpl.has_param(...) / tpl.get_raw_param(...). Because FilterTemplate.has_param/get_raw_param already localize the provided internal key, this effectively double-localizes and will fail on wikis where the query param is translated (e.g. nowiki uses spørring). Use the internal keys ('query', 'endpoint', 'mode') when calling tpl.*, and let FilterTemplate handle localization.
| self.endpoint = endpoint or 'https://query.wikidata.org/sparql' | ||
| endpoint_scheme = urllib.parse.urlparse(self.endpoint).scheme.lower() | ||
| if endpoint_scheme not in ['http', 'https']: | ||
| raise ValueError('Invalid sparql endpoint scheme: %s' % endpoint_scheme) | ||
| if mode not in ['items', 'pages']: |
There was a problem hiding this comment.
Allowing a user-provided endpoint makes the bot perform outbound HTTP requests to arbitrary hosts (contest pages are wiki-editable). Validating only the URL scheme (http/https) still permits SSRF to localhost/private IP ranges and internal services. Consider adding host allowlisting in config, and/or explicitly blocking localhost + RFC1918/link-local ranges after DNS resolution, so only trusted SPARQL endpoints can be used.
| item_var = 'item' | ||
| if self.mode == 'pages': | ||
| self.add_pages() | ||
| logger.info('SparqlFilter: Initialized with %d articles', len(self.page_keys)) | ||
| return |
There was a problem hiding this comment.
In mode='pages', add_pages() relies on do_query() selecting the first variable in the SPARQL result (head.vars[0]). This means a valid query that returns ?article but lists another variable first (e.g. SELECT ?item ?article WHERE ...) will silently produce wrong/empty results. Since the PR description says ?article is the intended variable, consider letting do_query() accept an explicit variable name (e.g. var='article') and raising a clear error if it is missing.
| @patch('ukbot.filters.SparqlFilter.fetch') | ||
| def test_make_reads_endpoint_param(self, fetch_mock): | ||
| tpl = Mock() | ||
| tpl.sites = Mock() | ||
| tpl.has_param = lambda name: name in ['query', 'endpoint'] | ||
| tpl.get_raw_param = lambda name: { | ||
| 'query': 'SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . }', | ||
| 'endpoint': 'https://example.org/sparql', | ||
| }[name] |
There was a problem hiding this comment.
These tests mock tpl directly and therefore don’t exercise the real localization logic in FilterTemplate (where parameter names may be translated, e.g. query -> spørring). Given the changes in SparqlFilter.make, adding a test that uses a real FilterTemplate instance (or at least simulates has_param/get_raw_param localization behavior) would catch regressions on non-English configs.
| ignore: ignore | ||
| sparql: sparql # as in {{ ukb criterion | sparql }} | ||
| query: query # as in {{ ukb criterion | sparql | query=... }} | ||
| endpoint: endpoint | ||
| pages: |
There was a problem hiding this comment.
config/config.se.yml adds endpoint but not mode, while the new SparqlFilter supports a mode parameter and the other site configs in this PR expose it. If this config is still used, add the corresponding mode translation/mapping here as well so mode=pages can be set on sewiki contests.
Description
This will add support for using full SELECTs and defining article URI:s directly in SPARQL in ?article variable.
Example below
{{Viikon kilpailu kriteerit|sparql|mode=pages|query=SELECT ?article WHERE { hint:Query hint:optimizer "None" . SERVICE <https://qlever.cs.uni-freiburg.de/api/wikidata> { SERVICE <https://sparqlbridge.toolforge.org/newpages/sparql/wiki=fi,smn,olo,se,incubator&include_edited_pages=1×tamp=20260331&user_list_page=w:fi:Wikiprojekti:Punaisten_linkkien_naiset/2026> { SELECT ?article ?item WHERE { ?article <http://schema.org/about> ?item . } GROUP BY ?article ?item } } ?item wdt:P21 ?gender . FILTER (?gender NOT IN (wd:Q6581097, wd:Q44148, wd:Q2449503)) } GROUP BY ?article }}The endpoint definition will work like this
{{Viikon kilpailu kriteerit|sparql|endpoint=https://query-main.wikidata.org/sparqlmode=pages|query=?item wdt:P31 wd:Q146. }}Howto test
ukbot --page Wikiprojekti:Punaisten_linkkien_naiset/2026-test --simulate config/config.fi-pln.ymlNote: Qlever can fail with Status Code=503 and it is not related to our code
#104
What type of PR is this? (check all applicable)
Related Tickets & Documents
Tested?
Added to documentation?
[optional] Are there any pre- or post-deployment tasks we need to perform?