Skip to content

Find entities (people, places, organizations) in Wikidata.

License

Notifications You must be signed in to change notification settings

cwrc/wikidata-entity-lookup

Repository files navigation

wikidata-entity-lookup

Picture

Travis Codecov version downloads GPL-3.0 semantic-release Commitizen friendly experimental

  1. Overview
  2. Installation
  3. Use
  4. API
  5. Development

Overview

Finds entities (people, places, organizations, titles) in wikidata, through the search entities module of the MediaWiki api (wikidata is built on mediawiki). Meant to be used with cwrc-public-entity-dialogs where it runs in the browser.

Although it will not work in node.js as-is, it does use the Fetch API for http requests, and so could likely therefore use a browser/node.js compatible fetch implementation like: isomorphic-fetch.

Why not SPARQL

wikidata supports sparql https://www.mediawiki.org/wiki/Wikidata_query_service, but SPARQL has limited support for full text search. The expectation with SPARQL mostly seems to be that you know exactly what you are matching on. So, a query that exactly details the label works fine:

SELECT DISTINCT ?s WHERE {
  ?s ?label "The Rolling Stones"@en .
  ?s ?p ?o
}

We'd like, however, to match with full text search, so we can match on partial strings, variant spellings, etc. Just in the simple case above, for example, someone searching for The Rolling Stones would have to fully specify 'The Rolling Stones' and not just 'Rolling Stones'. If they left out 'The' then their query won't return the result.

There is a SPARQL CONTAINS operator that can be used within a FILTER, and that matches substrings, which would be better. CONTAINS does only supports exact matches of substrings, no fuzzy querying, but for names that might be fine.

CONTAINS seems to work fine with some data stores like getty, but the same query that works on getty will only work occasionally on wikidata and mostly times out.

There are alternatives to CONTAINS, most notably REGEX, but as described here: https://www.cray.com/blog/dont-use-hammer-screw-nail-alternatives-regex-sparql/ REGEX has even worse performance than CONTAINS.

A further alternative is to use some of the custom full text SPARQL search functions that specific triplestores might offer, and maybe since we are controlling the queries that might be fine. Wikidata, however, doesn’t seem to have anything like this, and while support for full text search in SPARQL is planned, it’s been in the queue for a while: https://phabricator.wikimedia.org/T141813

Wikidata does, though, have another api, the MediaWiki api (wikidata is built on mediawiki): www.wikidata.org/w/ and in particular, has got a search function for entities that is probably mostly what we want:

https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities

It does not, however, allow specifying the ‘type’ of the entity, i.e., person, place, etc. (in the way that VIAF does) and so we couldn't easily return results by entity type (which we could with SPARQL). Instead one has to show results with mixed entity types for any query. So a query for a person might return results that include any entity type, including places, organizations, or titles.

There are a couple of npm packages for querying wikidata:

Both use the www.wikidata.org/w/api.php API mentioned above. The wikidata-sdk package also allows SPARQL querying, but again, without full text search — it assumes you know exactly the string you are matching.

In summary, if we knew the exact string to match, then we could use SPARQL and thereby filter by type. Otherwise, we have to use the custom, non SPARQL api, which supports full text search on entities, but doesn’t allow filtering the entities by entity type (person, place, org, title).

For now, we've chosen the latter. In particular, we use the wikidata-sdk npm package to construct the URLs for calls to the wikidata entity search api that we then invoke (using the Fetch API) to get our results.

Installation

npm i wikidata-entity-lookup

Use

import wikidataLookup from 'wikidata-entity-lookup';

API

findPerson(query)

findPlace(query)

findOrganization(query)

findTitle(query)

where the 'query' argument is an object:

{
    entity: "The name of the thing the user wants to find.",
    options: "TBD"
}

and all find* methods return promises that resolve to an object like the following:

{
    "id": "http://wikidata.org/wikidata/9447148209321300460003/",
    "name": "Fay Jones School of Architecture and Design",
    "nameType": "Corporate",
    "originalQueryString": "jones",
    "repository": "wikidata",
    "uri": "http://wikidata.org/9447148209321300460003/",
    "uriForDisplay": "https://wikidata.org/9447148209321300460003/"

}

There are a further four methods that are mainly made available to facilitate testing (to make it easier to mock calls to the wikidata service):

getPersonLookupURI(query)

getPlaceLookupURI(query)

getOrganizationLookupURI(query)

getTitleLookupURI(query)

where the 'query' argument is the entity name to find and the methods return the wikidata URL that in turn returns results for the query.

Development

CWRC-Writer-Dev-Docs describes general development practices for CWRC-Writer GitHub repositories, including this one.

Mocking

We use fetch-mock to mock http calls (which we make using the Fetch API rather than XMLHttpRequest).

Continuous Integration

We use Travis.

Release

We follow SemVer, which Semantic Release makes easy. Semantic Release also writes our commit messages, sets the version number, publishes to NPM, and finally generates a changelog and a release (including a git tag) on GitHub.