# Accessing Wikidata with Python
Leon Kastler<br/>
September 5, 2018. Zurich, Switzerland.

In this session, we introduce two ways to access Wikidata, via its API and via [pywikibot](https://www.mediawiki.org/wiki/Manual:Pywikibot) a python library.

# Wikidata API
We start with the [Wikidata API](https://www.wikidata.org/w/api.php), a HTTP-based API to search, retrieve, and manipulate entities in Wikidata.
The URI where all API calls go to is https://www.wikidata.org/w/api.php.
All requests are HTTPS calls, so we can use python's [requests](http://docs.python-requests.org) or any other HTTP/S library e.g. urllib2.
The structure of an HTTP/S call or request is fairly simple: a _client_ sends a _request_ to a _server_ and receives an _response_ from it.
There are different kinds of requests a client can make, but we use two of them here: _GET_ and _POST_.
Simplified, the two types define how _parameters_, which give further information, are submitted and let the server know where to look for them.
Both are not interchangable, a good rule of thumb is, that we use _GET_ for getting data and _POST_ for adding or manipulating data.

In [11]:
import requests

API_URI = 'https://www.wikidata.org/w/api.php'

## search entities
What kind of operation we want the API to do is defined by so called _actions_, which is provided as the `action` parameter to the API.
The first action, we use is called [`wbsearchentities`](https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities) which stands for "wikibase entity search", with Wikibase being the database behind Wikidata itself.
It offers the same functionality as the search box of Wikidata, in fact, they use the same API action for it.
You can see the complete action description [here](https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities).
In general, it is useful to define the response format with the `format` parameter as `json`, so we ensure that we can handle the response easily.
Let's do a search for entities with "Zürich" in German and see what we find:

In [12]:
search_parameters = {
    'action': 'wbsearchentities',
    'format': 'json',
    'language': 'de',
    'type': 'item',
}

search_parameters['search'] = 'Zürich'

response = requests.get(API_URI, params=search_parameters)

search_results = response.json()['search']

print('fields of a search hit:', search_results[0].keys())

print()

for hit in search_results:
    page_id = hit['id']
    label = hit['label'] if 'label' in hit else ''
    description = '(' + hit['description'] + ')' if 'description' in hit else ''
    print(page_id + ': ' + label + ' ' + description)

fields of a search hit: dict_keys(['repository', 'id', 'concepturi', 'title', 'pageid', 'url', 'label', 'description', 'match'])

Q72: Zürich (capital of the canton of Zürich, Switzerland)
Q11943: canton of Zürich (canton of Switzerland)
Q660732: Zürich District 
Q30998: Zurich (region of Switzerland)
Q19240447: Zürich 
Q33729618: Zürich (Wikimedia duplicated page)
Q68165: Kloten (municipality in Switzerland)


So what happended here?
In line `1` to `6`, we defined the basic parameters for our search.
We defined the `action`, defined the response format, defined that we want to search for a German word with the `language` parameter, and defined that we want to search for a [Wikidata item](https://www.mediawiki.org/wiki/Wikibase/API#Wikibase_and_Wikidata) with the `type` parameter.
We then added the `search` parameter in line `8`.
The actual API call is executed in line `10` with `requests.get`.
As we can see here, this is a _GET_ request.
In line `12` we parse the JSON response that contains, if successful, the search result.
We print out the information contained by a search hit in line `14`.
The interesting fields are:

- `id`: the entity's Q-number
- `concepturi`: the entity's URI to for dereferencing
- `label` the entity's label in the requested language (if existing).
- `description`: the entity's description in the requested language (if existing).

From line `18` on, we iterate over the results and extract the item's Q-number, German label and description if it exists and print them out in line `23`.

## retrieve an entity
Since the search itself does not retrieve a lot of information, we want more.
The action [`wbgetentities`](https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities) offers us to retrieve all information about an entity.
Have a look at the following code:

In [13]:
response = requests.get(API_URI, params={
    'action': 'wbgetentities',
    'format': 'json',
    'ids': 'Q72|P17'
})

# item: Zurich
zurich = response.json()['entities']['Q72']

print('Zurich\'s fields:', zurich.keys())

print()

print('Zurich\'s Q-Number:', zurich['title'])
print('Zurich\'s French label:', zurich['labels']['en']['value'])
print('Zurich\'s French description:', zurich['descriptions']['en']['value'])

print('Claims about Zurich: ', zurich['claims'].keys())

Zurich's fields: dict_keys(['pageid', 'ns', 'title', 'lastrevid', 'modified', 'type', 'id', 'labels', 'descriptions', 'aliases', 'claims', 'sitelinks'])

Zurich's Q-Number: Q72
Zurich's French label: Zürich
Zurich's French description: capital of the canton of Zürich, Switzerland
Claims about Zurich:  dict_keys(['P1151', 'P31', 'P1036', 'P17', 'P131', 'P190', 'P30', 'P94', 'P373', 'P402', 'P242', 'P18', 'P281', 'P47', 'P421', 'P473', 'P625', 'P6', 'P771', 'P37', 'P856', 'P910', 'P948', 'P227', 'P244', 'P214', 'P998', 'P982', 'P646', 'P902', 'P268', 'P269', 'P41', 'P1464', 'P1465', 'P1566', 'P1456', 'P1740', 'P150', 'P1791', 'P1792', 'P1376', 'P166', 'P2046', 'P194', 'P935', 'P1296', 'P1997', 'P1417', 'P2044', 'P1281', 'P2959', 'P3222', 'P3417', 'P1448', 'P1842', 'P3984', 'P206', 'P1325', 'P2347', 'P1225', 'P3241', 'P2184', 'P2581', 'P4672', 'P1937', 'P361', 'P3219', 'P1705', 'P5019', 'P463', 'P1313', 'P571', 'P1889', 'P1082', 'P213', 'P5573', 'P949', 'P1435'])


We reduced the requests call in line `1`, so that all parameters are filled in directly.
Note that we can ask for multiple entities in with `ids` parameter by splitting them with the `|` character.
In line `8` we extract the info for [_Q72_ (Zurich)](https://www.wikidata.org/wiki/Q72).
We use the `keys()` function of the extracted information in line `10` to show what it offers.
For us, the important onces are:

- `title`: the id of the entity (Q-Numbers for items, P-Numbers for properties.
- `labels`: all multi-langual labels for the entity
- `descriptions`: all multi-langual descriptions for the entity
- `claims`: claims made about the entity

From line `14`on, we print the Q-Number, English label and description and the keys for claims made about Zurich.

We can do the same procedure now for the [property country](https://www.wikidata.org/wiki/Property:P17) since they can be accessed identically.

In [14]:
# property: country
propery_country = response.json()['entities']['P17']

print('property Country\'s P-Number:', propery_country['title'])
print('property Country\'s French label:', propery_country['labels']['en']['value'])
print('property Country\'s French description:', propery_country['descriptions']['en']['value'])

print('Claims about property Country: ', propery_country['claims'].keys())

property Country's P-Number: Property:P17
property Country's French label: country
property Country's French description: sovereign state of this item; don't use on humans
Claims about property Country:  dict_keys(['P1659', 'P1629', 'P1647', 'P31', 'P2875', 'P1709', 'P1628', 'P3713', 'P3254', 'P2302', 'P1855', 'P3734', 'P3709'])


## retrieve preferred ranked claims for property "instance of" of entity "Zurich"
For the following example, we retrieve specific claims directly without accessing the entity before.
The action [`wbgetclaims`](https://www.wikidata.org/w/api.php?action=help&modules=wbgetclaims) allows us not only to retrieve the claims for a specific property for a specific entity but also other constraints like we only want preferred claims.
Have a look at this example:

In [15]:
response = requests.get(API_URI, params={
    'action': 'wbgetclaims',
    'entity': 'Q72',
    'format': 'json',
    'property': 'P31',
})
claims_for_P31 = response.json()['claims']['P31']

response = requests.get(API_URI, params={
    'action': 'wbgetclaims',
    'entity': 'Q72',
    'format': 'json',
    'property': 'P31',
    'rank': 'preferred'
})
preferred_claims_for_P31 = response.json()['claims']['P31']

print('number of claims for Q72\'s P31:', len(claims_for_P31))
print('number of preferred claims for Q72\'s P31:', len(preferred_claims_for_P31))

number of claims for Q72's P31: 6
number of preferred claims for Q72's P31: 4


We will skip the explanation for the requests, since they look similar.
We first retrieve all claims for [P31 (instance of)](https://www.wikidata.org/wiki/Property:P31) for [Q72 (Zurich)](https://www.wikidata.org/wiki/Q72) and then only the preferred.
As we can see from the output of line `18` and `19`, there are 6 claims in total, but only 4 are regarded as preferred.

## wikibase add claim
We now want to create a claim for an entity.
For this example, we use the entity [`Q15397819`](https://www.wikidata.org/wiki/Q15397819), one of the so called sandbox entities.
A general rule for creating scripts and software that manipulates Wikidata is, that one should be very careful what they do.
If you are creating a program, always use either one of the sandbox entities or the [Wikidata sandbox](https://test.wikidata.org/wiki/Wikidata:Main_Page) itself.

For the claim creation, we use the action [`wbcreateclaim`](https://www.wikidata.org/w/api.php?action=help&modules=wbcreateclaim). However, since this adds data to Wikidata, we need ad _token_ that identifies that we did the API call.
To get the token, we have a three step authentification to do.

### create a session and retrieve a login token
First step ist to create a so called _session_.
Under normal conditions, all our API calls are not related to each other, but since the authentification has multiple steps, we need to connect them.
This is what the session is for.
The first action we need is `query` that allows us to request a so called _login_token_ that allows us to log in.
the following code requests this token and stores it in a variable.

In [16]:
# create session
session = requests.Session()
# retrieve login token
response = session.get(API_URI, params={
    'action': 'query',
    'format': 'json',
    'meta': 'tokens',
    'type': 'login',
})
token = response.json()['query']['tokens']['logintoken']
print('retrieved login token:', token)

retrieved login token: 4aae7167ad3a5277c70dddd9a2100b4e5b901238+\


Next, we need to do the actual login with our new token.
It is in general not recommended to use your normal login password, but so called [bot passwords](https://www.mediawiki.org/wiki/Manual:Bot_passwords) instead.
These passwords are like a different user account where you can set different access rights.
Once you don't need that bot password anymore, you can remove it.
To create a bot password in Wikidata, use [this link](https://www.mediawiki.org/wiki/Special:BotPasswords).
Once the bot password has been created, we can use the `login` action to log in.

In [17]:
# log in
response = session.post(API_URI, data={
    'action': 'login',
    'lgname': 'Lkastler@wikidata_zurich',
    'lgpassword': '7343er7i5aspf9182s13mp70j8b9tjj3',
    'lgtoken': token,
    'format': 'json'
})

print('login:', response.json()['login']['result'])

login: Success


Great!
Next, we retrieve another token with the `query` action but this time without the `type` parameter (see how we retrieved a login token).

In [18]:
# retrieve actual token
response = session.get(API_URI, params={
    'action': 'query',
    'format': 'json',
    'meta': 'tokens',
})
token = response.json()['query']['tokens']['csrftoken']

print('working token:', token)

working token: 28874f5c55402bbe69d60754fe625bac5b90123b+\


Ok, so we now have a token that we can use for all following requests within this session.
We now create a claim with the [`wbcreateclaim`](https://www.wikidata.org/w/api.php?action=help&modules=wbcreateclaim) action, where we set for the [sandbox item](https://www.wikidata.org/wiki/Q15397819) `Q15397819` the property [`P103`](https://www.wikidata.org/wiki/Property:P103)(native language) to [`Q188`](https://www.wikidata.org/wiki/Q188) (German).

In [19]:
# create claim
response = session.post(API_URI, data={
    'action': 'wbcreateclaim',
    'entity': 'Q15397819',
    'format': 'json',
    'property': 'P103',
    'snaktype': 'value',
    'summary': 'add claim to sandbox',
    'token': token,
    'value': '{"entity-type":"item","id":"Q188"}',
})

print(response.json())

{'pageinfo': {'lastrevid': 740694614}, 'success': 1, 'claim': {'mainsnak': {'snaktype': 'value', 'property': 'P103', 'hash': 'eeedb26365d535268f5dc9d92a5fafddba00d858', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 188, 'id': 'Q188'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q15397819$98266984-2CCB-40DF-B00F-D8EA56CE1E8A', 'rank': 'normal'}}


This action has multiple important parameters:

1. We need to define the `snaktype` of which there are three: `novalue`, `value`, and `somevalue`:
  - `novalue` defines that we do not give a value for the claim.
  - `value` defines that we do give exactly one value for the claim.
  - `somevalue` defines that we give multiple values for the claim.
2. We need to provide our working token via the `token` parameter.
3. We should add a commit summary to express what we wanted to do via the `summary` parameter. This will be shown in the item's history.

## Using pywikibot
As seen in the last example, using the HTTP/S API can be complicated and error-prone.
We also ignored some API usage ettiquette, like [obeying to maxlag](https://www.mediawiki.org/wiki/Manual:Maxlag_parameter) and so on.
[Pywikibot](https://www.mediawiki.org/wiki/Manual:Pywikibot/Wikidata) is a python library that offers a quality of life improvement here.
The usage is similar to API calls, especially when it comes to what we retrieve.
You need to configure your user in the `user-config.py` file (see https://www.mediawiki.org/wiki/Manual:Pywikibot/Wikidata).
When using [PAWS](https://paws.wmflabs.org), you can create a text file just right in the root folder.

In [20]:
import pywikibot

site = pywikibot.Site('wikidata', 'wikidata')
repo = site.data_repository()

we first need to import the module itself.
After that, we define which _site_ we want to use. pywikibot supports all Wikimedia wikis, like Wikipedia or Wikicommons, but we want to use Wikidata, of course.
The second parameter of `pywikibot.Site` defines the "language" to use for normal wikis. Wikidata has no differnt language, but you can set the parameter to `test` in order to access the sandbox.

In [21]:
zurich = pywikibot.ItemPage(repo, 'Q72')
zurich_item = zurich.get()

print(zurich_item['labels']['en'])
print(zurich_item['descriptions']['en'])

zurich_claims = zurich_item['claims']
print(zurich_claims.keys())

Zürich
capital of the canton of Zürich, Switzerland
dict_keys(['P1151', 'P31', 'P1036', 'P17', 'P131', 'P190', 'P30', 'P94', 'P373', 'P402', 'P242', 'P18', 'P281', 'P47', 'P421', 'P473', 'P625', 'P6', 'P771', 'P37', 'P856', 'P910', 'P948', 'P227', 'P244', 'P214', 'P998', 'P982', 'P646', 'P902', 'P268', 'P269', 'P41', 'P1464', 'P1465', 'P1566', 'P1456', 'P1740', 'P150', 'P1791', 'P1792', 'P1376', 'P166', 'P2046', 'P194', 'P935', 'P1296', 'P1997', 'P1417', 'P2044', 'P1281', 'P2959', 'P3222', 'P3417', 'P1448', 'P1842', 'P3984', 'P206', 'P1325', 'P2347', 'P1225', 'P3241', 'P2184', 'P2581', 'P4672', 'P1937', 'P361', 'P3219', 'P1705', 'P5019', 'P463', 'P1313', 'P571', 'P1889', 'P1082', 'P213', 'P5573', 'P949', 'P1435'])


To retieve an item, we use `pywikibot.ItemPage` and need to call `get` afterwards.
The rest is identical to the API.

In [22]:
sandbox = pywikibot.ItemPage(repo, 'Q15397819')
# change German label
sandbox.editLabels(labels={'de': 'Hallo Welt'}, summary=u'edit label for sandbox item')

# add claim that Sandbox 3's color is black with reference "retrieved" on March 20th 2014
property_color = 'P462'
color_black = pywikibot.ItemPage(repo, 'Q23445')

black_color_claim = pywikibot.Claim(repo, property_color)
black_color_claim.setTarget(color_black)

sandbox.addClaim(black_color_claim, summary=u'add black color claim')

# prepare retrieved reference
retrieved = pywikibot.Claim(repo, 'P813')
date = pywikibot.WbTime(year=2014, month=3, day=20)
retrieved.setTarget(date)

# add reference to claim
black_color_claim.addSources([retrieved], summary='added source to claim')

Sleeping for 19.5 seconds, 2018-09-05 17:29:02
Sleeping for 19.4 seconds, 2018-09-05 17:29:22


You can see in the history of Q15397819 the changes we made: https://www.wikidata.org/w/index.php?title=Q15397819&action=history

For adding qualifiers and other use cases, have a look at the [Wikidata:Creating a bot](https://www.wikidata.org/wiki/Wikidata:Creating_a_bot) page.

## References
- [Wikidata - Data Access](https://www.wikidata.org/wiki/Wikidata:Data_access)
- [Wikidata API Help](https://www.wikidata.org/w/api.php?action=help)
- [Wikibase and Wikidata](https://www.mediawiki.org/wiki/Wikibase/API)
- [Wikidata Glossary](https://www.wikidata.org/wiki/Wikidata:Glossary)

- [Pywikibot Documentation](https://doc.wikimedia.org/pywikibot/master/)
- [Extended documentation for using pywikibot in Wikidata](https://www.wikidata.org/wiki/Wikidata:Creating_a_bot)
- [Wikidata Pywikibot Tutorial for python 3](https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial)
- [Python Requests](http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.