The easiest way to get an inital data set out of Wikidata is via the SPARQL endpoint.

In [28]:
%pip install SPARQLWrapper pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In order to know the shape of the data we want to retrieve from Wikidata, we will look up a dynamic list of properties that are identifiers that can match entities to other data sets.

There are thousands of exteneral identifiers in Wikidata, so we will limit our search tosome that are found on some fairly well knows place items.

https://w.wiki/72sh

In [29]:
from SPARQLWrapper import SPARQLWrapper, JSON

# Set up the SPARQL endpoint
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Set the query
query = '''
SELECT DISTINCT ?property ?propertyLabel
WHERE {
  VALUES ?item {
    wd:Q243   # Eiffel Tower
    wd:Q41225 # Big Ben
    wd:Q9188  # Empire State Building
    wd:Q37200 # Great Pyramid of Giza
    wd:Q934   # North Pole
    wd:Q1899  # Kyiv
  }
  ?item ?p ?statement .
  ?property wikibase:claim ?p .
  ?property a wikibase:Property ;
            wikibase:propertyType wikibase:ExternalId .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
'''

# Set the SPARQL query and request JSON response
sparql.setMethod("POST")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)

# Execute the SPARQL query
results = sparql.query().convert()

# Extract the property data from the results
properties = []
for result in results['results']['bindings']:
    prop_id = result['property']['value'].split('/')[-1]
    prop_label = result['propertyLabel']['value']
    properties.append({'property_id': prop_id, 'property_label': prop_label})

# Convert the list of properties to a Pandas DataFrame
import pandas as pd
df_properties = pd.DataFrame(properties)

# Print the DataFrame
print(df_properties)

    property_id                       property_label
0         P1669  Cultural Objects Names Authority ID
1         P3108                              Yelp ID
2         P4272                    DPLA subject term
3         P4986                 Routard.com place ID
4         P9346           France24 topic ID (French)
..          ...                                  ...
164       P9000        World History Encyclopedia ID
165       P3836                   Pinterest username
166       P5421      Trading Card Database person ID
167       P7982            Hrvatska enciklopedija ID
168       P7302                      Digital Giza ID

[169 rows x 2 columns]


The query below looks for all locations that are 2.3 km away from the Empire State Building.
Viewable online at https://w.wiki/72rQ

In [30]:
from SPARQLWrapper import SPARQLWrapper, JSON
import re

# Set up the SPARQL endpoint
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Set the base query template
base_query = '''SELECT ?place ?placeLabel ?location {extra_selects}
WHERE
{{
  wd:Q9188 wdt:P625 ?loc .
  SERVICE wikibase:around {{
      ?place wdt:P625 ?location .
      bd:serviceParam wikibase:center ?loc .
      bd:serviceParam wikibase:radius "2.3" .
  }}
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en" . }}
  {optional_clauses}
}}
'''

# Get the list of properties from the DataFrame
properties = df_properties['property_id'].tolist()

# Generate the SELECT clause and OPTIONAL clauses
select_clause = ' '.join(['?{}'.format(re.sub(r'\W+', '', p)) for p in properties])
optionals = '\n'.join(['OPTIONAL {{ ?place wdt:{} ?{} }}.'.format(p, re.sub(r'\W+', '', p)) for p in properties])

# Build the complete query by inserting the clauses into the base template
query = base_query.format(extra_selects=select_clause, optional_clauses=optionals)

# Set the SPARQL query and request JSON response
sparql.setMethod("POST")
sparql.setQuery(query)
sparql.setReturnFormat(JSON)

# Execute the SPARQL query
results = sparql.query().convert()

# Process the query results and convert to DataFrame
bindings = results['results']['bindings']
data = [{k: v['value'] for k, v in binding.items()} for binding in bindings]
df_results = pd.DataFrame(data)

# Print the DataFrame
print(df_results)

                                          place  \
0        http://www.wikidata.org/entity/Q502218   
1        http://www.wikidata.org/entity/Q502218   
2        http://www.wikidata.org/entity/Q502218   
3        http://www.wikidata.org/entity/Q502218   
4        http://www.wikidata.org/entity/Q652452   
...                                         ...   
3151  http://www.wikidata.org/entity/Q107015084   
3152  http://www.wikidata.org/entity/Q107518923   
3153  http://www.wikidata.org/entity/Q107519732   
3154  http://www.wikidata.org/entity/Q107519846   
3155  http://www.wikidata.org/entity/Q107657642   

                               location                              P4272  \
0         Point(-73.98675 40.737694444)  National Academy of Design (U.S.)   
1         Point(-73.98675 40.737694444)  National Academy of Design (U.S.)   
2         Point(-73.98675 40.737694444)  National Academy of Design (U.S.)   
3         Point(-73.98675 40.737694444)  National Academy of Design (U.S.) 

Now let's slam this data into a Wikibase...
For that we will use wikidataintegrator

In [31]:
%pip install wikidataintegrator

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [41]:
import json
from wikidataintegrator import wdi_core, wdi_login
from SPARQLWrapper import SPARQLWrapper, JSON

# Set up the SPARQL endpoint
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Auth with Wikibase
with open('./.secret.json') as f:
    data = json.load(f)
    wb_username = data['demo_user']
    wb_password = data['demo_password']
wb_endpoint = 'https://overture-demo.wikibase.cloud/w/api.php'
wb_login = wdi_login.WDLogin(user=wb_username, pwd=wb_password, mediawiki_api_url=wb_endpoint)

# Load the base property map
with open('./property_map.json') as f:
    property_map = json.load(f)
wikidata_map_property = property_map['wikidata']
location_property = property_map['location']

# Iterate through the columns and create any properties that are missing
for col in df_results.columns:
    if col in ['place', 'placeLabel', 'location']:
        continue
    # Check using SPAQRL if the property already exists with the matching statement
    query = '''
    SELECT ?property WHERE {{
        ?property wdt:{} wd:{} .
    }}
    '''.format(wikidata_map_property, col)
    sparql.setMethod("POST")
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    if len(results['results']['bindings']) > 0:
        print('Property already exists: {}'.format(results['results']['bindings'][0]['property']['value']))
        continue

    # Get the label from df_properties
    label = df_properties[df_properties['property_id'] == col]['property_label'].values[0]

    # Create the property
    property = wdi_core.WDItemEngine(mediawiki_api_url=wb_endpoint)
    property.set_label(label)
    property.set_aliases(["Wikidata:" + col])
    property.statements.append(wdi_core.WDString(col, prop_nr=wikidata_map_property))
    # Write and catch errors, as wikibase.cloud can be a bit slow in allowing us to lookup existing things sometimes
    try:
        property.write(login=wb_login,entity_type='property', property_datatype='external-id')
        property_map[col] = property.wd_item_id
    except Exception as e:
        if 'label-conflict' in str(e):
            # Add it to the map
            property_map[col] = re.search(r'\[\[Property:([^|]+)\|', str(e)).group(1)
            print("Property already exists: " + label + " is " + property_map[col])
        else:
            print("Error creating property: " + str(e))
            exit()

# Iterate over the DataFrame rows and create items in Wikibase
for index, row in df_results.iterrows():
    # Extract the wikidata id from the place that looks like "http://www.wikidata.org/entity/Q33341"
    wikidata_id = row['place'].split('/')[-1]

    # And extract the location from the location that looks like "Point(40.748433 -73.985656)"
    lat, long = row['location'].split('(')[1].split(')')[0].split(' ')
    # TODO actually handle precision..?

    # Use the query service to see if the item already exists
    query = '''
    SELECT ?item WHERE {{
        ?item wdt:{} wd:{} .
    }}
    '''.format(wikidata_map_property, wikidata_id)
    sparql.setMethod("POST")
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    if len(results['results']['bindings']) > 0:
        print('Local Item already exists: {}'.format(results['results']['bindings'][0]['item']['value']))
        continue

    # Create the item
    item = wdi_core.WDItemEngine(mediawiki_api_url=wb_endpoint)
    item.set_label(row['placeLabel'])
    item.statements.append(wdi_core.WDGlobeCoordinate(latitude=lat, longitude=long, precision=0.0001, prop_nr=location_property))
    item.statements.append(wdi_core.WDString(wikidata_map_property, wikidata_id))
    # Add a statement for every value in the row
    for k, v in row.items():
        if k in ['place', 'placeLabel', 'location']:
            continue
        if v:
            item.statements.append(wdi_core.WDString(property_map[k], v))

    

    


Error while writing to Wikidata
Property already exists: DPLA subject term is P41
Error while writing to Wikidata
Property already exists: Bing entity ID is P42
Error while writing to Wikidata
Property already exists: Quora topic ID is P43
Error while writing to Wikidata
Property already exists: WorldCat Identities ID (superseded) is P30
Error while writing to Wikidata
Property already exists: Google Arts & Culture partner ID is P49
Error while writing to Wikidata
Property already exists: Twitter username is P50
Error while writing to Wikidata
Property already exists: Google Maps Customer ID is P51
Error while writing to Wikidata
Property already exists: Facebook ID is P40
Error while writing to Wikidata
Property already exists: Instagram username is P53
Error while writing to Wikidata
Property already exists: National Library of Israel J9U ID is P54
Error while writing to Wikidata
Property already exists: TripAdvisor ID is P55
Error while writing to Wikidata
Property already exists: O

AttributeError: 'float' object has no attribute 'startswith'