### Prerequisites Setup
All required dependencies are set up here. This supports a production pipeline approach where reproducibility and environment setup are clearly defined.

In [54]:
# First I import the necessary libraries
# standard
import os
import sys
import re
import time
import json
import pickle
import logging
import asyncio
import aiohttp
import unicodedata
import urllib
import urllib.parse
import urllib.request
from pathlib import Path

# third parties
import pandas as pd
import requests
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from rapidfuzz import process, fuzz


## Data Normalization and Metadata Preparation
In this section, I normalize event titles using string operations. In this first cell the raw JSON data, located in the folder `Ibsenstage_raw`, is flattened and turned in a Pandas dataframe. In the second cell I remove the `venuecountry` column, since the dataset is focused exclusively on performances in Norway. Later, to standardize the naming of the events, I build a canonical list of titles using the `worktitle` field, then construct regular expressions to match common title variants. Some well-known plays (i.e. `Et dukkehjem`) have additional hardcoded variant patterns (i.e. "Nora", "Casa di bambola", etc.).
The `eventname` field is normalized by matching against the canonical patterns in the third cell.

In [55]:
# 1) check whether file exists
json_path = '../Ibsenstage_raw/IbsenStage_scrape.json'
if not os.path.isfile(json_path):
    raise FileNotFoundError(f'File not found: {json_path}')
print('Using source file:', json_path)

# 2) define output folder path
json_out = Path('.') / 'IbsenStage_normalized.json'

# 3) flatten and turn into a DataFrame
with open(json_path, 'r', encoding='utf-8') as f:
    root = json.load(f)
records = root.get('hits', root)
ibsen_df = pd.json_normalize(records, sep='_')
ibsen_df.head(5)

Using source file: ../Ibsenstage_raw/IbsenStage_scrape.json


Unnamed: 0,eventname,eventid,first_date,workid,worktitle,venueid,venuename,venuecountry
0,Hedda Gabler,85542,1983-11-12,8547.0,Hedda Gabler,14985,Honningsvåg kino,Norway
1,Hedda Gabler,85543,1983-11-14,8547.0,Hedda Gabler,14981,Vadsø kino,Norway
2,Hedda Gabler,85544,1983-11-15,8547.0,Hedda Gabler,14983,Miljøbygget,Norway
3,Gengangere,85550,1983-10-14,8542.0,Ghosts,12401,Telemark Teater,Norway
4,Gengangere,85607,1984-01-30,8542.0,Ghosts,12730,Kristiansand Teater,Norway


In [56]:
# 4) remove venuecountry if it exists
if 'venuecountry' in ibsen_df.columns:
    ibsen_df = ibsen_df.drop(columns=['venuecountry'])

# 5) build canonical patterns that are used to normalize titles
unique_titles = (
    ibsen_df['worktitle']
    .dropna()
    .astype(str)
    .str.strip()
    .sort_values()
    .unique()
)
canonical = {}
for title in unique_titles:
    safe = re.escape(title).replace('\\\\ ', '[\\\\s_-]*')
    canonical[title] = ['^' + safe + '$']

extra_variants = {
    'Et dukkehjem'   : ['^a doll.*house$', '^ett[\\\\s_-]*dockhem$', '^casa[\\\\s_-]*di[\\\\s_-]*bambola$', '^nora$'],
    'Gjengangere'    : ['^ghosts$', '^spettri$'],
    'En folkefiende' : ['^an enemy.*people$'],
    'Vildanden'      : ['^the[\\\\s_-]*wild[\\\\s_-]*duck$'],
}
for canon, pats in extra_variants.items():
    canonical.setdefault(canon, []).extend(pats)

pattern_map = [
    (re.compile(pat, re.IGNORECASE), canon)
    for canon, pats in canonical.items()
    for pat in pats
]
def normalize_title(txt: str) -> str:
    if pd.isna(txt): return txt
    low = str(txt).strip().lower()
    for pat, canon in pattern_map:
        if pat.match(low): return canon
    return txt

print(canonical)

{"A Doll's House": ["^A\\ Doll's\\ House$"], 'An Enemy Of The People': ['^An\\ Enemy\\ Of\\ The\\ People$'], 'Brand': ['^Brand$'], 'Catiline': ['^Catiline$'], 'Emperor and Galilean': ['^Emperor\\ and\\ Galilean$'], 'Ghosts': ['^Ghosts$'], 'Hedda Gabler': ['^Hedda\\ Gabler$'], 'John Gabriel Borkman': ['^John\\ Gabriel\\ Borkman$'], 'Lady Inger': ['^Lady\\ Inger$'], 'Little Eyolf': ['^Little\\ Eyolf$'], "Love's Comedy": ["^Love's\\ Comedy$"], 'Mountain Bird': ['^Mountain\\ Bird$'], 'Norma': ['^Norma$'], 'Olaf Liljekrans': ['^Olaf\\ Liljekrans$'], 'Peer Gynt': ['^Peer\\ Gynt$'], 'Pillars Of Society': ['^Pillars\\ Of\\ Society$'], 'Poetry': ['^Poetry$'], 'Rosmersholm': ['^Rosmersholm$'], "St. John's Night": ["^St\\.\\ John's\\ Night$"], 'Svanhild': ['^Svanhild$'], 'Terje Vigen': ['^Terje\\ Vigen$'], 'The Burial Mound': ['^The\\ Burial\\ Mound$'], 'The Feast at Solhaug': ['^The\\ Feast\\ at\\ Solhaug$'], 'The Lady From The Sea': ['^The\\ Lady\\ From\\ The\\ Sea$'], 'The League Of Youth': ['

In [57]:
# 6) normalize eventname using the canonical patterns
ibsen_df['eventname'] = ibsen_df['eventname'].apply(normalize_title)

# 7) save copy and preview
ibsen_df_2 = ibsen_df.copy() 
print('Preview:')
ibsen_df_2.head(5)

Preview:


Unnamed: 0,eventname,eventid,first_date,workid,worktitle,venueid,venuename
0,Hedda Gabler,85542,1983-11-12,8547.0,Hedda Gabler,14985,Honningsvåg kino
1,Hedda Gabler,85543,1983-11-14,8547.0,Hedda Gabler,14981,Vadsø kino
2,Hedda Gabler,85544,1983-11-15,8547.0,Hedda Gabler,14983,Miljøbygget
3,Gengangere,85550,1983-10-14,8542.0,Ghosts,12401,Telemark Teater
4,Gengangere,85607,1984-01-30,8542.0,Ghosts,12730,Kristiansand Teater


### Preparing the Dataset: working with cities
With the following cells I give information about the cities connected to the venues, since knowing the city in which a venue is located will help me in the following steps to enrich data with a city's URI in case a venue's URI will not be available. In the next cells I populate around 60% of the new key `venuecity` with actual city names thanks to GeoPy and GeoNames. 
To increase the accuracy, if the name of the city or a variation of it is included in the key `venuenames`, it will be mapped in the new key `venuecity` (i.e. "Teater i Trondheim" will give "Trondheim"). 
In the first cell I configure a geocoding script with a Nominatim (which finds locations based on the name through OpenStreetMaps), and then cached data is stored in a pickle file (`venue_geocode_cache.pkl`) to avoid stressing the API with redundant calls. There I also try to resolve some incongruencies, by hardcoding some known theares to their respective cities, so that in the case of Oslo the key `venuecity` will connect back to variation of Kristiania or Nationaltheatret. 
In the second cell I load Norwegian city names from cache/GeoNames API (with hardcoded fallback) and define a function (`get_city_fast`) that extracts city names from venue strings using past overrides, the I implement fuzzy matching (with `process.extractOne`) and text parsing (matching also the genitive form of a cityname).

In the next cell I try multiple methods to figure out which Norwegian city each theater venue is located in, by first checking a cache of previous results, then using fast text matching, and finally querying external geocoding APIs as backup options. Lastly I loads the existing cache and processes any new venue names to find their cities.

In [58]:
# configuration of the script
DATA_FOLDER = "."
CACHE_PATH = "./venue_geocode_cache.pkl"
CACHE_FILE = "./norway_cities_cache.pkl"
INPUT_FILE = "./IbsenStage_normalized.json" 
OUTPUT_FILE = "./IbsenStage_with_city.json" 
GEONAMES_USER = "MBIB4140_ibsen_user"  # your GeoNames username
GEONAMES_COUNTRY = "NO"

# nominatim geocoder setup
geolocator = Nominatim(user_agent="ibsen_city_extractor", timeout=5)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=0.5, max_retries=1)

# checks for a cache file and loads it if it exists (FIXED)
if os.path.exists(CACHE_PATH):
    with open(CACHE_PATH, "rb") as f:
        cache = pickle.load(f)
else:
    cache = {}

# list of possible location-related keys to check for city names
CITY_KEYS = [
    "city", "town", "village", "municipality", "hamlet",
    "locality", "county", "state_district", "state",
    "region", "district", "suburb"
]

# precompute overrides, maps theater names (or aliases) to known city names
OVERRIDES = {
    "nationaltheatret": "Oslo", "kristiania": "Oslo", "christiania": "Oslo",
    "det norske teatret": "Oslo", "black box teater": "Oslo", "oslo nye": "Oslo",
    "chateau neuf teaterscenen": "Oslo", 
    "trøndelag teater": "Trondheim", "rosendal teater": "Trondheim",
    "den nationale scene": "Bergen", "dns": "Bergen", "hordaland teater": "Bergen",
    "kilden": "Kristiansand", "agder teater": "Kristiansand",
    "hålogaland teater": "Tromsø", "rogaland teater": "Stavanger",
    "teater innsikt": "Stavanger", "teater i drammen": "Drammen",
    "telemark teater": "Skien", "teater Ibsen": "Skien", "skienhallen": "Skien",
    "teater i moss": "Moss", "teater i tromsø": "Tromsø", "Tromsøhallen": "Tromsø",
    "sandnessjøen": "Sandnes"
}

# standardize text and returns a lowercase, normalized string
def normalize(txt: str) -> str:
    txt = unicodedata.normalize('NFKD', txt)
    txt = "".join(c for c in txt if not unicodedata.combining(c))
    return re.sub(r'[^a-z0-9]', '', txt.lower())

# normalize each override key and create a mapping
NORMALIZED_OVERRIDES = {normalize(k): v for k, v in OVERRIDES.items()}
NORMALIZED_OVERRIDE_KEYS = list(NORMALIZED_OVERRIDES.keys())

print(f"Cache loaded: {len(cache)} entries")
print(f"Loaded {len(OVERRIDES)} theater overrides")
print(f"Normalized override keys: {len(NORMALIZED_OVERRIDE_KEYS)}")

Cache loaded: 857 entries
Loaded 25 theater overrides
Normalized override keys: 25


In [None]:

# load cached data about Norway cities
def load_norway_cities() -> set:
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE, "rb") as f:
            return pickle.load(f)
    
    qs = urllib.parse.urlencode({
        "country": GEONAMES_COUNTRY,
        "featureClass": "P",
        "maxRows": 2000,
        "username": GEONAMES_USER
    })
    url = f"http://api.geonames.org/searchJSON?{qs}"
    try:
        with urllib.request.urlopen(url, timeout=10) as resp:
            data = json.load(resp)
            cities = {item["name"] for item in data.get("geonames", [])}
            # Cache the cities list
            with open(CACHE_FILE, "wb") as f:
                pickle.dump(cities, f)
            return cities
    except Exception:
        # loads a set of Norwegian city names into `norway_cities`, either from a cached file or a hardcoded list, and saves the list to cache
        norway_cities = { # this list does not include "tettsted" or smaller towns
            "Oslo", "Bergen", "Trondheim", "Stavanger", "Kristiansand", "Drammen",
            "Tromsø", "Sandnes", "Fredrikstad", "Sarpsborg", "Skien", "Ålesund",
            "Sandefjord", "Haugesund", "Tønsberg", "Moss", "Bodø", "Arendal",
            "Hamar", "Larvik", "Halden", "Lillehammer", "Mo i Rana", "Molde",
            "Harstad", "Kongsberg", "Gjøvik", "Jessheim", "Porsgrunn", "Narvik",
            "Kristiansund", "Flekkefjord", "Grimstad", "Orkanger", "Mandal",
            "Steinkjer", "Elverum", "Alta", "Honefoss", "Kongsvinger",
            "Notodden", "Bryne", "Otta", "Namsos", "Fagernes", "Røros",
            "Hammerfest", "Kirkenes", "Vadsø", "Vardø", "Sortland", "Leknes",
            "Svolvær", "Stokmarknes", "Finnsnes", "Lenvik", "Bardufoss",
            "Egersund", "Levanger", "Verdal", "Florø", "Kolvereid", "Voss",
            "Stjørdal", "Kopervik", "Mysen", "Fauske", "Lillesand", "Stjørdalshalsen"
        }
        with open(CACHE_FILE, "wb") as f:
            pickle.dump(norway_cities, f)
        return norway_cities

norway_cities = load_norway_cities()

#attempt to extract a Norwegian city name from a venue name string
def get_city_fast(venue_name: str) -> str | None:
    if not venue_name or pd.isna(venue_name):
        return None
    
    name = venue_name.strip()
    key = name.lower()
    norm = normalize(key)
    
    # look for direct matches in overrides
    for norm_key, city in NORMALIZED_OVERRIDES.items():
        if norm_key in norm:
            return city
    
    
    # I try fuzzy matching for close-enough names
    if len(norm) > 3:
        match, score, *_ = process.extractOne(norm, NORMALIZED_OVERRIDE_KEYS, scorer=fuzz.partial_ratio)
        if score > 85: # I chose this threshold to avoid false positives, which is also not too high to miss some matches
            return NORMALIZED_OVERRIDES[match]
    
    # extract city names from venue strings by breaking them into words + check each word against the known city list
    for token in re.split(r'[,/()\-\s]+', key):
        token_cap = token.capitalize()
        if token_cap in norway_cities:
            return token_cap
        
        # check for genitive form of the city name (i.e, 'Skiens teater")
        genitive_match = re.match(r'^([A-Za-zÀ-ÿ]+)s$', token_cap)
        if genitive_match:
            base_city = genitive_match.group(1)
            if base_city in norway_cities:
                return base_city
    
    return None

print(f"Loaded {len(norway_cities)} Norwegian cities")
print(f"Sample cities: {list(norway_cities)[:10]}")

Loaded 69 Norwegian cities
Sample cities: ['Sarpsborg', 'Arendal', 'Orkanger', 'Oslo', 'Bergen', 'Vadsø', 'Lillehammer', 'Alta', 'Sandefjord', 'Jessheim']


In [60]:
# geocodes a batch of venue names to Norwegian cities using the GeoNames API
def batch_geocode_geonames(venue_names, geonames_user) -> dict[str, str | None]:
    results = {}
    
    for venue in venue_names:
        if len(venue) <= 2:
            results[venue] = None
            continue
            
        try:
            query = urllib.parse.urlencode({
                "q": venue,
                "country": "NO",
                "maxRows": 1,
                "username": geonames_user
            })
            
            url = f"http://api.geonames.org/searchJSON?{query}"
            with urllib.request.urlopen(url, timeout=3) as resp:
                data = json.load(resp)
                if data.get("geonames"):
                    results[venue] = data["geonames"][0]["name"]
                else:
                    results[venue] = None
                    
            time.sleep(0.1)  # Rate limiting
        except Exception:
            results[venue] = None
    
    return results

# enhanced geocoding using Nominatim with detailed address parsing
def batch_geocode_nominatim(venue_names):
    results = {venue: None for venue in venue_names}
    
    for venue in venue_names:
        if len(venue) <= 2 or not any(char.isalpha() for char in venue):
            continue
            
        try:
            loc = geocode(f"{venue}, Norway", addressdetails=True, country_codes="no")
            if loc and (addr := loc.raw.get("address")):
                # check multiple address fields for city information
                for k in CITY_KEYS:
                    if k in addr:
                        results[venue] = addr[k]
                        break
                        
            time.sleep(0.5)  # Rate limiting
        except Exception:
            pass
    
    return results

# ensures extraction operations are run only once per venue name, caching results to avoid redundant API calls
def get_city_for(venue_name: str) -> str | None:
    if not venue_name or pd.isna(venue_name):
        return None
    
    name = venue_name.strip()
    
    # check cache first
    if name in cache:
        return cache[name]
    
    key = name.lower()
    norm = normalize(key)

    # Try fast matching first
    city = get_city_fast(name)
    if city:
        cache[name] = city
        return city
    return None

In the following cell I prioritize the use of matching city names over Geocode API calls, which despite being more accurate would require many minutes to run

In [None]:
# checks for a cache file and loads it if it exists
if os.path.exists(CACHE_PATH):
    with open(CACHE_PATH, "rb") as f:
        cache = pickle.load(f)
    # clean existing None values
    cache = {k: v for k, v in cache.items() if v is not None}
else:
    cache = {}

# Get unique venue names and filter out already cached ones
unique_names = sorted({rec.get("venuename") for rec in records if rec.get("venuename")})
uncached_names = [name for name in unique_names if name not in cache]

# process venues individually 
if uncached_names:
    start_time = time.time()    
    for idx, venue in enumerate(uncached_names, 1):
        city = get_city_for(venue)
        if idx % 50 == 0:
            # save cache periodically
            with open(CACHE_PATH, "wb") as cf:
                pickle.dump(cache, cf)
                print(f"  Processed {idx}/{len(uncached_names)} venues (previously uncached)...")

# annotating records
for rec in records:
    vn = rec.get("venuename") or "not found"
    rec["venuecity"] = get_city_for(vn) or "not found"
    # Stats
filled = sum(1 for r in records if r.get("venuecity") and r["venuecity"].strip())
total_records = len(records)
print(f"venuecity filled for {filled} of {total_records} records ({(filled / total_records) * 100:.2f}%)")

# save results
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    json.dump(records, f, ensure_ascii=False, indent=2)

with open(CACHE_PATH, "wb") as cf:
    pickle.dump(cache, cf)
    
print(f"Total unique venues: {len(unique_names)}")
print(f" Example: {list(unique_names)[0]} → {cache.get(list(unique_names)[0], 'Not found')}")
print(cache.get(list(unique_names)[15], 'Not found'))
print(f"None values: {list(cache.values()).count(None)}")

  Processed 50/448 venues...
  Processed 100/448 venues...
  Processed 150/448 venues...
  Processed 200/448 venues...
  Processed 250/448 venues...
  Processed 300/448 venues...
  Processed 350/448 venues...
  Processed 400/448 venues...
venuecity filled for 4924 of 4924 records (100.00%)
Total unique venues: 1302
 Example: 6957 Hyllestad (Hyllestad) → Hyllestad
Alvdal
None values: 0


The script above pre-processes only those unique venue names for geocoding, and later all 4900+ records are annotated using the chached lookups.

### Mapping Works and Venues to authoritative IDs

The next step is to map every key `workid` and `venueid` with an authoritative URI, to resolve external Wikidata URIs for works/venue. To achieve this result, the optimal way is to keep the original keys for `workid` and `venueid` from IbsenStage and add the keys `workURI` and `venueURI` to the json. First I will map the works and then the venues.

In the first cell, a SPARQL query is sent to the Wikidata endpoint to retrieve all known works (`wdt:P800`) attributed to Henrik Ibsen (`wd:Q36661`). Labels are filtered to include multiple languages (en, no, nb, nn) and normalized to lowercase for matching, this continues in the second cell where I use the Wikidata results to create a dictionary mapping Ibsen play titles (lowercase) to their Wikidata QIDs. In the third cell then I attempt to match each worktitle from the dataset to a Wikidata URI in two passes:

1. Exact match based on normalized title.

2. Partial string match (i.e. “Gjengangere” may match “Ghosts” or “The Ghosts”).

If a match is found, a new field workURI is added with the corresponding Wikidata QID 

In [None]:
# read the file containing theater data with city information
with open('IbsenStage_with_city.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# convert to pandas DataFrame for easier data manipulation
df = pd.DataFrame(data)

# query Wikidata for Henrik Ibsen's notable works, direct SPARQL query
def get_ibsen_works() -> dict:
    sparql_query = """
    SELECT ?work ?label WHERE {
      wd:Q36661 wdt:P800 ?work .
      ?work rdfs:label ?label .
      FILTER(LANG(?label) IN ("en", "no", "nb", "nn"))
    }
    """
    url = "https://query.wikidata.org/sparql"
    headers = {'User-Agent': 'IbsenStage-Pipeline/1.0'}  # Polite API usage
    params = {'query': sparql_query, 'format': 'json'}
    
    # make the API request to Wikidata
    response = requests.get(url, headers=headers, params=params)
    if response.status_code == 200:
        return response.json()  # return parsed JSON response
    else:
        raise Exception(f"SPARQL query failed with status code {response.status_code}")

print("Fetching Ibsen works from Wikidata...")
# execute the query and get results
wikidata_result = get_ibsen_works()

# build title → QID mapping from SPARQL result
wikidata_mapping = {}
for item in wikidata_result['results']['bindings']:
    uri = item['work']['value']
    qid = uri.split('/')[-1]  # Extract QID from URI
    label = item['label']['value'].lower().strip()
    wikidata_mapping[label] = qid

# hardcode "St. John's Eve" mapping since it is the *only* title not caught by the query
wikidata_mapping["st. john's night"] = "Q3285337"
wikidata_mapping["sankthansnatten"] = "Q3285337"
wikidata_mapping["sanchthansnatten"] = "Q3285337"

print(f"Found {len(wikidata_mapping)} labels from Wikidata (including hardcoded St. John's Eve)")

# preview first 5 entries
print("First 5 results from Wikidata:")
for item in wikidata_result['results']['bindings'][:5]:
    print(item)

Fetching Ibsen works from Wikidata...
Found 57 labels from Wikidata (including hardcoded St. John's Eve)
First 5 results from Wikidata:
{'work': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q176465'}, 'label': {'xml:lang': 'en', 'type': 'literal', 'value': 'Hedda Gabler'}}
{'work': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q176465'}, 'label': {'xml:lang': 'nb', 'type': 'literal', 'value': 'Hedda Gabler'}}
{'work': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q176465'}, 'label': {'xml:lang': 'nn', 'type': 'literal', 'value': 'Hedda Gabler'}}
{'work': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q208094'}, 'label': {'xml:lang': 'en', 'type': 'literal', 'value': 'Peer Gynt'}}
{'work': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q208094'}, 'label': {'xml:lang': 'nb', 'type': 'literal', 'value': 'Peer Gynt'}}


In [63]:
# build title → QID mapping from SPARQL result
wikidata_mapping = {}
#process each work returned from the Wikidata query
for item in wikidata_result['results']['bindings']:
    uri = item['work']['value']
    qid = uri.split('/')[-1]  # extract QID from URI
    label = item['label']['value'].lower().strip()
    wikidata_mapping[label] = qid

print(f"Found {len(wikidata_mapping)} labels from Wikidata")

# show first 5 items in the mapping
print("First 5 label → QID mappings:")
for label, qid in list(wikidata_mapping.items())[:5]:
    print(f"{label} → {qid}")

Found 54 labels from Wikidata
First 5 label → QID mappings:
hedda gabler → Q176465
peer gynt → Q208094
emperor and galilean → Q268276
kejser og galilæer → Q268276
john gabriel borkman → Q289117


In [64]:
# Maps a work title to its Wikidata QID using exact and fuzzy matching, return QID or None
def map_to_qid(title) -> str | None:
   # handle empty or null titles
   if pd.isna(title) or not title.strip():
       return None
   title_norm = title.lower().strip() 
   
   # check if I have an exact match in our mapping
   if title_norm in wikidata_mapping:
       return wikidata_mapping[title_norm]
   
   # look for substring matches
   for label, qid in wikidata_mapping.items():
       # check if title contains label or label contains title
       if title_norm in label or label in title_norm:
           return qid
   
   return None  # No match found

# apply mapping to dataset
print("Mapping 'worktitle' to Wikidata QIDs...")
df['workURI'] = df['worktitle'].apply(map_to_qid)

# save to file
output_path = 'IbsenStage_with_wikidata_works.json'
# convert DataFrame back to list of records and save as JSON
with open(output_path, 'w', encoding='utf-8') as f:
   json.dump(df.to_dict(orient='records'), f, ensure_ascii=False, indent=2)

print(f"Saved updated file to: {output_path}")

#count how many titles I successfully mapped to QIDs
mapped_count = df['workURI'].notna().sum()
total_count = len(df)
print(f"Successfully mapped {mapped_count} out of {total_count} work titles ({mapped_count/total_count*100:.1f}%)")

Mapping 'worktitle' to Wikidata QIDs...
Saved updated file to: IbsenStage_with_wikidata_works.json
Successfully mapped 4905 out of 4924 work titles (99.6%)


Now I can proceed by connecting `venueURI` to authoritative identifiers from Wikidata, where available.

As expected, retrieving and mapping Wikidata URIs for all venues is a challenging process. Many venues are too small or obscure to have their own Wikidata item, or they may be part of a larger building. Some may no longer exist, as the IbsenStage dataset includes historical venues dating back to the 19th century.
To address this, I use a two-step fallback strategy:

1. Attempt to resolve the venuename to a `venueURI` by querying the Wikidata Search API (wbsearchentities).

2. If no match is found, attempt to resolve the `venuecity` instead and store the resulting URI under the key `cityURI`.

This method ensures that even if a venue does not have its own Wikidata item, the city associated with it can serve as a proxy reference.
To improve performance, I use `aiohttp` for asynchronous HTTP requests and `asyncio.gather()` to run all tasks concurrently (with a semaphore to limit concurrency to `MAX_WORKERS`). Despite this optimization, the cell may still take several minutes to complete (around 7) due to the delay between requests and the need to wait for responses from the Wikidata servers.

In [65]:
INPUT = "IbsenStage_with_wikidata_works.json"
OUTPUT = "IbsenStage_with_uris.json"
USER_AGENT = "VenueCityWikidataLinker/1.0"
MAX_WORKERS = 5
DELAY = 0.1

# logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger()

# file path
def stage_path(filename):
    return Path.cwd() / filename

# loading data
with open(INPUT, encoding="utf-8") as f:
    data = json.load(f)

# printing samples
print(f"Processing {len(data)} records from '{INPUT}' → '{OUTPUT}'")
print(f"Using {MAX_WORKERS} workers with {DELAY}s delays")
print(f"User agent: {USER_AGENT}")

print("\nFirst few records:")
for i, record in enumerate(data[:3]):
   venue = record.get("venuename", "N/A")
   city = record.get("venuecity", "N/A")
   print(f"  {i+1}. {venue} ({city})")

Processing 4924 records from 'IbsenStage_with_wikidata_works.json' → 'IbsenStage_with_uris.json'
Using 5 workers with 0.1s delays
User agent: VenueCityWikidataLinker/1.0

First few records:
  1. Honningsvåg kino (not found)
  2. Vadsø kino (Vadsø)
  3. Miljøbygget (Trondheim)


In [66]:
# search Wikidata for an entity and return its QID if found
async def query_wikidata(session, search_term: str, semaphore) -> str | None:
   if not search_term:
       return None
   
   async with semaphore:  # rate limiting
       await asyncio.sleep(DELAY)
       # building the Wikidata search API request
       params = urllib.parse.urlencode({
           "action": "wbsearchentities",
           "format": "json",
           "language": "en",
           "search": search_term,
           "limit": 1,  # I just want the best match
           "type": "item"
       })
       url = f"https://www.wikidata.org/w/api.php?{params}"
       headers = {"User-Agent": USER_AGENT}
       
       try:
           async with session.get(url, headers=headers, timeout=10) as response:
               result = await response.json()
               # return the QID of the best match
               if result.get("search"):
                   return result["search"][0]["id"]
       except Exception as e:
           logger.debug(f"Wikidata lookup failed for '{search_term}': {e}")
       return None

# try to find Wikidata URIs for venue first, then city as fallback.
async def resolve_entry(session, entry: dict, semaphore) -> dict:
   venue = (entry.get("venuename") or "").strip()
   city = (entry.get("venuecity") or "").strip()
   
   # venue lookup
   venue_uri = await query_wikidata(session, venue, semaphore)
   if venue_uri:
       entry["venueURI"] = venue_uri
   else:
       # fallback to city if venue not found
       city_uri = await query_wikidata(session, city, semaphore)
       if city_uri:
           entry["cityURI"] = city_uri
   
   return entry

# process all entries concurrently with rate limiting
async def resolve_all(entries: list[dict]) -> list[dict]:
   semaphore = asyncio.Semaphore(MAX_WORKERS)  # Control concurrency
   async with aiohttp.ClientSession() as session:
       # Create tasks for all entries and run them concurrently
       tasks = [resolve_entry(session, entry, semaphore) for entry in entries]
       return await asyncio.gather(*tasks)

In [67]:
# resolution with await
print("\nStarting Wikidata URI resolution...")
start_time = time.time()
results = await resolve_all(data)
processing_time = time.time() - start_time
print(f"Completed in {processing_time:.2f} seconds")

# stats
venue_uris = sum(1 for r in results if r.get("venueURI"))
city_uris = sum(1 for r in results if r.get("cityURI"))
total_uris = venue_uris + city_uris

print(f"\nFound {total_uris} URIs from {len(results)} records ({(total_uris/len(results)*100):.1f}%)")
print(f"  • {venue_uris} venues, {city_uris} cities")

print("\nSample resolved entries:")
uri_samples = [r for r in results if r.get("venueURI") or r.get("cityURI")][:3]
for i, record in enumerate(uri_samples):
   venue = record.get("venuename", "N/A")
   city = record.get("venuecity", "N/A")
   venue_uri = record.get("venueURI", "")
   city_uri = record.get("cityURI", "")
   
   # Show which URI we found
   uri_info = f"venue:{venue_uri}" if venue_uri else f"city:{city_uri}"
   print(f"  {i+1}. {venue} ({city}) → {uri_info}")

# saving output
with open(stage_path(OUTPUT), "w", encoding="utf-8") as f:
   json.dump(results, f, ensure_ascii=False, indent=2)

print(f"\nResults saved to {OUTPUT}")


Starting Wikidata URI resolution...
Completed in 439.22 seconds

Found 4923 URIs from 4924 records (100.0%)
  • 2883 venues, 2040 cities

Sample resolved entries:
  1. Honningsvåg kino (not found) → city:Q404
  2. Vadsø kino (Vadsø) → city:Q104379
  3. Miljøbygget (Trondheim) → city:Q25804

Results saved to IbsenStage_with_uris.json


With the fallback strategy, around 58% of venues and respective URI are found and mapped, and other additional 41% city URI complete the mapping, covering almost all the cases.Now that both work and venue IDs and URIs are present, we can bridge them in a single file.

In [68]:
# load the data
print("Loading enriched data with URIs...")
with open('../Ibsenstage_staged/IbsenStage_with_uris.json', encoding='utf-8') as f:
   data = json.load(f)
df = pd.DataFrame(data)

# extract unique work ID → URI mappings
work_bridge = (
   df[['workid', 'workURI']]
   .dropna()  # Skip entries without URIs
   .drop_duplicates() 
   .set_index('workid')['workURI']
   .to_dict()
)

# extract unique venue ID → URI mappings
venue_bridge = (
   df[['venueid', 'venueURI']]
   .dropna()
   .drop_duplicates()
   .set_index('venueid')['venueURI']
   .to_dict()
)

print(f"Built bridges: {len(work_bridge)} works, {len(venue_bridge)} venues")

# combine into one object
bridge = {
   'work_bridge': work_bridge,
   'venue_bridge': venue_bridge
}

# save the mapping
output_path = "./id_to_uri_bridge.json"
with open(output_path, 'w', encoding='utf-8') as f:
   json.dump(bridge, f, ensure_ascii=False, indent=2)

print(f"Bridge saved to {output_path}")

Loading enriched data with URIs...
Built bridges: 28 works, 780 venues
Bridge saved to ./id_to_uri_bridge.json


The next step is converting the JSON `IbsenStage_with_uris` to RDF, this will continue in the file `code2_IbsenStage_curated`