This notebook demonstrates how we used Wikidata to derive direct mappings between SSH-LCSH vocabulary concepts and Getty AAT concepts

In [72]:
"""
=============================================================================
Module    : get-SSH-LCSH-AAT-mappings.ipynb
Classes   : 
Project   : ATRIUM
Creator   : Ceri Binding, University of South Wales / Prifysgol de Cymru
Contact   : ceri.binding@southwales.ac.uk
Summary   : Python notebook demonstrating use of Wikidata to derive vocab mappings
Imports   : os, requests, pyoxigraph, pathlib, rdflib, SPARQLWrapper, datetime
License   : https://github.com/cbinding/ATRIUM-data/blob/main/LICENSE.txt
=============================================================================
History
18/042/2024 CFB Initially created script
=============================================================================
"""
# import required libraries
import os
import requests
from pyoxigraph import *
from pathlib import Path
from rdflib import Graph, ConjunctiveGraph
from SPARQLWrapper import SPARQLWrapper, RDFXML
from datetime import datetime as DT

# function to run SPARQL query against specified endpoint
# returnFormat supports "xml", "n3", "turtle", "nt", "pretty-xml", "trix", "trig", "nquads", "json-ld" and "hext"
def querySparqlEndpoint(endpointURI: str="", sparqlQuery: str="", returnFormat: str="nt"):
  sparql = SPARQLWrapper(endpointURI)
  sparql.setQuery(sparqlQuery)
  sparql.setMethod("POST")
  sparql.setReturnFormat(RDFXML)
  results = sparql.queryAndConvert()
  return results.serialize(format=returnFormat) 

def timestamp():
  return DT.now().strftime('%Y-%m-%dT%H:%M:%SZ')  

TRIPLE Vocabulary is an SSH multilingual vocabulary based on LCSH available at https://www.semantics.gr/authorities/vocabularies/SSH-LCSH?language=en<br>
The site reports there are 3375 semantic resources (concepts). Get the SSH-LCSH vocabulary NTriples format data and save to a local file

In [73]:
url = "https://www.semantics.gr/authorities/vocabularies/SSH-LCSH/n-triples"
response = requests.get(url, timeout=30)
with open("./data/SSH-LCSH.nt", "w") as file:
    file.write(response.text)

print(f"[last run {timestamp()}]")

[last run 2024-04-18T11:45:24Z]


Query the Getty vocabularies SPARQL endpoint (https://vocab.getty.edu/sparql) to retrieve Art &amp; Architecture Thesaurus (AAT) concept URIs with associated preferred labels (English). These labels are used for review of the mappings produced later. (Note results reported here are min 2 triples per concept)

In [74]:
# Getty AAT SPARQL query to get prefLabels (English) for each Concept (saved as NTriples output)
query = """
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
CONSTRUCT { 
  ?uri skos:prefLabel ?lbl . 
  ?uri a gvp:Concept .
}
#SELECT ?uri (STR(?lbl) AS ?label) 
WHERE {
 ?uri a gvp:Concept; skos:inScheme aat:; 
	gvp:prefLabelGVP [gvp:term ?lbl] .
 FILTER(langMatches(lang(?lbl), "en"))
}
"""
results = querySparqlEndpoint(endpointURI="https://vocab.getty.edu/sparql", sparqlQuery=query)
with open("./data/AAT-prefLabels.nt", "w") as file: 
  file.write(results)
  
print(f"{len(results.split("\n"))} results [last run {timestamp()}]")

116343 results [last run 2024-04-18T11:45:37Z]


Query the Wikidata SPARQL endpoint (https://query.wikidata.org/) to retrieve &lt;LCSH URI&gt; skos:closeMatch &lt;AAT URI&gt; mappings and save them to an NTriples format file (7937 triples returned, 17/04/2024). The mappings are generated from Wikidata records having both LCSH and AAT identifiers. Note: Wikidata LCSH URIs retrieved using the wdtn:P244 property are prefixed 'http://id.loc.gov/authorities/names/' - but this appears to be a mistake when referring to subject terms (e.g. http://www.wikidata.org/entity/Q167555 references http://id.loc.gov/authorities/names/sh85095140 but this should be http://id.loc.gov/authorities/subjects/sh85095140). In order to match the LCSH URIs as (correctly) referenced in SSH-LCSH vocabulary, we concatenate the correct prefix to the LCSH concept identifier, instead of using the URI retrieved from Wikidata.

In [75]:
# Wikidata SPARQL query to generate LCSH-closeMatch-AAT mappings (saved as NTriples output)
query = """
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
CONSTRUCT { ?lcsh_uri skos:closeMatch ?aat_uri }
WHERE { 
  ?wd_uri wdt:P244 ?lcsh_id; wdtn:P1014 ?aat_uri .
  BIND(IRI(CONCAT('http://id.loc.gov/authorities/subjects/', ?lcsh_id)) AS ?lcsh_uri) .
}
"""
results = querySparqlEndpoint(endpointURI="https://query.wikidata.org/sparql", sparqlQuery=query)
with open("./data/LCSH-closeMatch-AAT.nt", "w") as file: 
  file.write(results)
  
print(f"{len(results.split("\n"))} results [last run {timestamp()}]")

7939 results [last run 2024-04-18T11:45:38Z]


Combine the SSH-LCSH vocabulary, the AAT data and the LCSH-closeMatch-AAT mappings to an rdflib graph. 

In [76]:

graph = Graph(store="Oxigraph") # speed advantage over plain rdflib.Graph
#graph = ConjunctiveGraph(store="Oxigraph")
graph.parse("./data/SSH-LCSH.nt")
graph.parse("./data/AAT-prefLabels.nt")
graph.parse("./data/LCSH-closeMatch-AAT.nt")


<Graph identifier=Nccb6d01817cd479bbe0b8437ae1cf9cd (<class 'rdflib.graph.Graph'>)>

Use the LCSH-closeMatch-AAT mappings to produce SSH-closeMatch-AAT mappings. Save results to an NTriples RDF file.

In [77]:
query = """
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  PREFIX gvp: <http://vocab.getty.edu/ontology#>
  CONSTRUCT { ?ssh_uri skos:closeMatch ?aat_uri }
  WHERE {
    ?ssh_uri a skos:Concept; skos:exactMatch ?lcsh_uri .
    ?lcsh_uri skos:closeMatch ?aat_uri .
    ?aat_uri a gvp:Concept .
}
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-closeMatch-AAT.nt", format="nt")

# add the resultant mappings to the existing graph
graph.parse("./data/SSH-LCSH-closeMatch-AAT.nt")

print(f"{len(results)} results [last run {timestamp()}]")


392 results [last run 2024-04-18T11:45:57Z]


In [78]:
def countUnmappedConcepts():
    # count SSH-LCSH concepts with no mapping to AAT
    query = """
        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
        PREFIX gvp: <http://vocab.getty.edu/ontology#>
        SELECT (COUNT(DISTINCT ?ssh_uri) AS ?counter) WHERE {
        ?ssh_uri a skos:Concept .
        MINUS { ?ssh_uri ?property [a gvp:Concept] }
        }
    """
    results = graph.query(query)
    # we only want the count returned (maybe a better way to do this?)
    for result in results:
        return result[0]
        break    

print(f"{countUnmappedConcepts()} unmapped concepts [last run {timestamp()}]")
# 17/04/2024 - returns 2982 (88.4%), so 393 (11.6%) successfully mapped

2982 unmapped concepts [last run 2024-04-18T11:45:58Z]


In [79]:
# create skos:broadMatch relationships for concepts
# not yet mapped, using vocabulary broader relationship
query = """
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  PREFIX gvp: <http://vocab.getty.edu/ontology#>
  CONSTRUCT { ?ssh_uri skos:broadMatch ?aat_uri }
  WHERE {
    ?ssh_uri a skos:Concept; skos:broader ?broader_uri .
    ?broader_uri a skos:Concept; skos:closeMatch ?aat_uri .
    ?aat_uri a gvp:Concept .
    MINUS { ?ssh_uri ?property [a gvp:Concept] }
  }
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-broadMatch-AAT.nt", format="nt")
# add the resultant mappings to the existing graph
graph.parse("./data/SSH-LCSH-broadMatch-AAT.nt")

print(f"{len(results)} results [last run {timestamp()}]")

3347 results [last run 2024-04-18T11:45:58Z]


In [80]:
print(f"{countUnmappedConcepts()} unmapped concepts [last run {timestamp()}]")
# 17/04/2024 - returns 665 (19.7%), so 2710 (80.3%) successfully mapped


665 unmapped concepts [last run 2024-04-18T11:45:59Z]


In [81]:
# create skos:broadMatch relationships  for concepts
# not yet mapped, using TWO steps of broader relationship 
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    CONSTRUCT { ?ssh_uri skos:broadMatch ?aat_uri }
    WHERE {
    ?ssh_uri a skos:Concept; skos:broader/skos:broader ?broader_uri .
    ?broader_uri a skos:Concept; skos:closeMatch ?aat_uri .
    ?aat_uri a gvp:Concept .
    MINUS { ?ssh_uri ?property [a gvp:Concept] }
    }
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-broadbroadMatch-AAT.nt", format="nt")
# add the resultant mappings to the existing graph
graph.parse("./data/SSH-LCSH-broadbroadMatch-AAT.nt")

print(f"{len(results)} results [last run {timestamp()}]")

367 results [last run 2024-04-18T11:45:59Z]


In [82]:
print(f"{countUnmappedConcepts()} unmapped concepts [last run {timestamp()}]")
# 17/04/2024 - returns 398 (11.8%), so 2977 (88.2%) successfully mapped

398 unmapped concepts [last run 2024-04-18T11:45:59Z]


In [83]:
# list the mappings created (for review)
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    SELECT DISTINCT ?ssh_uri (str(?ssh_lbl) AS ?ssh_label) ?rel ?aat_uri (str(?aat_lbl) AS ?aat_label)    
    WHERE {
        ?ssh_uri skos:closeMatch|skos:broadMatch ?aat_uri . 
        ?ssh_uri a skos:Concept .
        ?aat_uri a gvp:Concept .         
        ?ssh_uri ?rel ?aat_uri .        
        OPTIONAL { 
            ?ssh_uri skos:prefLabel ?ssh_lbl . 
            FILTER(langMatches(lang(?ssh_lbl), "en")) 
        }
        OPTIONAL { 
            ?aat_uri skos:prefLabel ?aat_lbl . 
            FILTER(langMatches(lang(?aat_lbl), "en")) 
        }
    }
"""
mappings = graph.query(query)
mappings.serialize(destination="./data/SSH-LCSH-matched-AAT.csv", format="csv")

print(f"{len(mappings)} mappings [last run {timestamp()}]")

4138 mappings [last run 2024-04-18T11:46:00Z]


In [84]:
# display some of the mappings produced for review
# Note SSH-LCSH records showing a blank label are just those where there is no English
# preferred label present in the vocabulary, the mapping still works regardless of this
import pandas as pd
from IPython.display import display, HTML
df = pd.DataFrame(mappings)
display(HTML(df.fillna("").to_html(index=False, header=False, max_rows=25)))

0,1,2,3,4
http://semantics.gr/authorities/SSH-LCSH/sh85028889,,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300055912,commedia dell'arte
http://semantics.gr/authorities/SSH-LCSH/sh85081863,Mass media,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300055812,mass media
http://semantics.gr/authorities/SSH-LCSH/sh85067272,Interior decoration,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300161596,interior decoration
http://semantics.gr/authorities/SSH-LCSH/sh85066150,Information science,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300054574,information science
http://semantics.gr/authorities/SSH-LCSH/sh85076723,Library science,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300054576,library science
http://semantics.gr/authorities/SSH-LCSH/sh85108411,Psychoanalysis,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300054450,psychoanalysis
http://semantics.gr/authorities/SSH-LCSH/sh85148092,Word (Linguistics),http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300250895,words
http://semantics.gr/authorities/SSH-LCSH/sh85036316,Decorative arts,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300054168,decorative arts
http://semantics.gr/authorities/SSH-LCSH/sh85119950,Semiotics,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300054254,semiotics
http://semantics.gr/authorities/SSH-LCSH/sh85015738,Books,http://www.w3.org/2004/02/skos/core#closeMatch,http://vocab.getty.edu/aat/300028051,books


In [85]:
# list the remaining records not mapped (TODO: not tested in notebook yet, and output to DataFRame)
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    SELECT DISTINCT ?ssh_uri (str(?lbl) AS ?label) WHERE {
    ?ssh_uri a skos:Concept .
    OPTIONAL { 
        ?ssh_uri skos:prefLabel ?lbl . 
        FILTER(langMatches(lang(?lbl), "en")) 
    }
    MINUS { ?ssh_uri ?property [a gvp:Concept] }
    }
"""
unmapped = graph.query(query)
unmapped.serialize(destination="./data/SSH-LCSH-nomatch-AAT.csv", format="csv")

print(f"{len(unmapped)} unmapped [last run {timestamp()}]")

398 unmapped [last run 2024-04-18T11:46:00Z]


In [86]:
# display the unmapped recordss here for review
import pandas as pd
from IPython.display import display, HTML
df = pd.DataFrame(unmapped)
display(HTML(df.fillna("").to_html(index=False, header=False, max_rows=25)))

0,1
http://semantics.gr/authorities/SSH-LCSH/sh85116920,Salvadoran literature
http://semantics.gr/authorities/SSH-LCSH/sh85072864,Komi-Permyak literature
http://semantics.gr/authorities/SSH-LCSH/sh85057146,"Greek essays, Modern"
http://semantics.gr/authorities/SSH-LCSH/sh85048416,Finnish literature
http://semantics.gr/authorities/SSH-LCSH/sh85054380,German literature
http://semantics.gr/authorities/SSH-LCSH/sh88005294,"Rwanda, Literatures"
http://semantics.gr/authorities/SSH-LCSH/sh85086828,Mongolian literature
http://semantics.gr/authorities/SSH-LCSH/sh85119451,Sects
http://semantics.gr/authorities/SSH-LCSH/sh85028528,Colombian literature
http://semantics.gr/authorities/SSH-LCSH/sh85073564,


In [87]:
# get the generated mappings as RDF NTriples
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    CONSTRUCT { ?ssh_uri ?rel ?aat_uri }
    WHERE {
    ?ssh_uri skos:closeMatch|skos:broadMatch ?aat_uri .
    ?ssh_uri a skos:Concept . 
    ?aat_uri a gvp:Concept .
    ?ssh_uri ?rel ?aat_uri . 
    } 
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-match-AAT.nt", format="nt")

print(f"{len(results)} results [last run {timestamp()}]")

4136 results [last run 2024-04-18T11:46:01Z]
