This notebook demonstrates how we used Wikidata to derive direct mappings between SSH-LCSH vocabulary concepts and Getty AAT concepts

In [12]:
# import required libraries
import os
from pyoxigraph import *
from pathlib import Path
from rdflib import Graph
from SPARQLWrapper import SPARQLWrapper, RDFXML

# function to run SPARQL query against specified endpoint
# returnFormat supports "xml", "n3", "turtle", "nt", "pretty-xml", "trix", "trig", "nquads", "json-ld" and "hext"
def querySparqlEndpoint(endpointURI: str="", sparqlQuery: str="", returnFormat: str="nt"):
  sparql = SPARQLWrapper(endpointURI)
  sparql.setQuery(sparqlQuery)
  sparql.setMethod("POST")
  sparql.setReturnFormat(RDFXML)
  results = sparql.queryAndConvert()
  return results.serialize(format=returnFormat) 

TRIPLE Vocabulary is an SSH multilingual vocabulary based on LCSH available at http://semantics.gr/authorities/vocabularies/SSH-LCSH<br>
The site reports there are 3375 semantic resources (concepts). Get the SSH-LCSH vocabulary data and save as an NTriples format file

In [44]:
# NOTE: their site not currently working...
# the data as previously downloaded is cached here as ./data/SSH_LCSH.nt

Query the Getty vocabularies SPARQL endpoint (https://vocab.getty.edu/sparql) to retrieve Art &amp; Architecture Thesaurus (AAT) concept URIs with associated preferred labels (English). These labels are only used for convenient review of the mappings produced later. 116342 triples returned (17/04/2024) (at 2 triples per record = 58171 records)

In [None]:
# Getty AAT SPARQL query to get prefLabels (English) for each Concept (saved as NTriples output)
query = """
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
CONSTRUCT { 
  ?uri skos:prefLabel ?lbl . 
  ?uri a gvp:Concept .
}
#SELECT ?uri (STR(?lbl) AS ?label) 
WHERE {
 ?uri a gvp:Concept; skos:inScheme aat:; 
	gvp:prefLabelGVP [gvp:term ?lbl] .
 FILTER(langMatches(lang(?lbl), "en"))
}
"""
results = querySparqlEndpoint(endpointURI="https://vocab.getty.edu/sparql", sparqlQuery=query)
with open("./data/AAT-prefLabels.nt", "w") as file: 
  file.write(results)

Query the Wikidata SPARQL endpoint (https://query.wikidata.org/) to retrieve &lt;LCSH URI&gt; skos:closeMatch &lt;AAT URI&gt; mappings and save them to an NTriples format file (7937 triples returned, 17/04/2024). The mappings are generated from Wikidata records having both LCSH and AAT identifiers. Note: Wikidata LCSH URIs are prefixed 'http://id.loc.gov/authorities/names/' but we need the URIs prefixed 'http://id.loc.gov/authorities/subjects/' to match the LCSH URIs as referenced in SSH-LCSH vocabulary - this is achieved here by concatenating the correct prefix to the LCSH concept identifier 

In [20]:
# Wikidata SPARQL query to generate LCSH-closeMatch-AAT mappings (saved as NTriples output)
query = """
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
CONSTRUCT { ?lcsh_uri skos:closeMatch ?aat_uri }
WHERE { 
  ?wd_uri wdt:P244 ?lcsh_id; wdtn:P1014 ?aat_uri .
  BIND(IRI(CONCAT('http://id.loc.gov/authorities/subjects/', ?lcsh_id)) AS ?lcsh_uri) .
}
"""
results = querySparqlEndpoint(endpointURI="https://query.wikidata.org/sparql", sparqlQuery=query)
with open("./data/LCSH-closeMatch-AAT.nt", "w") as file: 
  file.write(results)

Combine the SSH-LCSH vocabulary, the AAT data and the LCSH-closeMatch-AAT mappings to an rdflib graph. Use the LCSH-closeMatch-AAT mappings to produce SSH-closeMatch-AAT mappings. Save results to an NTriples RDF file. (392 mappings created, 17/04/2024)

In [21]:

graph = Graph(store="Oxigraph") # speed advantage over plain rdflib.Graph
# first time create the store:
#graph.open('./data/myRDFLibStore', create=True)

#graph = Graph()
graph.parse("./data/SSH-LCSH.nt")
graph.parse("./data/AAT-prefLabels.nt")
graph.parse("./data/LCSH-closeMatch-AAT.nt")

query = """
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  PREFIX gvp: <http://vocab.getty.edu/ontology#>
  CONSTRUCT { ?ssh skos:closeMatch ?aat }
  WHERE {
    ?ssh a skos:Concept; skos:exactMatch ?lcsh .
    ?lcsh skos:closeMatch ?aat .
    ?aat a gvp:Concept .
}
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-closeMatch-AAT.nt", format="nt")
# add the resultant (392) mappings to the existing graph
graph.parse("./data/SSH-LCSH-closeMatch-AAT.nt")

<Graph identifier=N66c1cf75c84241c2a89704b5a18bd50a (<class 'rdflib.graph.Graph'>)>

In [39]:
import json

def countUnmappedConcepts():
    # check how many SSH-LCSH concepts do NOT yet have any mapping to AAT
    query = """
        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
        PREFIX gvp: <http://vocab.getty.edu/ontology#>
        SELECT (COUNT(DISTINCT ?uri) AS ?counter) WHERE {
        ?uri a skos:Concept .
        MINUS { ?uri ?property [a gvp:Concept] }
        }
    """
    results = graph.query(query)
    # we only want the count returned (maybe a better way to do this?)
    for result in results:
        return result[0]
        break    

counter = countUnmappedConcepts()
print(counter)

# 17/04/2024 - returns 2982 (88.4%), so 393 (11.6%) successfully mapped

665


In [25]:
# create skos:broadMatch relationships for concepts
# not yet mapped, using vocabulary broader relationship
query = """
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  PREFIX gvp: <http://vocab.getty.edu/ontology#>
  CONSTRUCT { ?uri skos:broadMatch ?aat_uri }
  WHERE {
    ?uri a skos:Concept; skos:broader ?broader_uri .
    ?broader_uri a skos:Concept; skos:closeMatch ?aat_uri .
    ?aat_uri a gvp:Concept .
    MINUS { ?uri ?property [a gvp:Concept] }
  }
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-broadMatch-AAT.nt", format="nt")
# add the resultant (3347, 17/04/2024) mappings to the existing graph
graph.parse("./data/SSH-LCSH-broadMatch-AAT.nt")

<Graph identifier=N66c1cf75c84241c2a89704b5a18bd50a (<class 'rdflib.graph.Graph'>)>

In [40]:
counter = countUnmappedConcepts()
print(counter)
# 17/04/2024 - returns 665 (19.7%), so 2710 (80.3%) successfully mapped


665


In [41]:
# create skos:broadMatch relationships  for concepts
# not yet mapped, using TWO steps of broader relationship 
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    CONSTRUCT { ?uri skos:broadMatch ?aat_uri }
    WHERE {
    ?uri a skos:Concept; skos:broader/skos:broader ?broader_uri .
    ?broader_uri a skos:Concept; skos:closeMatch ?aat_uri .
    ?aat_uri a gvp:Concept .
    MINUS { ?uri ?property [a gvp:Concept] }
    }
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-broadbroadMatch-AAT.nt", format="nt")
# add the resultant (367, 17/04/2024) mappings to the existing graph
graph.parse("./data/SSH-LCSH-broadbroadMatch-AAT.nt")

<Graph identifier=N66c1cf75c84241c2a89704b5a18bd50a (<class 'rdflib.graph.Graph'>)>

In [43]:
counter = countUnmappedConcepts()
print(counter)
# 17/04/2024 - returns 398 (11.8%), so 2977 (88.2%) successfully mapped

398


In [None]:
# list the mappings created (for review) (TODO: not tested in notebook yet, and output to DataFrame)
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    SELECT DISTINCT ?ssh (str(?ssh_lbl) AS ?ssh_label) ?rel ?aat (str(?aat_lbl) AS ?aat_label)
    WHERE {
    ?ssh a skos:Concept . 
    ?aat a gvp:Concept .
    ?ssh ?rel ?aat .
    OPTIONAL { 
        ?ssh skos:prefLabel ?ssh_lbl . 
        FILTER(langMatches(lang(?ssh_lbl), "en")) 
    }
    OPTIONAL { 
        ?aat skos:prefLabel ?aat_lbl . 
        FILTER(langMatches(lang(?aat_lbl), "en")) 
    }
    }
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-matched-AAT.csv", format="csv")

In [None]:
# list the remaining records not mapped (TODO: not tested in notebook yet, and output to DataFRame)
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    SELECT DISTINCT ?uri (str(?lbl) AS ?label) WHERE {
    ?uri a skos:Concept .
    OPTIONAL { 
        ?uri skos:prefLabel ?lbl . 
        FILTER(langMatches(lang(?lbl), "en")) 
    }
    MINUS { ?uri ?property [a gvp:Concept] }
    }
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-nomatch-AAT.csv", format="csv")

In [None]:
# get the generated mappings as RDF NTriples (TODO: not tested in notebook yet)
query = """
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX gvp: <http://vocab.getty.edu/ontology#>
    CONSTRUCT { ?ssh ?rel ?aat }
    WHERE {
    ?ssh a skos:Concept . 
    ?aat a gvp:Concept .
    ?ssh ?rel ?aat . 
    } 
"""
results = graph.query(query)
results.serialize(destination="./data/SSH-LCSH-match-AAT.nt", format="nt")
