<a href="https://colab.research.google.com/github/biblhertz/Datathink23_LinkedData/blob/main/BHMPI_KG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sample Project: Churches in Rome continued - Using Linked Data and Hertziana resources to enhance data

This sample project continues on to the one illustrated on Monday, with a few variations. Here you will learn to:

1. Query multiple Linked Data sets of the Bibliotheca Hertziana as if they were one
2. Enhance Geodata with heterogeneous sources

The basic ways to identify church buildings is still GND, however, you will see how this can be used to query datasets that lack direct usage of, or linking to, GND.

This notebook uses code written in the Python language; however, this is by no means prescriptive. The core of the project are the HTTP services which you can query in any way you like.

## Step 1: Install dependencies

[SPARQLWrapper](https://sparqlwrapper.readthedocs.io) is a useful client library for SPARQL endpoints. You do not strictly _need_ it to query the services, but it saves you some overhead of constructing HTTP requests and unpacking the results.

In [86]:
!pip install sparqlwrapper tabulate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Step 2: Prepare the data

First, load the joined GeoJSON data from Monday's project on GitHub.

For convenience, also keep an index of the churches' GND IDs.

In [87]:
import requests

geo_url = 'https://raw.githubusercontent.com/biblhertz/Datathink23_MappingGND/main/joined.geojson'
response = requests.get(geo_url)
geodata = response.json()

gnd_ids = [ props['gnd'] for props in (feat['properties'] for feat in geodata['features'] ) ]
print('{} GND IDs indexed'.format(len(gnd_ids)))

123 GND IDs indexed


We also keep track of a subset of GND IDs of churches present in Giuseppe Vasi's _Itinerario_.

In [88]:
# S. Stefano Rotondo , S. Francesco di Paola , S. Luigi dei Francesi
vasi_national_churches = [ '4302056-2' , '7563217-2', '4199094-8' ]
gndstr =  ' '.join(f'"{w}"' for w in vasi_national_churches)
gndstr

'"4302056-2" "7563217-2" "4199094-8"'

## Step 3: Exploratory queries to the datasets

The Bibliotheca Hertziana has had a multitude of data-intensive projects on churches in Rome, each with its own subset of churches of interest and with its own way of identifying them. Some projects link their data to those in other Hertziana projects, others to authority records like GND, VIAF and what have you.

However, they did not originally perform all the links in the same way. As part of creating Linked Data resources for all these projects, we are making it possible for these links to be discovered using one single predicate, `owl:sameAs` from the [OWL ontology language](https://www.w3.org/TR/owl2-overview/).

We will do most of the querying in SPARQL.

In [89]:
from SPARQLWrapper import SPARQLWrapper, JSON
from tabulate import tabulate

# Set up some data/ontlogy prefixes that will be useful later.
prefixes = """
PREFIX bibo:   <http://purl.org/ontology/bibo/>
PREFIX crm:    <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX crmsoc: <http://www.cidoc-crm.org/cidoc-crm/CRMsoc/>
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX frbr:   <http://purl.org/vocab/frbr/core#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos:   <http://www.w3.org/2004/02/skos/core#>
PREFIX rcp:    <http://data.biblhertz.it/romacommunispatria/term/>
"""

A church is identified in many different ways in different projects: each identifier is associated to certain properties that other identifiers are not.

For example, a church identifier in **Roma Communis Patria** has information on what peoples a church was assigned to, but not on who built it: that information can be found using the **Zuccaro** identifier.

We, however, want to abate that complexity: we don't want to bother what identifier uses what properties, we just want to use one: the GND ID.

Let's see for example all the ways in which Santo Stefano Rotondo is called in our datasets. We use an *exploratory query* to do that, and we execute it on the (provisional!) [SPARQL endpoint](http://data.biblhertz.it/sparql/) of the Bibliotheca Hertziana.

In [90]:
sparql_bhmpi = SPARQLWrapper('http://data.biblhertz.it/sparql/')

q = prefixes + """
SELECT DISTINCT ?x ?link WHERE {
  VALUES ?gnd { '""" +  vasi_national_churches[0] + """' }
  BIND ( IRI(CONCAT("https://d-nb.info/gnd/", ?gnd)) AS ?link)
  ?x (owl:sameAs|^owl:sameAs)* ?link
}"""
print(q)


PREFIX bibo:   <http://purl.org/ontology/bibo/>
PREFIX crm:    <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX crmsoc: <http://www.cidoc-crm.org/cidoc-crm/CRMsoc/>
PREFIX dct:    <http://purl.org/dc/terms/>
PREFIX frbr:   <http://purl.org/vocab/frbr/core#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos:   <http://www.w3.org/2004/02/skos/core#>
PREFIX rcp:    <http://data.biblhertz.it/romacommunispatria/term/>

SELECT DISTINCT ?x ?link WHERE {
  VALUES ?gnd { '4302056-2' }
  BIND ( IRI(CONCAT("https://d-nb.info/gnd/", ?gnd)) AS ?link)
  ?x (owl:sameAs|^owl:sameAs)* ?link
}


In [91]:
sparql_bhmpi.setQuery(q)
sparql_bhmpi.setReturnFormat(JSON)
# Run the query
ret = sparql_bhmpi.queryAndConvert()

# Print a table
links = {}
for r in ret['results']['bindings']:
  print(r['x']['value'])

https://d-nb.info/gnd/4302056-2
http://data.biblhertz.it/builtwork/lvpa/87dd8d495162c03f2ba4dad0ae3a2645
http://data.biblhertz.it/builtwork/zuccaro/100
http://data.biblhertz.it/romacommunispatria/builtwork/NB53
http://www.wikidata.org/entity/Q919456


So, there are several URIs for one church: three internal (starting with http://data.biblhertz.it/) and other externals. But what properties can be used to obtain nformation about them? Let's find out through other exploratory queries.

The first one looks for the *outgoing* predicates (i.e. what do the data that _describe_ the church say?).

In [92]:
q = prefixes + """
SELECT DISTINCT ?p_out WHERE {
  VALUES ?gnd { '""" +  vasi_national_churches[0] + """' }
  BIND ( IRI(CONCAT("https://d-nb.info/gnd/", ?gnd)) AS ?link)
  ?x (owl:sameAs|^owl:sameAs)* ?link
  . ?x ?p_out []
} ORDER BY ?p_out"""

sparql_bhmpi.setQuery(q)
ret = sparql_bhmpi.queryAndConvert()
for r in ret['results']['bindings']:
  print(r['p_out']['value'])

http://data.biblhertz.it/romacommunispatria/term/had_or_has_Christian_denomination
http://data.biblhertz.it/term/temp/altLabels
http://data.biblhertz.it/term/temp/notes
http://www.cidoc-crm.org/cidoc-crm/108i_was_produced_by
http://www.cidoc-crm.org/cidoc-crm/P1_is_identified_by
http://www.cidoc-crm.org/cidoc-crm/P2_has_type
http://www.cidoc-crm.org/cidoc-crm/P3_has_note
http://www.cidoc-crm.org/cidoc-crm/P53_has_former_or_current_location
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/2004/02/skos/core#altLabel
http://www.w3.org/2004/02/skos/core#prefLabel


Now the same for _incoming_ predicates (i.e. what do the data that _refer to_ the church say?).



In [93]:
q = prefixes + """
SELECT DISTINCT ?p_in WHERE {
  VALUES ?gnd { '""" +  vasi_national_churches[0] + """' }
  BIND ( IRI(CONCAT("https://d-nb.info/gnd/", ?gnd)) AS ?link)
  ?x (owl:sameAs|^owl:sameAs)* ?link
  . [] ?p_in ?x
} ORDER BY ?p_in"""

sparql_bhmpi.setQuery(q)
ret = sparql_bhmpi.queryAndConvert()
for r in ret['results']['bindings']:
  print(r['p_in']['value'])

http://www.cidoc-crm.org/cidoc-crm/CRMsoc/P6_to
http://www.w3.org/2002/07/owl#sameAs


We can do the same to inspect how the linked entities (e.g. the production of a church, the social bonds).

## Step 4: Query and enhance
From the exploratory queries above, we found a few properties of interest:

* `skos:altLabel` can be used to obtain other names of the church;
* `crm:108i_was_produced_by` tells us infomation on how the church was built;
* `crmsoc:P6_to` (incoming) tell us that there are social bonds about churches: these are the assignments of churches to be managed by specific (mostly ethnic) communities.

We want to enhance the GeoJSOn with this information for all the churches that have these data (some will not, see e.g. community assignments).

In [94]:
# For all the predicates that might or might not exist for all the churches, we will use OPTIONAL
q = prefixes + """
SELECT DISTINCT ?gnd ?otherNames ?rite ?assignee ?architect WHERE {
  VALUES ?gnd { """ +  ' '.join(f'"{w}"' for w in gnd_ids) + """ }
  BIND ( IRI(CONCAT("https://d-nb.info/gnd/", ?gnd)) AS ?link)
  ?x (owl:sameAs|^owl:sameAs)* ?link
  . OPTIONAL { ?x rcp:had_or_has_Christian_denomination/skos:prefLabel ?rite }
  . OPTIONAL { ?x ^crmsoc:P6_to/crmsoc:P7_to/skos:prefLabel ?assignee }
  . OPTIONAL { ?x skos:altLabel ?otherNames }
  . OPTIONAL { ?x crm:108i_was_produced_by/crm:P01i_is_domain_of [ crm:P02_has_range/skos:preferredLabel ?architect ; crm:P14.1_in_the_role_of 	
<http://data.biblhertz.it/role/architekt> ] }
}
"""
sparql_bhmpi.setQuery(q)
ret = sparql_bhmpi.queryAndConvert()

print(json.dumps(ret['results']['bindings'], indent=2))

[
  {
    "gnd": {
      "type": "literal",
      "value": "4215847-3"
    }
  },
  {
    "gnd": {
      "type": "literal",
      "value": "4215847-3"
    },
    "architect": {
      "type": "literal",
      "value": "Carlo Stefano Fontana"
    }
  },
  {
    "gnd": {
      "type": "literal",
      "value": "7600039-4"
    }
  },
  {
    "gnd": {
      "type": "literal",
      "value": "4593099-5"
    }
  },
  {
    "gnd": {
      "type": "literal",
      "value": "4302056-2"
    }
  },
  {
    "gnd": {
      "type": "literal",
      "value": "4302056-2"
    },
    "otherNames": {
      "xml:lang": "it",
      "type": "literal",
      "value": "Santo Stefano Rotondo al Celio, Santo Stefano in Girimonte, Santo Stefano in Querquetulano"
    },
    "rite": {
      "xml:lang": "en",
      "type": "literal",
      "value": "Latin"
    },
    "assignee": {
      "xml:lang": "it",
      "type": "literal",
      "value": "ungheresi"
    }
  },
  {
    "gnd": {
      "type": "literal",
      "v

What's left to do now is to inbject these new data into the GeoJSON, using the index by GND ID that we had built before.

In [95]:
attrs = [ 'assignee', 'otherNames' , 'architect', 'rite' ]  # We're only integrating these

for r in ret['results']['bindings']:
  gnd = r['gnd']['value']
  props = geodata['features'][gnd_ids.index(gnd)]['properties']
  for a in attrs:
    if a in r :
      a_ = props.setdefault(a,[])
      if r[a]['value'] not in a_ :
        a_.append(r[a]['value'])

idx = 3 # The index of Santo Stefano Rotondo in the geodata
geodata['features'][idx]

{'type': 'Feature',
 'properties': {'name': 'Basilica di Santo Stefano Rotondo',
  'id': 'relation/1576101',
  'wikidata': 'Q919456',
  'image': 'http://commons.wikimedia.org/wiki/Special:FilePath/Celio%20-%20santo%20Stefano%20Rotondo%20-%20interno%20in%20restauro%2001533-4.JPG',
  'place': 'http://www.wikidata.org/entity/Q919456',
  'placeLabel': 'basilica di Santo Stefano Rotondo al Celio',
  'gnd': '4302056-2',
  'iccd': 15527,
  'image:1': 'http://commons.wikimedia.org/wiki/Special:FilePath/Celio%20-%20santo%20Stefano%20Rotondo%20-%20interno%20in%20restauro%2001533-4.JPG',
  'assignee': ['ungheresi'],
  'otherNames': ['Santo Stefano Rotondo al Celio, Santo Stefano in Girimonte, Santo Stefano in Querquetulano'],
  'rite': ['Latin']},
 'geometry': {'type': 'Polygon',
  'coordinates': [[[12.4964904, 41.8844593],
    [12.4965599, 41.8843972],
    [12.4966596, 41.8843578],
    [12.4967388, 41.88435],
    [12.4968095, 41.8843585],
    [12.4968866, 41.8843825],
    [12.4969561, 41.8844267

### External data

Not all the Bibliotheca Hertziana data are currently accessible through the `data.biblhertz.it` SPARQL endpoint. The library holdings, for instance, are currently served by the Bibliotheksverbund Bayern (BVB) through the [B3Kat project](https://lod.b3kat.de). We can query and integrate those data too, knowing the the URI the identify the Hertziana as owner is `<http://lod.b3kat.de/bib/DE-Y2>` and they use `dcterms:subject` to connect to GND IDs (you can use exploratory queries to find that out).

However, there are some discrepancies in how different SPARQL endpoints work. While one would expect to retrieve books about S. Stefano Rotondo using this query:

```sparql
SELECT DISTINCT * WHERE {
  ?book dct:subject <http://d-nb.info/gnd/4302056-2>
   ; dc:title ?title
   ; frbr:exemplar/frbr:owner <http://lod.b3kat.de/bib/DE-Y2>
}
```
this has, for reasons unclear, not always returned a solution. For BVB, we'll have to resort to a query that performs string matching over the GND ID. Because this fallback is computationally intensive on the remote endpoint and can therefore be very slow, we'll demonstrate it with one church only: S. Stefano Rotondo.

In [96]:
sparql_bvb = SPARQLWrapper('https://lod.b3kat.de/sparql')

q = prefixes + f"""
SELECT DISTINCT ?book ?title WHERE {{
  ?book dct:subject ?y
   ; dc:title ?title
   ; frbr:exemplar/frbr:owner <http://lod.b3kat.de/bib/DE-Y2>
  FILTER (STR(?y)='http://d-nb.info/gnd/{vasi_national_churches[0]}')
}}
"""

sparql_bvb.setQuery(q)
sparql_bvb.setReturnFormat(JSON)

# Execute the query
ret = sparql_bvb.queryAndConvert()
props = geodata['features'][idx]['properties']
a_ = props.setdefault('books',[])
for r in ret['results']['bindings']:
  a_.append({ 
      'ID' : r['book']['value'],
      'title' : r['title']['value']
      })

# Check the results
geodata['features'][idx]

{'type': 'Feature',
 'properties': {'name': 'Basilica di Santo Stefano Rotondo',
  'id': 'relation/1576101',
  'wikidata': 'Q919456',
  'image': 'http://commons.wikimedia.org/wiki/Special:FilePath/Celio%20-%20santo%20Stefano%20Rotondo%20-%20interno%20in%20restauro%2001533-4.JPG',
  'place': 'http://www.wikidata.org/entity/Q919456',
  'placeLabel': 'basilica di Santo Stefano Rotondo al Celio',
  'gnd': '4302056-2',
  'iccd': 15527,
  'image:1': 'http://commons.wikimedia.org/wiki/Special:FilePath/Celio%20-%20santo%20Stefano%20Rotondo%20-%20interno%20in%20restauro%2001533-4.JPG',
  'assignee': ['ungheresi'],
  'otherNames': ['Santo Stefano Rotondo al Celio, Santo Stefano in Girimonte, Santo Stefano in Querquetulano'],
  'rite': ['Latin'],
  'books': [{'ID': 'http://lod.b3kat.de/title/BV000665349',
    'title': 'S. Stefano Rotondo'},
   {'ID': 'http://lod.b3kat.de/title/BV000774157',
    'title': 'Kirchen am Lebensweg'},
   {'ID': 'http://lod.b3kat.de/title/BV013367138',
    'title': 'Sant