# Knowledge Exploration With SPARQL queries

This notebook demonstrates how to explore, navigate, and analyze the *Gazetteers of Scotland Knowledge Graph* (1803–1901) using SPARQL queries over a remote Fuseki endpoint. The data is modeled using the [Heritage Textual Ontology (HTO)](https://w3id.org/hto), and includes semantically enriched descriptions of places, texts, volumes, and their provenance.

The following questions are addressed using 12 targeted SPARQL queries:

1. **What RDF classes are defined in the dataset?**  
   → Query 1 lists all `rdf:type` instances used from the HTO namespace.

2. **What properties are used across the knowledge graph?**  
   → Query 2 enumerates all `hto:` properties in use.

3. **What Gazetteer series are present?**  
   → Query 3 retrieves all `hto:Series` instances and their titles.

4. **Who are the editors associated with volumes or series?**  
   → Query 4 explores the `hto:editor` property, retrieves linked `hto:Person` entities, and their names via `foaf:name`.

5. **What volumes are included, and how are they organized into series and collections?**  
   → Query 5 lists all `hto:Volume` entities with their series and parent `hto:WorkCollection`.

6. **What metadata properties describe a given volume?**  
   → Query 6 selects a sample volume and lists all associated RDF triples.

7. **How are `hto:OriginalDescription` entries structured?**  
   → Query 7 lists all properties used to describe article-level entries.

8. **What is the text and source of each article?**  
   → Query 8 retrieves full text and source documents for descriptions.

9. **What are all RDF triples for a specific article?**  
   → Query 9 drills into a selected `hto:OriginalDescription` and inspects its metadata.

10. **How is a `LocationRecord` structured?**  
    → Query 10 retrieves a sample `hto:LocationRecord` and lists all associated properties including name, description, and pages.

11. **What is the article title, text, and page range for each location record?**  
    → Query 11 aggregates key fields (name, full text, start/end pages) from each `hto:LocationRecord`.

12. **How has a specific place (e.g., "DUNDEE") been described across the corpus?**  
    → Query 12 retrieves all articles titled "DUNDEE", including their text, source volume, parent series, and publication year.

13. **What are the longest Gazetteer articles, and where do they appear?**  
    → Query 13 lists the 10 longest `hto:LocationRecord` entries by text length, showing the article title, a text excerpt, the volume and series in which the article was published, and the year of publication. This helps surface dense or historically significant entries for further analysis.

14. **Which Gazetteer articles refer to other entries, and what do they reference?**  
    → Query 14 explores internal semantic links using the `hto:refersTo` property. It returns `hto:LocationRecord` entries that refer to other records, displaying both the source and target names. This enables tracing redirects, summaries, and cross-references within the Gazetteers knowledge graph.

15. **Which article titles are reused across multiple Gazetteer entries?**  
    → Query 15 groups `hto:LocationRecord` entries by their `hto:name` and lists those names that appear in multiple records. These cases reveal reused or ambiguous place names across volumes or editions (e.g., “LOGIE”, “KIRKHILL”), useful for disambiguation or tracking editorial duplication over time.

16. **Which Gazetteer articles include alternate or variant names?**  
    → Query 16 identifies articles that contain both a primary name (`hto:name`) and one or more alternate names (`rdfs:label`), typically derived from metadata fields such as “Alternative names.” This supports fuzzy search, historical variant matching, and linguistic normalization.



Each question is addressed using targeted SPARQL queries, executed through `SPARQLWrapper` against the remote Fuseki endpoint (or local one)



## Setup

Make sure **SPARQLWrapper** is installed in your python environment.

In [2]:
!pip install SPARQLWrapper

Collecting SPARQLWrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl.metadata (2.0 kB)
Collecting rdflib>=6.1.1 (from SPARQLWrapper)
  Downloading rdflib-7.1.4-py3-none-any.whl.metadata (11 kB)
Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Downloading rdflib-7.1.4-py3-none-any.whl (565 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.1/565.1 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdflib, SPARQLWrapper
Successfully installed SPARQLWrapper-2.0.0 rdflib-7.1.4


## Connection

Chose one of the two connection options - recommended, the remote one :)

### Remote Fuseki Connection


We connect to the remote SPARQL server hosting the Gazetteers knowledge graph. The data is served via a [Fuseki SPARQL endpoint](http://query.frances-ai.com/hto_gazetteers), and includes RDF resources describing volumes, series, articles, locations, pages, and provenance information.

In [12]:
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper(
    "http://query.frances-ai.com/hto_gazetteers"
)
sparql.setReturnFormat(JSON)

### Local Fuseki Connection

Use this option if you want to query a locally hosted version of the Gazetteer RDF graph (e.g., using `gaz.ttl`).it.  

In [None]:
from rdflib import Graph, URIRef, Namespace
from rdflib.plugins.sparql import prepareQuery

# Create a new RDFLib Graph
basic_graph = Graph()

# Load the rdf file into the graph
basic_graph_file = "./gaz.ttl"
basic_graph.parse(basic_graph_file, format="turtle")


hto = Namespace("https://w3id.org/hto#")
# Print the number of "triples" in the Graph
print(f"Basic Graph has {len(basic_graph)} statements.")

## Query the graph

### Query 1: Explore all RDF classes (types) defined in the HTO namespace

This query retrieves all distinct RDF types (`rdf:type`) that are used in the dataset and belong to the Heritage Textual Ontology (HTO).

In RDF, the `rdf:type` predicate is used to declare the class of a resource (e.g., `hto:Volume`, `hto:Location`, `hto:OriginalDescription`). Listing these types gives us a high-level overview of the entity types that populate the knowledge graph.

This is particularly useful at the beginning of an exploration session to understand the shape and semantics of the dataset.


In [27]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT DISTINCT ?type WHERE {
  ?s a ?type .
  FILTER(STRSTARTS(STR(?type), "https://w3id.org/hto#"))
}
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"Type: {r['type']['value']}")
except Exception as e:
    print(e)



Type: https://w3id.org/hto#Location
Type: https://w3id.org/hto#Concept
Type: https://w3id.org/hto#Organization
Type: https://w3id.org/hto#SoftwareAgent
Type: https://w3id.org/hto#Series
Type: https://w3id.org/hto#Volume
Type: https://w3id.org/hto#LocationType
Type: https://w3id.org/hto#Type
Type: https://w3id.org/hto#TextQuality
Type: https://w3id.org/hto#ResourceType
Type: https://w3id.org/hto#ExternalRecord
Type: https://w3id.org/hto#WorkCollection
Type: https://w3id.org/hto#InformationResource
Type: https://w3id.org/hto#Activity
Type: https://w3id.org/hto#Person
Type: https://w3id.org/hto#LocationRecord
Type: https://w3id.org/hto#OriginalDescription
Type: https://w3id.org/hto#Page


### Query 2: Explore all properties defined in the HTO namespace

This query returns all distinct RDF properties (predicates) in the dataset that belong to the Heritage Textual Ontology (HTO).

In RDF, predicates express the relationships between resources or between a resource and a literal (e.g., `hto:title`, `hto:editor`, `hto:startsAtPage`). By listing all used properties, we gain insight into the kinds of metadata and semantic links available in the graph.

This is useful for discovering which attributes are used to describe volumes, pages, descriptions, places, people, and other entities.


In [41]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT DISTINCT ?p WHERE {
  ?s ?p ?o .
  FILTER(STRSTARTS(STR(?p), "https://w3id.org/hto#"))
}
""")

try:
    ret = sparql.queryAndConvert()
    print("All hto: properties in use:")
    for r in ret["results"]["bindings"]:
        print(f"{r['p']['value']}")
except Exception as e:
    print(e)


All hto: properties in use:
https://w3id.org/hto#hadMember
https://w3id.org/hto#name
https://w3id.org/hto#editor
https://w3id.org/hto#genre
https://w3id.org/hto#language
https://w3id.org/hto#mmsid
https://w3id.org/hto#number
https://w3id.org/hto#printedAt
https://w3id.org/hto#shelfLocator
https://w3id.org/hto#yearPublished
https://w3id.org/hto#title
https://w3id.org/hto#physicalDescription
https://w3id.org/hto#subtitle
https://w3id.org/hto#publisher
https://w3id.org/hto#hasOriginalDescription
https://w3id.org/hto#endsAtPage
https://w3id.org/hto#startsAtPage
https://w3id.org/hto#hasTextQuality
https://w3id.org/hto#text
https://w3id.org/hto#wasExtractedFrom
https://w3id.org/hto#refersTo
https://w3id.org/hto#birthYear
https://w3id.org/hto#deathYear
https://w3id.org/hto#wasMemberOf
https://w3id.org/hto#permanentURL
https://w3id.org/hto#volumeId
https://w3id.org/hto#numberOfPages
https://w3id.org/hto#hadConceptRecord
https://w3id.org/hto#hasResourceType


### Query 3: Retrieve all Gazetteer series and their titles

This query retrieves all resources of type `hto:Series` and their corresponding titles using the `hto:title` property.

A `Series` in the HTO knowledge graph represents a multi-volume work (e.g., the *Imperial Gazetteer of Scotland* or the *Ordnance Gazetteer of Scotland*). Each series may consist of multiple volumes published across different years or editions.

This query helps establish the top-level bibliographic structure of the Gazetteers collection.


In [30]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?series ?title WHERE {
  ?series a hto:Series ;
          hto:title ?title .
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"Series URI: {r['series']['value']} — Title: {r['title']['value']}")
except Exception as e:
    print(e)


Series URI: https://w3id.org/hto/Series/9910440713804340 — Title: gazetteer of Scotland. [With plates and maps.]
Series URI: https://w3id.org/hto/Series/9928112733804340 — Title: imperial gazetteer of Scotland; or, Dictionary of Scottish topography, compiled from the most recent authorities, and forming a complete body of Scottish geography, physical, statistical, and historical
Series URI: https://w3id.org/hto/Series/9928151783804340 — Title: topographical dictionary of Scotland
Series URI: https://w3id.org/hto/Series/9928228793804340 — Title: Ordnance gazetteer of Scotland
Series URI: https://w3id.org/hto/Series/9930626093804340 — Title: Ordnance gazetteer of Scotland
Series URI: https://w3id.org/hto/Series/9931003343804340 — Title: gazetteer of Scotland
Series URI: https://w3id.org/hto/Series/9931344573804340 — Title: gazetteer of Scotland: containing a particular and concise description of the counties, parishes, islands, cities ... With ... map
Series URI: https://w3id.org/hto/Ser

### Query 4: Explore editorial metadata in the Gazetteers knowledge graph

This set of queries investigates how editorial contributions are modeled in the dataset using the `hto:editor` property and linked `hto:Person` entities. Editors are critical figures in shaping the content and structure of the Gazetteers.




#### 🔹 Query 4.1: Find all resources with an associated editor

This query retrieves all resources that declare an `hto:editor`, along with the URI of the editor (typically an `hto:Person`).

This helps identify which series or volumes are explicitly linked to known editors.



In [32]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?subject ?editor WHERE {
  ?subject hto:editor ?editor .
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"{r['subject']['value']} → {r['editor']['value']}")
except Exception as e:
    print(e)


https://w3id.org/hto/Series/9931344583804340 → https://w3id.org/hto/Person/4607874226
https://w3id.org/hto/Series/9910440713804340 → https://w3id.org/hto/Person/5247046190
https://w3id.org/hto/Series/9928151783804340 → https://w3id.org/hto/Person/7593396701
https://w3id.org/hto/Series/9928112733804340 → https://w3id.org/hto/Person/4251664498
https://w3id.org/hto/Series/9933057493804340 → https://w3id.org/hto/Person/4251664498
https://w3id.org/hto/Series/9928228793804340 → https://w3id.org/hto/Person/9594167312
https://w3id.org/hto/Series/9930626093804340 → https://w3id.org/hto/Person/9594167312
https://w3id.org/hto/Series/9931003343804340 → https://w3id.org/hto/Person/4957971131
https://w3id.org/hto/Series/9931344573804340 → https://w3id.org/hto/Person/4957971131
https://w3id.org/hto/Series/9931344933804340 → https://w3id.org/hto/Person/4957971131


#### 🔹 Query 4.2: Inspect all properties of a specific editor (Person)

Given the URI of a specific `hto:Person`, this query lists all associated properties. This typically includes:

- `rdf:type` (should be `hto:Person`)
- `foaf:name` (if available)
- `hto:birthYear`, `hto:deathYear` (if known)

This allows us to inspect how editors are semantically described in the graph.


In [36]:
person_uri = "https://w3id.org/hto/Person/4607874226"

sparql.setQuery(f"""
SELECT ?p ?o WHERE {{
  <{person_uri}> ?p ?o .
}}
""")

try:
    ret = sparql.queryAndConvert()
    print(f"All properties for {person_uri}:\n")
    for r in ret["results"]["bindings"]:
        print(f"{r['p']['value']} → {r['o']['value']}")
except Exception as e:
    print(e)


All properties for https://w3id.org/hto/Person/4607874226:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type → https://w3id.org/hto#Person
http://xmlns.com/foaf/0.1/name → Scotland. [Appendix. - Descriptions, Topography & Travels.]


#### 🔹 Query 4.3: List all distinct editor names

This query navigates from edited resources (`hto:editor`) to the linked `hto:Person`, then retrieves the person's name using `foaf:name`.

This gives a clean list of all editors represented in the graph by name — useful for documentation, indexing, or attribution.

Together, these queries illustrate how biographical and editorial metadata is encoded and linked across multiple entity types.

In [37]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?name WHERE {
  ?instance hto:editor ?editor .
  ?editor foaf:name ?name .
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    print("Editor names (via foaf:name):\n")
    for r in ret["results"]["bindings"]:
        print(f"- {r['name']['value']}")
except Exception as e:
    print(e)


Editor names (via foaf:name):

- Scotland. [Appendix. - Descriptions, Topography & Travels.]
- Chambers, William
- Lewis, Samuel
- Wilson, John Marius.
- Groome, Francis Hindes
- Scotland. [Appendix. - Descriptions, Topography and Travels.]


### Query 5: Retrieve volumes and their parent series from the Gazetteers of Scotland collection

This query lists all `hto:Volume` resources that are members of a `hto:Series`, which in turn belongs to the broader `hto:WorkCollection` titled *Gazetteers of Scotland Collection*.

Each volume is returned with:
- Its title (`hto:title`)
- The title of the series it belongs to

This hierarchical query traverses three levels of structure:
1. **Collection** → `hto:WorkCollection`
2. **Series** → `hto:Series`
3. **Volume** → `hto:Volume`

The result gives a curated view of how individual volumes are organized into series and grouped within the overall collection. It is useful for bibliographic exploration and user-facing navigation interfaces.


In [28]:
sparql.setQuery("""
    PREFIX hto: <https://w3id.org/hto#>
    SELECT * WHERE {
        ?volume a hto:Volume;
            hto:title ?vol_title.
        ?series a hto:Series;
            hto:title ?series_title;
            hto:hadMember ?volume.
        ?collection a hto:WorkCollection;
            hto:name "Gazetteers of Scotland Collection";
            hto:hadMember ?series.
        }
    """
)

try:
    ret = sparql.queryAndConvert()

    for r in ret["results"]["bindings"]:
        print(f"Volume title: {r['vol_title']['value']}, in series: {r['series_title']['value']}")
except Exception as e:
    print(e)

Volume title: gazetteer of Scotland. [With plates and maps.] 1838, Volume 1, in series: gazetteer of Scotland. [With plates and maps.]
Volume title: gazetteer of Scotland. [With plates and maps.] 1838, Volume 2, in series: gazetteer of Scotland. [With plates and maps.]
Volume title: imperial gazetteer of Scotland; or, Dictionary of Scottish topography, compiled from the most recent authorities, and forming a complete body of Scottish geography, physical, statistical, and historical 1868, Volume 1, in series: imperial gazetteer of Scotland; or, Dictionary of Scottish topography, compiled from the most recent authorities, and forming a complete body of Scottish geography, physical, statistical, and historical
Volume title: imperial gazetteer of Scotland; or, Dictionary of Scottish topography, compiled from the most recent authorities, and forming a complete body of Scottish geography, physical, statistical, and historical 1868, Volume 2, in series: imperial gazetteer of Scotland; or, Dic

### Query 6: Explore metadata properties of a sample Gazetteer volume

This pair of queries is used to inspect the metadata of a single `hto:Volume` in detail.


#### 🔹 Query 6.1: Select a sample volume URI

This query selects one instance of a `hto:Volume` from the dataset. It serves as a dynamic starting point for detailed inspection of that volume's metadata.

This step ensures that we are querying an actual, existing volume in the graph.


In [39]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?volume WHERE {
  ?volume a hto:Volume .
}
LIMIT 1
""")

try:
    ret = sparql.queryAndConvert()
    if ret["results"]["bindings"]:
        volume_uri = ret["results"]["bindings"][0]["volume"]["value"]
        print("Volume URI:", volume_uri)
    else:
        print("No volumes found.")
except Exception as e:
    print(e)



Volume URI: https://w3id.org/hto/Volume/9910440713804340_97424370


#### 🔹 Query 6.2: Retrieve all properties of the selected volume

Using the volume URI obtained in the previous step (e.g., `hto:Volume/9910440713804340_97424370`), this query retrieves all RDF properties and values linked to it.

Typical metadata includes:
- `hto:title` — full title of the volume
- `hto:number` — volume number
- `hto:numberOfPages` — total page count
- `hto:permanentURL` — link to the digitized version
- `hto:wasMemberOf` — series the volume belongs to
- `hto:hadMember` — individual pages contained in the volume

Together, these queries allow for close inspection of how volume-level bibliographic and structural metadata is modeled in the HTO knowledge graph.

In [47]:
sparql.setQuery("""
SELECT ?p ?o WHERE {
  <https://w3id.org/hto/Volume/9910440713804340_97424370> ?p ?o .
}

LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    print("Properties for Volume 9910440713804340_97424370:\n")
    for r in ret["results"]["bindings"]:
        print(f"{r['p']['value']} → {r['o']['value']}")
except Exception as e:
    print(e)


Properties for Volume 9910440713804340_97424370:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type → https://w3id.org/hto#Volume
http://www.w3.org/1999/02/22-rdf-syntax-ns#type → https://w3id.org/hto#WorkCollection
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_470
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_471
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_486
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_179
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_88
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_89
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_102
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_97424370_257
https://w3id.org/hto#hadMember → https://w3id.org/hto/Page/9910440713804340_

### Query 7: List all properties used in `hto:OriginalDescription` entries

This query retrieves all distinct RDF properties that appear in resources of type `hto:OriginalDescription`. These represent individual article-level entries extracted from the gazetteers.

By examining which properties are used on `hto:OriginalDescription`, we can understand how each entry is semantically described — including its content, provenance, and quality metadata.

Typical properties include:
- `hto:text` — the full textual content of the article
- `hto:hasTextQuality` — a quality indicator (e.g., Low, High)
- `hto:wasExtractedFrom` — the source page or document
- `prov:wasAttributedTo` — the responsible agent (e.g., MappingChange pipeline)

This query is useful for schema exploration and understanding how article-level data is structured.


In [29]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT DISTINCT ?p WHERE {
  ?desc a hto:OriginalDescription ;
        ?p ?o .
}
LIMIT 50
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"Property: {r['p']['value']}")
except Exception as e:
    print(e)



Property: http://www.w3.org/1999/02/22-rdf-syntax-ns#type
Property: http://www.w3.org/ns/prov#wasAttributedTo
Property: https://w3id.org/hto#hasTextQuality
Property: https://w3id.org/hto#text
Property: https://w3id.org/hto#wasExtractedFrom


### Query 8: Retrieve the text of Gazetteer articles and their source documents

This query returns sample `hto:OriginalDescription` entries, showing the full article text and the `hto:InformationResource` from which it was extracted (typically an ALTO XML file or digitized page).

Each result includes:
- The URI of the article description (`?desc`)
- The full text content (`hto:text`)
- The source document or page URI (`hto:wasExtractedFrom`)

This query provides a window into the actual semantic content of the gazetteer entries, enabling inspection of OCR outputs and understanding of how content is linked to its digitized provenance.

It is especially useful for building content previews, search indexes, or validating extraction quality.


In [48]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?desc ?text ?source WHERE {
  ?desc a hto:OriginalDescription ;
        hto:text ?text ;
        hto:wasExtractedFrom ?source .
}
LIMIT 10
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"\nDescription: {r['desc']['value']}\nSource: {r['source']['value']}\nText: {r['text']['value'][:200]}...\n")
except Exception as e:
    print(e)




Description: https://w3id.org/hto/OriginalDescription/9910440713804340_97424370_1007501475_0NLS
Source: https://w3id.org/hto/InformationResource/97424370_alto_97430004_34_xml
Text: a united parish on the mainland of Orkney, of nine miles in length, with a varying breadth, lying west of Kirkwall. In its centre is the lake of S tennis or Stenhouse, which is nearly divided in two b...


Description: https://w3id.org/hto/OriginalDescription/9910440713804340_97424370_1007608998_0NLS
Source: https://w3id.org/hto/InformationResource/97424370_alto_97430016_34_xml
Text: an inlet of the sea on the south-east coast of Sutherlandshire, across the narrow neck of which there is a ferry, on the thoroughfare along the coast northwards from Dornoch....


Description: https://w3id.org/hto/OriginalDescription/9910440713804340_97424370_1016059890_0NLS
Source: https://w3id.org/hto/InformationResource/97424370_alto_97430196_34_xml
Text: FRODA, an islet on the west coast of Skye....


Description: https://w

### Query 9: Inspect all RDF properties of a single `hto:OriginalDescription` entry

This two-part query inspects one specific Gazetteer article by first selecting an example description and then listing all of its associated RDF triples.


#### 🔹 Query 9.1: Select a sample `OriginalDescription` URI

This query retrieves one resource of type `hto:OriginalDescription`. This URI will be used to examine all the semantic properties associated with that individual article.


In [22]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?desc ?p ?o WHERE {
  ?desc a hto:OriginalDescription ;
        ?p ?o .
}
LIMIT 1
""")

# Now this will work:
try:
    ret = sparql.queryAndConvert()
    binding = ret["results"]["bindings"][0]
    desc_uri = binding["desc"]["value"]
    print(f"Using description URI: {desc_uri}")
except Exception as e:
    print("Could not get description URI:", e)




Using description URI: https://w3id.org/hto/OriginalDescription/9910440713804340_97424370_1007501475_0NLS


#### 🔹 Query 9.2: Retrieve all RDF properties of that description

Once a description URI is selected, this second query prints all RDF triples where that URI is the subject. This includes key metadata such as:

- `hto:text` — the full article content
- `hto:wasExtractedFrom` — the page or document source
- `hto:hasTextQuality` — quality annotation (e.g., Low, High)
- `prov:wasAttributedTo` — the agent responsible for the extraction
- Any other custom properties used in semantic modeling

Together, these queries allow you to deeply inspect the structure and provenance of individual gazetteer entries.

In [23]:
sparql.setQuery(f"""
SELECT ?p ?o WHERE {{
  <{desc_uri}> ?p ?o .
}}
""")

try:
    ret = sparql.queryAndConvert()
    print(f"\nAll properties for {desc_uri}:\n")
    for r in ret["results"]["bindings"]:
        print(f"{r['p']['value']} → {r['o']['value']}")
except Exception as e:
    print(e)



All properties for https://w3id.org/hto/OriginalDescription/9910440713804340_97424370_1007501475_0NLS:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type → https://w3id.org/hto#OriginalDescription
http://www.w3.org/ns/prov#wasAttributedTo → https://github.com/francesNLP/MappingChange
https://w3id.org/hto#hasTextQuality → https://w3id.org/hto#Low
https://w3id.org/hto#text → a united parish on the mainland of Orkney, of nine miles in length, with a varying breadth, lying west of Kirkwall. In its centre is the lake of S tennis or Stenhouse, which is nearly divided in two by a narrow shallow, which can be passed over by a sort of causeway of large stones. On the western side are the famous stones of Stonnis, which are only paralleled by those of Stonehenge. Some of these are single, standing jerect in the earth. Others describe particular figures ; but the greatest number form a large circle, surrounded by a ditch. A grent number have fallen. The largest stand between the old kirk of Stenni

### Query 10: Retrieve and inspect a `hto:LocationRecord`

This two-part query focuses on exploring a `hto:LocationRecord`, which represents a semantically enriched article entry linked to a specific place.



#### 🔹 Query 10.1: Retrieve a sample `LocationRecord` URI

This query selects a single resource of type `hto:LocationRecord`. This record aggregates structured metadata about an article that refers to a geographic location.


In [42]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?record WHERE {
  ?record a hto:LocationRecord .
}
LIMIT 1
""")

try:
    ret = sparql.queryAndConvert()
    record_uri = ret["results"]["bindings"][0]["record"]["value"]
    print(f"Using LocationRecord URI: {record_uri}")
except Exception as e:
    print("Could not retrieve a LocationRecord:", e)




Using LocationRecord URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1007501475_0


#### 🔹 Query 10.2: List all RDF properties of that `LocationRecord`

Once the URI is obtained, this query lists all RDF properties associated with the record. These typically include:

- `hto:name` — the article heading (e.g., "FIRTH AND STENNIS")
- `hto:hasOriginalDescription` — link to the full textual description
- `hto:startsAtPage` / `hto:endsAtPage` — page-level provenance
- `hto:refersToLocation` — the linked `hto:Location` resource

This structure allows rich querying of articles by location, supports place-based exploration, and connects textual content with bibliographic context.

In [43]:
sparql.setQuery("""
SELECT ?p ?o WHERE {
  <https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1007501475_0> ?p ?o .
}
""")

try:
    ret = sparql.queryAndConvert()
    print("All properties for the LocationRecord:\n")
    for r in ret["results"]["bindings"]:
        print(f"{r['p']['value']} → {r['o']['value']}")
except Exception as e:
    print(e)


All properties for the LocationRecord:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type → https://w3id.org/hto#LocationRecord
https://w3id.org/hto#hasOriginalDescription → https://w3id.org/hto/OriginalDescription/9910440713804340_97424370_1007501475_0NLS
https://w3id.org/hto#name → FIRTH AND STENNIS
https://w3id.org/hto#endsAtPage → https://w3id.org/hto/Page/9910440713804340_97424370_470
https://w3id.org/hto#startsAtPage → https://w3id.org/hto/Page/9910440713804340_97424370_470


### Query 11: Retrieve title, start-end page, ext of a article place


### Query 11: Retrieve article metadata including title, text, and page range

This query aggregates key metadata about Gazetteer articles modeled as `hto:LocationRecord` resources.

Each result includes:
- `hto:name` — the article title or heading (e.g., “FIRTH AND STENNIS”)
- `hto:text` — the full textual content of the article (from `hto:OriginalDescription`)
- `hto:startsAtPage` and `hto:endsAtPage` — the page span in the digitized volume

The query joins the `hto:LocationRecord` with its corresponding `hto:OriginalDescription`, providing a compact view of what each article covers, how long it is, and where it appears in the source volume.

This is useful for content previews, document navigation interfaces, or comparative analysis across editions.


In [45]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?record ?name ?text ?startPage ?endPage WHERE {
  ?record a hto:LocationRecord ;
          hto:name ?name ;
          hto:hasOriginalDescription ?desc ;
          hto:startsAtPage ?startPage ;
          hto:endsAtPage ?endPage .
  ?desc hto:text ?text .
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"Article title: {r['name']['value']}")
        print(f"Start page: {r['startPage']['value']}")
        print(f"End page:   {r['endPage']['value']}")
        print(f"Text: {r['text']['value'][:200]}...\n")
except Exception as e:
    print(e)



Article title: FIRTH AND STENNIS
Start page: https://w3id.org/hto/Page/9910440713804340_97424370_470
End page:   https://w3id.org/hto/Page/9910440713804340_97424370_470
Text: a united parish on the mainland of Orkney, of nine miles in length, with a varying breadth, lying west of Kirkwall. In its centre is the lake of S tennis or Stenhouse, which is nearly divided in two b...

Article title: FLEET LOCH
Start page: https://w3id.org/hto/Page/9910440713804340_97424370_471
End page:   https://w3id.org/hto/Page/9910440713804340_97424370_471
Text: an inlet of the sea on the south-east coast of Sutherlandshire, across the narrow neck of which there is a ferry, on the thoroughfare along the coast northwards from Dornoch....

Article title: FRODA
Start page: https://w3id.org/hto/Page/9910440713804340_97424370_486
End page:   https://w3id.org/hto/Page/9910440713804340_97424370_486
Text: FRODA, an islet on the west coast of Skye....

Article title: CLYNE
Start page: https://w3id.org/hto/Page/9910

### Query 12: Retrieve all Gazetteer entries titled "DUNDEE" with article text, volume, series, and year

This query returns all Gazetteer entries with the title `"DUNDEE"` from the knowledge graph, using the `hto:name` property on `hto:LocationRecord`.

For each entry, the query retrieves:
- The full article text (`hto:text`)
- Start and end pages in the source volume (`hto:startsAtPage`, `hto:endsAtPage`)
- The volume (`hto:Volume`) it belongs to, with title
- The series (`hto:Series`) the volume is part of, with title
- The year of publication, resolved from either the volume or the series (`hto:yearPublished`)

By using `COALESCE`, the query returns the earliest available year value, whether it's declared on the volume or the series — accommodating incomplete metadata.

This query is ideal for comparing how a single place, such as Dundee, has been described across different editions and series in the Gazetteers collection.


In [51]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT DISTINCT ?record ?desc ?text ?startPage ?endPage ?volume ?volumeTitle ?series ?seriesTitle ?year WHERE {
  ?record a hto:LocationRecord ;
          hto:name "DUNDEE" ;
          hto:hasOriginalDescription ?desc ;
          hto:startsAtPage ?startPage ;
          hto:endsAtPage ?endPage .

  ?desc hto:text ?text .

  # Get the volume from the page
  ?volume hto:hadMember ?page .
  FILTER (?page = ?startPage || ?page = ?endPage)

  OPTIONAL { ?volume hto:title ?volumeTitle . }
  OPTIONAL { ?volume hto:wasMemberOf ?series . }
  OPTIONAL { ?series hto:title ?seriesTitle . }
  OPTIONAL { ?volume hto:yearPublished ?volumeYear . }
  OPTIONAL { ?series hto:yearPublished ?seriesYear . }

  # Coalesce year from volume or series
  BIND(COALESCE(?volumeYear, ?seriesYear) AS ?year)
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"📘 Record: {r['record']['value']}")
        print(f"📄 Start Page: {r['startPage']['value']} → End Page: {r['endPage']['value']}")
        print(f"📚 Volume: {r.get('volumeTitle', {}).get('value', 'N/A')}")
        print(f"📦 Series: {r.get('seriesTitle', {}).get('value', 'N/A')}")
        print(f"📅 Year: {r.get('year', {}).get('value', 'N/A')}")
        print(f"📝 Text: {r['text']['value'][:300]}...\n")
except Exception as e:
    print(e)



📘 Record: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_769119998_0
📄 Start Page: https://w3id.org/hto/Page/9910440713804340_97424370_258 → End Page: https://w3id.org/hto/Page/9910440713804340_97424370_268
📚 Volume: gazetteer of Scotland. [With plates and maps.] 1838, Volume 1
📦 Series: gazetteer of Scotland. [With plates and maps.]
📅 Year: 1838
📝 Text: DUNDEE. 229 of some high rocks close to the river, and about a quarter of a mile from the church, was erected, in early times, a tolerably secure fortress, similar to that still nearly entire, at Broughty, a few miles farther down the Tay. Little is satisfactorily known of the castle of Dundee. Afte...

📘 Record: https://w3id.org/hto/LocationRecord/9928112733804340_97459138_769119998_0
📄 Start Page: https://w3id.org/hto/Page/9928112733804340_97459138_562 → End Page: https://w3id.org/hto/Page/9928112733804340_97459138_572
📚 Volume: imperial gazetteer of Scotland; or, Dictionary of Scottish topography, compiled from the mo

### Query 13: List the longest Gazetteer articles by text length

This query retrieves the top 10 `hto:LocationRecord` entries in the Gazetteers knowledge graph, ranked by the length of their textual content (`hto:text`).

Each result includes:
- The article title (`hto:name`)
- The URI of the location record
- A sample of the full text
- Volume title

Sorting articles by character length is a useful heuristic for identifying substantial entries — such as major cities, counties, or complex place groupings — which often span multiple paragraphs or pages. These long entries are ideal candidates for in-depth analysis, LLM summarization,


In [61]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?record ?name ?text ?volumeTitle ?year WHERE {
  ?record a hto:LocationRecord ;
          hto:name ?name ;
          hto:hasOriginalDescription ?desc ;
          hto:startsAtPage ?page .

  ?desc hto:text ?text .

  ?volume hto:hadMember ?page .
  OPTIONAL { ?volume hto:title ?volumeTitle . }
  OPTIONAL { ?volume hto:yearPublished ?year . }
}
ORDER BY DESC(STRLEN(?text))
LIMIT 10
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"📍 Title: {r['name']['value']}")
        print(f"📝 Record: {r['record']['value']}")
        print(f"📚 Volume: {r.get('volumeTitle', {}).get('value', 'N/A')} ({r.get('year', {}).get('value', 'N/A')})")
        print(f"📏 Length: {len(r['text']['value'])} characters")
        print(f"🔍 Excerpt: {r['text']['value'][:300]}...\n")
except Exception as e:
    print(e)





📍 Title: EDINBURGH
📝 Record: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_5424703086_0
📚 Volume: gazetteer of Scotland. [With plates and maps.] 1838, Volume 1 (N/A)
📏 Length: 800374 characters
🔍 Excerpt: EDINBURGH. 285 monarch, held his first parliament in Edinburgh, in the year 1214, and this event served to give it still more the air of a capital and seat of supreme justice. When Alexander, in 1221, married Joan, the princess of England, he made Edinburgh the place of his residence for some time. ...

📍 Title: EDINBURGH
📝 Record: https://w3id.org/hto/LocationRecord/9928112733804340_97459138_5424703086_0
📚 Volume: imperial gazetteer of Scotland; or, Dictionary of Scottish topography, compiled from the most recent authorities, and forming a complete body of Scottish geography, physical, statistical, and historical 1868, Volume 1 (N/A)
📏 Length: 678908 characters
🔍 Excerpt: EDINBURGH. -,:;:; EDINBURGH. burgh and Glasgow railway and the terminus ol' the Nortli British ra

### Query 14: Show article-to-article references (`refersTo`) by name

This query identifies and displays semantic links between Gazetteer entries that refer to one another using the `hto:refersTo` property.

Each result shows:
- The source article title (`hto:name`) and URI (`hto:LocationRecord`)
- The referred-to article’s title and URI

By joining the `refersTo` target with its own `hto:name`, the query outputs human-readable relations such as:

> `CRAWFURDSDIKES refers to GREENOCK`

This is useful for:
- Mapping internal cross-references within the Gazetteers corpus
- Detecting redirects, summaries, or composite place descriptions
- Building link graphs or knowledge navigation tools


In [67]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?record ?name ?altName WHERE {
  ?record a hto:LocationRecord ;
          hto:name ?name ;
          hto:alternateName ?altName .
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"📝 {r['name']['value']} — also known as: {r['altName']['value']}")
        print(f"   ↳ URI: {r['record']['value']}\n")
except Exception as e:
    print(e)




### Query 15: Identify Gazetteer article titles reused across multiple records

This query counts how many times each `hto:name` (place or article title) appears across the corpus of `hto:LocationRecord` entries.

It groups records by name and returns those names that are used more than once, showing how many distinct records share the same title.

Each result includes:
- The name/title (`hto:name`)
- The number of associated records (e.g., entries across different volumes or years)

This is essential for:
- Detecting reused or ambiguous names (e.g., "LOGIE", "KIRKHILL")
- Understanding how a place was described differently across sources
- Supporting disambiguation, temporal analysis, or cross-edition alignment


In [70]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>

SELECT ?name (COUNT(?record) AS ?count) WHERE {
  ?record a hto:LocationRecord ;
          hto:name ?name .
}
GROUP BY ?name
HAVING (COUNT(?record) > 1)
ORDER BY DESC(?count)
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    print("🧭 Repeated article names across records:\n")
    for r in ret["results"]["bindings"]:
        print(f"{r['name']['value']} — {r['count']['value']} records")
except Exception as e:
    print(e)




🧭 Repeated article names across records:

KIRKHILL — 36 records
GRANGE — 34 records
LOGIE — 34 records
KINCARDINE — 32 records
KIRKMICHAEL — 31 records
MILTON — 31 records
ABBEY — 30 records
CARRON — 29 records
BANKHEAD — 26 records
NEWTON — 26 records
BENMORE — 25 records
BRIDGEND — 24 records
FLADDA — 23 records
GREENLAW — 22 records
KIRKLAND — 22 records
LADYKIRK — 22 records
NEWBIGGING — 22 records
KILBRIDE — 21 records
ABERDOUR — 20 records
KIRKTON — 20 records


### Query 16: Retrieve Gazetteer articles with alternate names

This query identifies `hto:LocationRecord` entries that include both a primary name (`hto:name`) and one or more alternate or variant names stored using the `rdfs:label` property.

Each result includes:
- The main article title (`hto:name`)
- An alternate name (`rdfs:label`) such as a historical spelling, synonym, or variant
- The URI of the Gazetteer record

These alternate names are typically extracted from metadata fields like “Alternative names” in the original digitized sources. Including them is important for:
- Enhancing place name disambiguation
- Supporting fuzzy search and variant recognition
- Preserving historical name usage and orthographic shifts across editions


In [72]:
sparql.setQuery("""
PREFIX hto: <https://w3id.org/hto#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?record ?name ?altName WHERE {
  ?record a hto:LocationRecord ;
          hto:name ?name ;
          rdfs:label ?altName .
}
LIMIT 20
""")

try:
    ret = sparql.queryAndConvert()
    for r in ret["results"]["bindings"]:
        print(f"📝 {r['name']['value']} — also known as: {r['altName']['value']}")
        print(f"   ↳ URI: {r['record']['value']}\n")
except Exception as e:
    print(e)



📝 AVEN OR AVON — also known as: AVON
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1052967722_0

📝 AVEN OR AVON — also known as: AVON
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1052967722_2

📝 BURGHHEAD OR BURROWHEAD — also known as: BURROWHEAD
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1360851783_0

📝 BONKLE OR BUNKLE AND PRESTON — also known as: BUNKLE AND PRESTON
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1385500993_0

📝 CONAN OR CONON — also known as: CONON
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_1838535875_0

📝 BALLERNO OR BALLEDGARNO — also known as: BALLEDGARNO
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_2107624191_0

📝 COLLINGTON OR COLINTON — also known as: COLINTON
   ↳ URI: https://w3id.org/hto/LocationRecord/9910440713804340_97424370_2174040574_0

📝 AUCHTERGAVEN OR AUCHTERGOVAN — also known as: AUCHTERGO