## Semester paper - Making IbsenStage Interoperable

# Chapter 1.1: Introduction and Finding Gaps in Interoperability
The aim of this study is to enrich theatrical data from the IbsenStage database (https://ibsenstage.hf.uio.no/pages/browse/map) to obtain metadata that can be used with other data using RDF and linked data principles.

IbsenStage is a relational database that keeps track of all the known productions of Henrik Ibsen's plays around the world. The platform was built using MariaDB/MySQL technology and it is based on the Australian performance database AusStage (Evensen 2020, 229). AusStage focuses on the national theater scene in Australia, while IbsenStage takes a biographical and international approach, following the global performance history of Ibsen's plays from 1850 to the present (Bollen 2016, 621).

IbsenStage is a part of the Virtual Ibsen Centre at the Centre for Ibsen Studies (CIS). As an integral component of the Virtual Ibsen Centre at the Centre for Ibsen Studies (CIS), IbsenStage expands upon the existing repertoire database Ibsen.net. The Centre has systematically integrated content from other areas of the Ibsen.net ecosystem into this virtual platform, in line with CIS’s mission to support research, documentation, education, and dissemination related to Henrik Ibsen.

A key feature of the database is its interactive world map, which allows users to search and visualize performance data by geographic location, play title, language, participating actors, and theatre groups chronologically.

Issues of data quality emerge under closer examination: the provided resource identifiers are not linked to external authoritative sources. These IDs appear to follow an internal and arbitrary structure, limiting the potential for interoperability with other datasets. 
For example, the play _Peer Gynt_ is associated with the internal identifier _8539_ on IbsenStage (https://ibsenstage.hf.uio.no/pages/work/8539), but this ID is not aligned with the corresponding entry in an authoritative source such as Wikidata (https://www.wikidata.org/wiki/Q36661). Even though the Wikidata entry for _Peer Gynt_ references the IbsenStage page (on references Erxleben et al. 2014, 54), this link is unidirectional, and no interaction is established between the two sites. Notably, IbsenStage does not indicate anywhere on its platform that this record is connected to Wikidata, nor does the internal identifier correspond to the standardized ID used by Wikidata. The website doesn't clearly link or align its identifiers, which makes it hard for different platforms to work together and limits how easily the data can be integrated.

# Chapter 1.2 - Research Question

Based on these limitations, the following research question is formulated:

     How can theatrical production data from IbsenStage be transformed into interoperable metadata using linked data principles? 

This primary research question is addressed through two interconnected objectives:

1) Semantic data normalization and external authority linking will establish a clean, standardized dataset by resolving title duplicates and mapping theatrical works to established identifiers in VIAF and Wikidata.
2) RDF/XML modeling using appropriate vocabularies will transform the normalized data into interoperable metadata schemas that adhere to linked data principles and support semantic querying.
3) The exam folder is uploaded on a GitHub repository to attempt at publishing data

# Chapter 2.1 - Theoretical Framework

The transformation of theatrical production data from IbsenStage into interoperable metadata can be framed using the theoretical model proposed by Halshofer and Klas (2010) in connection to the paper by Wang and Strong (1996). These and other studies are also supported by the tenets of FRBR (Tillet 2004). 

In the context of IbsenStage, the absence of links to external authorities (i.e. Wikidata) presents a case of limited semantic interoperability, since the internal metadata is not aligned with global standards or shared URI. By following Halshofer and Klas, the first step involves aligning internal records — such as play titles — with authoritative external resources. This requires establishing an explicit connection between equivalent entities (i.e. _Peer Gynt_ --> Wikidata:Q36661), a task that can be accomplished through systematic mappings. Once normalization is achieved, the second step involves converting the data in triplets and adopting standard RDF ontologies, including Schema.org and Dublin Core, to facilitate machine readability. In addition, a clear relationship among events, creative works, and geographic locations can be established. This aligns with Halshofer and Klas’s third strategy: adopting shared ontologies as a mechanism for interoperability, which supports cross-system data reuse. 

In defyning more complex semantic and organizational interoperability challenges, Lewis et al. (2008) define interoperability as:
    The ability of a collection of communicating entities to (a) share specified information and (b) operate on that information according to an agreed operational semantics (165)

Therefore, Linked Data principles (Berners-Lee 2006) have to be implemented, using dereferenceable URIs and explicit links to other datasets. Implementing these structural and semantic transformations allows IbsenStage to overcome the limitations of its original relational database format and facilitates integration with global systems such as Wikidata. As Haslhofer and Klas write:

    [T]he approach for achieving interoperability on the metadata instance level, when there is no agreement on value encoding schemes or other standardization mechanisms. Instance transformations are functions that operate on the content values and perform a specified operation..." (25)


Building on Wang and Strong's foundational 1996 study, "the importance of the role of systems" (p. 22) presents several challenges within the IbsenStage database, especially concerning *Representational* and *Accessibility* data quality. The core problem is a lack of interoperability, as discussed in section 1.1. Crucially, none of Ibsen's major works on IbsenStage link to authoritative external sources. This is despite the existence of work IDs on Wikidata that have clear and unambiguous connections. IbsenStage currently lacks HATEOAS level 3 links that would validate these IDs by pointing to those external sources, representing the lowest level of abstraction. As will be demonstrated later, inconsistencies between IbsenStage and WikiData stem from failures in ID correspondence (Halshofer; Klas 2010, p. 27). However, this study will not delve into schema-level transformations.

A critical task for this study is to establish a *connection* between the actual name of a piece (its *manifestation*) and its original *work*. A uniform naming structure, like the one FRBR proposes for works (Tillett 2005), offers a stable conceptual anchor. This anchor allows metadata mappings and schema alignments to refer to a consistent point, which in turn facilitates consistent interpretation and integration across diverse systems. This not only ensures uniform access but also boosts semantic clarity and supports automated reasoning across repositories that might use different metadata standards or vocabularies. Equally important is enhancing the user-friendliness of the data, which directly improves its intrinsic data quality. As Wang and Strong (1996) point out:

Intrinsic DQ includes not only accuracy and objectivity, which are evident to IS professionals, but also believability and reputation. This suggests that, contrary to the traditional development view, data consumers also view believability and reputation as an integral part of intrinsic DQ (p. 20).

Additionally, Pepe et al. (2009) also advocate for a more connected and interoperable landscape, arguing that clear, structured relationships linking agents, artifacts, and events are essential for accurately modeling scientific work and enabling semantic interoperability (p. 573).

# Chapter 2.2 - Metadata vocabularies - Dublin Core VS. Schema.org

Here I present two vocabularies that are useful for the purpose of this exam. The first is Dublin Core Metadata Initiative (DCMI) [https://www.dublincore.org/specifications/dublin-core/dcmi-terms/], which is a widely-used vocabulary for describing resources across different domains. It includes 15 fundamental elements — including *title*, *creator*, *subject*, and *date* — these provide a base for metadata records. In addition, more specific and qualified terms can be found in DCMI that enable greater precision and contextual detail in metadata descriptions. In the instance of complex relationships, such as adaptations, translations, or reinterpretations of canonical works, one can find terms like `dcterms:isVersionOf`, `dcterms:source`, `dcterms:relation`. Each term is defined with a unique URI, and adheres to RDF with its namespaces.

On the other side, mostly used for web-scale descriptions, can be found Schema.org. As described on the website, Schema.org's vocabularies "[...] cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model".

These two metadata schemas differ in several aspects. Dublin Core is older compared to Schema.org, established in 1995, and has a broad, general-purpose schema. Schema.org, on the other hand, was launched in 2011 with the goal of facilitating structured data markup on web pages and enabling richer search results. While Dublin Core is well-suited for genral cross-domain interoperability through its simplicity, Schema.org is more detailed and effective for enhancing online visibility, besides harnessing the capabilities of the semantic web — particularly in contexts involving database-driven platforms.

Considering the information above, the choice falls on Schema.org, with a minor hybrid implementation of Dublin Core. Schema.org was made for the web and has particular types like "TheaterEvent", "CreativeWork", "Organization", and "Person". This makes it much easier to describe performances, plays, persons, and organizations in a way that is humanly undestandeable. It also works well with linked data tools, and connects easily to outside sources like Wikidata and VIAF, which is important for making systems operate together better. Dublin Core is good for fundamental information like titles, dates, and IDs, but it is too generic to give the same amount of depth for cultural or performance data. That stated, it makes sense to use both together: **Schema.org for the more detailed modeling and Dublin Core for the basic descriptive parts**.

Again, by relying on the FRBR model, one can distinguish between *Work*, *Expression*, *Manifestation*, and *Item*. Maintaining a relationship between a play's actual title ("Peer Gynt") and a uniform work identifier (e.g., `Q36661` in Wikidata) is essential for disambiguation and interoperability.

By aligning internal IbsenStage identifiers to external references, I also follow best practices in linked data design and FAIR principles—ensuring data is *Findable*, *Accessible*, *Interoperable*, and *Reusable*.

# Chapter 2.3 - The authoritative sources - VIAF and Wikidata

Wikidata and VIAF serve distinct but complementary roles as authoritative sources on the Semantic Web. 

VIAF (Virtual International Authority File), maintained by OCLC [https://www.oclc.org/en/viaf.html], is a highly controlled aggregation of standardized name authority records from national libraries worldwide. It is optimized for name disambiguation and identity management, linking multiple library identifiers (like ISNI and LCCN) using RDF and `owl:sameAs` semantics. Its strength lies in offering reliable and persistent identifiers for personal and corporate names, though it provides limited semantic depth beyond these identifiers. 

In contrast, Wikidata is an open, collaboratively edited knowledge graph hosted by the Wikimedia Foundation (Erxleben et al., 2014). It supports a wide range of entity types—people, places, works, concepts—and expresses them using rich, multilingual semantic triples. Wikidata is dynamic and flexible, with a robust ontology that supports full SPARQL querying and integration into complex Linked Data workflows. While it may lack the formal standardization of VIAF, it excels in breadth, update frequency, and semantic depth, making it especially valuable for inferencing, ontology modeling, and interdisciplinary data linking. Erxleben et al. 2014 develop RDF exports of WikiData and describe how "Wikidata uses a uniform scheme for URIs of all entities, i.e., items and properties [...] They implement content negotiation and redirect to the most suitable data, which might be an RDF document with basic information about the entity, or the HTML page of the entity on Wikidata" (56).

Both systems are widely used and often interconnected (Wikidata includes VIAF identifiers for many entities) allowing for enhanced interoperability in bibliographic systems and semantic applications.
The choices then falls on WikiData as the main URI source. The next chapter discusses the method of implementation of the plan here outlined. 


# Chapter 3.1 - Methodology

The dataset consists of a comprehensive JSON file obtained by scraping the original CSV data provided by IbsenStage [https://ibsenstage.hf.uio.no/pages/search]. The file obtained includes 4924 entries, thus **all performances of Ibsen’s plays in Norway**. This starting JSON file will be called `Ibsenstage_scrape` and is situated in the folder `Ibsenstage_raw`, the naming conventions for this exam are discussed in the chapter 3.6.

The JSON file from IbsenStage and contains useful keys, these being in order:
- `eventname`: The name of the performance
- `eventid`: The internal ID of the event (non-interoperable)
- `first_date`: The first known date of the performance
- `workid`: The internal ID (non-interoperable) of the Ibsen play the event is based on 
- `worktitle`: The canonical title of the play (e.g. *A Doll’s House*)
- `venueid`: Internal venue ID (non-interoperable)
- `venuename`: Name of the venue
- `venuecountry`: Always "Norway" in this dataset

## 3.2 Cleaning and Normalizing the Data

Before any semantic modeling could begin, the raw data required significant cleaning and normalization.

### Title Normalization
One of the first challenges is **normalizing play titles**. Although `worktitle` consistently reflects the English name of the play (i.e. "A Doll’s House"), `eventname` often varies. These included alternative spellings, translations, and reinterpretations (i.e., *Nora*, *Et dukkehjem*, *Ett dockhem*, *Casa di bambola* — all pointing to the same canonical play).

The following step is grouping these under unified representations using controlled mappings and then linked them to **Wikidata IDs**. For example:

- *A Doll’s House* → `Q669694`
- *Peer Gynt* → `Q208094`

### Removing the country and adding the city

Secondly, after having obtained the cleared and normalized titles, the field `venuecountry`  needs to be removed. In its place, a new field called `venuecity` is provided, as it will provide additional information of where a specific event was staged, helping identifying the venue location as many of the small venues of the 19th century are not active anymore. To map each venue to a single city in Norway, a connection with Geopy and GeoNames (such as geocoding services or libraries that can resolve place names to and associated cities) is necessary. This approach can help in finding a city based on the key `venuename`.

### External Identifier Linking
Even if `eventid`, `workid`, and `venueid` are usable locally, they are **not interoperable**, thus meaningless outside IbsenStage. To align this dataset with the broader semantic web, I need to map several key fields to authority datasets:
- **Plays** → Wikidata (e.g. `"A Doll's House"` → `https://www.wikidata.org/wiki/Q669694`)
- **Venues and cities** → Wikidata where available, if not I will fallback on cities URIs
- **Author** (Henrik Ibsen) → VIAF `71378383` and Wikidata `Q36661`

### Type Assignment
Each record can be semantically connected to Schema.org type:
- Events as `schema:TheaterEvent`
- Plays as `schema:Play`
- Venues as `schema:EventVenue`
- First date of performance as `schema:firstPerformance`
- Country as `schema:Country` (not important and therefore ignored)

This step ensures that every resource is treated appropriately in RDF modeling.

## 3.3 Modeling the Data: Using Schema.org and Dublin Core

With clean, structured data in hand, I need to model the RDF representation using **Schema.org** as the primary vocabulary.

Each event in the dataset is expressed as a `schema:TheaterEvent`, which is linked to the work it performs (`schema:CreativeWork`), more specifically to a `schema:Play`, and the venue where it took place (`schema:EventVenue`), along with other metadata like date (`schema:firstPerformance`).

Here is a possible RDF - RDF is a standard model for describing resources and facilitating data interchange on the Web - enabling the representation of information about resources [https://www.w3.org/RDF/].


Within this exam the RDF modeling approach taken aims at implementing multiple vocabularies to balance semantic precision. I use IbsenStage URIs to identify internal entities, and by implementing this namespace the dataset can be anchored to its original context while also allowing external Linked Data sources.
To handle performance dates—like "1880-01-30"—I use XMLSchema datatypes, which ensures the dates are treated as actual dates rather than just text.
When it comes to vocabularies, Schema.org is my primary choice for describing entities and their relationships. I also rely on Dublin Core Terms to cover important gaps, particularly in bibliographic metadata. For instance, I use `dcterms:identifier` to preserve IbsenStage’s original internal identifiers (such as `eventid`, `workid`, and `venueid`). This helps keep the RDF version closely linked to its source.

Geographical data is handled using a hierarchical strategy, as it will be shown in `code1_IbsenStage_staged`. When authoritative venue URIs are available in Wikidata, venues are typed as `schema:EventVenue` and linked directly using `schema:sameAs`. If not, the model falls back on referencing the associated city through `schema:location`, using city-level URIs from Wikidata to preserve contextual accuracy.

To express identity relationships, `schema:sameAs` statements are used to indicate when an internal IbsenStage entity corresponds to an external authority like Wikidata. 

```
@prefix wd: <https://www.wikidata.org/entity/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix schema1: <http://schema.org/> .
@prefix wd: <https://www.wikidata.org/entity/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://ibsenstage.hf.uio.no/pages/event/100168> a schema1:TheaterEvent ;
    dcterms:identifier "100168" ;
    schema1:firstPerformance "2020-01-06"^^xsd:date ;
    schema1:location wd:Q585 ;
    schema1:name "Hærmennene på Helgeland" ;
    schema1:workPerformed wd:Q3285405 .

<https://ibsenstage.hf.uio.no/pages/work/8534> a schema1:Play ;
    dcterms:identifier "8534" ;
    schema1:name "The Vikings at Helgeland" ;
    schema1:sameAs wd:Q3285405 .

<https://ibsenstage.hf.uio.no/pages/venue/11709> a schema1:EventVenue ;
    dcterms:identifier "11709" ;
    schema1:name "Nationaltheatret Amfiscenen" .

```



This **hybrid model** allows for both semantic richness and cross-domain compatibility, while preparing the data for semantic publication and SPARQL querying.

## 3.4 Transformation Pipeline - Implementation Workflow

The transformation pipeline consisted of the following steps:

In the `staged` folder I achieve the following through the Jupyter file `code1_IbsenStage_staged` :

    1. Loading the JSON file (`IbsenStage_scrape.json`) in Python using `json` and `pandas`
    
    2. Cleaning fields

    3. Mapping: Work and Venue Refinement: matching work and venue names to authoritative URIs. In the case of venues, when these will not match a URI in WikiData, they will be connecte to the city's venue URI.

    
In the curated folder, the Jupyter notebook `code2_IbsenStage_curated` guides the final steps of the process. Here’s what it does:

    1. Generates RDF triples using the rdflib library

    2. Serializes the data into RDF/XML format

    3. Validates the RDF using the W3C RDF Validator and Linked Data browser tools

The end result is a direct Raw RDF/XML link to the file `ibsenstage_triplets.rdf`. The folder is then uploaded on GitHub.

## 3.5 Why This Matters

This modeling process turns a static, single-platform dataset into a **flexible, semantically enriched knowledge graph** that can be:
- Queried semantically (e.g. "All performances of *A Doll’s House* before 1900 in Bergen")
- Linked to external datasets such as **Wikidata**
- Shared and reused across institutions, libraries, and cultural heritage networks


# 3.6 Naming conventions of the files

Following are some naming conventions that are used in the exam:
The files will be placed in these three folders: Raw → Staged → Curated structure

1. Raw = exact scrape dump (read-only).
2. Staged = tidy dataframe after cleaning, fixes, duplicates removed and normalization.
3. Curated = analytical products (linked data, RDF triples, statistics).

In these folders, the following files are included 

`IbsenStage_scrape.json` - this is the original JSON file scraped from IbsenStage, including the raw 4924 hits as presented on the website.
`code1_IbsenStage_staged` - this is the file that contains the first transformations, in which the titles under `eventname` are normalized
`code2_IbsenStage_curated` - in this file the cleaned data will be linked to authoritative sources and transformed in RDF triples

# Bilbiography

- Bollen, Jonathan. 2016. “Data Models for Theatre Research: People, Places, and Performance.” Theatre Journal 68, no. 4: pp. 615–32. https://doi.org/10.1353/tj.2016.0109. 
- Evensen, Nina Marie (2020). 'Inheriting Digital Projects: How to Keep Ibsen Alive Online.' *CEUR Workshop Proceedings* vol.2612.
- Erxleben, Fredo, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny Vrandečić (2014). “Introducing Wikidata to the Linked Data Web.” Lecture Notes in Computer Science, pp. 50–65. https://doi.org/10.1007/978-3-319-11964-9_4. 
- Haslhofer, Bernhard, and Wolfgang Klas (2010). “A Survey of Techniques for Achieving Metadata Interoperability.” ACM Computing Surveys 42, no. 2: pp. 1–37. https://doi.org/10.1145/1667062.1667064. 
- Lewis, Grace A., Edwin Morris, Soumya Simanta, and Lutz Wrage (2008). “Why Standards Are Not Enough to Guarantee End-to-End Interoperability.” Seventh International Conference on Composition-Based Software Systems (ICCBSS 2008), pp. 164–73. https://doi.org/10.1109/iccbss.2008.25.
- Pepe, Alberto, Matthew Mayernik, Christine L. Borgman, and Herbert Van de Sompel (2009). “From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web.” Journal of the American Society for Information Science and Technology 61, no. 3, pp. 567–82. https://doi.org/10.1002/asi.21263.  
- Rahm, Erhard; Hong Hai Do (2000). 'Data Cleaning: Problems and Current Approaches' *IEEE Data Engineering Bulletin*, 23:4. pp.3-13
- Tillett, Barbara. 2005. “What Is FRBR? A Conceptual Model for the Bibliographic Universe.” The Australian Library Journal 54 (1): 24–30. doi:10.1080/00049670.2005.10721710.
- Wang, Richard Y; Diane M. Strong (1996) 'Beyond Accuracy: What Data Quality Means to Data Consumers' *Journal of Management Information Systems*, 12:4, pp.5-33, DOI: 10.1080/07421222.1996.11518099