# 1. The Case for RDF: (Semantic) Data Interoperablity, Reasoning and Data Validation Based on Open Standards

This presentation/notebook makes the case for RDF and demonstrates three capabilities that RDF can bring:

- (Semantic) Data Interoperablity
- Reasoning, and
- Data Validation

Note: There are more capabilities to highlights, such as (querying) federated data, URI dereferencing and enabling FAIR data. For more information, feel free to contact me!

[amir.westhoff@capgemini.com](mailto:amir.westhoff@capgemini.com)

## 1.1. Data

In the folder `assets`, there are three separate files with data in three different formats:
- data.csv
- data.json
- data.xml

We'll assume that they have been developed separately, based on different schemas. 

## 1.2. Standards and Vocabularies: RDF, RDFS, OWL, SHACL

For reference, the (open) standards and vocabularies used further down in this notebook are the following:

- RDF: [https://www.w3.org/TR/rdf11-concepts/](https://www.w3.org/TR/rdf11-concepts/)
- RDFS: [https://www.w3.org/TR/rdf-schema/](https://www.w3.org/TR/rdf-schema/)
- OWL: [https://www.w3.org/TR/owl2-syntax/](https://www.w3.org/TR/owl2-syntax/)
- SHACL: [https://www.w3.org/TR/shacl/](https://www.w3.org/TR/shacl/)

Click on the links for their specifications.

## 1.3. First: RDF Basics

In [1]:
from rdflib import Graph

def parse_turtle_rdf(rdf_string: str):
    """Parses a string containing RDF in Turtle syntax."""
    graph = Graph()

    try:
        graph.parse(data=rdf_string, format="turtle")
        print("✅ RDF parsed successfully!")
        return graph
    except Exception as e:
        print(f"❌ Failed to parse RDF: {e}")

def serialize_rdf(graph: Graph, format: str):
    """Serializes and prints an RDF graph in the specified format."""
    try:
        rdf_data = graph.serialize(format=format)
        print(format)
        print(rdf_data)
    except Exception as e:
        print(f"❌ Failed to serialize RDF: {e}")


Let's say you want to express that you work at Capgemini US:

- your HTTP-URI is `http://capgemini.com/123`;
- the HTTP-URI for Capgemini is `http://capgemini.com/capgemini-us`;
- the HTTP-URI to signify the relationship between you and Capgemini is `http://capgemini.com/worksAt`
- to add your name, you reuse a property from the FOAF vocabulary: `foaf:name`

In [None]:
# Example usage

rdf_string = """

prefix cap: <http://capgemini.com/>         # you can use prefixes to shorten URIs
prefix foaf: <http://xmlns.com/foaf/0.1/>   # you can reuse classes and attributes from other vocabularies

cap:123                                     # this syntax allows not having to repeat the subject
    cap:worksAt cap:capgemini-us ;          # this triple expresses that you work at Capgemini US
    foaf:name "your-name-here" .            # tris triple holds your name, reusing foaf:name 

"""

rdf_graph = parse_turtle_rdf(rdf_string)

RDF has several serialization options:

In [None]:
# RDF/XML
serialize_rdf(rdf_graph, "xml")


In [None]:
# JSON-LD
serialize_rdf(rdf_graph, "json-ld")  # JSON-LD

In [None]:
# NT
serialize_rdf(rdf_graph, "nt") 

# 2. (Semantic) Data Interoperablity

## 2.1. Transforming CSV

In [None]:
# Transforming csv

import pandas as pd
from rdflib import Graph, Literal, RDF, URIRef, Namespace, FOAF

# Define the CSV file path
csv_file_path = './assets/data.csv'

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path)

# Create an RDF graph
g = Graph()

# Define a namespace
EX = Namespace("http://example.org/")
SDO = Namespace("http://schema.org/")

# Bind namespaces to the graph
g.bind("ex", EX)
g.bind("sdo", SDO)

# Loop through each row in the DataFrame
for index, row in df.iterrows():
    
    # Check if the essential fields are filled before assignment
    if pd.notna(row['id']) and pd.notna(row['FirstName']) and pd.notna(row['LastName']):
        
        # Extract values from the row
        id = row['id']
        name = f"{row['FirstName']} {row['LastName']}"
        dob = row['DateOfBirth'] if pd.notna(row['DateOfBirth']) else None
        home = row['ComesFrom'] if pd.notna(row['ComesFrom']) else None
        instrument = row['Instrument'] if pd.notna(row['Instrument']) else None
        
        # Create RDF triples, adding only if fields are not None
        subject = URIRef(f"http://example.org/{id}")
        
        if id:
            g.add((subject, RDF.type, EX.Person))
        if name:
            g.add((subject, FOAF.name, Literal(name)))
        if dob:
            g.add((subject, SDO.birthDate, Literal(dob)))
        if home:
            g.add((subject, EX.home, Literal(home)))
        if instrument:
            g.add((subject, EX.playsInstrument, URIRef(f"{EX}{instrument}")))

# Serialize the graph to an RDF file
output_file = './assets/transformed_csv.ttl'
g.serialize(destination=output_file, format='turtle')

print(f"Graph written to {output_file}")



## 2.2. Transforming JSON

In [None]:
# Transforming json

import json
from rdflib import Graph, Literal, RDF, URIRef, Namespace, FOAF

# Parse the JSON data
file_path = './assets/data.json'

# Open the file and load the content
with open(file_path, 'r') as f:
    try:
        json_data = json.load(f)  # Use json.load for reading directly from a file
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")

# Create an RDF graph
g = Graph()

# Define a namespace
EX = Namespace("http://example.org/")
SDO = Namespace("http://schema.org/")

# Bind namespaces to the graph
g.bind("ex", EX)
g.bind("sdo", SDO)

# Iterate over each object in the JSON array
for person in json_data:
    
    # Create a unique subject URI for each person based on their homepage
    subject = URIRef(f"{EX}{person['id']}")
    
    if 'fullName' in person:
        g.add((subject, FOAF.name, Literal(person['fullName'])))
    if 'home' in person:
        g.add((subject, EX.home, Literal(person['home'])))
    if 'id':
        g.add((subject, RDF.type, EX.Person))
    if 'playsInstrument' in person:
        g.add((subject, EX.playsInstrument, URIRef(f"{EX}{person['playsInstrument']}")))
    if 'aka' in subject:
        g.add((subject, FOAF.nick, Literal(person['aka'])))

# Serialize the graph to an RDF/XML file
output_file = './assets/transformed_json.ttl'
g.serialize(destination=output_file, format='turtle')

print(f"Graph written to {output_file}")

## 2.3. Transforming XML

In [None]:
# Transforming xml

from lxml import etree
from rdflib import Graph, Literal, RDF, URIRef, Namespace

# Load XML from a file
tree = etree.parse('./assets/data.xml')
root = tree.getroot()

# Create an RDF graph
g = Graph()

# Define a namespace
EX = Namespace("http://example.org/")
SDO = Namespace("http://schema.org/")

# Bind namespaces to the graph
g.bind("ex", EX)
g.bind("sdo", SDO)

# Iterate over each person in the XML
for person in root.findall('Person'):
    # Extract fields from XML
    id = person.find('id').text if person.find('id') is not None else None
    full_name = person.find('FullName').text if person.find('FullName') is not None else None
    alias = person.find('Alias').text if person.find('Alias') is not None else None
    born_in = person.find('BornIn').text if person.find('BornIn') is not None else None
    date_of_birth = person.find('DOB').text if person.find('DOB') is not None else None
    address = person.find('Address').text if person.find('Address') is not None else None
    phone = person.find('Phone').text if person.find('Phone') is not None else None
    
    # Check if essential fields are filled before creating RDF triples
    if id and full_name:
        
        # Create RDF triples, adding only if fields are not None
        subject = URIRef(f"http://example.org/{id}")
        if id:
            g.add((subject, RDF.type, EX.Person))
        if full_name:
            g.add((subject, FOAF.name, Literal(full_name)))
        if alias:
            g.add((subject, FOAF.nick, Literal(alias)))
        if born_in:
            g.add((subject, EX.home, Literal(born_in)))
        if date_of_birth:
            g.add((subject, SDO.birthDate, Literal(date_of_birth)))
        if address:
            g.add((subject, SDO.address, Literal(address)))
        if phone:
            g.add((subject, SDO.telephone, Literal(phone)))
# Serialize the graph to an RDF/XML file
output_file = './assets/transformed_xml.ttl'
g.serialize(destination=output_file, format='turtle')

print(f"Graph written to {output_file}")

## 2.4. Simulating and querying a SPARQL endpoint

### 2.4.1. Loading the RDF Data in a Combined Graph

In [9]:
# Loading RDF data into a Graph

import rdflib

combined_graph = rdflib.Graph()

# List of turtle files
turtle_files = [
    "./assets/transformed_csv.ttl",
    "./assets/transformed_json.ttl",
    "./assets/transformed_xml.ttl"
]

# Load each TRIG file into the named graph
for file in turtle_files:
    g = rdflib.Graph()
    g.parse(file, format="turtle")
    
    # Add triples to combined graph
    for s, p, o in g:
        combined_graph.add((s, p, o))


### 2.4.2. Querying the Graph with SPARQL

In [None]:
# Querying the Graph with SPARQL

from IPython.display import display, HTML
import pandas as pd

# Define a simple SPARQL query
query = """
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex: <http://example.org/>
PREFIX sdo: <http://schema.org/>

SELECT ?person ?name ?nickname ?birthDate ?home ?instrument
WHERE {
    ?person a ex:Person  ;
        foaf:name ?name ;
        ex:home ?home .
    OPTIONAL { ?person foaf:nick ?nickname }
    OPTIONAL { ?person ex:playsInstrument ?instrument }
    OPTIONAL { ?person sdo:birthDate ?birthDate }
}
"""

# Execute the query
results = combined_graph.query(query)

# Convert the results to a Pandas DataFrame
data = []
for row in results:
    data.append({str(var): str(row[var]) for var in row.labels})

df = pd.DataFrame(data)

# Display the DataFrame as an HTML table
html_table = df.to_html()
display(HTML(html_table))

# 3. Reasoning

## 3.1. Loading a model/ontology

In [None]:
# A simple ontology about persons and musicians

ontology_data = """
# @prefix : <http://example.org/ontology#> .
@prefix : <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

### Ontology header
<http://example.org/>
    rdf:type owl:Ontology ;
    rdfs:comment "A simple ontology about persons and musicians." .

### Classes
:Person rdf:type owl:Class ;
        rdfs:label "Person" ;
        rdfs:comment "A class representing people." .

:Musician rdf:type owl:Class ;
          rdfs:subClassOf :Person ;
          rdfs:label "Musician" ;
          rdfs:comment "A person who can play an instrument." .

:Instrument rdf:type owl:Class ;
            rdfs:label "Instrument" ;
            rdfs:comment "A class representing musical instruments." .

### Properties
:playsInstrument rdf:type owl:ObjectProperty ;
                   rdfs:domain :Musician ;
                   rdfs:range :Instrument ;
                   rdfs:label "can play instrument" ;
                   rdfs:comment "A property indicating that a person who can play an instrument is a Musician." .

"""

# g.parse(data=ontology_data, format="turtle")
combined_graph.parse(data=ontology_data, format="turtle")


## 3.2. Performing reasoning/inference

In [None]:
import rdflib
from owlrl import DeductiveClosure, OWLRL_Semantics

# Apply OWL RL reasoning using owlrl
# DeductiveClosure(OWLRL_Semantics).expand(g)
DeductiveClosure(OWLRL_Semantics).expand(combined_graph)

# Check the inferred facts re :Musician
query = """
    PREFIX : <http://example.org/>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?person ?name
    WHERE {
        ?person rdf:type :Musician ;
            foaf:name ?name .
    }
"""
# result = g.query(query)
result = combined_graph.query(query)

# Output the results
print("Inferred Musicians:\n")
for row in result:
    print(row[1])


# 4. Data Validation

## 4.1. Defining SHACL shapes

In [13]:
# Defining SHACL shapes

shapes_data = """
@prefix : <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <http://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Shape to validate Persons 
:PersonShape
    rdf:type sh:NodeShape ;
    sh:targetClass :Person ;
    sh:property [
        sh:path sdo:birthDate ;
        sh:minCount 1 ;
        sh:message "Every Person should have a date of birth." ;
    ] .
"""

## 4.2. Performing RDF Data Validation

In [None]:
import rdflib
from pyshacl import validate

# Step 1: Load the ontology and the shapes into rdflib graphs
# g = rdflib.Graph()
# g.parse(data=ontology_data, format="turtle")

shapes_g = rdflib.Graph()
shapes_g.parse(data=shapes_data, format="turtle")

# Step 2: Validate the graph against the SHACL shapes
conforms, results_graph, results_text = validate(combined_graph, shacl_graph=shapes_g, data_graph_format="turtle", shacl_graph_format="turtle")

# Step 3: Output the validation results
print(results_text)