[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kellm-fit/ISWC_tutorial/blob/main/session-3/hands-on/graphrag_rdf_kg_construction.ipynb)

## Prepare Notebook for Colab

In [None]:
!git clone https://github.com/kellm-fit/ISWC_tutorial.git

In [None]:
base_path = "/content/ISWC_tutorial/session-3/hands-on/"
# For local
# base_path = "./"

# Knowledge Graph Construction Overview
1. Environment Setup: Install necessary Python libraries.
2. Data Ingestion: Read Parquet files into Pandas DataFrames.
3. RDF Graph Initialization: Set up the RDF graph using rdflib.
4. Ontology Definition: Define classes and properties in RDF.
5. Mapping Nodes and Relationships to RDF: Convert each node and its relationships into RDF triples.
6. Serializing the RDF Graph: Export the RDF graph in your preferred format.
7. Example Queries (Optional): Execute SPARQL queries on the RDF graph.
8. Ontology Visualization.

## Step 1: Environment Setup
Ensure you have the required Python libraries installed. We'll use pandas and pyarrow for handling Parquet files, and rdflib for creating and managing RDF data.

In [1]:
!pip install -q pandas==2.2.2 pyarrow==17.0.0 rdflib==7.0.0 pyvis==0.3.2 networkx==3.3 ipysigma==0.24.2




[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step 2: Data Ingestion

In [2]:
import pandas as pd
import os

kg_path = os.path.join(os.getcwd(), 'kg')

# Read Parquet files
documents_df = pd.read_parquet(os.path.join(kg_path, 'create_final_documents.parquet'))
entities_df = pd.read_parquet(os.path.join(kg_path, 'create_final_entities.parquet'))
relationships_df = pd.read_parquet(os.path.join(kg_path, 'create_final_relationships.parquet'))
text_units_df = pd.read_parquet(os.path.join(kg_path, 'create_final_text_units.parquet'))
community_reports_df = pd.read_parquet(os.path.join(kg_path, 'create_final_community_reports.parquet'))
communities_df = pd.read_parquet(os.path.join(kg_path, 'create_final_communities.parquet'))

In [3]:
entities_df.head(1)

Unnamed: 0,id,name,type,description,human_readable_id,graph_embedding,text_unit_ids,description_embedding
0,23e483d08c6f4f3297707b38926ba21d,CAUSAL DISCOVERY,"SCIENTIFIC DISCIPLINE, METHOD",Causal discovery is a process in scientific re...,0,,"[5102919b3ad27254673948bab8a72ed8, 6b1091d4897...","[-0.004210005979984999, 0.006614891812205315, ..."


## Step 3: RDF Graph Initialization
Initialize the RDF graph and define the necessary namespaces. We'll use a base namespace (`EX`) for your data and an ontology namespace (`ONTO`) for defining classes and properties.

In [4]:
from rdflib import Namespace, URIRef, BNode, Literal, Graph
from rdflib.namespace import RDF, RDFS, OWL, XSD, FOAF

# Initialize RDF Graph
rdf_graph = Graph()

# Define Namespaces
EX = Namespace("http://example.org/data#")
ONTO = Namespace("http://example.org/ontology#")

# Bind namespaces to prefixes for readability
rdf_graph.bind("ex", EX)
rdf_graph.bind("onto", ONTO)
rdf_graph.bind("rdf", RDF)
rdf_graph.bind("rdfs", RDFS)
rdf_graph.bind("owl", OWL)
rdf_graph.bind("foaf", FOAF)

## Step 4: Ontology Definition
Define RDF classes and properties based on the structure of generated Parquet files. This step establishes the semantic structure of the RDF graph.

In [5]:
from rdflib import Namespace, RDF, RDFS, OWL, XSD

# --- Define Classes ---
classes = [
    "Document",
    "Entity",
    "Relationship",
    "TextUnit",
    "CommunityReport",
    "Community",    
    "Finding"
]

for cls in classes:
    rdf_graph.add((ONTO[cls], RDF.type, OWL.Class))

# --- Define Properties ---

# Object Properties
object_properties = {
    "hasTextUnit": {
        "domain": "Document",
        "range": "TextUnit",
        "type": OWL.ObjectProperty
    },
    "referencesEntity": {
        "domain": "TextUnit",
        "range": "Entity",
        "type": OWL.ObjectProperty
    },
    "hasFinding": {
        "domain": "CommunityReport",
        "range": "Finding",
        "type": OWL.ObjectProperty
    },
    "isInCommunity": {
        "domain": "Entity",
        "range": "Community",
        "type": OWL.ObjectProperty
    },
    "relates": {
        "domain": "Entity",
        "range": "Entity",
        "type": OWL.ObjectProperty
    },
    "source": {
        "domain": "Relationship",
        "range": "Entity",
        "type": OWL.ObjectProperty
    },
    "target": {
        "domain": "Relationship",
        "range": "Entity",
        "type": OWL.ObjectProperty
    },
    "referencesRelationship": {
        "domain": "TextUnit",
        "range": "Relationship",
        "type": OWL.ObjectProperty
    },
    "hasCommunityReport": {
        "domain": "Community",
        "range": "CommunityReport",
        "type": OWL.ObjectProperty
    },
}

# Datatype Properties
datatype_properties = {
    "hasTitle": {
        "domain": ["Document", "Community", "CommunityReport"],
        "range": XSD.string,        
        "type": OWL.DatatypeProperty
    },
    "hasRawContent": {
        "domain": "Document",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasName": {
        "domain": "Entity",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasType": {
        "domain": ["Entity"],
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasDescription": {
        "domain": ["Entity", "Relationship"],
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasDescriptionEmbedding": {
        "domain": "Entity",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasHumanReadableId": {
        "domain": ["Entity", "Relationship"],
        "range": XSD.integer,
        "type": OWL.DatatypeProperty
    },
    "hasWeight": {
        "domain": "Relationship",
        "range": XSD.integer,
        "type": OWL.DatatypeProperty
    },
    "hasSourceName": {
        "domain": "Relationship",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasTargetName": {
        "domain": "Relationship",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasText": {
        "domain": "TextUnit",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasNToken": {
        "domain": ["TextUnit"],
        "range": XSD.integer,
        "type": OWL.DatatypeProperty
    },
    "hasLevel": {
        "domain": ["Community", "CommunityReport"],
        "range": XSD.integer,
        "type": OWL.DatatypeProperty
    },
    "hasRank": {
        "domain": ["Community", "CommunityReport", "Relationship"],
        "range": XSD.integer,
        "type": OWL.DatatypeProperty
    },
    "hasSummary": {
        "domain": ["CommunityReport", "Finding"],
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    }, 
    "hasExplanation": {
        "domain": "Finding",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasFullContent": {
        "domain": "CommunityReport",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },
    "hasFullContentJSON": {
        "domain": "CommunityReport",
        "range": XSD.string,
        "type": OWL.DatatypeProperty
    },    
}

# Add Object Properties to the Graph
for prop, details in object_properties.items():
    rdf_graph.add((ONTO[prop], RDF.type, details["type"]))
    rdf_graph.add((ONTO[prop], RDFS.domain, ONTO[details["domain"]]))
    rdf_graph.add((ONTO[prop], RDFS.range, ONTO[details["range"]]))

# Add Datatype Properties to the Graph
for prop, details in datatype_properties.items():
    rdf_graph.add((ONTO[prop], RDF.type, details["type"]))
    # Handle multiple domains
    domains = details["domain"] if isinstance(details["domain"], list) else [details["domain"]]
    for domain in domains:
        rdf_graph.add((ONTO[prop], RDFS.domain, ONTO[domain]))
    rdf_graph.add((ONTO[prop], RDFS.range, details["range"]))


## Step 5: Mapping Nodes to RDF
Convert each node type from the Parquet files into RDF triples. This involves iterating through each DataFrame and creating corresponding RDF resources with their properties.

### 5.0 Helper Function
`remove_quotes` removes leading/trailing quotes from strings.

In [6]:
def remove_quotes(string):
    return string.strip('"').strip("'")

### 5.1 Documents
Each `Document` is mapped to an RDF node with its properties (title, raw content) and linked to its `TextUnits`.

In [7]:
for _, row in documents_df.iterrows():
    doc_uri = EX[f"Document_{row['id']}"]    
    rdf_graph.add((doc_uri, RDF.type, ONTO.Document))
    rdf_graph.add((doc_uri, ONTO.hasTitle, Literal(row['title'], datatype=XSD.string)))
    rdf_graph.add((doc_uri, ONTO.hasRawContent, Literal(row['raw_content'], datatype=XSD.string)))
    
    # Link to TextUnits
    for tu_id in row['text_unit_ids']:
        tu_uri = EX[f"TextUnit_{tu_id}"]        
        rdf_graph.add((doc_uri, ONTO.hasTextUnit, tu_uri))

### 5.2 Entities
`Entity` nodes are created with properties (name, type, description, etc.) and linked to relevant `TextUnits` and `Communities`.

In [8]:
entity_name_to_id = dict(zip(entities_df['name'], entities_df['id']))
for _, row in entities_df.iterrows():
    entity_uri = EX[f"Entity_{row['id']}"]    
    rdf_graph.add((entity_uri, RDF.type, ONTO.Entity))    
    rdf_graph.add((entity_uri, ONTO.hasName, Literal(remove_quotes(row['name']), datatype=XSD.string)))
    rdf_graph.add((entity_uri, ONTO.hasType, Literal(remove_quotes(row['type']), datatype=XSD.string)))
    rdf_graph.add((entity_uri, ONTO.hasDescription, Literal(remove_quotes(row['description']), datatype=XSD.string)))
    rdf_graph.add((entity_uri, ONTO.hasHumanReadableId, Literal(int(row['human_readable_id']), datatype=XSD.integer)))
    embedding_str = " ".join(map(str, row['description_embedding']))    
    rdf_graph.add((entity_uri, ONTO.hasDescriptionEmbedding, Literal(embedding_str, datatype=XSD.string)))
                
    # Link to TextUnits
    for tu_id in row['text_unit_ids']:
        tu_uri = EX[f"TextUnit_{tu_id}"]
        rdf_graph.add((entity_uri, ONTO.referencesEntity, tu_uri))

for _, row in communities_df.iterrows():
    community_uri = EX[f"Community_{row['id']}"]
    
    # Get the list of relationship IDs associated with the community
    relationship_ids = row['relationship_ids']
    
    for rel_id in relationship_ids:
        # Find the corresponding relationship row in relationships_df
        relationship_row = relationships_df[relationships_df['id'] == rel_id].iloc[0]
        source_entity_id = entity_name_to_id[relationship_row['source']]
        target_entity_id = entity_name_to_id[relationship_row['target']]
        
        # Create URIs for source and target entities
        source_entity_uri = EX[f"Entity_{source_entity_id}"]
        target_entity_uri = EX[f"Entity_{target_entity_id}"]
        
        # Link source and target entities to the community using IN_COMMUNITY
        rdf_graph.add((source_entity_uri, ONTO.isInCommunity, community_uri))
        rdf_graph.add((target_entity_uri, ONTO.isInCommunity, community_uri))


### 5.3 Relationships
`Relationship` nodes connect source and target `Entities`, containing properties (description, rank, weight, etc.) and linked to `TextUnits`.

In [9]:
for _, row in relationships_df.iterrows():
    rel_uri = EX[f"Relationship_{row['id']}"]
    
    rdf_graph.add((rel_uri, RDF.type, ONTO.Relationship))
    rdf_graph.add((rel_uri, ONTO.hasWeight, Literal(row['weight'], datatype=XSD.integer)))        
    rdf_graph.add((rel_uri, ONTO.hasRank, Literal(row['rank'], datatype=XSD.integer))) 
    rdf_graph.add((rel_uri, ONTO.hasDescription, Literal(remove_quotes(row['description']), datatype=XSD.string)))    
    rdf_graph.add((rel_uri, ONTO.hasHumanReadableId, Literal(int(row['human_readable_id']), datatype=XSD.integer)))
    rdf_graph.add((rel_uri, ONTO.hasSourceName, Literal(remove_quotes(row['source']), datatype=XSD.string)))
    rdf_graph.add((rel_uri, ONTO.hasTargetName, Literal(remove_quotes(row['target']), datatype=XSD.string)))
    # Link to Source and Target Entities    
    source_uri = EX[f"Entity_{entity_name_to_id[row['source']]}"]
    target_uri = EX[f"Entity_{entity_name_to_id[row['target']]}"]
    rdf_graph.add((rel_uri, ONTO.source, source_uri))
    rdf_graph.add((rel_uri, ONTO.target, target_uri))
    
    rdf_graph.add((source_uri, ONTO.relates, target_uri))
    
    # Link to TextUnits
    for tu_id in row['text_unit_ids']:
        tu_uri = EX[f"TextUnit_{tu_id}"]
        rdf_graph.add((rel_uri, ONTO.referencesRelationship, tu_uri))
        


### 5.4 TextUnits
`TextUnit` nodes represent document chunks, linked back to `Documents`, `Entities`, and `Relationships`.

In [10]:
for _, row in text_units_df.iterrows():
    tu_uri = EX[f"TextUnit_{row['id']}"]
    rdf_graph.add((tu_uri, RDF.type, ONTO.TextUnit))
    rdf_graph.add((tu_uri, ONTO.hasText, Literal(row['text'], datatype=XSD.string)))
    rdf_graph.add((tu_uri, ONTO.hasNToken, Literal(row['n_tokens'], datatype=XSD.integer)))
    
    # Link to Documents
    if row['document_ids'] is not None:
        for doc_id in row['document_ids']:
            doc_uri = EX[f"Document_{doc_id}"]
            rdf_graph.add((tu_uri, ONTO.belongsToDocument, doc_uri))
    
    # Link to Entities
    if row['entity_ids'] is not None:
        for entity_id in row['entity_ids']:
            entity_uri = EX[f"Entity_{entity_id}"]
            rdf_graph.add((tu_uri, ONTO.referencesEntity, entity_uri))
    
    # Link to Relationships    
    if row['relationship_ids'] is not None:
        for rel_id in row['relationship_ids']:
            rel_uri = EX[f"Relationship_{rel_id}"]
            rdf_graph.add((tu_uri, ONTO.referencesRelationship, rel_uri))

### 5.5 Community Reports
`CommunityReport` nodes store report details (content, title, findings, etc.) and link to their respective `Communities`.

In [11]:
# Create a mapping from Community ID to Community URI
community_id_to_uri = {}

for _, row in communities_df.iterrows():
    community_id = row['id']  # Assuming 'id' is the numeric human-readable ID
    community_uri = EX[f"Community_{community_id}"]    
    community_id_to_uri[community_id] = community_uri

for _, row in community_reports_df.iterrows():
    cr_id = row['id']
    cr_uri = EX[f"CommunityReport_{cr_id}"]
    
    # Add CommunityReport as an instance of ONTO.CommunityReport
    rdf_graph.add((cr_uri, RDF.type, ONTO.CommunityReport))
    
    # Link to Community using the human-readable community ID
    community_hr_id = row['community']  # e.g., 1, 2, 3, 4
    community_uri = community_id_to_uri.get(community_hr_id)
    if community_uri:
        rdf_graph.add((cr_uri, ONTO.hasCommunity, community_uri))
    else:
        print(f"Warning: Community ID {community_hr_id} not found for CommunityReport ID {cr_id}")
    
    # Add other properties
    rdf_graph.add((cr_uri, ONTO.hasFullContent, Literal(row['full_content'], datatype=XSD.string)))
    rdf_graph.add((cr_uri, ONTO.hasLevel, Literal(row['level'], datatype=XSD.integer)))
    rdf_graph.add((cr_uri, ONTO.hasRank, Literal(row['rank'], datatype=XSD.integer)))
    rdf_graph.add((cr_uri, ONTO.hasTitle, Literal(row['title'], datatype=XSD.string)))
    rdf_graph.add((cr_uri, ONTO.hasSummary, Literal(row['summary'], datatype=XSD.string)))
    rdf_graph.add((cr_uri, ONTO.hasFullContentJSON, Literal(row['full_content_json'], datatype=XSD.string)))
    
    # Handle Findings    
    for finding in row['findings']:        
        finding_explanation = finding.get('explanation', '')
        finding_summary = finding.get('summary', '')
        
        # Create a Finding URI or use a blank node
        finding_bnode = BNode()
        rdf_graph.add((finding_bnode, RDF.type, ONTO.Finding))
        
        # Add explanation and summary
        rdf_graph.add((finding_bnode, ONTO.hasExplanation, Literal(finding_explanation, datatype=XSD.string)))
        rdf_graph.add((finding_bnode, ONTO.hasSummary, Literal(finding_summary, datatype=XSD.string)))
        
        # Link Finding to CommunityReport
        rdf_graph.add((cr_uri, ONTO.hasFinding, finding_bnode))


### 5.6 Communities
`Community` nodes group related entities and relationships, linking to `TextUnits`, `Relationships`, and `CommunityReports`.

In [12]:
for _, row in communities_df.iterrows():
    community_uri = EX[f"Community_{row['id']}"]
    rdf_graph.add((community_uri, RDF.type, ONTO.Community))
    rdf_graph.add((community_uri, ONTO.hasTitle, Literal(row['title'], datatype=XSD.string)))
    rdf_graph.add((community_uri, ONTO.hasLevel, Literal(row['level'], datatype=XSD.integer)))    
    
    # Link to Relationships
    for rel_id in row['relationship_ids']:
        rel_uri = EX[f"Relationship_{rel_id}"]
        rdf_graph.add((community_uri, ONTO.referencesRelationship, rel_uri))
        
    # Link to TextUnits
    for tu_id in row['text_unit_ids']:        
        tu_uri = EX[f"TextUnit_{tu_id}"]
        rdf_graph.add((community_uri, ONTO.referencesTextUnit, tu_uri))

# Link to CommunityReports
for _, row in community_reports_df.iterrows():
    cr_uri = EX[f"CommunityReport_{row['id']}"]
    community_number = row['community']
    community_uri = EX[f"Community_{community_number}"]
    rdf_graph.add((community_uri, ONTO.hasCommunityReport, cr_uri))


### Step 6: Serializing the RDF Graph
Once all nodes and relationships have been mapped, serialize the RDF graph into a desired format such as Turtle (.ttl), RDF/XML, or JSON-LD.

In [13]:
# Serialize as Turtle
turtle_data = rdf_graph.serialize(format='turtle')

# Save to a file
with open('global_graph.ttl', 'w', encoding="utf-8") as f:
    f.write(turtle_data)

print("RDF graph has been serialized to 'global_graph.ttl'")

RDF graph has been serialized to 'global_graph.ttl'


## Step 7: Example Queries (Optional)
This code demonstrates querying an RDF graph with `rdflib` to retrieve names and descriptions of up to three entities.


In [14]:
from rdflib import Graph, Namespace, RDF, OWL

# Initialize RDF Graph and define namespaces
rdf_graph = Graph()

# Load the TTL file
rdf_graph.parse("global_graph.ttl", format="turtle")

# Define and execute the SPARQL query to extract all instances of 'Entity'
query = """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>    
    PREFIX onto: <http://example.org/ontology#>
    
    SELECT ?entity ?name ?description
    WHERE {
        ?entity rdf:type onto:Entity .
        ?entity onto:hasName ?name .
        ?entity onto:hasDescription ?description .
    }
    limit 3
"""

# Execute the query
results = rdf_graph.query(query)

# Print the results
for entity, name, desc in results:
    print(f"Entity: {entity}, Name: {name}, Description: {desc}")

Entity: http://example.org/data#Entity_01b840a5b1f44c42a27207c492c915f1, Name: WANG, Y., Description: Wang, Y. is an author and researcher involved in structure-augmented text representation learning for efficient knowledge graph completion
Entity: http://example.org/data#Entity_01eefacdb0694a15b39358fdbab3e40a, Name: CHEN, W., Description: W. Chen is an author and researcher known for work on relational rules for knowledge graph link prediction
Entity: http://example.org/data#Entity_06a23c47a64f4a95b6b0951c9211233b, Name: KWOK, J.T., Description: Kwok, J.T. is an author and researcher involved in the development of KICGPT, a large language model for knowledge graph completion


## Step 8: Ontology Visualization
This section visualizes the ontology using NetworkX, PyVis, and Sigma to represent the structure of the RDF graph.

In [15]:
from rdflib import Graph, Namespace
from rdflib.namespace import RDF, RDFS, OWL
import networkx as nx
from pyvis.network import Network

g = Graph()

g.parse("global_graph.ttl", format="turtle")

print(f"Graph has {len(g)} triples.")

classes = set(g.subjects(RDF.type, OWL.Class))

# Extract Object and Datatype Properties
object_properties = set(g.subjects(RDF.type, OWL.ObjectProperty))
datatype_properties = set(g.subjects(RDF.type, OWL.DatatypeProperty))

# Display the results
print(f"Classes: {len(classes)}")
print(f"Object Properties: {len(object_properties)}")
print(f"Datatype Properties: {len(datatype_properties)}")

nx_graph = nx.MultiDiGraph()

# Helper function to get short names (remove prefixes)
def get_short_name(uri):
    return uri.split('#')[-1] if '#' in uri else uri.split('/')[-1]

# Add classes as nodes with short labels and color
for cls in classes:
    nx_graph.add_node(str(cls), label=get_short_name(str(cls)), color="lightblue")  # Use class short name and set color

# Add object properties as edges between their domain and range, label with property name
for prop in object_properties:
    domain = g.value(prop, RDFS.domain)
    range_ = g.value(prop, RDFS.range)
    if domain and range_:
        nx_graph.add_node(str(domain), label=get_short_name(str(domain)), color="lightgreen")  # Use short name
        nx_graph.add_node(str(range_), label=get_short_name(str(range_)), color="lightgreen")  # Use short name
        nx_graph.add_edge(str(domain), str(range_), label=get_short_name(str(prop)), color="blue")  # Short name for edge label

# Add datatype properties as edges between classes and literals, label with property name
for prop in datatype_properties:
    # Find domains and ranges for each datatype property
    domains = list(g.objects(prop, RDFS.domain))
    range_ = g.value(prop, RDFS.range)
    
    if domains and range_:
        for domain in domains:
            nx_graph.add_node(str(domain), label=get_short_name(str(domain)), color="lightgreen")  # Use short name
            nx_graph.add_node(str(range_), label=get_short_name(str(range_)), color="orange")  # Datatypes in orange
            nx_graph.add_edge(str(domain), str(range_), label=get_short_name(str(prop)), color="green")  # Short name for edge label

# Create a PyVis network for visualization
nt = Network(notebook=True, height="1000px", width="100%", directed=True)
nt.from_nx(nx_graph)

# Generate the interactive HTML file
nt.write_html("ontology_visualization.html", open_browser=True)

Graph has 36824 triples.
Classes: 7
Object Properties: 9
Datatype Properties: 18


In [16]:
from ipysigma import Sigma

sigma = Sigma(
    nx_graph,
    default_edge_type="curve",    
    clickable_edges =True,
    node_border_color_from="node",    
    label_font="cursive",
    height=400
)

sigma

Sigma(nx.MultiDiGraph with 9 nodes and 35 edges)