# Ontology Discovery with NER and Relation Extraction

This notebook demonstrates an automated approach to ontology discovery by combining:
1. **Named Entity Recognition (NER)** - Identifying entities in text using BERT
2. **Relation Extraction** - Using LLMs to discover relationships between entities

The workflow:
- Extract entities from text using a pre-trained NER model
- Use an LLM to identify meaningful relationships between entities
- Structure the output in an ontology-friendly format (entities + relations)

## Step 1: Named Entity Recognition (NER)

In this step, we:
1. Load a pre-trained BERT model fine-tuned for Named Entity Recognition
2. Process example text to identify entities (Person, Location, Organization, etc.)
3. Map the BIO tags (B-PER, I-PER, etc.) to human-readable entity types

**Model Used**: `dslim/bert-base-NER` - A BERT model trained on the CoNLL-2003 dataset

**Entity Types Detected**:
- **Person**: Individual names
- **Location**: Geographic locations (cities, countries, etc.)
- **Organization**: Companies, institutions, agencies
- **Miscellaneous**: Other named entities

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
entity_mapping = {
    'B-PER': 'Person',
    'I-PER': 'Person',
    'B-ORG': 'Organization',
    'I-ORG': 'Organization',
    'B-LOC': 'Location',
    'I-LOC': 'Location',
    'B-MISC': 'Miscellaneous',
    'I-MISC': 'Miscellaneous'
}

filtered_entities = [
    {
        'entity': entity_mapping.get(result['entity'], result['entity']),
        'score': result['score'],
        'word': result['word']
    }
    for result in ner_results
]

filtered_entities

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity': 'Person', 'score': np.float32(0.9990139), 'word': 'Wolfgang'},
 {'entity': 'Location', 'score': np.float32(0.999645), 'word': 'Berlin'}]

## Step 2: Relation Extraction with LLM

Now we use a Large Language Model (LLM) to discover relationships between the identified entities.

**Approach**:
- Send the original text and NER results to an LLM (Gemma3 via Ollama)
- The LLM is prompted to act as an ontology expert
- It extracts relationships ONLY between existing entities (no hallucination)
- Output is structured as JSON with entities and relations

**Output Format**:
```json
{
  "entities": [{"id": "...", "type": "..."}],
  "relations": [{"subject": "...", "predicate": "...", "object": "..."}]
}
```

**Key Constraints**:
- Predicates use camelCase (e.g., `livesIn`, `worksFor`)
- Entity types follow ontology classes (Person, Location, Organization)
- Relations are expressed as RDF-style triples

In [3]:
import json
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langchain_ollama import OllamaLLM

llm = OllamaLLM(
    model="gemma3",
    temperature=0.7
)

messages = [
    SystemMessage(content="""
                    You are an expert in ontology discovery and relation extraction.

                    Given:
                    1. Input text
                    2. A list of identified entities with their types

                    Your task:
                    - Identify relationships ONLY between the given entities.
                    - Do NOT invent new entities.
                    - Use ontology-friendly predicate names (camelCase, verbs).

                    Output JSON with TWO sections:
                    1. "entities": list of entities with fields:
                    - "id": canonical name
                    - "type": ontology class (Person, Location, Organization, etc.)

                    2. "relations": list of triples with fields:
                    - "subject"
                    - "predicate"
                    - "object"

                    Rules:
                    - Use consistent ontology classes.
                    - Predicates must be reusable OWL ObjectProperties.
                    - Output ONLY valid JSON.
                    """),
    HumanMessage(content=f"""
                    Original text: {example}

                    Identified entities: {ner_results}

                    Extract the relationships between these entities based on the original text.
                    """)
]

response = llm.invoke(messages)
json_str = response.strip()[7:-3].strip()
llm_output = json.loads(json_str)
llm_output

{'entities': [{'id': 'Wolfgang', 'type': 'Person'},
  {'id': 'Berlin', 'type': 'Location'}],
 'relations': [{'subject': 'Wolfgang',
   'predicate': 'livesIn',
   'object': 'Berlin'}]}

## Workflow Summary: End-to-End Ontology Discovery Pipeline

This notebook demonstrates a complete pipeline from raw text to a queryable knowledge graph:

**Phase 1: Entity & Relation Extraction (Steps 1-2)**
- NER model identifies entities (Person, Location, Organization)
- LLM extracts meaningful relationships between entities
- Structured JSON output with typed entities and predicates

**Phase 2: Ontology Population (Steps 3-7)**
- Load pre-defined ontology schema from Protege
- Map extracted data to OWL classes and properties
- Create individuals and establish relationships
- Validate logical consistency with reasoner

**Phase 3: Graph Database Storage (Steps 8-15)**
- Extract ontology data (entities + relationships)
- Connect to Neo4j graph database
- Import nodes with semantic labels
- Create typed relationships for graph traversal

**Result**: A validated, scalable knowledge graph ready for complex queries, reasoning, and integration with downstream applications.

## Step 3: Loading Pre-existing Ontology from Protege

In this step, we load an ontology that was created and validated in Protege. This ontology contains the class hierarchy and property definitions that will be used to structure our extracted entities.

**Key Components**:
- **Owlready2**: Python library for manipulating OWL 2.0 ontologies
- **Ontology File**: Test1.rdf - created in Protege with predefined classes and properties
- **Classes**: Person, Location, Organization, Event
- **Object Properties**: livesIn, bornIn, worksFor, etc.

The loaded ontology provides the schema/structure that our NER-extracted entities will conform to.

In [4]:
from owlready2 import *
from owlready2.reasoning import sync_reasoner

onto = get_ontology("file://C:\\Users\\zohai\\OneDrive\\Documents\\Test1.rdf").load()
print("Classes:")
for c in onto.classes():
    print(c)

print("\nObject Properties:")
for p in onto.object_properties():
    print(p)

Classes:
C:\Users\zohai\OneDrive\Documents\Test1.Entity
C:\Users\zohai\OneDrive\Documents\Test1.Event
C:\Users\zohai\OneDrive\Documents\Test1.Location
C:\Users\zohai\OneDrive\Documents\Test1.Organisation
C:\Users\zohai\OneDrive\Documents\Test1.Person

Object Properties:
C:\Users\zohai\OneDrive\Documents\Test1.livesIn
C:\Users\zohai\OneDrive\Documents\Test1.locatedIn
C:\Users\zohai\OneDrive\Documents\Test1.participatedIn
C:\Users\zohai\OneDrive\Documents\Test1.worksFor


## Step 4: Mapping NER Results to Ontology Schema

To bridge the gap between raw NER output and formal ontology, we create mapping dictionaries:

**CLASS_MAP**: Maps natural language entity types to OWL classes
- "person", "human" → `onto.Person`
- "location", "place" → `onto.Location`
- "organization" → `onto.Organization`

**PROPERTY_MAP**: Maps relation predicates to OWL object properties
- "livesIn", "residesIn" → `onto.livesIn`
- "bornIn" → `onto.bornIn`
- "worksFor" → `onto.worksFor`

This mapping ensures consistent ontology population regardless of variations in LLM output.

In [5]:
CLASS_MAP = {
    "person": onto.Person,
    "human": onto.Person,
    "location": onto.Location,
    "place": onto.Location,
    "organization": onto.Organization,
    "event": onto.Event,
}
PROPERTY_MAP = {
    "livesin": onto.livesIn,
    "residesin": onto.livesIn,
    "bornin": onto.bornIn,
    "worksfor": onto.worksFor,
}

## Step 5: Creating Ontology Individuals

Now we instantiate individuals (instances) in the ontology based on the entities extracted by NER and validated by the LLM.

**Process**:
1. Iterate through entities from LLM output
2. Map each entity type to the appropriate OWL class using CLASS_MAP
3. Create new individuals in the ontology
4. Cache individuals for later use in relationship creation

**Result**: Ontology populated with concrete instances (e.g., "Wolfgang" as an instance of Person class)

In [6]:
entity_cache = {}
with onto:
    for ent in llm_output["entities"]:
        name = ent["id"]
        ent_type = ent["type"].lower()

        cls = CLASS_MAP.get(ent_type)
        if cls is None:
            raise ValueError(f"Unknown entity type: {ent['type']}")

        if name not in entity_cache:
            entity_cache[name] = cls(name)

## Step 6: Adding Relationships Between Individuals

With individuals created, we now establish relationships between them using object properties.

**Process**:
1. Iterate through relations from LLM output
2. Retrieve subject and object individuals from the entity cache
3. Map predicates to OWL properties using PROPERTY_MAP
4. Establish property assertions (e.g., `Wolfgang.livesIn.append(Berlin)`)

**Result**: A fully connected knowledge graph with typed relationships conforming to the ontology schema

In [7]:
for rel in llm_output["relations"]:
    subj = entity_cache.get(rel["subject"])
    obj = entity_cache.get(rel["object"])

    pred = rel["predicate"].lower()
    prop = PROPERTY_MAP.get(pred)

    if prop is None:
        raise ValueError(f"Unknown relation: {rel['predicate']}")

    prop[subj].append(obj)


## Step 7: Ontology Validation with Reasoner

Before deploying the ontology, we validate it using a reasoning engine to ensure logical consistency.

**Reasoner**: HermiT (via Owlready2's `sync_reasoner`)

**What it checks**:
- Class satisfiability (no contradictory class definitions)
- Property domain/range consistency
- Cardinality constraints
- Disjointness axioms

**Output**: ✅ if consistent, ❌ if inconsistent (with error details)

In [9]:
try:
    sync_reasoner()
    print("✅ Ontology is consistent")
except OwlReadyInconsistentOntologyError:
    print("❌ Ontology is inconsistent")

* Owlready2 * Running HermiT...
    java -Xmx2000M -cp c:\Users\zohai\University\Ontology Discovery\.venv\Lib\site-packages\owlready2\hermit;c:\Users\zohai\University\Ontology Discovery\.venv\Lib\site-packages\owlready2\hermit\HermiT.jar org.semanticweb.HermiT.cli.CommandLine -c -O -D -I file:///C:/Users/zohai/AppData/Local/Temp/tmp3318ot20


✅ Ontology is consistent


* Owlready2 * HermiT took 0.365386962890625 seconds
* Owlready * (NB: only changes on entities loaded in Python are shown, other changes are done but not listed)


## Step 8: Inspecting Ontology Content

Explore the populated ontology to verify individuals and their class memberships.

In [10]:
for ind in onto.individuals():
    print("Individual:", ind.name)
    print("  Types:", [cls.name for cls in ind.is_a])


Individual: Berlin
  Types: ['Location']
Individual: OpenAI
  Types: ['Organisation']
Individual: Wolfgang
  Types: ['Person']


## Step 9: Visualizing Relationships

Display all relationships in human-readable triple format (subject → predicate → object).

In [14]:
for ind in onto.individuals():
    for prop in ind.get_properties():
        for value in prop[ind]:
            print(f"{ind.name} --{prop.name}--> {value.name}")

OpenAI --locatedIn--> Berlin
Wolfgang --livesIn--> Berlin
Wolfgang --livesIn--> Berlin
Wolfgang --worksFor--> OpenAI


## Step 10: Extracting Data for Neo4j

Prepare ontology data for export to a graph database. This involves:

**Entity Types Extraction**: Create a dictionary mapping each individual to its classes for Neo4j node labels.

**Relations Extraction**: Convert OWL property assertions into a list of relationship dictionaries suitable for Neo4j import.

In [17]:
entity_types = {}

for ind in onto.individuals():
    entity_types[ind.name] = [
        cls.name for cls in ind.is_a
        if hasattr(cls, "name")
    ]
entity_types

{'Berlin': ['Location'], 'OpenAI': ['Organisation'], 'Wolfgang': ['Person']}

In [18]:
relations = []

for ind in onto.individuals():
    for prop in ind.get_properties():
        for value in prop[ind]:
            relations.append({
                "subject": ind.name,
                "predicate": prop.name,
                "object": value.name
            })
relations

[{'subject': 'OpenAI', 'predicate': 'locatedIn', 'object': 'Berlin'},
 {'subject': 'Wolfgang', 'predicate': 'livesIn', 'object': 'Berlin'},
 {'subject': 'Wolfgang', 'predicate': 'livesIn', 'object': 'Berlin'},
 {'subject': 'Wolfgang', 'predicate': 'worksFor', 'object': 'OpenAI'}]

## Step 11: Connecting to Neo4j Graph Database

Establish connection to Neo4j using the official Python driver.

**Configuration**:
- **Database**: Neo4j running locally on port 7687
- **Authentication**: Using credentials stored in environment variables (`.env` file)
- **Driver**: `neo4j-driver` for executing Cypher queries

Neo4j will serve as the scalable storage backend for querying and traversing the knowledge graph.

In [30]:
import os
from dotenv import load_dotenv
from neo4j import GraphDatabase
load_dotenv()
driver = GraphDatabase.driver(
    "neo4j://127.0.0.1:7687",
    auth=("neo4j", os.getenv("NEO4J_PASSWORD"))
)

## Step 12: Defining Entity Creation Function

Create a helper function to insert entities into Neo4j as nodes with appropriate labels.

**Key Features**:
- Uses `MERGE` to avoid duplicate nodes
- Applies multiple labels based on ontology class hierarchy
- Labels reflect the entity's types (e.g., a node could be both Person and NamedIndividual)

In [31]:
def create_entity(tx, entity_id, labels):
    label_str = ":" + ":".join(labels)
    tx.run(
        f"MERGE (e{label_str} {{id: $id}})",
        id=entity_id
    )

## Step 13: Importing Entities into Neo4j

Execute the entity creation for all individuals from the ontology.

In [32]:
with driver.session() as session:
    for entity, labels in entity_types.items():
        session.execute_write(create_entity, entity, labels)

## Step 14: Defining Relationship Creation Function

Create a helper function to insert relationships between nodes as Neo4j edges.

**Features**:
- Matches both source and target nodes by ID
- Creates directed relationships using predicate as relationship type
- Uses uppercase for relationship types (Neo4j convention)

In [33]:
def create_relation(tx, s, p, o):
    tx.run(
        f"""
        MATCH (a {{id: $s}})
        MATCH (b {{id: $o}})
        MERGE (a)-[:{p.upper()}]->(b)
        """,
        s=s,
        o=o
    )


## Step 15: Importing Relationships into Neo4j

Execute the relationship creation for all property assertions from the ontology.

**Result**: A complete knowledge graph stored in Neo4j, ready for:
- Complex graph queries using Cypher
- Relationship traversal and pattern matching
- Graph analytics and visualization
- Integration with downstream applications

In [34]:
with driver.session() as session:
    for r in relations:
        session.execute_write(
            create_relation,
            r["subject"],
            r["predicate"],
            r["object"]
        )
