# Knowledge Graph Schema Exploration

**Goal**: Test entity extraction and relationship identification on 5 sample paragraphs from Book 3, Chapter 4 (Rome).

**Approach**: 
- Extract ALL entities (approach B - categorize later)
- Use GPT-4o with structured outputs
- Compare with manual annotations
- Iterate on schema and prompts

## Setup

In [1]:
import json
import os
from pathlib import Path
from pydantic import BaseModel, Field
from typing import Optional
from langchain_openai import ChatOpenAI

# Set up OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not set")

## Load Sample Paragraphs

In [2]:
# Load the 5 selected paragraphs
with open("selected_5_paragraphs.json") as f:
    paragraphs = json.load(f)

print(f"Loaded {len(paragraphs)} paragraphs\n")
for i, p in enumerate(paragraphs, 1):
    print(f"[{i}] Page {p['page']}: {p['text'][:100]}...")

Loaded 5 paragraphs

[1] Page 319: In the sixth century BC the Etruscans were installed in an important bridgehead on the south bank of...
[2] Page 346: Augustus was able to obtain the return of the Roman standards taken from Crassus and thankfully set ...
[3] Page 339: Brilliance like this was not just a matter of winning battles. Brief though Caesar’s recent visits t...
[4] Page 341: This was the end of civil war. Octavian returned to become consul. He had every card in his hand and...
[5] Page 325: The constitutional arrangements of the early republic were thus very complicated, but effective. The...


## Define Schema (Pydantic Models)

In [48]:
class Entity(BaseModel):
    """Extracted entity from historical text (LLM output)."""
    name: str
    type: str  # "person", "place", "collective_entity", "event", "temporal", "cultural"
    subtype: Optional[str] = None  # For place: "city", "region", "river" | collective_entity: "state", "organization", "league" | temporal: "century", "year", "date"
    aliases: list[str] = Field(default_factory=list)
    description: Optional[str] = None
    attributes: Optional[dict[str, str]] = None  # For titles, roles, etc.
    time_range: Optional[list[str, str]] = None  # Will be populated from relationships
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)

class Relationship(BaseModel):
    """Relationship between two entities (LLM output)."""
    source_entity: str  # Entity name
    relation_type: str  # "conquered", "succeeded", "influenced-by", "ruled", "allied-with", etc.
    target_entity: str  # Entity name
    temporal_context: Optional[str] = None  # "509 BC", "6th century BC", "27 BC"
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)

class ExtractionResult(BaseModel):
    """Result of entity extraction from a paragraph."""
    entities: list[Entity]
    relationships: list[Relationship]
    paragraph_id: str

# Post-processed models (with IDs assigned)
class EntityWithId(BaseModel):
    """Entity with UUID assigned after extraction."""
    id: str  # UUID
    name: str
    type: str
    subtype: Optional[str] = None
    aliases: list[str] = Field(default_factory=list)
    description: Optional[str] = None
    attributes: Optional[dict[str, str]] = None
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)
    paragraph_id: str
    relationship_ids: list[str] = Field(default_factory=list)  # Bidirectional link

class RelationshipWithId(BaseModel):
    """Relationship with UUIDs for entities."""
    id: str  # UUID
    source_id: str  # Entity UUID
    target_id: str  # Entity UUID
    relation_type: str
    temporal_context: Optional[str] = None
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)
    paragraph_id: str

# Normalized models (after merging duplicates)
class NormalizedEntity(BaseModel):
    """Normalized entity after merging duplicates."""
    id: str  # Canonical UUID
    name: str
    type: str
    subtype: Optional[str] = None
    aliases: list[str] = Field(default_factory=list)
    description: str  # Combined descriptions
    attributes: Optional[dict[str, str]] = None
    source_paragraph_ids: list[str]
    occurrence_count: int
    merged_from_ids: list[str] = Field(default_factory=list)  # Original entity IDs that merged
    relationship_ids: list[str] = Field(default_factory=list)  # Bidirectional link

class NormalizedRelationship(BaseModel):
    """Normalized relationship with canonical entity IDs."""
    id: str  # UUID (preserved from original)
    source_id: str  # Normalized entity ID
    target_id: str  # Normalized entity ID
    relation_type: str
    temporal_context: Optional[str] = None
    confidence: float
    paragraph_id: str

## Manual Annotations (Reference)

Our manual annotations from Paragraph 1 (Etruscans and Rome's founding):

In [4]:
# Manual annotations for Paragraph 1 (abbreviated - see planning notes for full)
manual_paragraph_1 = {
    "entities": [
        {"name": "Etruscans", "type": "collective_entity", "subtype": "people"},
        {"name": "River Tiber", "type": "place", "subtype": "river", "aliases": ["Tiber"]},
        {"name": "Rome", "type": "place", "subtype": "city"},
        {"name": "Rome", "type": "collective_entity", "subtype": "state"},  # Dual nature!
        {"name": "Latins", "type": "collective_entity", "subtype": "people"},
        {"name": "Campania", "type": "place", "subtype": "region"},
        {"name": "509 BC", "type": "temporal", "subtype": "year"},
        {"name": "sixth century BC", "type": "temporal", "subtype": "century"},
        # ... more entities
    ],
    "relationships": [
        {"source": "Rome", "relation_type": "located-on", "target": "River Tiber"},
        {"source": "Rome", "relation_type": "broke-away-from", "target": "Etruscans", "temporal_context": "509 BC"},
        {"source": "Latin cities", "relation_type": "revolted-against", "target": "Etruscans", "temporal_context": "sixth century BC"},
        # ... more relationships
    ]
}

print("Manual annotations recorded for comparison")

Manual annotations recorded for comparison


## GPT-4o Extraction - Initial Prompt

In [None]:
EXTRACTION_PROMPT = """You are analyzing "The Penguin History of the World", Book 3: The Classical Age, Chapter 4 on Rome.

Extract ALL entities and relationships from the provided paragraph.

**ENTITY TYPES**:
- person: Individuals (rulers, leaders, historical figures)
- place: Geographic locations (cities, regions, rivers, etc.)
- collective_entity: Groups, states, organizations, peoples, leagues
- event: Historical events, political actions, battles, reforms
- temporal: Time references (centuries, years, dates)
- cultural: Cultural concepts, traditions, civilizations

**SUBTYPES**:
- place: "city", "region", "river", "sea"
- collective_entity: "state", "people", "organization", "league"
- temporal: "century" (single century), "year" (single year), "date" (specific date), "range" (time range)

**RELATIONSHIP TYPES** (examples - extract any you find):
- Political: "ruled", "conquered", "allied-with", "subordinated", "revolted-against", "succeeded"
- Geographic: "located-on", "located-in", "bordered-by"
- Cultural: "influenced-by", "came-from", "accessed-through"
- Temporal: "happened-in", "occurred-during"

**IMPORTANT GUIDELINES**:
1. Extract entities FROM THIS PARAGRAPH ONLY - do not use external knowledge
2. Extract relationships that are EXPLICITLY STATED in the text
3. Include aliases if the entity is referred to by multiple names (e.g., "Octavian" also called "Augustus")
4. For titles/roles, store as attributes (e.g., Caesar's "dictator for life")
5. DO NOT extract:
   - Generic unnamed groups ("his men", "the soldiers")
   - Entities mentioned only as comparisons ("like Athens")
   - Vague references without clear identity
6. Include temporal_context in relationships when dates/times are mentioned
7. Note: Some entities may have dual nature (e.g., Rome as both a city and a political state) - extract both if clear from context

Extract entities and relationships from this paragraph:

{paragraph_text}
"""

print("Extraction prompt defined")

Extraction prompt defined


## Test Extraction on Paragraph 1

In [10]:
def extract_entities_gpt4o(paragraph_text: str, paragraph_id: str) -> ExtractionResult:
    """Extract entities and relationships using GPT-4.1 with LangChain structured outputs."""
    
    # Initialize model with structured output
    model = ChatOpenAI(model="gpt-4.1", temperature=0.0)
    model_with_structure = model.with_structured_output(ExtractionResult)
    
    # Create the prompt
    system_message = "You are an expert at extracting structured historical entities and relationships from text."
    user_message = EXTRACTION_PROMPT.format(paragraph_text=paragraph_text)
    
    # Invoke with LCEL
    result = model_with_structure.invoke([
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ])
    
    result.paragraph_id = paragraph_id
    return result

# Test on first paragraph
para1 = paragraphs[0]
print(f"Extracting from Paragraph 1 (Page {para1['page']})...\n")
print(f"Text preview: {para1['text'][:200]}...\n")

result1 = extract_entities_gpt4o(para1['text'], para1['id'])

print(f"\n=== EXTRACTION RESULTS ===")
print(f"Entities extracted: {len(result1.entities)}")
print(f"Relationships extracted: {len(result1.relationships)}")

Extracting from Paragraph 1 (Page 319)...

Text preview: In the sixth century BC the Etruscans were installed in an important bridgehead on the south bank of the River Tiber. This was the site of Rome, one of a number of small cities of the Latins, an old-e...


=== EXTRACTION RESULTS ===
Entities extracted: 18
Relationships extracted: 18


## Inspect Extracted Entities

In [11]:
print("\n=== ENTITIES ===")
for i, entity in enumerate(result1.entities, 1):
    print(f"\n[{i}] {entity.name}")
    print(f"    Type: {entity.type}")
    if entity.subtype:
        print(f"    Subtype: {entity.subtype}")
    if entity.aliases:
        print(f"    Aliases: {entity.aliases}")
    if entity.description:
        print(f"    Description: {entity.description[:100]}...")
    if entity.attributes:
        print(f"    Attributes: {entity.attributes}")
    print(f"    Confidence: {entity.confidence}")


=== ENTITIES ===

[1] Etruscans
    Type: collective_entity
    Subtype: people
    Description: An ancient people installed in a bridgehead on the south bank of the River Tiber in the sixth centur...
    Confidence: 1.0

[2] River Tiber
    Type: place
    Subtype: river
    Aliases: ['Tiber']
    Description: A river in Italy, on whose south bank the Etruscans were installed....
    Confidence: 1.0

[3] Rome
    Type: place
    Subtype: city
    Description: A city on the south bank of the River Tiber, site of Etruscan installation and later a Latin city....
    Confidence: 1.0

[4] Rome
    Type: collective_entity
    Subtype: state
    Description: The political entity of Rome, which broke away from Etruscan dominion and retained Etruscan influenc...
    Confidence: 1.0

[5] Latins
    Type: collective_entity
    Subtype: people
    Description: An old-established people of the Campania, to which Rome belonged....
    Confidence: 1.0

[6] Campania
    Type: place
    Subtype: regi

## Inspect Extracted Relationships

In [12]:
print("\n=== RELATIONSHIPS ===")
for i, rel in enumerate(result1.relationships, 1):
    temporal = f" ({rel.temporal_context})" if rel.temporal_context else ""
    print(f"[{i}] {rel.source_entity} --[{rel.relation_type}]--> {rel.target_entity}{temporal}")
    print(f"    Confidence: {rel.confidence}")


=== RELATIONSHIPS ===
[1] Etruscans --[installed-in]--> bridgehead on the south bank of the River Tiber (sixth century BC)
    Confidence: 1.0
[2] bridgehead on the south bank of the River Tiber --[located-in]--> site of Rome
    Confidence: 1.0
[3] Rome --[located-on]--> south bank of the River Tiber
    Confidence: 1.0
[4] Rome --[member-of]--> Latin cities
    Confidence: 1.0
[5] Latins --[located-in]--> Campania
    Confidence: 1.0
[6] Etruscan --[influenced]--> European tradition
    Confidence: 0.8
[7] Rome --[influenced-by]--> Etruscans
    Confidence: 1.0
[8] Rome --[broke-away-from]--> Etruscan dominion (end of sixth century BC)
    Confidence: 1.0
[9] Latin cities --[revolted-against]--> Etruscan dominion (end of sixth century BC)
    Confidence: 1.0
[10] kings of Rome --[ruled]--> Rome (before 509 BC)
    Confidence: 1.0
[11] last king of Rome --[expelled-from]--> Rome (509 BC)
    Confidence: 1.0
[12] Etruscan power --[challenged-by]--> Latin peoples (about 509 BC)
    Con

## Compare with Manual Annotations

In [13]:
# Manual comparison - what did LLM miss? What did it over-extract?
print("\n=== COMPARISON NOTES ===")
print("\nManual extraction had these key entities:")
for e in manual_paragraph_1["entities"][:5]:
    print(f"  - {e['name']} ({e['type']})")

print("\nLLM extracted:")
for e in result1.entities[:5]:
    print(f"  - {e.name} ({e.type})")

print("\nQuestions to explore:")
print("- Did LLM capture the dual nature of Rome (city + political entity)?")
print("- Did LLM correctly categorize temporal entities (century vs year)?")
print("- Did LLM avoid extracting generic groups?")
print("- Are relationship types appropriate and specific enough?")


=== COMPARISON NOTES ===

Manual extraction had these key entities:
  - Etruscans (collective_entity)
  - River Tiber (place)
  - Rome (place)
  - Rome (collective_entity)
  - Latins (collective_entity)

LLM extracted:
  - Etruscans (collective_entity)
  - River Tiber (place)
  - Rome (place)
  - Rome (collective_entity)
  - Latins (collective_entity)

Questions to explore:
- Did LLM capture the dual nature of Rome (city + political entity)?
- Did LLM correctly categorize temporal entities (century vs year)?
- Did LLM avoid extracting generic groups?
- Are relationship types appropriate and specific enough?


## Save Extraction Results

In [14]:
# Save for later analysis
output = {
    "paragraph_id": para1['id'],
    "paragraph_text": para1['text'],
    "page": para1['page'],
    "entities": [e.model_dump() for e in result1.entities],
    "relationships": [r.model_dump() for r in result1.relationships]
}

with open("extraction_paragraph1_initial.json", "w") as f:
    json.dump(output, f, indent=2)

print("Saved extraction results to extraction_paragraph1_initial.json")

Saved extraction results to extraction_paragraph1_initial.json


## Iteration Space

Use the cells below to:
1. Refine the extraction prompt
2. Test on other paragraphs
3. Experiment with different entity types or relationship schemas
4. Test normalization approaches (Levenshtein, embeddings)

In [52]:
# Iteration cell - modify prompt and re-run
# TODO: Based on comparison, adjust EXTRACTION_PROMPT and re-extract

EXTRACTION_PROMPT = """You are analyzing text from "The Penguin History of the World".

Extract ALL entities and relationships from the provided paragraph.

**ENTITY TYPES**:
- person: Individuals (rulers, leaders, historical figures)
- place: Geographic locations (cities, regions, rivers, etc.)
- collective_entity: Groups, states, organizations, peoples, leagues
- event: Historical events, political actions, battles, reforms
- temporal: Time references (centuries, years, dates)
- cultural: Cultural concepts, traditions, civilizations

**SUBTYPES**:
- place: "city", "region", "river", "sea"
- collective_entity: "state", "people", "organization", "league"
- temporal: "century" (single century), "year" (single year), "date" (specific date), "range" (time range)

**RELATIONSHIP TYPES** (examples - extract any you find):
- Political: "ruled", "conquered", "allied-with", "subordinated", "revolted-against", "succeeded"
- Geographic: "located-on", "located-in", "bordered-by"
- Cultural: "influenced-by", "came-from", "accessed-through"
- Temporal: "happened-in", "occurred-during"

**IMPORTANT GUIDELINES**:
1. Extract entities FROM THIS PARAGRAPH ONLY - do not use external knowledge
2. Extract relationships that are EXPLICITLY STATED in the text
3. Include aliases if the entity is referred to by multiple names (e.g., "Octavian" also called "Augustus")
4. For titles/roles, store as attributes (e.g., Caesar's "dictator for life")
5. DO NOT extract:
   - Generic unnamed groups ("his men", "the soldiers")
   - Entities mentioned only as comparisons ("like Athens")
   - Vague references without clear identity
6. Include temporal_context in relationships when dates/times are mentioned
7. Note: Some entities may have dual nature (e.g., Rome as both a city and a political state) - extract both if clear from context
8. The extracted entities and relationships will be used to build a knowledge graph. To this end, relationships must be clearly defined between identified entities.

Extract entities and relationships from this paragraph:

{paragraph_text}
"""

In [53]:
# Test on all 5 paragraphs
all_results = []

for para in paragraphs:
    print(f"\nExtracting from page {para['page']}...")
    result = extract_entities_gpt4o(para['text'], para['id'])
    all_results.append(result)
    print(f"  Entities: {len(result.entities)}, Relationships: {len(result.relationships)}")

print(f"\nTotal extracted across all paragraphs:")
total_entities = sum(len(r.entities) for r in all_results)
total_relationships = sum(len(r.relationships) for r in all_results)
print(f"  Entities: {total_entities}")
print(f"  Relationships: {total_relationships}")


Extracting from page 319...
  Entities: 17, Relationships: 20

Extracting from page 346...
  Entities: 10, Relationships: 12

Extracting from page 339...
  Entities: 10, Relationships: 10

Extracting from page 341...
  Entities: 14, Relationships: 18

Extracting from page 325...
  Entities: 14, Relationships: 15

Total extracted across all paragraphs:
  Entities: 65
  Relationships: 75


In [55]:
print("\n=== ENTITIES ===")
for result in all_results:
    print(f"\n--- Paragraph ID: {result.paragraph_id} ---")
    for i, entity in enumerate(result.entities, 1):
        print(f"\n[{i}] {entity.name}")
        print(f"    Type: {entity.type}")
        if entity.subtype:
            print(f"    Subtype: {entity.subtype}")
        if entity.aliases:
            print(f"    Aliases: {entity.aliases}")
        if entity.description:
            print(f"    Description: {entity.description}")
        if entity.attributes:
            print(f"    Attributes: {entity.attributes}")
        print(f"    Confidence: {entity.confidence}")


=== ENTITIES ===

--- Paragraph ID: 0cd2eb74-f135-4c54-8d93-7ee75280146a ---

[1] Etruscans
    Type: collective_entity
    Subtype: people
    Description: An ancient people installed in a bridgehead on the south bank of the River Tiber in the sixth century BC.
    Confidence: 1.0

[2] River Tiber
    Type: place
    Subtype: river
    Aliases: ['Tiber']
    Description: A river in Italy, on whose south bank the Etruscans were installed.
    Confidence: 1.0

[3] Rome
    Type: place
    Subtype: city
    Description: A city on the south bank of the River Tiber, site of Etruscan installation, and a city of the Latins.
    Confidence: 1.0

[4] Latins
    Type: collective_entity
    Subtype: people
    Description: An old-established people of the Campania, associated with a number of small cities including Rome.
    Confidence: 1.0

[5] Campania
    Type: place
    Subtype: region
    Description: A region associated with the Latins.
    Confidence: 1.0

[6] European tradition
    Type

In [56]:
print("\n=== RELATIONSHIPS ===")
for result in all_results:
    print(f"\n--- Paragraph ID: {result.paragraph_id} ---")
    for i, rel in enumerate(result.relationships, 1):
        temporal = f" ({rel.temporal_context})" if rel.temporal_context else ""
        print(f"[{i}] {rel.source_entity} --[{rel.relation_type}]--> {rel.target_entity}{temporal}")
        print(f"    Confidence: {rel.confidence}")


=== RELATIONSHIPS ===

--- Paragraph ID: 0cd2eb74-f135-4c54-8d93-7ee75280146a ---
[1] Etruscans --[installed-in]--> bridgehead on the south bank of the River Tiber (sixth century BC)
    Confidence: 1.0
[2] bridgehead on the south bank of the River Tiber --[located-in]--> site of Rome
    Confidence: 1.0
[3] Rome --[located-on]--> River Tiber
    Confidence: 1.0
[4] Rome --[city-of]--> Latins
    Confidence: 1.0
[5] Latins --[located-in]--> Campania
    Confidence: 1.0
[6] Etruscan --[influenced-by]--> European tradition
    Confidence: 0.8
[7] Rome --[ruled-by]--> Etruscan dominion (until end of sixth century BC)
    Confidence: 1.0
[8] revolt of the Latin cities --[occurred-during]--> end of the sixth century BC (end of the sixth century BC)
    Confidence: 1.0
[9] Rome --[broke-away-from]--> Etruscan dominion (end of the sixth century BC)
    Confidence: 1.0
[10] Latin cities --[revolted-against]--> Etruscan dominion (end of the sixth century BC)
    Confidence: 1.0
[11] kings of R

In [22]:
# export all results
all_output = []
for result in all_results:
    output = {
        "paragraph_id": result.paragraph_id,
        "entities": [e.model_dump() for e in result.entities],
        "relationships": [r.model_dump() for r in result.relationships]
    }
    all_output.append(output)
with open("extraction_all_paragraphs.json", "w") as f:
    json.dump(all_output, f, indent=2)
print("Saved all extraction results to extraction_all_paragraphs.json")

Saved all extraction results to extraction_all_paragraphs.json


In [57]:
# Assign IDs to entities and relationships (post-processing step)
import uuid as uuid_module

def assign_ids_to_extraction_results(all_results: list[ExtractionResult]) -> tuple[list[EntityWithId], list[RelationshipWithId]]:
    """
    Assign UUIDs to entities and relationships, create bidirectional links.
    
    Processes one paragraph at a time for clarity.
    
    Returns:
        (entities_with_ids, relationships_with_ids)
    """
    all_entities_with_ids = []
    all_relationships_with_ids = []
    skipped_relationships = 0
    
    # Process each paragraph
    for result in all_results:
        paragraph_entities = {}  # entity_name -> EntityWithId for this paragraph
        paragraph_relationships = []
        
        # Step 1: Create entities with IDs for this paragraph
        for entity in result.entities:
            entity_id = str(uuid_module.uuid4())
            entity_with_id = EntityWithId(
                id=entity_id,
                name=entity.name,
                type=entity.type,
                subtype=entity.subtype,
                aliases=entity.aliases,
                description=entity.description,
                attributes=entity.attributes,
                confidence=entity.confidence,
                paragraph_id=result.paragraph_id,
                relationship_ids=[]  # Will populate as we process relationships
            )
            paragraph_entities[entity.name] = entity_with_id
        
        # Step 2: Create relationships with IDs and build bidirectional links
        for rel in result.relationships:
            source_entity = paragraph_entities.get(rel.source_entity)
            target_entity = paragraph_entities.get(rel.target_entity)
            
            # Only create relationship if both entities exist
            if source_entity and target_entity:
                rel_id = str(uuid_module.uuid4())
                rel_with_id = RelationshipWithId(
                    id=rel_id,
                    source_id=source_entity.id,
                    target_id=target_entity.id,
                    relation_type=rel.relation_type,
                    temporal_context=rel.temporal_context,
                    confidence=rel.confidence,
                    paragraph_id=result.paragraph_id
                )
                paragraph_relationships.append(rel_with_id)
                
                # Build bidirectional links
                source_entity.relationship_ids.append(rel_id)
                target_entity.relationship_ids.append(rel_id)
            else:
                skipped_relationships += 1
                print(f"Skipped relationship: {rel.source_entity} --[{rel.relation_type}]--> {rel.target_entity} (entities not found)")
                print(f"\tsource entity: {source_entity}")
                print(f"\ttarget entity: {target_entity}")

        
        # Add to global lists
        all_entities_with_ids.extend(paragraph_entities.values())
        all_relationships_with_ids.extend(paragraph_relationships)
    
    print(f"Created {len(all_entities_with_ids)} entities with IDs")
    print(f"Created {len(all_relationships_with_ids)} relationships with IDs")
    print(f"Skipped {skipped_relationships} relationships (referenced entities not found)")
    
    return all_entities_with_ids, all_relationships_with_ids

# Assign IDs
entities_with_ids, relationships_with_ids = assign_ids_to_extraction_results(all_results)

# Verify bidirectional links
entities_with_rels = [e for e in entities_with_ids if e.relationship_ids]
print(f"\nEntities with relationships: {len(entities_with_rels)} / {len(entities_with_ids)}")
if entities_with_rels:
    print(f"Example: '{entities_with_rels[0].name}' has {len(entities_with_rels[0].relationship_ids)} relationships")

Skipped relationship: Etruscans --[installed-in]--> bridgehead on the south bank of the River Tiber (entities not found)
	source entity: id='d4e98946-f307-476b-b5e4-143e64cef032' name='Etruscans' type='collective_entity' subtype='people' aliases=[] description='An ancient people installed in a bridgehead on the south bank of the River Tiber in the sixth century BC.' attributes=None confidence=1.0 paragraph_id='0cd2eb74-f135-4c54-8d93-7ee75280146a' relationship_ids=[]
	target entity: None
Skipped relationship: bridgehead on the south bank of the River Tiber --[located-in]--> site of Rome (entities not found)
	source entity: None
	target entity: None
Skipped relationship: Etruscan --[influenced-by]--> European tradition (entities not found)
	source entity: None
	target entity: id='19cf6a98-f8d5-40b5-a4a0-e607b4147bc9' name='European tradition' type='cultural' subtype=None aliases=[] description='The broader European cultural tradition into which Etruscan influence flowed.' attributes=Non

## Next Steps

1. Review extraction quality
2. Identify patterns in what LLM extracts vs manual annotations
3. Refine prompt based on findings
4. Test entity normalization (Levenshtein for duplicates)
5. Build simple graph visualization with NetworkX/Plotly
6. Iterate until satisfied, then move to production script

# Entity Normalization

Merge duplicate entities across all paragraphs using:

1. **Exact name matching** (case-insensitive) - DONE ✓
   
2. **Alias-based merging** - DONE ✓
   - If entity A's name appears in entity B's alias list, merge them
   - **Canonical name selection** (smart merging):
     - More frequent entity becomes canonical (e.g., "Rome" × 4 beats "Roman power" × 1)
     - If tied, shorter name wins
     - If still tied, alphabetical order
   - Iterates until no more merges possible
   
3. **Fuzzy Levenshtein matching** - DONE ✓
   - Finds pairs with high string similarity (default threshold: 0.90)
   - **Two-phase workflow**:
     - Phase 1: Run without `fuzzy_merge_pairs` → returns candidates for review
     - Phase 2: Run with `fuzzy_merge_pairs` → applies approved merges
   - Same frequency-based canonical name selection

In [58]:
# Entity and relationship normalization functions (ID-based)

from difflib import SequenceMatcher
from collections import defaultdict
import uuid

def levenshtein_similarity(s1: str, s2: str) -> float:
    """Calculate similarity ratio between two strings (0.0 to 1.0)."""
    return SequenceMatcher(None, s1.lower(), s2.lower()).ratio()

def normalize_entities_and_relationships(
    entities_with_ids: list[EntityWithId],
    relationships_with_ids: list[RelationshipWithId],
    fuzzy_threshold: float = 0.90,
    fuzzy_merge_pairs: list = None
) -> tuple[list[NormalizedEntity], list[NormalizedRelationship], Optional[list]]:
    """
    Normalize entities by merging duplicates, update relationships to use normalized IDs.
    
    Returns:
        (normalized_entities, normalized_relationships, fuzzy_candidates or None)
    """
    
    print(f"Starting normalization with {len(entities_with_ids)} entities")
    
    # Stage 1: Group entities by exact name match (case-insensitive)
    exact_groups = defaultdict(list)
    for entity in entities_with_ids:
        key = entity.name.lower().strip()
        exact_groups[key].append(entity)
    
    print(f"Unique names after exact matching: {len(exact_groups)}")
    
    # Stage 2: Alias-based merging
    merged_groups = _alias_based_merging(exact_groups)
    print(f"Unique names after alias-based merging: {len(merged_groups)}")
    
    # Stage 3: Fuzzy matching
    fuzzy_candidates = None
    if fuzzy_merge_pairs:
        merged_groups = _apply_fuzzy_merges(merged_groups, fuzzy_merge_pairs)
        print(f"Unique names after fuzzy matching: {len(merged_groups)}")
    else:
        fuzzy_candidates = _find_fuzzy_match_candidates(merged_groups, fuzzy_threshold)
        print(f"Found {len(fuzzy_candidates)} fuzzy match candidates")
    
    # Stage 4: Create normalized entities and track ID mappings
    normalized_entities = []
    old_id_to_normalized_id = {}  # old entity ID -> normalized entity ID
    
    for name, entity_group in merged_groups.items():
        # Create normalized entity
        canonical_id = str(uuid.uuid4())
        merged_from_ids = [e.id for e in entity_group]
        
        # Aggregate data from all entities in group
        all_aliases = set()
        all_descriptions = []
        all_paragraphs = set()
        all_attributes = {}
        all_relationship_ids = set()
        base_entity = entity_group[0]
        
        for entity in entity_group:
            if entity.aliases:
                all_aliases.update(entity.aliases)
            if entity.description:
                all_descriptions.append(entity.description)
            all_paragraphs.add(entity.paragraph_id)
            if entity.attributes:
                all_attributes.update(entity.attributes)
            all_relationship_ids.update(entity.relationship_ids)
            
            # Track old ID -> normalized ID mapping
            old_id_to_normalized_id[entity.id] = canonical_id
        
        normalized_entity = NormalizedEntity(
            id=canonical_id,
            name=base_entity.name,
            type=base_entity.type,
            subtype=base_entity.subtype,
            aliases=list(all_aliases),
            description=' | '.join(all_descriptions) if all_descriptions else '',
            attributes=all_attributes if all_attributes else None,
            source_paragraph_ids=list(all_paragraphs),
            occurrence_count=len(entity_group),
            merged_from_ids=merged_from_ids,
            relationship_ids=list(all_relationship_ids)  # Will update after normalizing relationships
        )
        normalized_entities.append(normalized_entity)
    
    print(f"Created {len(normalized_entities)} normalized entities")
    
    # Stage 5: Normalize relationships (update entity IDs, rebuild bidirectional links)
    normalized_relationships = []
    normalized_entity_id_to_obj = {e.id: e for e in normalized_entities}
    
    # Reset relationship_ids on normalized entities (will rebuild)
    for entity in normalized_entities:
        entity.relationship_ids = []
    
    for rel in relationships_with_ids:
        # Map old entity IDs to normalized IDs
        norm_source_id = old_id_to_normalized_id.get(rel.source_id)
        norm_target_id = old_id_to_normalized_id.get(rel.target_id)
        
        if norm_source_id and norm_target_id:
            # Create normalized relationship
            norm_rel = NormalizedRelationship(
                id=rel.id,  # Preserve original relationship ID
                source_id=norm_source_id,
                target_id=norm_target_id,
                relation_type=rel.relation_type,
                temporal_context=rel.temporal_context,
                confidence=rel.confidence,
                paragraph_id=rel.paragraph_id
            )
            normalized_relationships.append(norm_rel)
            
            # Rebuild bidirectional links
            if norm_source_id in normalized_entity_id_to_obj:
                normalized_entity_id_to_obj[norm_source_id].relationship_ids.append(rel.id)
            if norm_target_id in normalized_entity_id_to_obj:
                normalized_entity_id_to_obj[norm_target_id].relationship_ids.append(rel.id)
    
    print(f"Normalized {len(normalized_relationships)} relationships")
    print(f"Skipped {len(relationships_with_ids) - len(normalized_relationships)} relationships (entities not found)")
    
    return normalized_entities, normalized_relationships, fuzzy_candidates

def _alias_based_merging(exact_groups: dict) -> dict:
    """Merge entity groups if one group's name appears in another group's aliases."""
    merged_groups = dict(exact_groups)
    
    # Build alias lookup
    alias_to_canonical = {}
    for canonical_name, entities in merged_groups.items():
        for entity in entities:
            if entity.aliases:
                for alias in entity.aliases:
                    alias_key = alias.lower().strip()
                    if alias_key and alias_key != canonical_name:
                        alias_to_canonical[alias_key] = canonical_name
    
    print(f"  Found {len(alias_to_canonical)} alias mappings")
    
    # Merge based on aliases (with frequency-based canonical selection)
    merged_any = True
    merge_count = 0
    
    while merged_any:
        merged_any = False
        for name in list(merged_groups.keys()):
            if name not in merged_groups:
                continue
            
            if name in alias_to_canonical:
                other_name = alias_to_canonical[name]
                if other_name in merged_groups and other_name != name:
                    # Frequency-based canonical selection
                    name_count = len(merged_groups[name])
                    other_count = len(merged_groups[other_name])
                    
                    if name_count > other_count:
                        canonical, merge_from = name, other_name
                    elif other_count > name_count:
                        canonical, merge_from = other_name, name
                    else:
                        canonical = name if len(name) < len(other_name) else other_name
                        merge_from = other_name if canonical == name else name
                    
                    merged_groups[canonical].extend(merged_groups[merge_from])
                    del merged_groups[merge_from]
                    merged_any = True
                    merge_count += 1
                    print(f"  Merged '{merge_from}' into '{canonical}'")
                    
                    # Rebuild alias lookup
                    alias_to_canonical = {}
                    for canonical_name, entities in merged_groups.items():
                        for entity in entities:
                            if entity.aliases:
                                for alias in entity.aliases:
                                    alias_key = alias.lower().strip()
                                    if alias_key and alias_key != canonical_name:
                                        alias_to_canonical[alias_key] = canonical_name
                    break
    
    print(f"  Completed {merge_count} alias-based merges")
    return merged_groups

def _find_fuzzy_match_candidates(merged_groups: dict, threshold: float) -> list:
    """Find pairs of entity groups with high string similarity."""
    candidates = []
    group_names = list(merged_groups.keys())
    
    for i, name1 in enumerate(group_names):
        for name2 in group_names[i+1:]:
            similarity = levenshtein_similarity(name1, name2)
            if similarity >= threshold:
                group1 = merged_groups[name1]
                group2 = merged_groups[name2]
                entity1 = group1[0]
                entity2 = group2[0]
                
                candidates.append({
                    'name1': name1,
                    'name2': name2,
                    'similarity': similarity,
                    'count1': len(group1),
                    'count2': len(group2),
                    'type1': entity1.type,
                    'type2': entity2.type,
                })
    
    candidates.sort(key=lambda x: x['similarity'], reverse=True)
    return candidates

def _apply_fuzzy_merges(merged_groups: dict, merge_pairs: list) -> dict:
    """Apply approved fuzzy matches."""
    result = dict(merged_groups)
    
    for name1, name2 in merge_pairs:
        if name1 not in result or name2 not in result:
            continue
        
        # Frequency-based canonical selection
        count1 = len(result[name1])
        count2 = len(result[name2])
        canonical = name1 if count1 >= count2 else name2
        merge_from = name2 if canonical == name1 else name1
        
        result[canonical].extend(result[merge_from])
        del result[merge_from]
        print(f"  Merged '{merge_from}' into '{canonical}' (fuzzy match)")
    
    return result

print("Entity and relationship normalization functions defined")

Entity and relationship normalization functions defined


## Test Normalization

Run normalization on all 5 paragraphs and inspect results

In [59]:
# Run normalization on entities and relationships
normalized_entities, normalized_relationships, fuzzy_candidates = normalize_entities_and_relationships(
    entities_with_ids, 
    relationships_with_ids,
    fuzzy_threshold=0.90
)

# Print statistics
print("\n" + "=" * 80)
print("NORMALIZATION STATISTICS")
print("=" * 80)
print(f"Total entities before: {len(entities_with_ids)}")
print(f"Normalized entities: {len(normalized_entities)}")
print(f"Reduction: {len(entities_with_ids) - len(normalized_entities)} entities merged")
print()
print(f"Total relationships before: {len(relationships_with_ids)}")
print(f"Normalized relationships: {len(normalized_relationships)}")
print(f"Skipped: {len(relationships_with_ids) - len(normalized_relationships)} relationships")
print()

# Display top 10 most frequently occurring entities
print("=" * 80)
print("TOP 10 MOST FREQUENT ENTITIES")
print("=" * 80)
sorted_entities = sorted(normalized_entities, key=lambda x: x.occurrence_count, reverse=True)

for i, entity in enumerate(sorted_entities[:10], 1):
    print(f"\n[{i}] {entity.name} ({entity.type})")
    print(f"    Occurrences: {entity.occurrence_count}")
    print(f"    Merged from {len(entity.merged_from_ids)} entities")
    print(f"    Relationships: {len(entity.relationship_ids)}")
    if entity.aliases:
        print(f"    Aliases: {', '.join(entity.aliases[:3])}")
    if entity.attributes:
        print(f"    Attributes: {entity.attributes}")

print("\n" + "=" * 80)

Starting normalization with 65 entities
Unique names after exact matching: 61
  Found 9 alias mappings
  Merged 'roman power' into 'rome'
  Merged 'augustus' into 'octavian'
  Completed 2 alias-based merges
Unique names after alias-based merging: 59
Found 0 fuzzy match candidates
Created 59 normalized entities
Normalized 44 relationships
Skipped 0 relationships (entities not found)

NORMALIZATION STATISTICS
Total entities before: 65
Normalized entities: 59
Reduction: 6 entities merged

Total relationships before: 44
Normalized relationships: 44
Skipped: 0 relationships

TOP 10 MOST FREQUENT ENTITIES

[1] Rome (place)
    Occurrences: 5
    Merged from 5 entities
    Relationships: 13
    Aliases: Rome

[2] Senate (collective_entity)
    Occurrences: 2
    Merged from 2 entities
    Relationships: 5
    Aliases: Roman Senate

[3] Octavian (person)
    Occurrences: 2
    Merged from 2 entities
    Relationships: 9
    Aliases: Augustus
    Attributes: {'titles_or_roles': 'consul, imperat

In [61]:
# Display sample entity mappings to verify normalization
print("=" * 80)
print("SAMPLE ENTITY MAPPINGS (first 15)")
print("=" * 80)
print("Format: (paragraph_id, entity_name) -> normalized_entity_id")
print()

for i, ((para_id, entity_name), norm_id) in enumerate(list(normalized_data['entity_mapping'].items())[:15], 1):
    # Find the normalized entity to show its canonical name
    norm_entity = next(e for e in normalized_data['normalized_entities'] if e['id'] == norm_id)
    print(f"[{i}] ({para_id[:8]}..., '{entity_name}') -> {norm_entity['name']}")
    if entity_name.lower() != norm_entity['name'].lower():
        print(f"     NOTE: Name normalized from '{entity_name}' to '{norm_entity['name']}'")

print("\n" + "=" * 80)

SAMPLE ENTITY MAPPINGS (first 15)
Format: (paragraph_id, entity_name) -> normalized_entity_id

[1] (0cd2eb74..., 'Etruscans') -> Etruscans
[2] (0cd2eb74..., 'River Tiber') -> River Tiber
[3] (0cd2eb74..., 'Rome') -> Rome
[4] (1ef9dd6f..., 'Rome') -> Rome
[5] (1fa5555b..., 'Rome') -> Rome
[6] (22caab97..., 'Rome') -> Rome
[7] (22caab97..., 'Roman power') -> Rome
     NOTE: Name normalized from 'Roman power' to 'Rome'
[8] (0cd2eb74..., 'Latins') -> Latins
[9] (0cd2eb74..., 'Latin peoples') -> Latins
     NOTE: Name normalized from 'Latin peoples' to 'Latins'
[10] (0cd2eb74..., 'Campania') -> Campania
[11] (0cd2eb74..., 'European tradition') -> European tradition
[12] (0cd2eb74..., 'revolt of the Latin cities') -> revolt of the Latin cities
[13] (0cd2eb74..., 'kings of Rome') -> kings of Rome
[14] (0cd2eb74..., 'last king of Rome') -> last king of Rome
[15] (0cd2eb74..., '509 BC') -> 509 BC



## Review Fuzzy Match Candidates

If fuzzy candidates were found, review them before applying merges

In [62]:
# Display fuzzy match candidates if any were found
if 'fuzzy_candidates' in normalized_data and normalized_data['fuzzy_candidates']:
    candidates = normalized_data['fuzzy_candidates']
    
    print("=" * 100)
    print(f"FUZZY MATCH CANDIDATES (found {len(candidates)} pairs)")
    print("=" * 100)
    print()
    
    for i, candidate in enumerate(candidates, 1):
        print(f"[{i}] Similarity: {candidate['similarity']:.3f}")
        print(f"    '{candidate['name1']}' ({candidate['type1']}/{candidate['subtype1']}) × {candidate['count1']}")
        print(f"    '{candidate['name2']}' ({candidate['type2']}/{candidate['subtype2']}) × {candidate['count2']}")
        print()
    
    print("=" * 100)
    print("\nTo apply merges, create a list of approved pairs:")
    print("approved_merges = [")
    print("    ('name1', 'name2'),")
    print("    ('name3', 'name4'),")
    print("]")
    print("\nThen re-run normalization with:")
    print("normalized_data = normalize_entities(all_results, fuzzy_merge_pairs=approved_merges)")
else:
    print("No fuzzy candidates found (or fuzzy merges already applied)")

FUZZY MATCH CANDIDATES (found 4 pairs)

[1] Similarity: 0.875
    'sixth century bc' (temporal/century) × 1
    'fifth century bc' (temporal/century) × 1

[2] Similarity: 0.800
    'kings of rome' (collective_entity/organization) × 1
    'last king of rome' (person/None) × 1

[3] Similarity: 0.759
    'roman standards' (cultural/None) × 1
    'roman calendar' (cultural/None) × 1

[4] Similarity: 0.759
    'julian calendar' (cultural/None) × 1
    'roman calendar' (cultural/None) × 1


To apply merges, create a list of approved pairs:
approved_merges = [
    ('name1', 'name2'),
    ('name3', 'name4'),
]

Then re-run normalization with:
normalized_data = normalize_entities(all_results, fuzzy_merge_pairs=approved_merges)


## Apply Approved Fuzzy Merges

After reviewing candidates, apply selected merges

In [None]:
# Example: Apply fuzzy merges after review
# Uncomment and modify this to apply selected merges

# approved_merges = [
#     ('entity1', 'entity2'),  # Replace with actual entity names from candidates above
#     # Add more pairs as needed
# ]

# # Re-run normalization with approved merges
# normalized_data = normalize_entities(all_results, fuzzy_merge_pairs=approved_merges)

# # Display updated statistics
# print("\n" + "=" * 80)
# print("FINAL NORMALIZATION STATISTICS")
# print("=" * 80)
# print(f"Total entities extracted: {sum(len(r.entities) for r in all_results)}")
# print(f"Final normalized entities: {len(normalized_data['normalized_entities'])}")
# print(f"Total reduction: {sum(len(r.entities) for r in all_results) - len(normalized_data['normalized_entities'])} duplicates merged")

print("Ready to apply fuzzy merges - uncomment and modify the code above")

In [63]:
import networkx as nx

def build_knowledge_graph(
    normalized_entities: list[NormalizedEntity],
    normalized_relationships: list[NormalizedRelationship]
) -> nx.DiGraph:
    """
    Build a directed graph from normalized entities and relationships.
    
    Args:
        normalized_entities: List of NormalizedEntity objects
        normalized_relationships: List of NormalizedRelationship objects
    
    Returns:
        NetworkX DiGraph with entity nodes and relationship edges
    """
    G = nx.DiGraph()
    
    # Add nodes for each normalized entity
    for entity in normalized_entities:
        G.add_node(
            entity.id,
            name=entity.name,
            type=entity.type,
            subtype=entity.subtype,
            aliases=entity.aliases,
            description=entity.description,
            attributes=entity.attributes,
            occurrence_count=entity.occurrence_count,
            source_paragraphs=entity.source_paragraph_ids,
            relationship_count=len(entity.relationship_ids)
        )
    
    print(f"Added {G.number_of_nodes()} nodes to graph")
    
    # Add edges for relationships
    for rel in normalized_relationships:
        if rel.source_id in G and rel.target_id in G:
            # Check if edge already exists (from another relationship)
            if G.has_edge(rel.source_id, rel.target_id):
                # Aggregate relation types
                edge_data = G[rel.source_id][rel.target_id]
                if rel.relation_type not in edge_data.get('relation_types', []):
                    edge_data.setdefault('relation_types', []).append(rel.relation_type)
            else:
                # New edge
                G.add_edge(
                    rel.source_id,
                    rel.target_id,
                    relation_id=rel.id,
                    relation_type=rel.relation_type,
                    relation_types=[rel.relation_type],
                    temporal_context=rel.temporal_context,
                    confidence=rel.confidence,
                    source_paragraph=rel.paragraph_id
                )
    
    print(f"Added {G.number_of_edges()} edges to graph")
    
    return G

# Build the graph
kg = build_knowledge_graph(normalized_entities, normalized_relationships)

# Display graph statistics
print("\n" + "=" * 80)
print("KNOWLEDGE GRAPH STATISTICS")
print("=" * 80)
print(f"Nodes (entities): {kg.number_of_nodes()}")
print(f"Edges (relationships): {kg.number_of_edges()}")
if kg.number_of_nodes() > 0:
    print(f"Average degree: {sum(dict(kg.degree()).values()) / kg.number_of_nodes():.2f}")
    print(f"Density: {nx.density(kg):.3f}")
print()

# Top 5 most connected entities
if kg.number_of_nodes() > 0:
    degree_centrality = nx.degree_centrality(kg)
    top_entities = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
    
    print("Top 5 most connected entities:")
    for entity_id, centrality in top_entities:
        entity_data = kg.nodes[entity_id]
        print(f"  {entity_data['name']} ({entity_data['type']}): {kg.degree(entity_id)} connections")

print("=" * 80)

Added 59 nodes to graph
Added 42 edges to graph

KNOWLEDGE GRAPH STATISTICS
Nodes (entities): 59
Edges (relationships): 42
Average degree: 1.42
Density: 0.012

Top 5 most connected entities:
  Rome (place): 12 connections
  Octavian (person): 8 connections
  Julian calendar (cultural): 6 connections
  Parthia (collective_entity): 5 connections
  Senate (collective_entity): 4 connections


# Graph Construction

Build knowledge graph with NetworkX using normalized entities and relationships

In [64]:
import plotly.graph_objects as go
import numpy as np

def visualize_knowledge_graph(G: nx.DiGraph, width: int = 1200, height: int = 800):
    """
    Create an interactive Plotly visualization of the knowledge graph.
    
    Args:
        G: NetworkX DiGraph
        width: Figure width in pixels
        height: Figure height in pixels
    """
    # Compute layout using spring layout
    pos = nx.spring_layout(G, k=0.5, iterations=50, seed=42)
    
    # Define colors for each entity type
    type_colors = {
        'person': '#FF6B6B',        # Red
        'place': '#4ECDC4',         # Teal
        'collective_entity': '#45B7D1',  # Blue
        'event': '#FFA07A',         # Light orange
        'temporal': '#98D8C8',      # Mint
        'cultural': '#C7CEEA',      # Lavender
    }
    
    # Create edge traces
    edge_traces = []
    for edge in G.edges():
        source_id, target_id = edge
        x0, y0 = pos[source_id]
        x1, y1 = pos[target_id]
        
        # Create edge line
        edge_trace = go.Scatter(
            x=[x0, x1, None],
            y=[y0, y1, None],
            mode='lines',
            line=dict(width=0.5, color='#888'),
            hoverinfo='none',
            showlegend=False
        )
        edge_traces.append(edge_trace)
    
    # Create node trace
    node_x = []
    node_y = []
    node_text = []
    node_colors = []
    node_sizes = []
    
    for node_id in G.nodes():
        x, y = pos[node_id]
        node_x.append(x)
        node_y.append(y)
        
        # Get node data
        node_data = G.nodes[node_id]
        name = node_data['name']
        entity_type = node_data['type']
        subtype = node_data.get('subtype', '')
        occurrence_count = node_data['occurrence_count']
        
        # Node color by type
        node_colors.append(type_colors.get(entity_type, '#CCCCCC'))
        
        # Node size by occurrence count (scaled)
        node_sizes.append(10 + (occurrence_count * 5))
        
        # Hover text
        hover_text = f"<b>{name}</b><br>"
        hover_text += f"Type: {entity_type}"
        if subtype:
            hover_text += f" ({subtype})"
        hover_text += f"<br>Occurrences: {occurrence_count}"
        hover_text += f"<br>Connections: {G.degree(node_id)}"
        
        if node_data.get('attributes'):
            hover_text += f"<br>Attributes: {node_data['attributes']}"
        
        node_text.append(hover_text)
    
    node_trace = go.Scatter(
        x=node_x,
        y=node_y,
        mode='markers+text',
        text=[G.nodes[node_id]['name'] for node_id in G.nodes()],
        textposition='top center',
        textfont=dict(size=8),
        hovertext=node_text,
        hoverinfo='text',
        marker=dict(
            size=node_sizes,
            color=node_colors,
            line=dict(width=1, color='white')
        ),
        showlegend=False
    )
    
    # Create legend traces (one per entity type)
    legend_traces = []
    for entity_type, color in type_colors.items():
        legend_trace = go.Scatter(
            x=[None],
            y=[None],
            mode='markers',
            marker=dict(size=10, color=color),
            name=entity_type,
            showlegend=True
        )
        legend_traces.append(legend_trace)
    
    # Create figure
    fig = go.Figure(
        data=edge_traces + [node_trace] + legend_traces,
        layout=go.Layout(
            title=dict(
                text="Knowledge Graph: Rome Chapter (Book 3, Chapter 4)",
                font=dict(size=20)
            ),
            showlegend=True,
            hovermode='closest',
            margin=dict(b=20, l=5, r=5, t=40),
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            width=width,
            height=height,
            plot_bgcolor='#F8F9FA'
        )
    )
    
    return fig

# Create and display the visualization
fig = visualize_knowledge_graph(kg, width=1400, height=900)
fig.show()

print("\nVisualization tips:")
print("- Hover over nodes to see entity details")
print("- Node size = occurrence count")
print("- Node color = entity type (see legend)")
print("- Drag to pan, scroll to zoom")
print("- Click legend items to show/hide entity types")


Visualization tips:
- Hover over nodes to see entity details
- Node size = occurrence count
- Node color = entity type (see legend)
- Drag to pan, scroll to zoom
- Click legend items to show/hide entity types


# Interactive Graph Visualization

Visualize the knowledge graph with Plotly for interactive exploration

In [None]:
from pyvis.network import Network

def visualize_with_pyvis(
    G: nx.DiGraph,
    normalized_entities: list[NormalizedEntity],
    normalized_relationships: list[NormalizedRelationship],
    output_file: str = "knowledge_graph.html",
    height: str = "900px",
    width: str = "100%"
):
    """
    Create interactive PyVis visualization.
    
    Args:
        G: NetworkX DiGraph
        normalized_entities: List of NormalizedEntity objects
        normalized_relationships: List of NormalizedRelationship objects
        output_file: HTML file to save
        height: Canvas height
        width: Canvas width
    """
    # Create PyVis network
    net = Network(
        height=height,
        width=width,
        directed=True,
        notebook=True,
        bgcolor="#F8F9FA",
        font_color="#333333"
    )
    
    # Configure physics for better layout
    net.set_options("""
    {
        "physics": {
            "enabled": true,
            "forceAtlas2Based": {
                "gravitationalConstant": -50,
                "centralGravity": 0.01,
                "springLength": 200,
                "springConstant": 0.08
            },
            "maxVelocity": 50,
            "solver": "forceAtlas2Based",
            "timestep": 0.35,
            "stabilization": {"iterations": 150}
        },
        "interaction": {
            "hover": true,
            "tooltipDelay": 100,
            "navigationButtons": true,
            "keyboard": true
        }
    }
    """)
    
    # Define colors for entity types
    type_colors = {
        'person': '#FF6B6B',
        'place': '#4ECDC4',
        'collective_entity': '#45B7D1',
        'event': '#FFA07A',
        'temporal': '#98D8C8',
        'cultural': '#C7CEEA',
    }
    
    # Create entity ID lookup
    entity_lookup = {e.id: e for e in normalized_entities}
    
    # Add nodes
    for entity in normalized_entities:
        color = type_colors.get(entity.type, '#CCCCCC')
        size = 15 + (entity.occurrence_count * 5)
        
        # Build title (hover text)
        title = f"<b>{entity.name}</b><br>"
        title += f"Type: {entity.type}"
        if entity.subtype:
            title += f" ({entity.subtype})"
        title += f"<br>Occurrences: {entity.occurrence_count}"
        title += f"<br>Relationships: {len(entity.relationship_ids)}"
        if entity.aliases:
            title += f"<br>Aliases: {', '.join(entity.aliases[:3])}"
        if entity.attributes:
            title += f"<br>Attributes: {entity.attributes}"
        
        net.add_node(
            entity.id,
            label=entity.name,
            title=title,
            color=color,
            size=size,
            font={'size': 14}
        )
    
    # Add edges with labels
    for rel in normalized_relationships:
        if rel.source_id in entity_lookup and rel.target_id in entity_lookup:
            # Create edge label and title
            label = rel.relation_type.replace('-', ' ').replace('_', ' ')
            title = f"{entity_lookup[rel.source_id].name} → {entity_lookup[rel.target_id].name}<br>"
            title += f"Relationship: {rel.relation_type}"
            if rel.temporal_context:
                title += f"<br>When: {rel.temporal_context}"
                label += f" ({rel.temporal_context})"
            
            net.add_edge(
                rel.source_id,
                rel.target_id,
                label=label,
                title=title,
                arrows='to',
                color={'color': '#888888', 'highlight': '#333333'},
                width=2,
                font={'size': 10, 'align': 'middle'}
            )
    
    # Save
    net.save_graph(output_file)
    print(f"✅ Interactive graph saved to: {output_file}")
    print("\nFeatures:")
    print("  • Drag nodes to rearrange")
    print("  • Click node to highlight connections")
    print("  • Hover over nodes/edges for details")
    print("  • Scroll to zoom, drag canvas to pan")
    print("  • Use navigation buttons (bottom right)")
    
    return net

# Create PyVis visualization
net = visualize_with_pyvis(kg, normalized_entities, normalized_relationships)
net.show("knowledge_graph.html")

# PyVis Visualization (Alternative)

PyVis creates an interactive HTML visualization with:
- **Drag nodes** to rearrange layout
- **Click node** to highlight its connections  
- **Hover** to see entity/relationship details
- **Edge labels** showing relationship types
- **Physics simulation** for organic layout