# Knowledge Graph Experiment v2

**Goal**: Extract entities and relationships from historical text, normalize them, and visualize as a knowledge graph.

**Approach**: Hybrid normalization pipeline
1. **LLM Extraction** - GPT structured outputs to extract entities + relationships per paragraph
2. **ID Assignment** - UUIDs with bidirectional entity↔relationship links
3. **Rule-Based Normalization** - Exact name + alias matching to collapse obvious duplicates
4. **Embedding Similarity** - Compute on rule-normalized entities (not raw) to find remaining merge candidates
5. **LLM Merge (incremental)** - Evaluate candidates in order of similarity, merge as we go to avoid redundant comparisons
6. **Graph Construction & Visualization** - NetworkX + PyVis

**Key insight**: Running LLM normalization on all raw entity pairs is an N² problem that doesn't scale. The hybrid approach reduces this dramatically:
- Rule-based pass collapses ~30-40% of entities (exact name + alias matches)
- Embedding similarity on the reduced set finds only the ambiguous candidates
- Incremental LLM merge skips pairs already in the same group

In [1]:
import json
import os
import uuid as uuid_module
import warnings
from collections import defaultdict
from pathlib import Path

# Suppress Pydantic V2 deprecation warnings from LangChain's tracer
os.environ["LANGCHAIN_TRACING_V2"] = "false"
warnings.filterwarnings("ignore", category=DeprecationWarning)

import networkx as nx
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from pydantic import BaseModel, Field
from pyvis.network import Network
from sklearn.metrics.pairwise import cosine_similarity

CONFIG = {
    "extraction_model": "gpt-4.1-mini",
    "extraction_temperature": 0.0,
    "merge_model": "gpt-5-mini",
    "merge_temperature": 0.0,
    "embedding_model": "text-embedding-3-small",
    "similarity_threshold": 0.65,
    "max_llm_candidates": 100,
    "reasoning_effort": "minimal",  # for models that support it
}

In [2]:
# Load paragraph data

# Option A: 5-paragraph test set
# data_path = Path("selected_5_paragraphs.json")
# with open(data_path) as f:
#     paragraphs = json.load(f)

# Option B: Full chapter from database
import sys
sys.path.insert(0, str(Path.cwd().parent.parent / "src"))
from history_book.database.config.database_config import WeaviateConfig
from history_book.database.repositories.book_repository import BookRepositoryManager
config = WeaviateConfig.from_environment()
manager = BookRepositoryManager(config)
chapter_paragraphs = manager.paragraphs.find_by_chapter_index(book_index=3, chapter_index=4)
paragraphs = [
    {"id": str(p.id), "text": p.text, "page": p.page, "paragraph_index": p.paragraph_index,
     "book_index": p.book_index, "chapter_index": p.chapter_index}
    for p in chapter_paragraphs
]

# Sort by book order so adjacent paragraphs (which share entities) are processed together
paragraphs.sort(key=lambda p: (p.get("page", 0), p.get("paragraph_index", 0)))

print(f"Loaded {len(paragraphs)} paragraphs")
for i, p in enumerate(paragraphs):
    print(f"  [{i}] page {p['page']}, para {p['paragraph_index']}: {p['text'][:80]}...")

Loaded 85 paragraphs
  [0] page 317, para 0: All around the western Mediterranean shores and across wide tracts of western Eu...
  [1] page 318, para 1: Rome itself, the values it embodied and imposed, the notion of what was one day ...
  [2] page 318, para 2: It was believed to have deep roots. Romans said their city was founded by one Ro...
  [3] page 318, para 3: In spite of a rich archaeological record, with many inscriptions and much schola...
  [4] page 318, para 4: There were probably still at that time some aboriginal natives among them whose ...
  [5] page 319, para 5: In the sixth century BC the Etruscans were installed in an important bridgehead ...
  [6] page 321, para 6: Fertilization by Greek influence was perhaps its most important inheritance, but...
  [7] page 321, para 7: The Roman republic was to last for more than 450 years, and even after that its ...
  [8] page 321, para 8: Broadly speaking, the changes of republican times were symptoms and results of t...
  [9] p

## Data Models

8 models total:
- **Extraction**: `Entity`, `Relationship`, `ExtractionResult` — LLM output shapes
- **Post-extraction**: `EntityWithId`, `RelationshipWithId` — UUIDs + bidirectional links
- **Normalization**: `NormalizedEntity`, `NormalizedRelationship` — merged duplicates
- **LLM merge**: `EntityMergeDecision` — structured merge output

In [3]:
# --- Extraction models (LLM output) ---

class Entity(BaseModel):
    """Extracted entity from historical text."""
    name: str
    type: str  # person, polity, place, event
    aliases: list[str] = Field(default_factory=list)
    description: str | None = None


class Relationship(BaseModel):
    """Relationship between two entities."""
    source_entity: str  # Entity name
    relation_type: str  # ruled, conquered, fought, allied_with, succeeded, revolted_against, influenced, part_of, founded, evolved_into, participated_in
    target_entity: str  # Entity name
    temporal_context: str | None = None


class ExtractionResult(BaseModel):
    """Result of entity extraction from a paragraph."""
    entities: list[Entity]
    relationships: list[Relationship]
    paragraph_id: str


# --- Post-extraction models (with IDs) ---

class EntityWithId(BaseModel):
    """Entity with UUID assigned after extraction."""
    id: str
    name: str
    type: str
    aliases: list[str] = Field(default_factory=list)
    description: str | None = None
    paragraph_id: str
    relationship_ids: list[str] = Field(default_factory=list)


class RelationshipWithId(BaseModel):
    """Relationship with UUIDs for entities and original names preserved."""
    id: str
    source_id: str  # Entity UUID
    target_id: str  # Entity UUID
    source_entity_name: str  # Original entity name from extraction
    target_entity_name: str  # Original entity name from extraction
    relation_type: str
    temporal_context: str | None = None
    paragraph_id: str


# --- Normalized models (after merging duplicates) ---

class NormalizedEntity(BaseModel):
    """Normalized entity after merging duplicates."""
    id: str
    name: str
    type: str
    aliases: list[str] = Field(default_factory=list)
    description: str
    source_paragraph_ids: list[str]
    occurrence_count: int
    merged_from_ids: list[str] = Field(default_factory=list)
    relationship_ids: list[str] = Field(default_factory=list)


class NormalizedRelationship(BaseModel):
    """Normalized relationship with canonical entity IDs and original names."""
    id: str
    source_id: str  # Normalized entity ID
    target_id: str  # Normalized entity ID
    source_entity_name: str  # Original entity name from extraction
    target_entity_name: str  # Original entity name from extraction
    relation_type: str
    temporal_context: str | None = None
    paragraph_id: str


# --- LLM merge decision ---

class EntityMergeDecision(BaseModel):
    """LLM decision on whether two entities should be merged."""
    reasoning: str = Field(description="Brief explanation of the decision")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence in the decision")
    should_merge: bool = Field(description="True if entities refer to the same historical entity")
    merged_entity: EntityWithId | None = Field(
        default=None,
        description="The merged entity if should_merge=True, otherwise None",
    )

## Extraction

Extract entities and relationships from each paragraph using structured outputs.
The prompt constrains extraction to what's explicitly stated in the text.

In [4]:
EXTRACTION_PROMPT = """You are analyzing text from "The Penguin History of the World".

Extract only the most historically significant entities and relationships from the provided paragraph. Focus on major states, leaders, regions, and pivotal events. Typically extract 3-6 entities per paragraph.

**ENTITY TYPES** (use exactly these):
- person: Major historical figures — rulers, generals, political leaders (e.g., Augustus, Caesar, Hannibal)
- polity: States, empires, peoples, organizations, political bodies (e.g., Rome, Etruscans, Senate, Roman Republic)
- place: Major cities, regions, bodies of water (e.g., Italy, Carthage, Mediterranean)
- event: Wars, revolts, reforms, conquests, pivotal moments (e.g., Punic Wars, revolt of the Latin cities)

**RELATIONSHIP TYPES** (use exactly one of these):
- ruled: A person or polity governed a place or polity
- conquered: Military takeover of a place or polity
- fought: Armed conflict without outright conquest
- allied_with: Formal alliance or cooperation
- succeeded: One leader/polity followed another in power
- revolted_against: Rebellion or uprising against authority
- influenced: Cultural, political, or intellectual impact
- part_of: Geographic or organizational membership (e.g., Sicily part_of Roman Republic)
- founded: Established or created
- evolved_into: Political transformation (e.g., Roman Republic evolved_into Roman Empire)
- participated_in: Connects actors to event entities (e.g., Rome participated_in Punic Wars)

**IMPORTANT GUIDELINES**:
1. Extract entities FROM THIS PARAGRAPH ONLY — do not use external knowledge
2. Be highly selective — only major historical actors, places, and events
3. Extract relationships that are EXPLICITLY STATED in the text
4. Include aliases if the entity is referred to by multiple names (e.g., "Octavian" also called "Augustus")
5. Do NOT extract dates or time periods as entities — instead, include them as temporal_context on relationships
6. Relationships MUST reference exact entity names from your entities list
7. Only extract entities that participate in at least one relationship

**DO NOT EXTRACT**:
- Unnamed individuals or groups ("an astronomer", "his great-uncle", "money-lenders")
- Abstract concepts ("Roman power", "political authority", "civil war" as a concept)
- Generic descriptions ("sea-going vessels", "land and water routes", "frontier provinces")
- Infrastructure or objects ("roads", "aqueducts", "temples")
- Cultural traditions or practices ("European tradition", "Greek mythology")
- Minor geographic features unless historically pivotal
- Entities mentioned only in passing or as comparisons

Extract entities and relationships from this paragraph:

{paragraph_text}
"""

In [5]:
def extract_entities(paragraph_text: str, paragraph_id: str, config: dict) -> ExtractionResult:
    """Extract entities and relationships from a paragraph using structured outputs."""
    llm = ChatOpenAI(
        model=config["extraction_model"],
        temperature=config["extraction_temperature"],
    )
    llm_with_structure = llm.with_structured_output(ExtractionResult)

    system_message = "You are an expert at extracting structured historical entities and relationships from text."
    user_message = EXTRACTION_PROMPT.format(paragraph_text=paragraph_text)

    result = llm_with_structure.invoke([
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ])

    result.paragraph_id = paragraph_id
    return result

## Incremental Pipeline

Process paragraphs one at a time into a growing master graph:
1. **Extract** entities and relationships from the paragraph
2. **Assign IDs** and drop orphans
3. **Rule-based merge** into master (exact name + alias matching)
4. **Embedding similarity** — compare new entities vs master (not all pairwise)
5. **LLM merge** on candidates involving new entities
6. **Update master** embeddings and state

Key advantage: embedding similarity is `n_new × n_master` per paragraph, not `n_total × n_total`.

In [6]:
def assign_ids_single(result: ExtractionResult) -> tuple[list[EntityWithId], list[RelationshipWithId]]:
    """Assign UUIDs to entities and relationships from a single paragraph extraction.
    Drops orphaned entities (not referenced by any relationship)."""
    para_entities: dict[str, EntityWithId] = {}
    relationships: list[RelationshipWithId] = []
    skipped = 0

    for entity in result.entities:
        entity_id = str(uuid_module.uuid4())
        para_entities[entity.name] = EntityWithId(
            id=entity_id,
            name=entity.name,
            type=entity.type,
            aliases=entity.aliases,
            description=entity.description,
            paragraph_id=result.paragraph_id,
            relationship_ids=[],
        )

    for rel in result.relationships:
        source = para_entities.get(rel.source_entity)
        target = para_entities.get(rel.target_entity)
        if source and target:
            rel_id = str(uuid_module.uuid4())
            rel_with_id = RelationshipWithId(
                id=rel_id,
                source_id=source.id,
                target_id=target.id,
                source_entity_name=rel.source_entity,
                target_entity_name=rel.target_entity,
                relation_type=rel.relation_type,
                temporal_context=rel.temporal_context,
                paragraph_id=result.paragraph_id,
            )
            relationships.append(rel_with_id)
            source.relationship_ids.append(rel_id)
            target.relationship_ids.append(rel_id)
        else:
            skipped += 1

    all_entities = list(para_entities.values())
    orphaned = [e for e in all_entities if not e.relationship_ids]
    all_entities = [e for e in all_entities if e.relationship_ids]

    if skipped:
        print(f"    Skipped {skipped} rels (entity not found)")
    if orphaned:
        print(f"    Dropped {len(orphaned)} orphans: {', '.join(e.name for e in orphaned)}")

    return all_entities, relationships


def create_entity_text(entity: NormalizedEntity) -> str:
    """Create text representation of entity for embedding."""
    parts = [f"Name: {entity.name}", f"Type: {entity.type}"]
    if entity.description:
        parts.append(f"Description: {entity.description}")
    if entity.aliases:
        parts.append(f"Aliases: {', '.join(entity.aliases)}")
    return " | ".join(parts)

In [7]:
def merge_into_master_rule_based(
    new_entities: list[EntityWithId],
    new_relationships: list[RelationshipWithId],
    master_entities: list[NormalizedEntity],
    master_relationships: list[NormalizedRelationship],
) -> tuple[list[NormalizedEntity], list[NormalizedRelationship], list[NormalizedEntity]]:
    """
    Merge new paragraph entities into master graph using exact name + alias matching.

    NOTE: Known limitation — entity names are matched globally without context disambiguation.
    For per-chapter processing this is acceptable, but cross-chapter processing may incorrectly
    merge entities that share names across different historical contexts (e.g., "Senate").

    Returns:
        (updated_master_entities, updated_master_relationships, newly_added_entities)
    """
    # Build lookup: lowercase name/alias -> master entity
    name_to_master: dict[str, NormalizedEntity] = {}
    for me in master_entities:
        name_to_master[me.name.lower().strip()] = me
        for alias in me.aliases:
            alias_key = alias.lower().strip()
            if alias_key:
                name_to_master[alias_key] = me

    old_id_to_master_id: dict[str, str] = {}  # new entity ID -> master entity ID
    newly_added: list[NormalizedEntity] = []
    rule_merges = 0

    for entity in new_entities:
        key = entity.name.lower().strip()
        # Check name and aliases against master
        match = name_to_master.get(key)
        if not match:
            for alias in entity.aliases:
                alias_key = alias.lower().strip()
                match = name_to_master.get(alias_key)
                if match:
                    break

        if match:
            # Merge into existing master entity
            rule_merges += 1
            print(f"    Rule merge: '{entity.name}' -> '{match.name}'")
            match.aliases = list(set(match.aliases + entity.aliases + [entity.name]))
            # Remove self-referencing aliases
            match.aliases = [a for a in match.aliases if a.lower().strip() != match.name.lower().strip()]
            if entity.description:
                match.description = f"{match.description} | {entity.description}" if match.description else entity.description
            match.source_paragraph_ids = list(set(match.source_paragraph_ids + [entity.paragraph_id]))
            match.occurrence_count += 1
            match.merged_from_ids.append(entity.id)
            old_id_to_master_id[entity.id] = match.id
            # Update lookup with new aliases
            for alias in entity.aliases:
                alias_key = alias.lower().strip()
                if alias_key:
                    name_to_master[alias_key] = match
        else:
            # Create new master entity
            new_master = NormalizedEntity(
                id=str(uuid_module.uuid4()),
                name=entity.name,
                type=entity.type,
                aliases=entity.aliases,
                description=entity.description or "",
                source_paragraph_ids=[entity.paragraph_id],
                occurrence_count=1,
                merged_from_ids=[entity.id],
                relationship_ids=[],
            )
            master_entities.append(new_master)
            newly_added.append(new_master)
            old_id_to_master_id[entity.id] = new_master.id
            # Update lookup
            name_to_master[key] = new_master
            for alias in entity.aliases:
                alias_key = alias.lower().strip()
                if alias_key:
                    name_to_master[alias_key] = new_master

    # Remap relationships to master IDs
    master_entity_lookup = {e.id: e for e in master_entities}
    for rel in new_relationships:
        new_source = old_id_to_master_id.get(rel.source_id)
        new_target = old_id_to_master_id.get(rel.target_id)
        if new_source and new_target:
            norm_rel = NormalizedRelationship(
                id=rel.id,
                source_id=new_source,
                target_id=new_target,
                source_entity_name=rel.source_entity_name,
                target_entity_name=rel.target_entity_name,
                relation_type=rel.relation_type,
                temporal_context=rel.temporal_context,
                paragraph_id=rel.paragraph_id,
            )
            master_relationships.append(norm_rel)
            if new_source in master_entity_lookup:
                master_entity_lookup[new_source].relationship_ids.append(rel.id)
            if new_target in master_entity_lookup:
                master_entity_lookup[new_target].relationship_ids.append(rel.id)

    if rule_merges:
        print(f"    {rule_merges} rule-based merge(s)")

    return master_entities, master_relationships, newly_added

In [8]:
def find_candidates_against_master(
    new_entities: list[NormalizedEntity],
    new_embeddings: np.ndarray,
    master_entities: list[NormalizedEntity],
    master_embeddings: np.ndarray,
    threshold: float,
) -> list[dict]:
    """Find merge candidates: new entities vs existing master entities.
    Returns pairs sorted by descending similarity."""
    if len(new_embeddings) == 0 or len(master_embeddings) == 0:
        return []

    # (n_new, n_master) — NOT all pairwise
    sim_matrix = cosine_similarity(new_embeddings, master_embeddings)

    candidates = []
    for i in range(len(new_entities)):
        for j in range(len(master_entities)):
            sim = sim_matrix[i, j]
            if sim >= threshold:
                candidates.append({
                    "new_entity": new_entities[i],
                    "master_entity": master_entities[j],
                    "similarity": float(sim),
                })

    candidates.sort(key=lambda x: x["similarity"], reverse=True)
    return candidates

In [9]:
# LLM merge prompt — used for ambiguous candidates that pass embedding similarity threshold

ENTITY_MERGE_PROMPT = """You are an expert historian analyzing entity mentions from "The Penguin History of the World".

Given two entities extracted from different paragraphs, determine if they refer to the SAME historical entity.
This is an entity normalization task as part of knowledge graph construction. The goal is to merge duplicate entities while maintaining distinct but related entities separately.

**Entity 1:**
Name: {entity1_name}
Type: {entity1_type}
Aliases: {entity1_aliases}
Description: {entity1_description}

**Entity 2:**
Name: {entity2_name}
Type: {entity2_type}
Aliases: {entity2_aliases}
Description: {entity2_description}

**Entity types**: person, polity, place, event

**Instructions:**
1. Determine if these refer to the SAME historical entity
    - Same here means strictly identical entities, not just similar or related.
    - Mergeable examples:
        - "Octavian" and "Augustus" (same person, different names)
        - "Roman Legions" and "Roman Army" (same organization)
        - "Roman Republic" and "Rome" (same political entity)
    - Non-mergeable examples:
        - Different people with same last name (e.g., "Julius Caesar" vs "Augustus Caesar")
        - Same place in different contexts (e.g., "Rome" the city vs "Rome" the empire)
        - Related political and geographical entities (e.g., "Roman Empire" vs "Italy")
        - Different entity types (e.g., "Punic Wars" event vs "Carthage" polity)
2. If they should be merged:
   - Choose the most canonical/common name
   - Write a consolidated description (combine key information, ~2-3 sentences)
   - Merge aliases (include both original names if not already aliases)
"""

In [10]:
def setup_merge_chain(config: dict):
    """Create LangChain chain for entity merge decisions."""
    llm = ChatOpenAI(
        model=config["merge_model"],
        temperature=config["merge_temperature"],
        reasoning_effort=config.get("reasoning_effort", None),
    )
    llm_with_structure = llm.with_structured_output(EntityMergeDecision)
    prompt = ChatPromptTemplate.from_template(ENTITY_MERGE_PROMPT)
    return prompt | llm_with_structure


def format_entity_for_prompt(entity) -> dict:
    """Format an entity (EntityWithId or NormalizedEntity) for the merge prompt."""
    aliases = entity.aliases if entity.aliases else []
    return {
        "name": entity.name,
        "type": entity.type,
        "aliases": ", ".join(aliases) if aliases else "None",
        "description": entity.description or "None",
    }


def decide_entity_merge(entity1, entity2, chain) -> EntityMergeDecision:
    """Run the merge decision chain on two entities."""
    e1 = format_entity_for_prompt(entity1)
    e2 = format_entity_for_prompt(entity2)
    inputs = {}
    for key, val in e1.items():
        inputs[f"entity1_{key}"] = val
    for key, val in e2.items():
        inputs[f"entity2_{key}"] = val
    return chain.invoke(inputs)

In [11]:
# Incremental pipeline: process paragraphs one at a time into a growing master graph

embeddings_model = OpenAIEmbeddings(model=CONFIG["embedding_model"])
merge_chain = setup_merge_chain(CONFIG)
threshold = CONFIG["similarity_threshold"]

master_entities: list[NormalizedEntity] = []
master_relationships: list[NormalizedRelationship] = []
master_embeddings: np.ndarray | None = None
master_entity_order: list[str] = []  # entity IDs matching master_embeddings rows

# Union-Find for LLM merges
uf_parent: dict[str, str] = {}
uf_representative: dict[str, NormalizedEntity] = {}


def uf_find(entity_id: str) -> str:
    """Find root of entity's group with path compression."""
    while uf_parent[entity_id] != entity_id:
        uf_parent[entity_id] = uf_parent[uf_parent[entity_id]]
        entity_id = uf_parent[entity_id]
    return entity_id


def uf_union(id1: str, id2: str, merged_entity: EntityWithId | None):
    """Union two groups. Update representative with merged entity info if provided."""
    root1, root2 = uf_find(id1), uf_find(id2)
    if root1 == root2:
        return
    uf_parent[root2] = root1
    rep, other = uf_representative[root1], uf_representative[root2]
    if merged_entity:
        updated = NormalizedEntity(
            id=rep.id, name=merged_entity.name, type=merged_entity.type,
            aliases=list(set(rep.aliases + other.aliases + (merged_entity.aliases or []))),
            description=merged_entity.description or rep.description,
            source_paragraph_ids=list(set(rep.source_paragraph_ids + other.source_paragraph_ids)),
            occurrence_count=rep.occurrence_count + other.occurrence_count,
            merged_from_ids=list(set(rep.merged_from_ids + other.merged_from_ids)),
            relationship_ids=list(set(rep.relationship_ids + other.relationship_ids)),
        )
    else:
        updated = NormalizedEntity(
            id=rep.id, name=rep.name, type=rep.type,
            aliases=list(set(rep.aliases + other.aliases)),
            description=f"{rep.description} | {other.description}".strip(" | "),
            source_paragraph_ids=list(set(rep.source_paragraph_ids + other.source_paragraph_ids)),
            occurrence_count=rep.occurrence_count + other.occurrence_count,
            merged_from_ids=list(set(rep.merged_from_ids + other.merged_from_ids)),
            relationship_ids=list(set(rep.relationship_ids + other.relationship_ids)),
        )
    uf_representative[root1] = updated


all_llm_results = []
print(f"Candidate finding: cosine (threshold: {threshold})\n")

for i, para in enumerate(paragraphs):
    # 1. Extract
    result = extract_entities(para["text"], para["id"], CONFIG)

    # 2. Assign IDs + drop orphans
    entities, relationships = assign_ids_single(result)

    if not entities:
        print(f"[{i}] p{para['page']} para {para['paragraph_index']} | 0 entities after filtering")
        continue

    # 3. Rule-based merge into master
    n_before = len(master_entities)
    master_entities, master_relationships, newly_added = merge_into_master_rule_based(
        entities, relationships, master_entities, master_relationships,
    )
    n_rule = len(entities) - len(newly_added)

    # 4. Find candidates and run LLM merge
    n_candidates = 0
    n_llm_checked = 0
    n_llm_merged = 0

    if newly_added:
        # Initialize UF entries for new entities
        for ne in newly_added:
            uf_parent[ne.id] = ne.id
            uf_representative[ne.id] = ne

        # Embed new entities
        new_texts = [create_entity_text(e) for e in newly_added]
        new_embs = np.array(embeddings_model.embed_documents(new_texts))

        # Find candidates via cosine similarity against existing master
        candidates = []
        if master_embeddings is not None and len(master_entity_order) > 0:
            master_lookup = {e.id: e for e in master_entities}
            existing_entities = [master_lookup[eid] for eid in master_entity_order]
            candidates = find_candidates_against_master(
                newly_added, new_embs, existing_entities, master_embeddings, threshold,
            )

        n_candidates = len(candidates)

        # 5. LLM merge on candidates
        for c in candidates[: CONFIG["max_llm_candidates"]]:
            ne, me = c["new_entity"], c["master_entity"]

            root_new, root_master = uf_find(ne.id), uf_find(me.id)
            if root_new == root_master:
                continue

            rep_new = uf_representative[root_new]
            rep_master = uf_representative[root_master]
            decision = decide_entity_merge(rep_new, rep_master, merge_chain)
            n_llm_checked += 1

            all_llm_results.append({
                "paragraph_idx": i,
                "page": para["page"],
                "entity1_name": rep_new.name,
                "entity2_name": rep_master.name,
                "cosine_similarity": c["similarity"],
                "should_merge": decision.should_merge,
                "confidence": decision.confidence,
                "reasoning": decision.reasoning,
            })

            if decision.should_merge:
                uf_union(ne.id, me.id, decision.merged_entity)
                n_llm_merged += 1
                print(f"  ** LLM MERGE: {rep_new.name} + {rep_master.name} (cos:{c['similarity']:.3f})")

        # 6. Update master embeddings
        if master_embeddings is None:
            master_embeddings = new_embs
        else:
            master_embeddings = np.vstack([master_embeddings, new_embs])
        master_entity_order.extend(e.id for e in newly_added)

    print(
        f"[{i}] p{para['page']} para {para['paragraph_index']} | "
        f"+{len(result.entities)} ext, {len(entities)} kept | "
        f"rule: {n_rule} merged, {len(newly_added)} new | "
        f"llm: {n_candidates} cand, {n_llm_checked} checked, {n_llm_merged} merged | "
        f"master: {len(master_entities)}e {len(master_relationships)}r"
    )

print(f"\n{'='*60}")
print(f"DONE: {len(master_entities)} master entities, {len(master_relationships)} rels")
print(f"LLM calls: {len(all_llm_results)}, merges: {sum(1 for r in all_llm_results if r['should_merge'])}")

Candidate finding: cosine (threshold: 0.65)

    Skipped 5 rels (entity not found)
    Dropped 1 orphans: Roman civilization
[0] p317 para 0 | +3 ext, 2 kept | rule: 0 merged, 2 new | llm: 0 cand, 0 checked, 0 merged | master: 2e 1r
[1] p318 para 1 | 0 entities after filtering
    Rule merge: 'Rome' -> 'Rome'
    1 rule-based merge(s)
[2] p318 para 2 | +4 ext, 4 kept | rule: 1 merged, 3 new | llm: 0 cand, 0 checked, 0 merged | master: 5e 4r
    Rule merge: 'Etruscans' -> 'Etruscans'
    1 rule-based merge(s)
[3] p318 para 3 | +3 ext, 3 kept | rule: 1 merged, 2 new | llm: 0 cand, 0 checked, 0 merged | master: 7e 8r
    Rule merge: 'Etruscans' -> 'Etruscans'
    1 rule-based merge(s)
[4] p318 para 4 | +5 ext, 5 kept | rule: 1 merged, 4 new | llm: 2 cand, 2 checked, 0 merged | master: 11e 13r
    Rule merge: 'Etruscans' -> 'Etruscans'
    Rule merge: 'Rome' -> 'Rome'
    Rule merge: 'Etruria' -> 'Etruria'
    3 rule-based merge(s)
[5] p319 para 5 | +6 ext, 6 kept | rule: 3 merged, 3 new |

In [12]:
# Build final entities from Union-Find groups (apply LLM merges to master state)

groups: dict[str, list[str]] = defaultdict(list)
for eid in uf_parent:
    root = uf_find(eid)
    groups[root].append(eid)

final_entities: list[NormalizedEntity] = []
master_id_to_final_id: dict[str, str] = {}

for root_id, member_ids in groups.items():
    rep = uf_representative[root_id]
    final_id = str(uuid_module.uuid4())

    final_entity = NormalizedEntity(
        id=final_id, name=rep.name, type=rep.type,
        aliases=rep.aliases, description=rep.description,
        source_paragraph_ids=rep.source_paragraph_ids,
        occurrence_count=rep.occurrence_count,
        merged_from_ids=rep.merged_from_ids,
        relationship_ids=[],
    )
    final_entities.append(final_entity)
    for mid in member_ids:
        master_id_to_final_id[mid] = final_id

# Remap relationships from master IDs to final IDs
final_relationships: list[NormalizedRelationship] = []
final_entity_lookup = {e.id: e for e in final_entities}

for rel in master_relationships:
    final_source = master_id_to_final_id.get(rel.source_id)
    final_target = master_id_to_final_id.get(rel.target_id)
    if final_source and final_target:
        final_rel = NormalizedRelationship(
            id=rel.id, source_id=final_source, target_id=final_target,
            source_entity_name=rel.source_entity_name, target_entity_name=rel.target_entity_name,
            relation_type=rel.relation_type, temporal_context=rel.temporal_context,
            paragraph_id=rel.paragraph_id,
        )
        final_relationships.append(final_rel)
        if final_source in final_entity_lookup:
            final_entity_lookup[final_source].relationship_ids.append(rel.id)
        if final_target in final_entity_lookup:
            final_entity_lookup[final_target].relationship_ids.append(rel.id)

llm_merge_count = sum(1 for r in all_llm_results if r["should_merge"])
print(f"Final: {len(final_entities)} entities, {len(final_relationships)} relationships")
print(f"Pipeline: {len(paragraphs)} paragraphs -> {len(master_entities)} after rule-based -> {len(final_entities)} final ({llm_merge_count} LLM merges)")

Final: 174 entities, 389 relationships
Pipeline: 85 paragraphs -> 198 after rule-based -> 174 final (24 LLM merges)


In [13]:
# # Export LLM normalization results for analysis
# df_llm = pd.DataFrame(all_llm_results)
# export_path = Path("llm_normalization_results.csv")
# df_llm.to_csv(export_path, index=False)
# print(f"Exported {len(df_llm)} LLM check results to {export_path}")
# print(f"  Merges: {df_llm['should_merge'].sum()}, Non-merges: {(~df_llm['should_merge']).sum()}")
# print(f"  Cosine range: {df_llm['cosine_similarity'].min():.3f} - {df_llm['cosine_similarity'].max():.3f}")

In [14]:
# Inspect final normalized entities
print("Final entities (sorted by occurrence):")
for e in sorted(final_entities, key=lambda x: x.occurrence_count, reverse=True):
    aliases_str = f" (aka {', '.join(e.aliases[:5])})" if e.aliases else ""
    merged_str = f" [merged from {len(e.merged_from_ids)}]" if len(e.merged_from_ids) > 1 else ""
    print(f"  {e.name}{aliases_str}: {e.occurrence_count} occ, {len(e.relationship_ids)} rels [{e.type}]{merged_str}")

# Show what LLM normalization caught that rule-based missed
llm_merges = [r for r in all_llm_results if r["should_merge"]]
if llm_merges:
    print(f"\nLLM caught {len(llm_merges)} additional merge(s) beyond rule-based:")
    for r in llm_merges:
        print(f"  {r['entity1_name']} + {r['entity2_name']}")

Final entities (sorted by occurrence):
  Octavian (aka princeps, Caesar, Augustus, Julius Caesar, Caesar Augustus): 17 occ, 38 rels [person] [merged from 17]
  Rome (aka People, Roman civilization, Roman, Roman culture, Ancient Rome): 14 occ, 227 rels [polity] [merged from 14]
  Roman Senate (aka Senate): 13 occ, 21 rels [polity] [merged from 13]
  Carthage (aka Carthaginians): 12 occ, 22 rels [polity] [merged from 12]
  Julius Caesar (aka Julius, Caesar): 8 occ, 30 rels [person] [merged from 8]
  Etruscans (aka Etruscan, Etruscans): 7 occ, 16 rels [polity] [merged from 7]
  Italy (aka Romanized Italy): 7 occ, 14 rels [place] [merged from 7]
  plebs (aka plebeians, Roman proletariat, Italian peasant): 6 occ, 11 rels [polity] [merged from 6]
  Gaul (aka northern France and Belgium): 5 occ, 12 rels [place] [merged from 5]
  Greek civilization (aka Greeks, Greek, Greek society, Greek inheritance, Greek background): 4 occ, 6 rels [polity] [merged from 4]
  Roman army (aka legions, Roman le

In [15]:
def build_knowledge_graph(
    entities: list[NormalizedEntity],
    relationships: list[NormalizedRelationship],
) -> nx.DiGraph:
    """Build a directed graph from normalized entities and relationships."""
    G = nx.DiGraph()

    for entity in entities:
        G.add_node(
            entity.id,
            name=entity.name,
            entity_type=entity.type,
            aliases=entity.aliases,
            description=entity.description,
            occurrence_count=entity.occurrence_count,
            merged_from_ids=entity.merged_from_ids,
            paragraph_ids=entity.source_paragraph_ids,
        )

    for rel in relationships:
        if rel.source_id in G and rel.target_id in G:
            if G.has_edge(rel.source_id, rel.target_id):
                edge_data = G[rel.source_id][rel.target_id]
                if rel.relation_type not in edge_data.get("relation_types", []):
                    edge_data.setdefault("relation_types", []).append(rel.relation_type)
                edge_data.setdefault("original_name_pairs", []).append(
                    (rel.source_entity_name, rel.target_entity_name)
                )
            else:
                G.add_edge(
                    rel.source_id,
                    rel.target_id,
                    relation_type=rel.relation_type,
                    relation_types=[rel.relation_type],
                    temporal_context=rel.temporal_context,
                    source_paragraph=rel.paragraph_id,
                    source_entity_name=rel.source_entity_name,
                    target_entity_name=rel.target_entity_name,
                    original_name_pairs=[(rel.source_entity_name, rel.target_entity_name)],
                )

    print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
    return G


G = build_knowledge_graph(final_entities, final_relationships)

Graph: 174 nodes, 282 edges


In [16]:
# Graph stats
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
if G.number_of_nodes() > 0:
    degrees = dict(G.degree())
    print(f"Avg degree: {sum(degrees.values()) / len(degrees):.2f}")
    print(f"Density: {nx.density(G):.3f}")

    print(f"\nTop entities by degree:")
    for node_id, degree in sorted(degrees.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {G.nodes[node_id]['name']}: degree {degree}")

Nodes: 174
Edges: 282
Avg degree: 3.24
Density: 0.009

Top entities by degree:
  Rome: degree 125
  Octavian: degree 22
  Roman Senate: degree 14
  Carthage: degree 14
  Julius Caesar: degree 14
  Pompey: degree 12
  Etruscans: degree 11
  Italy: degree 9
  Roman army: degree 9
  plebs: degree 8


In [17]:
TYPE_COLORS = {
    "person": "#FF6B6B",
    "polity": "#45B7D1",
    "place": "#4ECDC4",
    "event": "#FFA07A",
}


def visualize_with_pyvis(
    G: nx.DiGraph,
    entities: list[NormalizedEntity],
    relationships: list[NormalizedRelationship],
    output_file: str = "knowledge_graph_v2.html",
) -> Network:
    """Create interactive PyVis visualization. Edges show original entity names."""
    net = Network(
        height="900px", width="100%", directed=True, notebook=True,
        bgcolor="#F8F9FA", font_color="#333333",
    )
    net.set_options("""
    {
        "physics": {
            "enabled": true,
            "forceAtlas2Based": {
                "gravitationalConstant": -50,
                "centralGravity": 0.01,
                "springLength": 200,
                "springConstant": 0.08
            },
            "maxVelocity": 50,
            "solver": "forceAtlas2Based",
            "timestep": 0.35,
            "stabilization": {"iterations": 150}
        },
        "interaction": {
            "hover": true, "tooltipDelay": 100,
            "navigationButtons": true, "keyboard": true
        }
    }
    """)

    entity_lookup = {e.id: e for e in entities}

    # Add nodes
    for entity in entities:
        color = TYPE_COLORS.get(entity.type, "#CCCCCC")
        size = 15 + (entity.occurrence_count * 5)

        title = f"<b>{entity.name}</b><br>"
        title += f"Type: {entity.type}"
        title += f"<br>Occurrences: {entity.occurrence_count}"
        title += f"<br>Relationships: {len(entity.relationship_ids)}"
        if entity.aliases:
            title += f"<br>Aliases: {', '.join(entity.aliases[:5])}"
        if entity.description:
            desc = entity.description[:150] + "..." if len(entity.description) > 150 else entity.description
            title += f"<br><br>{desc}"
        if len(entity.merged_from_ids) > 1:
            title += f"<br><br>Merged from {len(entity.merged_from_ids)} entities"

        net.add_node(
            entity.id, label=entity.name, title=title,
            color=color, size=size, font={"size": 14},
            borderWidth=2,
        )

    # Add edges with original entity names
    for rel in relationships:
        if rel.source_id in entity_lookup and rel.target_id in entity_lookup:
            label = f"{rel.source_entity_name} \u2192 {rel.target_entity_name}"
            label += f"\n{rel.relation_type.replace('-', ' ').replace('_', ' ')}"

            title = f"{rel.source_entity_name} \u2192 {rel.target_entity_name}<br>"
            title += f"Relationship: {rel.relation_type}<br>"
            title += f"Normalized: {entity_lookup[rel.source_id].name} \u2192 {entity_lookup[rel.target_id].name}"
            if rel.temporal_context:
                title += f"<br>When: {rel.temporal_context}"

            net.add_edge(
                rel.source_id, rel.target_id,
                label=label, title=title, arrows="to",
                color={"color": "#888888", "highlight": "#333333"},
                width=2, font={"size": 10, "align": "middle"},
            )

    net.save_graph(output_file)
    print(f"Saved visualization to {output_file}")
    return net

In [18]:
outfile = "knowledge_graph_v3_rome_2.html"
net = visualize_with_pyvis(G, final_entities, final_relationships, output_file=outfile)

Saved visualization to knowledge_graph_v3_rome_2.html
