# SKGB - Semantic Knowledge Graph Builder (H100 GPU + Hugging Face Edition)

This notebook demonstrates the full **SKGB** pipeline using **Hugging Face** models on **H100 GPU**:

**Document -> Docling Markdown -> Semantic Chunks -> itext2kg ATOM -> Knowledge Graph -> Export**

## Pipeline Flow
1. **Docling Adapter** - Converts PDF (and 17+ formats) to Markdown
2. **Chunking Adapter** - Splits Markdown by headers into semantic chunks (200-800 words)
3. **itext2kg Adapter** - Builds Knowledge Graph using ATOM with Hugging Face models
4. **Export** - Outputs JSON, CSV, GraphML, HTML visualization, and Neo4j Cypher script

> **Runtime**: Go to *Runtime -> Change runtime type* and select **H100 GPU** for maximum performance.
> 
> **Models**: Uses `meta-llama/Llama-3.1-8B-Instruct` for LLM and `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
> 
> **Note**: You'll need a Hugging Face token with access to Llama models.

## 1. Hugging Face Authentication\n\nYou need a Hugging Face token to access gated models like Llama.\n1. Get your token from https://huggingface.co/settings/tokens\n2. Add it to Colab secrets or enter it below:

In [None]:
# Hugging Face Authentication
import os
from getpass import getpass

# Try to get token from Colab secrets first
try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("Loaded HF_TOKEN from Colab secrets")
except:
    HF_TOKEN = None

# If not in secrets, prompt user
if not HF_TOKEN:
    print("Please enter your Hugging Face token:")
    print("   (Get one at https://huggingface.co/settings/tokens)")
    HF_TOKEN = getpass("HF_TOKEN: ")

os.environ['HF_TOKEN'] = HF_TOKEN

# Login to Hugging Face
!huggingface-cli login --token $HF_TOKEN

print("\nHugging Face authentication complete!")

## 2. Verify H100 GPU\n\nCheck that we have GPU access and it's an H100:

In [None]:
# Verify GPU\nimport torch\nimport subprocess\n\nprint("GPU Available:", torch.cuda.is_available())\nprint("GPU Count:", torch.cuda.device_count())\n\nif torch.cuda.is_available():\n    for i in range(torch.cuda.device_count()):\n        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")\n        print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")\n\n# Check with nvidia-smi\nprint("\n=== nvidia-smi output ===")\n!nvidia-smi

## 3. Install Dependencies

Install SKGB with Hugging Face support.

**Note**: Google Colab uses NumPy 2.0+ which causes binary incompatibility errors with many packages. We downgrade to NumPy <2.0 to fix this. After running this cell, you **must restart the runtime** before proceeding.

In [None]:
# Clone the repository
!git clone https://github.com/edwinidrus/DynamicKGConstruction.git 2>/dev/null || echo "Already cloned"
%cd DynamicKGConstruction

# CRITICAL: Downgrade NumPy FIRST to avoid binary incompatibility with NumPy 2.0
# See: https://github.com/googlecolab/colabtools/issues/5238
print("Downgrading NumPy to <2.0 for compatibility...")
!pip install -q "numpy<2"

# Install base requirements (includes langchain pinned for itext2kg compatibility)
!pip install -q -r DynamicKGConstruction/requirements.txt

# Install Hugging Face specific dependencies
!pip install -q transformers accelerate sentence-transformers huggingface_hub
!pip install -q langchain-huggingface

# Install nest-asyncio for Jupyter/Colab event loop compatibility
!pip install -q nest-asyncio

print("\n" + "="*60)
print("Dependencies installed!")
print("="*60)
print("\nIMPORTANT: You MUST restart the runtime now!")
print("Go to: Runtime -> Restart session")
print("Then re-run cells starting from 'Hugging Face Authentication'")
print("="*60)

In [None]:
# Verify NumPy version after restart
# Run this cell AFTER restarting the runtime
import numpy as np
print(f"NumPy version: {np.__version__}")

if np.__version__.startswith("2"):
    print("\nWARNING: NumPy 2.x detected!")
    print("This may cause binary incompatibility errors.")
    print("Please re-run the install cell and restart the runtime.")
else:
    print("NumPy <2.0 confirmed - ready to proceed!")

## 4. Load Hugging Face Models\n\nLoad the LLM and embeddings models on H100 GPU:

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings

# Model configuration for H100
LLM_MODEL = "meta-llama/Llama-3.1-8B-Instruct"  # ~16GB VRAM
EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # ~400MB

print(f"Loading LLM: {LLM_MODEL}")
print("This may take 2-3 minutes on first run...\n")

# Load LLM tokenizer and model
llm_tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL, token=os.environ['HF_TOKEN'])
llm_model = AutoModelForCausalLM.from_pretrained(
    LLM_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=os.environ['HF_TOKEN']
)

# Create text-generation pipeline
llm_pipeline = pipeline(
    "text-generation",
    model=llm_model,
    tokenizer=llm_tokenizer,
    max_new_tokens=512,
    temperature=0.0,  # Deterministic output (matches skgb defaults)
    do_sample=False,
    return_full_text=False,
)

# Wrap for LangChain compatibility with itext2kg
llm = HuggingFacePipeline(pipeline=llm_pipeline)

print(f"\nLLM loaded: {LLM_MODEL}")
print(f"   Device: {llm_model.device}")

# Load Embeddings
print(f"\nLoading Embeddings: {EMBEDDINGS_MODEL}...")
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDINGS_MODEL,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

print(f"\nEmbeddings loaded: {EMBEDDINGS_MODEL}")
print("\nAll models ready on H100 GPU!")

## 5. Apply itext2kg Patch\n\nPatch to handle empty atomic KG lists gracefully:

In [None]:
# Apply itext2kg patch for IndexError
import functools

print("Applying itext2kg patch...")

try:
    import itext2kg.atom.atom as atom_module
    from itext2kg.atom.models.knowledge_graph import KnowledgeGraph
    
    _original_func = atom_module.Atom.parallel_atomic_merge
    
    @functools.wraps(_original_func)
    def _safe_parallel_atomic_merge(self, kgs, existing_kg=None, rel_threshold=0.7, ent_threshold=0.8, max_workers=8):
        if not kgs:
            print("Warning: No atomic KGs to merge. Returning empty KG.")
            return KnowledgeGraph()
        return _original_func(self, kgs, existing_kg, rel_threshold, ent_threshold, max_workers)
    
    atom_module.Atom.parallel_atomic_merge = _safe_parallel_atomic_merge
    print("Patch applied successfully")
    
except Exception as e:
    print(f"Patch not applied: {e}")

## 6. Upload a PDF\n\nUpload your own PDF or use the sample:

In [None]:
import os\nfrom pathlib import Path\n\nINPUT_DIR = Path("input_docs")\nINPUT_DIR.mkdir(exist_ok=True)\n\n# Option A: Upload from your computer\ntry:\n    from google.colab import files\n    print("Click below to upload a PDF:")\n    uploaded = files.upload()\n    for filename, data in uploaded.items():\n        dest = INPUT_DIR / filename\n        dest.write_bytes(data)\n        print(f"Saved: {dest}")\nexcept ImportError:\n    print("Not in Colab - place PDF in input_docs/ manually")

In [None]:
# Option B: Download sample PDF (Attention Is All You Need)
SAMPLE_URL = "https://arxiv.org/pdf/1706.03762"
SAMPLE_PATH = INPUT_DIR / "attention_is_all_you_need.pdf"

if not SAMPLE_PATH.exists():
    !wget -q -O "{SAMPLE_PATH}" "{SAMPLE_URL}"
    print(f"Downloaded: {SAMPLE_PATH}")
else:
    print(f"Sample exists: {SAMPLE_PATH}")

pdfs = list(INPUT_DIR.glob("*.pdf"))
print(f"\nPDFs available: {[p.name for p in pdfs]}")

## 7. Configure and Run Pipeline\n\nRun the SKGB pipeline with Hugging Face models:

In [None]:
# Custom Hugging Face adapter for SKGB
# Mirrors the pattern from skgb/adapters/itext2kg_adapter.py
import asyncio
import logging
from typing import Dict, List
from pathlib import Path

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

def _run(coro):
    """Run async coroutine, handling Jupyter/Colab event loop conflicts."""
    try:
        return asyncio.run(coro)
    except RuntimeError as exc:
        # In Jupyter/Colab environments, there's already an event loop running
        if "asyncio.run() cannot be called" not in str(exc):
            raise
        import nest_asyncio
        nest_asyncio.apply()
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(coro)

async def _build_async_hf(
    atomic_facts_dict: Dict[str, List[str]],
    llm,
    embeddings,
    ent_threshold: float,
    rel_threshold: float,
    max_workers: int,
):
    """Build KG using itext2kg ATOM with Hugging Face models."""
    from itext2kg.atom import Atom
    from itext2kg.atom.models.knowledge_graph import KnowledgeGraph
    
    total_facts = sum(len(facts) for facts in atomic_facts_dict.values())
    logger.info(f"Building KG from {total_facts} atomic facts across {len(atomic_facts_dict)} timestamps")
    
    atom = Atom(llm_model=llm, embeddings_model=embeddings)
    
    try:
        kg = await atom.build_graph_from_different_obs_times(
            atomic_facts_with_obs_timestamps=atomic_facts_dict,
            ent_threshold=ent_threshold,
            rel_threshold=rel_threshold,
            max_workers=max_workers,
        )
    except IndexError as e:
        # Handle the itext2kg bug where parallel_atomic_merge returns current[0] on an empty list
        logger.warning(
            f"itext2kg IndexError: {e}. "
            "No valid knowledge graph entities could be extracted. Returning empty KG."
        )
        kg = KnowledgeGraph()
    except Exception as e:
        error_msg = str(e)
        if "list index out of range" in error_msg.lower():
            logger.warning(f"itext2kg failed with IndexError: {e}. Returning empty KG.")
            kg = KnowledgeGraph()
        else:
            logger.error(f"itext2kg failed with unexpected error: {e}")
            raise
    
    # Log results
    if kg and hasattr(kg, 'entities') and kg.entities:
        logger.info(f"Successfully built KG with {len(kg.entities)} entities and {len(kg.relationships)} relations")
    else:
        logger.warning("Knowledge graph is empty - no entities or relations extracted")
    
    return kg

def build_kg_hf(
    atomic_facts_dict: Dict[str, List[str]],
    llm,
    embeddings,
    ent_threshold: float = 0.8,
    rel_threshold: float = 0.7,
    max_workers: int = 4,
):
    """Build a KnowledgeGraph using itext2kg ATOM with Hugging Face models (async under the hood)."""
    try:
        return _run(
            _build_async_hf(
                atomic_facts_dict=atomic_facts_dict,
                llm=llm,
                embeddings=embeddings,
                ent_threshold=ent_threshold,
                rel_threshold=rel_threshold,
                max_workers=max_workers,
            )
        )
    except IndexError as e:
        # Catch IndexError that propagates through the event loop
        logger.warning(f"itext2kg IndexError (caught at sync level): {e}. Returning empty KG.")
        from itext2kg.atom.models.knowledge_graph import KnowledgeGraph
        return KnowledgeGraph()

print("Hugging Face adapter ready")

In [None]:
# Run the pipeline
import sys
sys.path.insert(0, '/content/DynamicKGConstruction')  # Ensure module is importable

from DynamicKGConstruction.skgb.adapters.docling_adapter import docling_convert_to_markdown
from DynamicKGConstruction.skgb.adapters.chunking_adapter import chunk_markdown_files
from DynamicKGConstruction.skgb.export.file_export import export_kg_outputs, kg_to_dict
from DynamicKGConstruction.skgb.export.neo4j_export import write_neo4j_load_cypher
import datetime
import json

# Configuration
OUT_DIR = Path("skgb_output")
BUILD_DOCLING_DIR = OUT_DIR / "build_docling"
CHUNKS_DIR = OUT_DIR / "chunks_output"
KG_DIR = OUT_DIR / "kg_output"

for d in [OUT_DIR, BUILD_DOCLING_DIR, CHUNKS_DIR, KG_DIR]:
    d.mkdir(parents=True, exist_ok=True)

# Pipeline parameters (match skgb defaults)
ENT_THRESHOLD = 0.8
REL_THRESHOLD = 0.7
MAX_WORKERS = 4  # Higher values possible for H100

pdf_path = list(INPUT_DIR.glob("*.pdf"))[0]
print(f"Processing: {pdf_path}\n")

# Step 1: Docling - Convert PDF to Markdown
print("Step 1: Converting PDF to Markdown...")
md_paths = docling_convert_to_markdown(
    input_path=pdf_path,
    output_dir=BUILD_DOCLING_DIR,
    recursive=False,
    enable_ocr=False,
    enable_table_structure=True,
)
print(f"Created {len(md_paths)} markdown files\n")

# Step 2: Chunking - Split Markdown into semantic chunks
print("Step 2: Creating semantic chunks...")
chunks = chunk_markdown_files(
    md_paths=md_paths,
    min_chunk_words=200,
    max_chunk_words=800,
    overlap_words=0,
)

chunks_json = CHUNKS_DIR / "all_chunks.json"
chunks_json.write_text(json.dumps(chunks, indent=2), encoding="utf-8")
print(f"Created {len(chunks)} chunks\n")

# Step 3: Prepare atomic facts (format: "[Section Title] content...")
print("Step 3: Preparing atomic facts...")
t_obs = datetime.datetime.now().strftime("%Y-%m-%d")
atomic_facts_dict = {t_obs: []}
for ch in chunks:
    section = ch.get("section_title", "")
    content = ch.get("content", "")
    atomic_facts_dict[t_obs].append(f"[{section}] {content}".strip())

print(f"   Prepared {len(atomic_facts_dict[t_obs])} atomic facts for timestamp {t_obs}\n")

# Step 4: Build KG with Hugging Face models
print(f"Step 4: Building Knowledge Graph with {LLM_MODEL}...")
print("   (This may take several minutes...)\n")

kg = build_kg_hf(
    atomic_facts_dict=atomic_facts_dict,
    llm=llm,
    embeddings=embeddings,
    ent_threshold=ENT_THRESHOLD,
    rel_threshold=REL_THRESHOLD,
    max_workers=MAX_WORKERS,
)

# Get counts (use correct attributes)
entity_count = len(getattr(kg, 'entities', []) or [])
rel_count = len(getattr(kg, 'relationships', []) or [])
print(f"\nKG built: {entity_count} entities, {rel_count} relations\n")

# Step 5: Export results
print("Step 5: Exporting results...")
export_kg_outputs(
    kg=kg,
    kg_output_dir=KG_DIR,
    total_chunks=len(chunks),
    ent_threshold=ENT_THRESHOLD,
    rel_threshold=REL_THRESHOLD,
    llm_model=LLM_MODEL,
    embeddings_model=EMBEDDINGS_MODEL,
)

cypher_path = write_neo4j_load_cypher(KG_DIR)
print(f"Exported to {KG_DIR}\n")

print("=" * 60)
print("PIPELINE COMPLETE!")
print(f"   Markdown: {BUILD_DOCLING_DIR}")
print(f"   Chunks:   {chunks_json}")
print(f"   KG:       {KG_DIR}")
print(f"   Cypher:   {cypher_path}")

## 8. Explore Results\n\nView the generated knowledge graph:

In [None]:
# List output files\nprint("Output files:")\nfor f in sorted(KG_DIR.rglob("*")):\n    if f.is_file():\n        size = f.stat().st_size\n        print(f"  {f.name:40s} {size:>8,} bytes")

In [None]:
# View KG JSON
import json

kg_json_path = KG_DIR / "knowledge_graph.json"
if kg_json_path.exists():
    kg_data = json.loads(kg_json_path.read_text())
    nodes = kg_data.get("nodes", [])
    edges = kg_data.get("edges", [])

    print(f"Nodes: {len(nodes)}, Edges: {len(edges)}\n")

    print("=== Sample Nodes ===")
    for n in nodes[:10]:
        name = n.get('name', 'N/A')[:40]
        label = n.get('label', 'N/A')
        print(f"  {name:40s} ({label})")

    print("\n=== Sample Edges ===")
    for e in edges[:10]:
        src = e.get('source', 'N/A')[:25]
        rel = e.get('relation', 'N/A')[:15]
        tgt = e.get('target', 'N/A')[:25]
        print(f"  {src} --[{rel}]--> {tgt}")
else:
    print("knowledge_graph.json not found")

In [None]:
# Display interactive visualization\nfrom IPython.display import HTML, display\n\nviz_path = KG_DIR / "kg_visualization.html"\nif viz_path.exists():\n    display(HTML(viz_path.read_text()))\nelse:\n    print("Visualization not found")

## 9. Download Results

In [None]:
import shutil\n\narchive = shutil.make_archive("skgb_hf_results", "zip", ".", "skgb_output")\nprint(f"Created: {archive}")\n\ntry:\n    from google.colab import files\n    files.download(archive)\nexcept:\n    print(f"Download from: {archive}")