# SKGB - Semantic Knowledge Graph Builder (VS Code + Google Colab)

This notebook demonstrates the full **DynamicKGConstruction** pipeline using VS Code connected to a Google Colab runtime.

**PDF -> Docling Markdown -> Semantic Chunks -> itext2kg Knowledge Graph -> Visualization**

## Prerequisites

### Option 1: Google Colab VS Code Extension (Recommended)
1. Install the **Google Colab** extension in VS Code
2. Start a Colab runtime (from colab.research.google.com or within Colab)
3. In VS Code: `Cmd/Ctrl + Shift + P` → "Colab: Connect to Runtime"
4. Select your running Colab instance

### Option 2: Manual Remote Jupyter
1. Start Jupyter server in Colab: `!jupyter notebook --NotebookApp.allow_origin='*' --port=8888`
2. Get the connection token
3. In VS Code: Add Jupyter server → connect to `http://localhost:8888?token=xxx`

### Runtime Settings
- Go to **Runtime → Change runtime type** → Select **T4 GPU** (recommended for faster LLM inference)

## 1. Environment Detection

Detect whether we're running in Colab, local, or VS Code with Colab extension.

In [None]:
# Environment detection
import sys

def detect_environment():
    """Detect the current environment."""
    env = {
        "in_colab": False,
        "in_vscode": False,
        "colab_extension": False
    }

    # Check for Colab
    try:
        from google.colab import _is_colab_env
        env["in_colab"] = _is_colab_env()
    except ImportError:
        pass

    # Check if running in VS Code
    env["in_vscode"] = hasattr(sys, "ps1") or "VSCODE" in sys.prefix

    # Check for Colab extension indicators
    try:
        import json
        import os
        # Check for colab runtime connection
        if os.path.exists("/usr/local/colab_user"):
            env["colab_extension"] = True
    except Exception:
        pass

    return env

env = detect_environment()
print(f"Environment detected:")
print(f"  In Google Colab: {env['in_colab']}")
print(f"  In VS Code:     {env['in_vscode']}")
print(f"  Colab Extension: {env['colab_extension']}")

## 2. Install Ollama

Ollama runs inside the Colab runtime. You need to install and pull models.

In [None]:
# Install Ollama (run this once per Colab session)
# !sudo apt-get install zstd  # if needed
# !curl -fsSL https://ollama.com/install.sh | sh

# Or use the Ollama that may already be available in your Colab environment
# Check if ollama is available
!which ollama || echo "Ollama not found - may need installation"

In [None]:
import subprocess
import time

# Start Ollama server in background
ollama_proc = subprocess.Popen(
    ["ollama", "serve"],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)
time.sleep(3)
print(f"Ollama server started (PID {ollama_proc.pid})")

In [None]:
# Pull required models
# Adjust model sizes based on your Colab GPU VRAM

LLM_MODEL = "qwen2.5"  # ~4.7 GB (use qwen2.5:7b for less VRAM)
EMBEDDINGS_MODEL = "nomic-embed-text"  # ~274 MB

# Uncomment to pull models (only needed once)
# !ollama pull {LLM_MODEL}
# !ollama pull {EMBEDDINGS_MODEL}

# Verify models are available
# !ollama list

## 3. Install DynamicKGConstruction

In [None]:
# Clone the repository (if not already present)
!git clone https://github.com/edwinidrus/DynamicKGConstruction.git 2>/dev/null || echo "Already cloned"

# Change to project directory
%cd DynamicKGConstruction

In [None]:
# Install dependencies
!pip install -q -r requirements.txt

In [None]:
# Verify SKGB imports correctly
from DynamicKGConstruction.skgb import SKGBConfig, run_pipeline
print("SKGB imported successfully")

## 4. Upload a PDF

Upload your PDF to the Colab runtime. The file will be available for processing.

In [None]:
# Configure logging for debugging
import logging

logging.basicConfig(level=logging.INFO, format="%(levelname)s:%(name)s:%(message)s")
logging.getLogger("DynamicKGConstruction.skgb.adapters.itext2kg_adapter").setLevel(logging.DEBUG)
print("Logging configured - adapter debug messages will appear below.")

In [None]:
import os
from pathlib import Path

INPUT_DIR = Path("input_docs")
INPUT_DIR.mkdir(exist_ok=True)

# Check if running in Colab (including VS Code with Colab extension)
try:
    from google.colab import files
    print("Running in Colab - use the file upload button below:")
    uploaded = files.upload()
    for filename, data in uploaded.items():
        dest = INPUT_DIR / filename
        dest.write_bytes(data)
        print(f"Saved: {dest}")
except ImportError:
    print("Not in Colab - please place your PDF in 'input_docs/' manually")
    print(f"Current directory: {os.getcwd()}")

In [None]:
# Option: Download a sample PDF
SAMPLE_URL = "https://arxiv.org/pdf/1706.03762"  # Attention Is All You Need
SAMPLE_PATH = INPUT_DIR / "attention_is_all_you_need.pdf"

if not SAMPLE_PATH.exists():
    !wget -q -O "{SAMPLE_PATH}" "{SAMPLE_URL}"
    print(f"Downloaded sample PDF to {SAMPLE_PATH}")
else:
    print(f"Sample PDF already exists at {SAMPLE_PATH}")

# List all PDFs
pdfs = list(INPUT_DIR.glob("*.pdf"))
print(f"\nPDFs in {INPUT_DIR}/: {[p.name for p in pdfs]}")

## 5. Configure and Run the SKGB Pipeline

In [None]:
from pathlib import Path
from DynamicKGConstruction.skgb import SKGBConfig, run_pipeline

# Select input PDF
pdf_path = list(Path("input_docs").glob("*.pdf"))[0]
print(f"Input PDF: {pdf_path}")

# Create pipeline configuration
# Note: Ollama runs in Colab, so use localhost:11434
cfg = SKGBConfig.from_out_dir(
    "skgb_output",
    llm_model="qwen2.5",  # Use smaller model for Colab
    embeddings_model="nomic-embed-text",
    ollama_base_url="http://localhost:11434",
    temperature=0.0,
    ent_threshold=0.8,
    rel_threshold=0.7,
    max_workers=2,        # Keep low for Colab
    min_chunk_words=200,
    max_chunk_words=800,
    overlap_words=0,
)

print(f"\nPipeline config:")
print(f"  LLM model:        {cfg.llm_model}")
print(f"  Embeddings model: {cfg.embeddings_model}")
print(f"  Ollama URL:       {cfg.ollama_base_url}")
print(f"  Output dir:       {cfg.out_dir}")

In [None]:
# Run the full pipeline
# This may take several minutes depending on PDF size and model

result = run_pipeline(pdf_path, cfg)

print("\n" + "=" * 60)
print("Pipeline completed!")
print(f"  Markdown dir:  {result.build_docling_dir}")
print(f"  Chunks JSON:   {result.chunks_json_path}")
print(f"  KG output dir: {result.kg_output_dir}")
print(f"  Neo4j Cypher:  {result.neo4j_cypher_path}")

## 6. Explore the Results

In [None]:
# List all output files
print("Output files:")
for f in sorted(result.kg_output_dir.rglob("*")):
    if f.is_file():
        size = f.stat().st_size
        print(f"  {f.name:40s} {size:>8,} bytes")

### 6.1 Construction Report

In [None]:
report_path = result.kg_output_dir / "construction_report.txt"
print(report_path.read_text())

### 6.2 Knowledge Graph JSON

In [None]:
import json

kg_json_path = result.kg_output_dir / "knowledge_graph.json"
kg_data = json.loads(kg_json_path.read_text())

nodes = kg_data.get("nodes", [])
edges = kg_data.get("edges", [])

print(f"Total nodes: {len(nodes)}")
print(f"Total edges: {len(edges)}")
print(f"\n--- First 10 Nodes ---")
for n in nodes[:10]:
    print(f"  {n['name']:40s}  label={n.get('label', '')}")

print(f"\n--- First 10 Edges ---")
for e in edges[:10]:
    print(f"  {e['source'][:25]:25s} --[{e['relation'][:20]}]--> {e['target'][:25]}")

### 6.3 Nodes & Edges as DataFrames

In [None]:
import pandas as pd

df_nodes = pd.read_csv(result.kg_output_dir / "kg_nodes.csv")
df_edges = pd.read_csv(result.kg_output_dir / "kg_edges.csv")

print(f"Nodes shape: {df_nodes.shape}")
display(df_nodes.head(10))

print(f"\nEdges shape: {df_edges.shape}")
display(df_edges.head(10))

### 6.4 Interactive Knowledge Graph Visualization

In [None]:
# Display the PyVis interactive graph
from IPython.display import HTML, display

viz_path = result.kg_output_dir / "kg_visualization.html"
if viz_path.exists():
    display(HTML(viz_path.read_text()))
else:
    print("Visualization file not found.")

### 6.5 NetworkX Graph Stats

In [None]:
import networkx as nx

G = nx.read_graphml(str(result.kg_output_dir / "knowledge_graph.graphml"))

print(f"Graph type:       {type(G).__name__}")
print(f"Number of nodes:  {G.number_of_nodes()}")
print(f"Number of edges:  {G.number_of_edges()}")
print(f"Density:          {nx.density(G):.4f}")

if G.number_of_nodes() > 0:
    degree_sorted = sorted(G.degree(), key=lambda x: x[1], reverse=True)
    print(f"\nTop 10 nodes by degree:")
    for name, deg in degree_sorted[:10]:
        print(f"  {name:40s}  degree={deg}")

### 6.6 Semantic Chunks Preview

In [None]:
chunks = json.loads(result.chunks_json_path.read_text())
print(f"Total chunks: {len(chunks)}\n")

for i, ch in enumerate(chunks[:3]):
    print(f"--- Chunk {i} ---")
    print(f"  ID:      {ch.get('chunk_id', 'N/A')}")
    print(f"  Section: {ch.get('section_title', 'N/A')}")
    content = ch.get('content', '')
    print(f"  Content: {content[:300]}{'...' if len(content) > 300 else ''}")
    print()

## 7. Neo4j Cypher Script

In [None]:
cypher_path = result.neo4j_cypher_path
if cypher_path.exists():
    print(cypher_path.read_text())
else:
    print("Neo4j Cypher file not generated.")

## 8. Download Results

In [None]:
import shutil

# Zip all outputs
archive_path = shutil.make_archive("skgb_results", "zip", ".", "skgb_output")
print(f"Archive created: {archive_path}")

# Try to trigger download (works in Colab)
try:
    from google.colab import files
    files.download(archive_path)
    print("Download initiated - check your browser downloads")
except ImportError:
    print("Not in Colab - find the zip at:", archive_path)
    print("\nTo download from Colab manually:")
    print("1. Go to Colab: https://colab.research.google.com")
    print("2. Open the Files tab (folder icon on the left)")
    print("3. Right-click 'skgb_results.zip' and download")

## 9. Cleanup

Stop the Ollama server when done to free resources.

In [None]:
# Stop the Ollama server
try:
    ollama_proc.terminate()
    ollama_proc.wait()
    print("Ollama server stopped.")
except Exception as e:
    print(f"Could not stop Ollama: {e}")