# arcOS Benchmark - Causal QA with GNN + LLM

**Phase 1: Environment & Data Foundation**

This notebook implements the complete arcOS benchmark pipeline:
- Graph Neural Network structural reasoning over knowledge graphs
- LLM text generation with graph-guided prompts
- Evaluation on RoG-WebQSP question answering dataset

**Requirements:**
- Google Colab with GPU runtime (T4 or better)
- Google Drive mounted for checkpointing
- ~10GB free space on Drive

**Architecture:**
- Dataset: RoG-WebQSP (4,706 QA pairs with Freebase subgraphs)
- Graph DB: NetworkX in-memory
- GNN: Graph Attention Network (GATv2)
- LLM: OpenRouter API (Claude 3.5 Sonnet)
- Verbalization: Hard prompts (text-based, not soft embeddings)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# List folders in MyDrive to see what's there
!ls -la /content/drive/MyDrive/ | head -20

Mounted at /content/drive
total 1137487
-rw------- 1 root root     22759 Jun 14  2020 001374 (1).ldb
-rw------- 1 root root     22759 Jun 14  2020 001374.ldb
-rw------- 1 root root   2123232 Jun 14  2020 001377.ldb
-rw------- 1 root root   2137969 Jun 14  2020 001378.ldb
-rw------- 1 root root     96516 Jun 14  2020 001379 (1).ldb
-rw------- 1 root root     96516 Jun 14  2020 001379.ldb
-rw------- 1 root root    493826 Jun 14  2020 001394 (1).ldb
-rw------- 1 root root    493826 Jun 14  2020 001394.ldb
-rw------- 1 root root   2120864 Jun 14  2020 001395.ldb
-rw------- 1 root root   2110289 Jun 14  2020 001396.ldb
-rw------- 1 root root    153657 Jun 14  2020 001397 (1).ldb
-rw------- 1 root root    153657 Jun 14  2020 001397.ldb
-rw------- 1 root root     83340 Jun 14  2020 001412 (1).ldb
-rw------- 1 root root     83340 Jun 14  2020 001412.ldb
-rw------- 1 root root      9232 Jun 14  2020 001427 (1).ldb
-rw------- 1 root root      9232 Jun 14  2020 001427.ldb
-rw------- 1 root root  

## Cell 1: Environment Setup

Install dependencies and verify GPU availability.

In [2]:
# Clone the repository into Colab
!git clone https://github.com/ashtonalex/arcOS-benchmark-colab /content/arcOS-benchmark-colab

Cloning into '/content/arcOS-benchmark-colab'...
remote: Enumerating objects: 63, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 63 (delta 12), reused 56 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (63/63), 118.61 KiB | 4.56 MiB/s, done.
Resolving deltas: 100% (12/12), done.


In [3]:
# ============================================================================
# ENVIRONMENT SETUP WITH UV PACKAGE MANAGER
# Ensures absolute environment parity between kernel and installed packages
# ============================================================================

import sys
import os
import subprocess
from pathlib import Path

# Colab UV workaround: Clear broken constraint files
os.environ["UV_CONSTRAINT"] = ""
os.environ["UV_BUILD_CONSTRAINT"] = ""

print("="*70)
print("STEP 1: ENVIRONMENT PATH VERIFICATION")
print("="*70)

# Capture current Python executable
current_python = sys.executable
print(f"Current kernel executable: {current_python}")
print(f"Python version: {sys.version}")
print(f"Site packages: {sys.path[0] if sys.path else 'N/A'}")

# Check if uv is available
def check_uv_available():
    """Check if uv is installed and accessible."""
    try:
        result = subprocess.run(
            ['uv', '--version'],
            capture_output=True,
            text=True,
            timeout=5
        )
        return result.returncode == 0
    except (FileNotFoundError, subprocess.TimeoutExpired):
        return False

uv_available = check_uv_available()

if not uv_available:
    print("\n⚠ UV not found. Installing uv package manager...")
    %pip install -q uv
    uv_available = check_uv_available()

if uv_available:
    # Get uv version
    uv_version = subprocess.run(
        ['uv', '--version'],
        capture_output=True,
        text=True
    ).stdout.strip()
    print(f"✓ UV available: {uv_version}")
else:
    print("✗ UV installation failed. Will fall back to pip.")

print("\n" + "="*70)
print("STEP 2: PACKAGE INSTALLATION")
print("="*70)

# Define packages to install
packages = [
    "datasets",
    "networkx",
    "tqdm",
    "faiss-gpu-cu12" # Added faiss-gpu
]

# PyTorch with CUDA support
torch_packages = "torch torchvision torchaudio"
torch_index = "https://download.pytorch.org/whl/cu118"

if uv_available:
    print(f"Installing packages using UV with --python {current_python}\n")

    # Install PyTorch with CUDA
    print("Installing PyTorch with CUDA 11.8 support...")
    !uv pip install --python {current_python} {torch_packages} --index-url {torch_index}

    # Install other packages
    print("\nInstalling additional dependencies...")
    for package in packages:
        !uv pip install --python {current_python} {package}
else:
    print("Falling back to standard pip installation\n")

    # Install PyTorch with CUDA
    print("Installing PyTorch with CUDA 11.8 support...")
    %pip install -q {torch_packages} --index-url {torch_index}

    # Install other packages
    print("\nInstalling additional dependencies...")
    %pip install -q {' '.join(packages)}

print("\n" + "="*70)
print("STEP 3: INSTALLATION VERIFICATION")
print("="*70)

# Verify installed packages are in the correct location
def verify_package_location(package_name):
    """Verify package is installed in current kernel's site-packages."""
    try:
        module = __import__(package_name)
        module_path = Path(module.__file__).parent

        # Check if module is in one of sys.path locations
        in_sys_path = any(str(module_path).startswith(p) for p in sys.path if p)

        # Get version if available
        version = getattr(module, '__version__', 'unknown')

        return {
            'installed': True,
            'version': version,
            'location': str(module_path),
            'in_sys_path': in_sys_path
        }
    except ImportError:
        return {'installed': False}

# Verify key packages
verification_packages = ['torch', 'datasets', 'networkx', 'tqdm', 'faiss'] # Added faiss for verification
print("\nVerifying installed packages:\n")

all_verified = True
for pkg in verification_packages:
    info = verify_package_location(pkg)
    if info['installed']:
        status = "✓" if info['in_sys_path'] else "⚠"
        print(f"{status} {pkg:12s} v{info['version']:12s}")
        print(f"  Location: {info['location']}")
        if not info['in_sys_path']:
            print(f"  WARNING: Not in sys.path!")
            all_verified = False
    else:
        print(f"✗ {pkg:12s} NOT INSTALLED")
        all_verified = False
    print()

print("="*70)
print("STEP 4: GPU VERIFICATION")
print("="*70)

import torch
gpu_available = torch.cuda.is_available()
print(f"\nGPU available: {gpu_available} {'✓' if gpu_available else '✗'}")

if gpu_available:
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("⚠ Warning: No GPU detected.")
    print("  Go to: Runtime -> Change runtime type -> Select T4 GPU")

print("\n" + "="*70)
print("ENVIRONMENT SETUP SUMMARY")
print("="*70)
print(f"Package manager: {'UV' if uv_available else 'pip'}")
print(f"Python executable: {current_python}")
print(f"All packages verified: {'✓ YES' if all_verified else '✗ NO'}")
print(f"GPU available: {'✓ YES' if gpu_available else '✗ NO'}")
print("="*70)

if not all_verified:
    print("\n⚠ WARNING: Some packages failed verification. Check output above.")
else:
    print("\n✓ Environment setup complete with full parity!")


STEP 1: ENVIRONMENT PATH VERIFICATION
Current kernel executable: /usr/bin/python3
Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Site packages: /content
✓ UV available: uv 0.9.26

STEP 2: PACKAGE INSTALLATION
Installing packages using UV with --python /usr/bin/python3

Installing PyTorch with CUDA 11.8 support...
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m3 packages[0m [2min 14ms[0m[0m

Installing additional dependencies...
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 16ms[0m[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 12ms[0m[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 13ms[0m[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 13ms[0m[0m

STEP 3: INSTALLATION VERIFICATION

Verifying installed packages:

✓ torch        v2.9.0+cpu   
  Location: /usr/local/lib/py

## Cell 2: Import Modules

Import arcOS benchmark modules from `src/` package.

In [4]:
import sys
from pathlib import Path
import os

# Auto-detect project root (works with Colab, local Jupyter, and VSCode)
# Try multiple strategies to find the project root
possible_roots = []

# Strategy 1: Check if we're in the notebooks/ folder
if Path.cwd().name == "notebooks":
    possible_roots.append(Path.cwd().parent)

# Strategy 2: Check current directory
possible_roots.append(Path.cwd())

# Strategy 3: Check parent directory (for when running from notebooks/)
possible_roots.append(Path.cwd().parent)

# Strategy 4: Check if notebook file path is available (Jupyter/VSCode)
try:
    # Try to get the notebook's directory from IPython
    from IPython import get_ipython
    ipython = get_ipython()
    if ipython and hasattr(ipython, 'user_ns'):
        # In Jupyter/VSCode, __file__ might be available
        notebook_path = ipython.user_ns.get('__vsc_ipynb_file__')
        if notebook_path:
            possible_roots.append(Path(notebook_path).parent.parent)
except:
    pass

# Strategy 5: Colab-specific paths
possible_roots.extend([
    Path("arcOS-benchmark-colab"),
    Path("/content/arcOS-benchmark-colab"),
    Path("/content"),
])

# Remove duplicates while preserving order
seen = set()
unique_roots = []
for root in possible_roots:
    root_abs = root.resolve()
    if root_abs not in seen:
        seen.add(root_abs)
        unique_roots.append(root)

# Search for project root
project_root = None
for root in unique_roots:
    try:
        src_path = root / "src"
        if src_path.exists() and (src_path / "config.py").exists():
            project_root = root.resolve()
            break
    except (OSError, PermissionError):
        continue

if project_root is None:
    print("⚠ ERROR: Could not find src/ directory")
    print(f"Current directory: {Path.cwd()}")
    print(f"\nSearched in the following locations:")
    for i, root in enumerate(unique_roots, 1):
        try:
            abs_path = root.resolve()
            exists = root.exists()
            src_exists = (root / "src").exists() if exists else False
            print(f"  {i}. {abs_path} (exists: {exists}, has src/: {src_exists})")
        except Exception as e:
            print(f"  {i}. {root} (error: {e})")
    raise ImportError("src/ not found - check project structure")

# Add to Python path
sys.path.insert(0, str(project_root))
print(f"✓ Found project root: {project_root}")
print(f"  Current directory: {Path.cwd()}")

# Import modules
print("\nImporting modules...")
try:
    from src.config import BenchmarkConfig
    from src.utils.seeds import set_seeds
    from src.utils.checkpoints import (
        ensure_drive_mounted,
        checkpoint_exists,
        save_checkpoint,
        load_checkpoint,
        create_checkpoint_dirs,
    )
    from src.data.dataset_loader import RoGWebQSPLoader
    from src.data.graph_builder import GraphBuilder
    print("✓ All imports successful")
except ImportError as e:
    print(f"✗ Import failed: {e}")
    print(f"Project root: {project_root}")
    print(f"sys.path: {sys.path[:3]}")
    raise

✓ Found project root: /content/arcOS-benchmark-colab
  Current directory: /content

Importing modules...
✓ All imports successful


## Cell 3: Configuration

Initialize benchmark configuration with hyperparameters.

In [5]:
# Initialize configuration
config = BenchmarkConfig(
    seed=42,
    deterministic=True,
    drive_root="/content/drive/MyDrive/arcOS_benchmark",
)

# Print configuration summary
config.print_summary()

arcOS Benchmark Configuration
Seed: 42 (deterministic=True)
Dataset: rmanluo/RoG-webqsp
Drive root: /content/drive/MyDrive/arcOS_benchmark
Checkpoint dir: /content/drive/MyDrive/arcOS_benchmark/checkpoints
Results dir: /content/drive/MyDrive/arcOS_benchmark/results

--- Retrieval ---
Embedding model: sentence-transformers/all-MiniLM-L6-v2
Top-K entities: 10
PCST budget: 50

--- GNN ---
Hidden dim: 256
Num layers: 3
Num heads: 4
Pooling: attention

--- LLM ---
Model: anthropic/claude-3.5-sonnet
Provider: openrouter
Temperature: 0.0


## Cell 4: Seed Initialization

Set random seeds for reproducibility.

In [6]:
# Set seeds for reproducibility
set_seeds(config.seed, config.deterministic)

✓ Random seeds set to 42 (deterministic=True)


## Cell 5: Google Drive Setup

Mount Google Drive and create checkpoint/results directories.

In [7]:
# Mount Google Drive
drive_mounted = ensure_drive_mounted()

if drive_mounted:
    # Create checkpoint and results directories
    create_checkpoint_dirs(config.checkpoint_dir, config.results_dir)
else:
    print("⚠ Warning: Drive not mounted. Checkpointing will not work.")
    print("  Continuing with local /content/ storage (temporary)")

Mounted at /content/drive
✓ Google Drive mounted at /content/drive
✓ Checkpoint directory: /content/drive/MyDrive/arcOS_benchmark/checkpoints
✓ Results directory: /content/drive/MyDrive/arcOS_benchmark/results


## Cell 6: Dataset Loading

Load RoG-WebQSP dataset from HuggingFace with Drive caching.

In [9]:
# Initialize dataset loader
cache_dir = config.checkpoint_dir / "huggingface_cache"
loader = RoGWebQSPLoader(cache_dir=cache_dir)

# Check for cached dataset
dataset_checkpoint_path = config.get_checkpoint_path("dataset.pkl")

dataset = None # Initialize dataset to None

if checkpoint_exists(dataset_checkpoint_path):
    print("Loading dataset from checkpoint...")
    try:
        dataset = load_checkpoint(dataset_checkpoint_path, format="pickle")
        print("✓ Dataset loaded from checkpoint.")
    except FileNotFoundError as e:
        print(f"⚠ Warning: Failed to load dataset from checkpoint due to missing files: {e}")
        print("  Falling back to downloading dataset from HuggingFace...")
        # If loading fails, proceed to download
        pass # dataset remains None, so the next block will execute

if dataset is None: # If dataset was not loaded successfully or checkpoint didn't exist
    print("Downloading dataset from HuggingFace...")
    dataset = loader.load(dataset_name=config.dataset_name)
    save_checkpoint(dataset, dataset_checkpoint_path, format="pickle")
    print("✓ Dataset downloaded and saved to checkpoint.")


# Inspect dataset schema
loader.inspect_schema(dataset, num_examples=1)

# Compute statistics
loader.compute_statistics(dataset)

# Validate split counts
split_valid = loader.validate_split_counts(
    dataset,
    expected_train=config.expected_train_size,
    expected_val=config.expected_val_size,
    expected_test=config.expected_test_size,
)


✓ HuggingFace cache directory: /content/drive/MyDrive/arcOS_benchmark/checkpoints/huggingface_cache
Loading dataset from checkpoint...
✓ Checkpoint loaded: /content/drive/MyDrive/arcOS_benchmark/checkpoints/dataset.pkl (pickle)
✓ Dataset loaded from checkpoint.

Dataset Schema Inspection
Inspecting first split: train

Fields:
  - id: Value('string')
  - question: Value('string')
  - answer: List(Value('string'))
  - q_entity: List(Value('string'))
  - a_entity: List(Value('string'))
  - graph: List(List(Value('string')))
  - choices: List(Value('null'))

✓ All expected fields present

Sample Examples (first 1):

--- Example 0 ---
ID: WebQTrn-0
Question: what is the name of justin bieber brother
Answer: ['Jaxon Bieber']
Question Entity: ['Justin Bieber']
Answer Entity: ['Jaxon Bieber']
Choices: []
Graph: 9088 triples
  Sample triple: ['P!nk', 'freebase.valuenotation.is_reviewed', 'Gender']

Dataset Statistics

--- train ---
Total examples: 2826
Graph size (triples):
  - Average: 4229.2


## Cell 7: Graph Construction

Build NetworkX graphs from dataset triples.

In [10]:
# Initialize graph builder
graph_builder = GraphBuilder(directed=config.graph_directed)

# Check for cached unified graph
unified_graph_path = config.get_checkpoint_path("unified_graph.pkl")

if checkpoint_exists(unified_graph_path):
    print("Loading unified graph from checkpoint...")
    unified_graph = load_checkpoint(unified_graph_path, format="pickle")
else:
    print("Building unified graph from training split...")
    unified_graph = graph_builder.build_unified_graph(dataset["train"])
    save_checkpoint(unified_graph, unified_graph_path, format="pickle")

# Print graph statistics
graph_builder.print_graph_info(unified_graph, name="Unified Training Graph")

# Validate graph size
graph_valid = graph_builder.validate_graph_size(
    unified_graph,
    min_nodes=config.unified_graph_min_nodes,
    min_edges=config.unified_graph_min_edges,
)

# Build sample per-example graph for demonstration
print("\nBuilding sample per-example graph...")
sample_example = dataset["train"][0]
sample_graph = graph_builder.build_from_triples(
    sample_example["graph"],
    graph_id=sample_example["id"]
)
graph_builder.print_graph_info(sample_graph, name="Sample Per-Example Graph")

✓ GraphBuilder initialized (directed=True)
Loading unified graph from checkpoint...
✓ Checkpoint loaded: /content/drive/MyDrive/arcOS_benchmark/checkpoints/unified_graph.pkl (pickle)

Unified Training Graph Information
Nodes: 1023103
Edges: 2889277
Directed: True
Density: 0.000003
Weakly connected: True

Relation Statistics:
Unique relations: 5622
Top 10 relations:
  - common.topic.notable_types: 135525
  - common.topic.notable_for: 70887
  - location.location.containedby: 67294
  - freebase.valuenotation.is_reviewed: 67203
  - people.person.profession: 44770
  - location.statistical_region.population: 42123
  - common.topic.article: 41886
  - people.person.gender: 40946
  - common.topic.webpage: 38099
  - common.webpage.topic: 38029

Degree Statistics:
Average in-degree: 2.82
Average out-degree: 2.82

Graph Size Validation
Nodes: 1023103 ✓
Edges: 2889277 ✓

✓ Graph meets size requirements

Building sample per-example graph...

Sample Per-Example Graph Information
Nodes: 1723
Edges: 82

## Cell 8: Phase 1 Validation

Automated validation of all Phase 1 success criteria.

In [11]:
print("\n" + "="*60)
print("Phase 1 Success Criteria Validation")
print("="*60)

# Collect validation results
validation_results = {
    "GPU Available": torch.cuda.is_available(),
    "All Imports Successful": True,  # If we got here, imports worked
    "Dataset Splits Valid": split_valid,
    "Unified Graph Size Valid": graph_valid,
}

# Test checkpoint round-trip
test_checkpoint_path = config.get_checkpoint_path("test_roundtrip.pkl")
test_data = {"test": "round-trip", "value": 42}
try:
    save_checkpoint(test_data, test_checkpoint_path, format="pickle")
    loaded_data = load_checkpoint(test_checkpoint_path, format="pickle")
    checkpoint_roundtrip_ok = (loaded_data == test_data)
    validation_results["Checkpoint Round-Trip"] = checkpoint_roundtrip_ok
except Exception as e:
    print(f"Checkpoint round-trip failed: {e}")
    validation_results["Checkpoint Round-Trip"] = False

# Print results
print("\nValidation Results:")
all_passed = True
for criterion, passed in validation_results.items():
    status = "✓" if passed else "✗"
    print(f"  {status} {criterion}")
    if not passed:
        all_passed = False

print("\n" + "="*60)
if all_passed:
    print("✓ PHASE 1 COMPLETE - All criteria passed!")
    print("\nReady to proceed to Phase 2: Retrieval Pipeline")
else:
    print("✗ PHASE 1 INCOMPLETE - Some criteria failed")
    print("\nPlease review failed criteria above")
print("="*60)

# Print summary statistics
print("\nPhase 1 Summary:")
print(f"  Dataset: {config.dataset_name}")
print(f"  Training examples: {len(dataset['train'])}")
print(f"  Validation examples: {len(dataset['validation'])}")
print(f"  Test examples: {len(dataset['test'])}")
print(f"  Unified graph nodes: {unified_graph.number_of_nodes()}")
print(f"  Unified graph edges: {unified_graph.number_of_edges()}")
print(f"  Checkpoints saved to: {config.checkpoint_dir}")


Phase 1 Success Criteria Validation
✓ Checkpoint saved: /content/drive/MyDrive/arcOS_benchmark/checkpoints/test_roundtrip.pkl (pickle)
✓ Checkpoint loaded: /content/drive/MyDrive/arcOS_benchmark/checkpoints/test_roundtrip.pkl (pickle)

Validation Results:
  ✗ GPU Available
  ✓ All Imports Successful
  ✗ Dataset Splits Valid
  ✓ Unified Graph Size Valid
  ✓ Checkpoint Round-Trip

✗ PHASE 1 INCOMPLETE - Some criteria failed

Please review failed criteria above

Phase 1 Summary:
  Dataset: rmanluo/RoG-webqsp
  Training examples: 2826
  Validation examples: 246
  Test examples: 1628
  Unified graph nodes: 1023103
  Unified graph edges: 2889277
  Checkpoints saved to: /content/drive/MyDrive/arcOS_benchmark/checkpoints


## Cell 9: Build Retrieval Pipeline

Initialize retrieval components (embeddings, FAISS index, PCST solver).

In [12]:
print("=" * 60)
print("PHASE 2: RETRIEVAL PIPELINE")
print("=" * 60)

from src.retrieval import Retriever

# Build retriever (uses checkpoints if available)
retriever = Retriever.build_from_checkpoint_or_new(
    config=config,
    unified_graph=unified_graph  # From Phase 1 Cell 7
)

print("\n✓ Retrieval pipeline initialized")
print(f"  - Entity embeddings: {len(retriever.entity_index)} entities")
print(f"  - Top-K: {config.top_k_entities}")
print(f"  - PCST budget: {config.pcst_budget} nodes")

PHASE 2: RETRIEVAL PIPELINE
BUILDING RETRIEVAL PIPELINE

[1/4] Initializing text embedder...
⚠ CUDA not available, falling back to CPU for embeddings


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
  - Device: cpu
  - Embedding dimension: 384

[2/4] Loading/computing entity embeddings...
Loading cached embeddings from entity_embeddings.pkl
✓ Checkpoint loaded: /content/drive/MyDrive/arcOS_benchmark/checkpoints/entity_embeddings.pkl (pickle)
✓ Loaded 1023103 entity embeddings

[3/4] Loading/computing relation embeddings...
Loading cached relation embeddings from relation_embeddings.pkl
✓ Checkpoint loaded: /content/drive/MyDrive/arcOS_benchmark/checkpoints/relation_embeddings.pkl (pickle)
✓ Loaded 5622 relation embeddings

[4/4] Loading/building FAISS index...
Loading cached FAISS index from faiss_index.bin
✓ Loaded FAISS index from /content/drive/MyDrive/arcOS_benchmark/checkpoints/faiss_index.bin
  - 1023103 entities indexed

Initializing PCST solver...
✓ PCST solver ready (budget: 50 nodes)

✓ RETRIEVAL PIPELINE READY

✓ Retrieval pipeline initialized
  - Entity embeddings: 1023103 entities
  - Top-K: 10
  - PCST 

## Cell 10: Retrieval Validation

Test retrieval pipeline on 10 validation examples.

In [15]:
!uv pip install pcst-fast

[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m2 packages[0m [2min 222ms[0m[0m                                         [0m
[2K[2mPrepared [1m2 packages[0m [2min 87ms[0m[0m                                              
[2K[2mInstalled [1m2 packages[0m [2min 8ms[0m[0m                                 [0m
 [32m+[39m [1mpcst-fast[0m[2m==1.0.10[0m
 [32m+[39m [1mpybind11[0m[2m==3.0.1[0m


In [16]:
print("\n" + "=" * 60)
print("RETRIEVAL VALIDATION (10 examples)")
print("=" * 60)

# Use first 10 validation examples
val_examples = list(dataset["validation"].select(range(10)))

hit_count = 0
total_time_ms = 0
subgraph_sizes = []

for i, example in enumerate(val_examples):
    question = example["question"]
    answer_entities = example.get("a_entity", [])
    if isinstance(answer_entities, str):
        answer_entities = [answer_entities]

    # Retrieve subgraph
    result = retriever.retrieve(question)

    # Check if answer entity in subgraph
    subgraph_nodes = set(result.subgraph.nodes())
    hit = any(ans in subgraph_nodes for ans in answer_entities)

    if hit:
        hit_count += 1

    total_time_ms += result.retrieval_time_ms
    subgraph_sizes.append(result.num_nodes)

    # Print example
    print(f"\n[{i+1}/10] Q: {question[:60]}...")
    print(f"  Answer entities: {answer_entities}")
    print(f"  Subgraph: {result.num_nodes} nodes, {result.num_edges} edges")
    print(f"  Hit: {'✓' if hit else '✗'}")
    print(f"  Time: {result.retrieval_time_ms:.1f}ms")

# Summary metrics
hit_rate = hit_count / len(val_examples) * 100
avg_time = total_time_ms / len(val_examples)
avg_size = sum(subgraph_sizes) / len(subgraph_sizes)

print("\n" + "=" * 60)
print("VALIDATION SUMMARY")
print("=" * 60)
print(f"Hit rate: {hit_rate:.1f}% ({hit_count}/{len(val_examples)})")
print(f"Avg retrieval time: {avg_time:.1f}ms")
print(f"Avg subgraph size: {avg_size:.1f} nodes")
print(f"Max subgraph size: {max(subgraph_sizes)} nodes")


RETRIEVAL VALIDATION (10 examples)

[1/10] Q: how old is sacha baron cohen...
  Answer entities: ['1971-10-13']
  Subgraph: 1 nodes, 1 edges
  Hit: ✗
  Time: 29313.7ms

[2/10] Q: what time zone am i in cleveland ohio...
  Answer entities: ['Eastern Time Zone']
  Subgraph: 1 nodes, 1 edges
  Hit: ✗
  Time: 31709.9ms


KeyboardInterrupt: 

## Cell 11: Phase 2 Success Criteria

Validate Phase 2 completion criteria.

In [18]:
import networkx as nx

print("\n" + "=" * 60)
print("PHASE 2 SUCCESS CRITERIA")
print("=" * 60)

# Criterion 1: Retrieval speed < 1 second
speed_pass = avg_time < 1000  # ms
print(f"[{'✓' if speed_pass else '✗'}] Retrieval completes in <1 second: {avg_time:.1f}ms")

# Criterion 2: Hit rate > 60%
hit_pass = hit_rate >= 60.0
print(f"[{'✓' if hit_pass else '✗'}] Subgraph contains answer entity >60%: {hit_rate:.1f}%")

# Criterion 3: All subgraphs connected
all_connected = all(
    nx.is_weakly_connected(retriever.retrieve(example["question"]).subgraph)
    for example in val_examples[:5]  # Check first 5
)
print(f"[{'✓' if all_connected else '✗'}] All subgraphs are connected")

# Criterion 4: Subgraph size respects budget
size_pass = max(subgraph_sizes) <= config.pcst_budget
print(f"[{'✓' if size_pass else '✗'}] Subgraph size ≤ budget ({max(subgraph_sizes)} ≤ {config.pcst_budget})")

# Overall pass
all_pass = speed_pass and hit_pass and all_connected and size_pass
print("\n" + "=" * 60)
if all_pass:
    print("✓ PHASE 2 COMPLETE - All criteria met!")
    print("\nReady to proceed to Phase 3: GNN Encoder")
else:
    print("⚠ PHASE 2 INCOMPLETE - Review failed criteria above")
print("=" * 60)


PHASE 2 SUCCESS CRITERIA


NameError: name 'avg_time' is not defined