# InsightSpike-AI: Large Scale RAG Experiment (Colab)

This notebook demonstrates how to run a large-scale RAG experiment (e.g., 10,000 nodes, 1,000 queries) using the InsightSpike-AI framework.

**Goal**: Evaluate geDIG on a larger dataset with real embeddings.
**Target**: 1,000 queries, 10,000 documents, SentenceTransformers embeddings.

## 1. Setup Environment
We need `sentence-transformers` and `torch` for this experiment.

In [None]:
!git clone https://github.com/miyauchikazuyoshi/InsightSpike-AI.git
%cd InsightSpike-AI
!pip install -e .[dev]
!pip install sentence-transformers scikit-learn

## 2. Generate Large Dataset (10k Nodes, 1k Queries)
Since the repository only contains small samples, we will generate a synthetic dataset with:
- **10,000 Documents**: Distributed across queries to build the corpus.
- **1,000 Queries**: Each with a unique ground truth.
- **Format**: Compatible with `dataset.py`.

In [None]:
import json
import random
import os

output_file = "experiments/exp2to4_lite/data/synthetic_10k.jsonl"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

total_docs = 10000
total_queries = 1000
docs_per_query = total_docs // total_queries  # 10 docs per query line

print(f"Generating {total_queries} queries and {total_docs} documents...")

with open(output_file, "w", encoding="utf-8") as f:
    doc_counter = 0
    for q_idx in range(total_queries):
        # Generate 10 documents for this batch
        batch_docs = []
        for _ in range(docs_per_query):
            doc_id = f"doc_{doc_counter}"
            # Simple synthetic text
            text = f"This is the content of document {doc_counter}. It contains information relevant to query {q_idx} if selected as ground truth."
            metadata = {"id": doc_id, "source": "synthetic"}
            batch_docs.append({"id": doc_id, "text": text, "metadata": metadata})
            doc_counter += 1
        
        # Pick the first one as 'relevant' for this query
        target_doc = batch_docs[0]
        query_text = f"What is the content of document {target_doc['id']}?"
        ground_truth = target_doc['text']
        
        entry = {
            "query": query_text,
            "ground_truth": ground_truth,
            "documents": batch_docs  # These will be added to the global corpus
        }
        f.write(json.dumps(entry) + "\n")

print(f"Created {output_file}")

## 3. Create Configuration
We will create a custom configuration file `large_scale_config.yaml` that targets the full dataset.

In [None]:
%%writefile large_scale_config.yaml
experiment:
  name: exp23_large_scale_10k
  output_dir: experiments/exp2to4_lite/results
  seed: 42
  target_ag_rate: 0.08
  target_dg_rate: 0.04

dataset:
  # Use the synthetic 10k dataset
  path: experiments/exp2to4_lite/data/synthetic_10k.jsonl
  max_queries: null  # Run all 1000 queries

embedding:
  model: sentence-transformers/all-MiniLM-L6-v2
  normalize: true
  cache_dir: experiments/exp2to4_lite/cache

retrieval:
  top_k: 10
  bm25_weight: 0.5
  embedding_weight: 0.5
  expansion_hops: 1

gedig:
  lambda: 0.6
  use_multihop: true
  max_hops: 3
  decay_factor: 0.7
  sp_beta: 0.2
  theta_ag: 2.0
  theta_dg: 0.05
  ig_mode: raw
  spike_mode: and

psz:
  acceptance_threshold: 0.6
  fmr_threshold: 0.02
  latency_p50_threshold_ms: 200

baselines:
  - name: static_rag
    type: static
  - name: gedig_ag_dg
    type: gedig

logging:
  save_step_logs: false
  save_memory_snapshots: false
  snapshot_interval: 200

## 4. Run Experiment
Execute the experiment using the custom config.

In [None]:
!python -m experiments.exp2to4_lite.src.run_experiment --config large_scale_config.yaml

## 5. Download Results (Base64 Fallback)
If file explorer and direct download fail, run this cell.
It will print a long text string. Copy that string to a local file named `results_base64.txt` and run the provided decode script.

In [None]:
import base64
import os

# Ensure zip exists
if not os.path.exists('/content/results.zip'):
    os.system('zip -r /content/results.zip experiments/exp2to4_lite/results')

with open('/content/results.zip', 'rb') as f:
    data = f.read()
    b64_data = base64.b64encode(data).decode('utf-8')

print("COPY THE STRING BELOW (BETWEEN THE SEPARATORS) AND SAVE TO 'results_base64.txt' LOCALLY:")
print("="*80)
print(b64_data)
print("="*80)