# Setup Text Viewer

**View the actual text** for:
1. **Redacted paper context** (what AI saw)
2. **Ground truth** (what was redacted - AI did NOT see)
3. **Non-agentic generated setup** (single-shot)
4. **Agentic generated setup** (multi-draft)

Side-by-side comparison of all.

In [7]:
import json
from pathlib import Path
from IPython.display import display, Markdown
import textwrap

## Configuration

In [8]:
# CONFIGURE: Your run directory
RUN_DIR = Path("../../runs/20251117_222251_full_eval")

# Load summary to see available papers
with open(RUN_DIR / "summary.json") as f:
    summary = json.load(f)

print("Available papers:")
for i, result in enumerate(summary['results']):
    print(f"{i}: {result['paper_id'][:80]}")

Available papers:
0: Understanding Scaling Laws with Statistical and Ap_e411a237
1: Scaling transformer neural networks for skillful a_f9bbc835


In [9]:
# CONFIGURE: Select paper index
PAPER_INDEX = 0

paper_id = summary['results'][PAPER_INDEX]['paper_id']
paper_dir = RUN_DIR / paper_id

print(f"Selected: {paper_id}")

Selected: Understanding Scaling Laws with Statistical and Ap_e411a237


## Load Data

In [10]:
# Load JSON files
with open(paper_dir / "paper_context.json") as f:
    paper_context = json.load(f)

with open(paper_dir / "non_agentic_setup.json") as f:
    non_agentic = json.load(f)

with open(paper_dir / "agentic_setup.json") as f:
    agentic = json.load(f)

# Load ground truth if available
ground_truth = None
gt_path = paper_dir / "ground_truth.json"
if gt_path.exists():
    with open(gt_path) as f:
        ground_truth = json.load(f)

# Extract setups
non_agentic_setup = non_agentic['setup']
if 'ai_setup' in non_agentic_setup:
    non_agentic_setup = non_agentic_setup['ai_setup']

agentic_setup = agentic['setup']
if 'ai_setup' in agentic_setup:
    agentic_setup = agentic_setup['ai_setup']

print("✓ Data loaded")
print(f"  Ground truth available: {'✓' if ground_truth and ground_truth.get('has_experiments') else '✗'}")

✓ Data loaded
  Ground truth available: ✓


In [11]:
# Helper function for formatting
def format_component(components, component_name):
    """Format a component list nicely."""
    if not components:
        return f"No {component_name} found"
    
    lines = []
    for i, comp in enumerate(components, 1):
        if isinstance(comp, dict):
            name = comp.get('name', 'Unnamed')
            desc = comp.get('description', '')
            rationale = comp.get('rationale', '')
            
            lines.append(f"{i}. {name}")
            if desc:
                lines.append(f"   Description: {desc}")
            if rationale:
                lines.append(f"   Rationale: {rationale}")
        else:
            lines.append(f"{i}. {comp}")
        lines.append("")  # Blank line
    
    return "\n".join(lines)

---

## 1. REDACTED PAPER CONTEXT

**This is what the AI saw** (experimental sections were removed):

In [12]:
print("="*80)
print("REDACTED PAPER CONTEXT (What AI Saw)")
print("="*80)
print()

print("TITLE:")
print(textwrap.fill(paper_context.get('title', 'N/A'), width=80))
print()

print("="*80)
print("ABSTRACT:")
print("="*80)
print(textwrap.fill(paper_context.get('abstract', 'N/A'), width=80))
print()

print("="*80)
print("RESEARCH QUESTION:")
print("="*80)
print(textwrap.fill(paper_context.get('research_question', 'N/A'), width=80))
print()

print("="*80)
print("DOMAIN:")
print("="*80)
print(paper_context.get('domain', 'N/A'))
print()

print("="*80)
print("METHOD DESCRIPTION:")
print("="*80)
print(textwrap.fill(paper_context.get('method_description', 'N/A'), width=80))
print()

print("="*80)
print("NOTE: Experimental sections (experiments, results, evaluation) were REDACTED")
print("The AI never saw the actual baselines, metrics, or datasets used.")
print("="*80)

REDACTED PAPER CONTEXT (What AI Saw)

TITLE:
Unknown Title

ABSTRACT:
  When training deep neural networks, a model’s generalization error is often
observed to follow a power scaling law dependent both on the model size and the
data size. Perhaps the best known example of such scaling laws are for
transformerbased large language models ( **LLMs** ), where networks with
billions of parameters are trained on trillions of tokens of text. Yet, despite
sustained widespread interest, a rigorous understanding of why transformer
scaling laws exist is still missing. To answer this question, we establish novel
statistical estimation and mathematical approximation theories for transformers
when the input data are concentrated on a low-dimensional manifold. Our theory
predicts a power law between the generalization error and both the training data
size and the network size for transformers, where the power depends on the
intrinsic dimension _d_ of the training data. Notably, the constructed model


---

## 2. GROUND TRUTH (What AI Did NOT See)

**This was REDACTED** - extracted from experimental sections that were removed:

In [13]:
if ground_truth and ground_truth.get('has_experiments'):
    print("="*80)
    print("GROUND TRUTH - WHAT WAS REDACTED (AI DID NOT SEE THIS)")
    print("="*80)
    print()
    print("These experimental details were removed before AI saw the paper.")
    print("This is what the authors actually used in their experiments.")
    print()

    print("="*80)
    print("ACTUAL BASELINES USED:")
    print("─"*80)
    baselines = ground_truth.get('baselines', [])
    if baselines:
        print(format_component(baselines, 'baselines'))
    else:
        print("No baselines extracted (paper may not have explicit baseline comparisons)")
    print()

    print("="*80)
    print("ACTUAL METRICS USED:")
    print("─"*80)
    metrics = ground_truth.get('metrics', [])
    if metrics:
        print(format_component(metrics, 'metrics'))
    else:
        print("No metrics extracted")
    print()

    print("="*80)
    print("ACTUAL DATASETS USED:")
    print("─"*80)
    datasets = ground_truth.get('datasets', [])
    if datasets:
        print(format_component(datasets, 'datasets'))
    else:
        print("No datasets extracted")
    print()

    print("="*80)
    print("EXTRACTION METADATA:")
    print("─"*80)
    print(f"Confidence: {ground_truth.get('extraction_confidence', 'unknown').upper()}")
    if ground_truth.get('notes'):
        print(f"\nNotes:")
        print(textwrap.fill(ground_truth['notes'], width=80))
    print()

    print("="*80)
    print("NOTE: This ground truth was extracted from experimental sections")
    print("      that were REDACTED before the AI saw the paper.")
    print("="*80)
else:
    print("="*80)
    print("NO GROUND TRUTH AVAILABLE")
    print("="*80)
    print()
    print("Either no experiments were found in the paper, or ground truth")
    print("extraction failed. The AI still generated setups based on the")
    print("research context alone.")
    print("="*80)

GROUND TRUTH - WHAT WAS REDACTED (AI DID NOT SEE THIS)

These experimental details were removed before AI saw the paper.
This is what the authors actually used in their experiments.

ACTUAL BASELINES USED:
────────────────────────────────────────────────────────────────────────────────
No baselines extracted (paper may not have explicit baseline comparisons)

ACTUAL METRICS USED:
────────────────────────────────────────────────────────────────────────────────
1. Scaling Exponent (α_D)
   Description: Measures the predicted scaling law exponent for data size.

2. Scaling Exponent (α_N)
   Description: Measures the predicted scaling law exponent for model size.


ACTUAL DATASETS USED:
────────────────────────────────────────────────────────────────────────────────
1. Gokaslan et al. dataset
   Description: Natural language dataset used for pretraining small LLMs.

2. Eldan and Li dataset
   Description: Natural language dataset used for pretraining small LLMs.

3. Kocetkov et al. dataset

---

## 3. NON-AGENTIC GENERATED SETUP

**Single-shot generation** (1 LLM call):

In [14]:
print("="*80)
print("NON-AGENTIC GENERATED SETUP")
print("="*80)
print()

print("BASELINES:")
print("─"*80)
print(format_component(non_agentic_setup.get('baselines', []), 'baselines'))

print("="*80)
print("METRICS:")
print("─"*80)
print(format_component(non_agentic_setup.get('metrics', []), 'metrics'))

print("="*80)
print("DATASETS:")
print("─"*80)
print(format_component(non_agentic_setup.get('datasets', []), 'datasets'))

print("="*80)
print("EXPERIMENTAL PROTOCOL:")
print("─"*80)
protocol = non_agentic_setup.get('experimental_protocol', {})
if isinstance(protocol, dict):
    for key, value in protocol.items():
        print(f"{key.replace('_', ' ').title()}: {value}")
else:
    print(protocol if protocol else "No protocol specified")
print()

print("="*80)
print("METADATA:")
print("─"*80)
metadata = non_agentic.get('metadata', {})
print(f"Time elapsed: {metadata.get('time_elapsed', 'N/A'):.2f}s")
print(f"LLM calls: {metadata.get('llm_calls', 'N/A')}")
print(f"Quality score: {metadata.get('quality_score', 'N/A'):.3f}")
print("="*80)

NON-AGENTIC GENERATED SETUP

BASELINES:
────────────────────────────────────────────────────────────────────────────────
1. Random Initialization Transformer
   Description: A transformer model initialized randomly with no pretraining.
   Rationale: Provides an essential comparison to understand the impact of training on scaling laws.

2. Pretrained Large Transformer
   Description: A pretrained transformer model (e.g., GPT-3) with billions of parameters.
   Rationale: Represents the current state-of-the-art in large transformer models; useful for benchmarking against theoretical predictions.

3. Low-Dimensional Embedded Transformer
   Description: A transformer trained on a dataset artificially constrained to a lower-dimensional manifold.
   Rationale: Directly tests the paper's hypothesis about the relationship between intrinsic data dimension and scaling laws.

4. Shallow Transformer
   Description: A shallow transformer architecture with logarithmic depth.
   Rationale: Tests the p

---

## 4. AGENTIC GENERATED SETUP

**Multi-draft generation with selection** (3-5 LLM calls):

In [15]:
print("="*80)
print("AGENTIC GENERATED SETUP")
print("="*80)
print()

print("BASELINES:")
print("─"*80)
print(format_component(agentic_setup.get('baselines', []), 'baselines'))

print("="*80)
print("METRICS:")
print("─"*80)
print(format_component(agentic_setup.get('metrics', []), 'metrics'))

print("="*80)
print("DATASETS:")
print("─"*80)
print(format_component(agentic_setup.get('datasets', []), 'datasets'))

print("="*80)
print("EXPERIMENTAL PROTOCOL:")
print("─"*80)
protocol = agentic_setup.get('experimental_protocol', {})
if isinstance(protocol, dict):
    for key, value in protocol.items():
        print(f"{key.replace('_', ' ').title()}: {value}")
else:
    print(protocol if protocol else "No protocol specified")
print()

print("="*80)
print("METADATA:")
print("─"*80)
metadata = agentic.get('metadata', {})
print(f"Time elapsed: {metadata.get('time_elapsed', 'N/A'):.2f}s")
print(f"LLM calls: {metadata.get('llm_calls', 'N/A')}")
print(f"Best score: {metadata.get('best_score', 'N/A'):.3f}")
print(f"Drafts explored: {metadata.get('drafts_explored', 'N/A')}")
print("="*80)

AGENTIC GENERATED SETUP

BASELINES:
────────────────────────────────────────────────────────────────────────────────
1. Standard Transformer Scaling
   Description: Evaluate transformer-based models following established scaling laws.
   Rationale: This serves as a direct comparison to gauge the validity of the proposed theoretical framework against standard practices.

2. Shallow Transformer with Fixed Intrinsic Dimension
   Description: Implement a shallow transformer with logarithmic depth in the intrinsic dimension as proposed in the paper.
   Rationale: This aligns with the paper's hypothesis and validates the novel approach.

3. Random Baseline Model
   Description: Train a random neural network of comparable size without any scaling laws assumptions.
   Rationale: This baseline provides a measure of performance without scaling law considerations, serving as a control.

4. Empirical Scaling Laws from LLMs
   Description: Use empirical scaling laws derived from training large LLMs

---

## 5. AGENTIC EXPLORATION TREE

**All drafts explored** by the agentic approach:

In [16]:
if 'exploration_tree' in agentic:
    print("="*80)
    print("AGENTIC EXPLORATION TREE")
    print("="*80)
    print()
    
    for draft in agentic['exploration_tree']:
        draft_id = draft['draft_id']
        score = draft['score']
        is_best = draft.get('is_best', False)
        
        marker = "⭐ SELECTED" if is_best else ""
        print(f"Draft {draft_id}: Score = {score:.3f} {marker}")
        
        # Show brief summary of this draft
        draft_setup = draft.get('setup', {})
        if 'ai_setup' in draft_setup:
            draft_setup = draft_setup['ai_setup']
        
        num_baselines = len(draft_setup.get('baselines', []))
        num_metrics = len(draft_setup.get('metrics', []))
        num_datasets = len(draft_setup.get('datasets', []))
        
        print(f"  Components: {num_baselines} baselines, {num_metrics} metrics, {num_datasets} datasets")
        print()
    
    print("="*80)
else:
    print("No exploration tree available")

AGENTIC EXPLORATION TREE

Draft 0: Score = 0.935 
  Components: 4 baselines, 3 metrics, 3 datasets

Draft 1: Score = 0.955 
  Components: 6 baselines, 3 metrics, 3 datasets

Draft 2: Score = 0.975 ⭐ SELECTED
  Components: 4 baselines, 4 metrics, 3 datasets



---

## 6. SIDE-BY-SIDE COMPARISON

**Three-way comparison** of Non-Agentic vs Agentic vs Ground Truth:

In [None]:
def extract_names(components):
    """Extract just the names from components."""
    names = []
    for comp in components:
        if isinstance(comp, dict):
            names.append(comp.get('name', 'Unnamed'))
        else:
            names.append(str(comp))
    return names

# Extract names from AI-generated setups
non_agentic_baselines = extract_names(non_agentic_setup.get('baselines', []))
agentic_baselines = extract_names(agentic_setup.get('baselines', []))

non_agentic_metrics = extract_names(non_agentic_setup.get('metrics', []))
agentic_metrics = extract_names(agentic_setup.get('metrics', []))

non_agentic_datasets = extract_names(non_agentic_setup.get('datasets', []))
agentic_datasets = extract_names(agentic_setup.get('datasets', []))

# Extract names from ground truth
gt_baselines = []
gt_metrics = []
gt_datasets = []

if ground_truth and ground_truth.get('has_experiments'):
    gt_baselines = extract_names(ground_truth.get('baselines', []))
    gt_metrics = extract_names(ground_truth.get('metrics', []))
    gt_datasets = extract_names(ground_truth.get('datasets', []))

print("="*120)
print("THREE-WAY SIDE-BY-SIDE COMPARISON")
print("="*120)
print()

# Baselines
print("BASELINES:")
print("─"*120)
max_baselines = max(len(non_agentic_baselines), len(agentic_baselines), len(gt_baselines))
print(f"{'Non-Agentic':<35} | {'Agentic':<35} | {'Ground Truth (Actual)':<35}")
print("─"*120)
for i in range(max_baselines):
    na = non_agentic_baselines[i] if i < len(non_agentic_baselines) else ""
    ag = agentic_baselines[i] if i < len(agentic_baselines) else ""
    gt = gt_baselines[i] if i < len(gt_baselines) else ""
    print(f"{na:<35} | {ag:<35} | {gt:<35}")
print()

# Metrics
print("METRICS:")
print("─"*120)
max_metrics = max(len(non_agentic_metrics), len(agentic_metrics), len(gt_metrics))
print(f"{'Non-Agentic':<35} | {'Agentic':<35} | {'Ground Truth (Actual)':<35}")
print("─"*120)
for i in range(max_metrics):
    na = non_agentic_metrics[i] if i < len(non_agentic_metrics) else ""
    ag = agentic_metrics[i] if i < len(agentic_metrics) else ""
    gt = gt_metrics[i] if i < len(gt_metrics) else ""
    print(f"{na:<35} | {ag:<35} | {gt:<35}")
print()

# Datasets
print("DATASETS:")
print("─"*120)
max_datasets = max(len(non_agentic_datasets), len(agentic_datasets), len(gt_datasets))
print(f"{'Non-Agentic':<35} | {'Agentic':<35} | {'Ground Truth (Actual)':<35}")
print("─"*120)
for i in range(max_datasets):
    na = non_agentic_datasets[i] if i < len(non_agentic_datasets) else ""
    ag = agentic_datasets[i] if i < len(agentic_datasets) else ""
    gt = gt_datasets[i] if i < len(gt_datasets) else ""
    print(f"{na:<35} | {ag:<35} | {gt:<35}")
print()

print("="*120)
print()

# Summary stats
print("SUMMARY:")
print(f"  Non-Agentic: {len(non_agentic_baselines)} baselines, {len(non_agentic_metrics)} metrics, {len(non_agentic_datasets)} datasets")
print(f"  Agentic:     {len(agentic_baselines)} baselines, {len(agentic_metrics)} metrics, {len(agentic_datasets)} datasets")
print(f"  Ground Truth: {len(gt_baselines)} baselines, {len(gt_metrics)} metrics, {len(gt_datasets)} datasets")
print("="*120)

---

## 7. OVERLAP ANALYSIS

**What's shared vs unique** between the two approaches:

In [18]:
def normalize(name):
    """Normalize for comparison."""
    return name.lower().strip()

def find_overlap(list1, list2):
    """Find shared and unique items."""
    set1 = {normalize(x) for x in list1}
    set2 = {normalize(x) for x in list2}
    
    shared = set1 & set2
    only_1 = set1 - set2
    only_2 = set2 - set1
    
    return sorted(shared), sorted(only_1), sorted(only_2)

print("="*80)
print("OVERLAP ANALYSIS")
print("="*80)
print()

# Baselines
shared_b, only_na_b, only_ag_b = find_overlap(non_agentic_baselines, agentic_baselines)
print("BASELINES:")
print(f"  Shared: {len(shared_b)}")
if shared_b:
    for item in shared_b:
        print(f"    - {item}")
print(f"  Only Non-Agentic: {len(only_na_b)}")
if only_na_b:
    for item in only_na_b:
        print(f"    - {item}")
print(f"  Only Agentic: {len(only_ag_b)}")
if only_ag_b:
    for item in only_ag_b:
        print(f"    - {item}")
print()

# Metrics
shared_m, only_na_m, only_ag_m = find_overlap(non_agentic_metrics, agentic_metrics)
print("METRICS:")
print(f"  Shared: {len(shared_m)}")
if shared_m:
    for item in shared_m:
        print(f"    - {item}")
print(f"  Only Non-Agentic: {len(only_na_m)}")
if only_na_m:
    for item in only_na_m:
        print(f"    - {item}")
print(f"  Only Agentic: {len(only_ag_m)}")
if only_ag_m:
    for item in only_ag_m:
        print(f"    - {item}")
print()

# Datasets
shared_d, only_na_d, only_ag_d = find_overlap(non_agentic_datasets, agentic_datasets)
print("DATASETS:")
print(f"  Shared: {len(shared_d)}")
if shared_d:
    for item in shared_d:
        print(f"    - {item}")
print(f"  Only Non-Agentic: {len(only_na_d)}")
if only_na_d:
    for item in only_na_d:
        print(f"    - {item}")
print(f"  Only Agentic: {len(only_ag_d)}")
if only_ag_d:
    for item in only_ag_d:
        print(f"    - {item}")

print("="*80)

OVERLAP ANALYSIS

BASELINES:
  Shared: 0
  Only Non-Agentic: 4
    - low-dimensional embedded transformer
    - pretrained large transformer
    - random initialization transformer
    - shallow transformer
  Only Agentic: 4
    - empirical scaling laws from llms
    - random baseline model
    - shallow transformer with fixed intrinsic dimension
    - standard transformer scaling

METRICS:
  Shared: 0
  Only Non-Agentic: 4
    - data size efficiency
    - mean squared error (mse)
    - model size efficiency
    - r-squared (r²)
  Only Agentic: 4
    - model complexity measure
    - power law fit accuracy
    - squared regression error (l_sq)
    - training efficiency

DATASETS:
  Shared: 0
  Only Non-Agentic: 4
    - cifar-10
    - imagenet
    - synthetic low-dimensional data
    - wikitext-103
  Only Agentic: 3
    - mnist with dimensionality reduction
    - openwebtext-subset
    - synthetic low-dimensional manifold dataset


---

## 8. FULL JSON VIEW

**Raw JSON** for detailed inspection:

In [19]:
print("NON-AGENTIC SETUP (Full JSON):")
print(json.dumps(non_agentic_setup, indent=2))

NON-AGENTIC SETUP (Full JSON):
{
  "baselines": [
    {
      "name": "Random Initialization Transformer",
      "description": "A transformer model initialized randomly with no pretraining.",
      "rationale": "Provides an essential comparison to understand the impact of training on scaling laws.",
      "implementation_complexity": "simple"
    },
    {
      "name": "Pretrained Large Transformer",
      "description": "A pretrained transformer model (e.g., GPT-3) with billions of parameters.",
      "rationale": "Represents the current state-of-the-art in large transformer models; useful for benchmarking against theoretical predictions.",
      "implementation_complexity": "complex"
    },
    {
      "name": "Low-Dimensional Embedded Transformer",
      "description": "A transformer trained on a dataset artificially constrained to a lower-dimensional manifold.",
      "rationale": "Directly tests the paper's hypothesis about the relationship between intrinsic data dimension and sc

In [20]:
print("AGENTIC SETUP (Full JSON):")
print(json.dumps(agentic_setup, indent=2))

AGENTIC SETUP (Full JSON):
{
  "baselines": [
    {
      "name": "Standard Transformer Scaling",
      "description": "Evaluate transformer-based models following established scaling laws.",
      "rationale": "This serves as a direct comparison to gauge the validity of the proposed theoretical framework against standard practices.",
      "implementation_complexity": "moderate"
    },
    {
      "name": "Shallow Transformer with Fixed Intrinsic Dimension",
      "description": "Implement a shallow transformer with logarithmic depth in the intrinsic dimension as proposed in the paper.",
      "rationale": "This aligns with the paper's hypothesis and validates the novel approach.",
      "implementation_complexity": "complex"
    },
    {
      "name": "Random Baseline Model",
      "description": "Train a random neural network of comparable size without any scaling laws assumptions.",
      "rationale": "This baseline provides a measure of performance without scaling law consideratio