Skip to content

Phase 0: Establish canonical paper-faithful baselines and lock reference metrics #1

@georgepullen

Description

@georgepullen

Purpose

Create a reproducible paper-faithful baseline package that all downstream tickets must compare against.

Mandatory Reading (blocking)

Before writing code, the agent must read and summarize in a first issue comment:

  • reports/NL_IMPLEMENTATION_ORACLE.md (sections: 3, 4.1-4.7, 5, 6.1-6.4)
  • reports/paper/NL-print.extracted.clean.txt (Eq. 21-24, 28-31 and Table 1 text)
  • docs/PAPER_COMPLIANCE.md

The first comment must include:

  1. 5-10 bullet summary of the current implementation state.
  2. Exact list of code paths used for baseline runs.
  3. Confirmed gaps that this ticket does not change.

Required Code Anchors

  • src/nested_learning/training.py
  • src/nested_learning/model.py
  • configs/pilot_paper_faithful.yaml
  • configs/pilot_selfmod_paper_faithful.yaml
  • scripts/eval/zeroshot.py
  • scripts/eval/niah.py
  • scripts/eval/continual.py
  • scripts/eval/passkey.py
  • scripts/eval/pg19_perplexity.py

Scope

  • Define canonical baseline run commands for:
    • hope_selfmod paper-faithful
    • hope_attention comparison
  • Produce frozen baseline artifact bundle with checksums + evals.
  • Add reports/ baseline table (short + explicit command provenance).

Runbook

  • uv run python train.py --config-name pilot_selfmod_paper_faithful train.steps=2000
  • uv run python train.py --config-name pilot_paper_faithful model.block_variant=hope_attention train.steps=2000
  • Run full eval suite for both checkpoints.

Deliverables

  • Baseline artifact bundle under artifacts/.
  • Eval JSON set under eval/ for both baselines.
  • Report file documenting exact commands + hashes + metrics.

Acceptance Criteria

  • No NaN/Inf in training logs.
  • All eval outputs exist for both baselines.
  • scripts/checkpoint/verify.py passes on all published checkpoints.
  • First issue comment includes mandatory reading summary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexecution-boardExecution board ticket set for paper alignmentphase-0Phase 0: baseline lock and instrumentationquality-gateHas explicit acceptance criteria and test gates

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions