What this is
A direction-setting issue, not a roadmap checklist. Items here are research trajectories ClawLoop is actively exploring. Shape will change as we learn.
The question
Agent learning plateaus when the training distribution stops surprising the agent. Real deployments produce a firehose of traces — successes, failures, tool calls, reward signals — that describe the world the agent actually lives in. The open question is: can we use that trace data to synthesize environments, world models, and curricula that keep the agent learning past where a fixed benchmark tops out?
Concrete research threads
- Failure-driven env synthesis. Cluster real failures (error taxonomy), generate targeted tasks that exercise those failures, measure whether targeted training closes the gap faster than random sampling.
- Curriculum from traces. Order synthesized tasks by difficulty inferred from observed reward distributions, not hand-tuned schedules. Compare against fixed curricula and random sampling.
- World-model distillation. Learn approximate environment dynamics from traces so learners can train against simulations when live envs are expensive, slow, or irreversible. What's the fidelity floor before transfer breaks down?
- Coverage metrics. Synthesized envs are only useful if they explore regions the real distribution under-samples. Need a measurement story before we claim any benefit.
Why this is separate from learner tuning
The companion issue (#54) is about tuning how learners learn. This one is about tuning what they learn from. Different objectives, different data diets, different literature — keeping them separate lets each be evaluated on its own terms. They eventually co-evolve: harder envs drive better learners, better learners surface new failure modes, new failures seed the next round of envs.
Prior art worth reading
- PAIRED, Minimax regret env design, unsupervised env design literature.
- Open-Ended Learning (POET, PLR, ACCEL).
- Synthetic data / self-play work in LLM-agent settings.
Related
Engage
Comment with papers, critiques, or pointers. If you want to collaborate, reach out.
What this is
A direction-setting issue, not a roadmap checklist. Items here are research trajectories ClawLoop is actively exploring. Shape will change as we learn.
The question
Agent learning plateaus when the training distribution stops surprising the agent. Real deployments produce a firehose of traces — successes, failures, tool calls, reward signals — that describe the world the agent actually lives in. The open question is: can we use that trace data to synthesize environments, world models, and curricula that keep the agent learning past where a fixed benchmark tops out?
Concrete research threads
Why this is separate from learner tuning
The companion issue (#54) is about tuning how learners learn. This one is about tuning what they learn from. Different objectives, different data diets, different literature — keeping them separate lets each be evaluated on its own terms. They eventually co-evolve: harder envs drive better learners, better learners surface new failure modes, new failures seed the next round of envs.
Prior art worth reading
Related
Engage
Comment with papers, critiques, or pointers. If you want to collaborate, reach out.