Parameterized personality architecture and Letta evals for testing how agent behavior changes across models.
The most important result in this repo is not that personality matters. It's that stronger models can regress under heavier personality structure.
From evals/results.md:
| Form | Auto/Letta | M2.5 | M2.7 |
|---|---|---|---|
| Stealth | 0.73 | 0.77 | 0.82 |
| Compressed | 0.80 | 0.70 | 0.75 |
| Full | 0.63 | 0.70 | 0.67 |
Stealth improved monotonically with stronger models. Full did not.
That points to a more interesting failure mode than "long prompts bad": instruction hierarchy conflict.
constitution.json ← canonical behavioral source
↓
personality/ ← parameter schema, profiles, lexicons, render templates
↓
generated/ ← rendered forms, system overlays, candidate payloads
↓
forms/ ← legacy synced forms
↓
evals/ ← benchmark data, result artifacts, slot specs
↓
search/ ← candidate generation, runners, reports
Personality is data, not prose.
The repo is moving toward:
- semantic parameters
- deterministic rendering
- model-specific evaluation
- static eval slots instead of disposable eval agents
git clone https://github.com/ameno-/leda-agents
cd leda-agents
python3 scripts/render_profiles.py --sync-legacy
cat evals/results.mdconstitution.json— behavioral source of truthpersonality/profiles/base/— baseline personality profilesgenerated/— rendered outputs from parameterized profilesevals/results.md— current cross-model result summaryevals/rubric.txt— grader rubricsearch/run_experiment.py— static-slot experiment runner scaffold
- Stealth scales with stronger models
- Compressed is not universally best
- Full can regress on stronger models
- Scope-respect is the main regression surface
- Eval environment contamination is real — fixture isolation matters
Most prompt tuning work still treats personality as prose. This repo treats it as a system:
- source
- rendering
- evaluation
- regression detection
That makes the failures legible.
MIT