Skip to content

Evolution system rework: scoreboard-ready signatures, archetype parity, and unprompted-surface specialization #26

@theognis1002

Description

@theognis1002

Problem Statement

The evolution system in docs/evolution.md and crates/core/src/evolution/ has a set of foundational issues that surface as user-visible bugs and as ceiling-limits on what the feature can become:

  • Stage 2 → Final takes 1–2+ years to reach because the level gate is Lvl.99 on a n^1.8 curve. Engagement collapses well before users get there. Once Final is reached, the level cap at 99 means there's no progression for long-term users.
  • The "permanent name" + "drifting archetype" rules contradict each other. A user who pivots from Ops to Marketing keeps an "Ops Sentinel" name forever, while the system simultaneously injects Archetype: Marketer into the prompt. The agent's identity and behavior diverge.
  • Generalists can't evolve. The Stage 1 → 2 dominance gate (top archetype ≥ 1.3× runner-up) locks out the modal indie/founder user who legitimately splits time between two domains.
  • 4 of 11 archetypes (Communicator, Creator, Caretaker, Merchant) are nearly unreachable because they have no keyword sets in classification.rs. The 11-archetype surface promises parity the implementation doesn't deliver.
  • Plan mode users earn ~⅓ the XP of Execute-mode peers doing equivalent intellectual work, because creation events (the +2/+3 XP source) are blocked in Plan mode. The mode is penalized for being thoughtful.
  • Per-turn <evolution_context> injection makes the agent "an Ops specialist" claim hollow. Two lines of XML don't measurably change in-task behavior, but they risk miscasting the agent for cross-domain tasks. The doc claim "defaults to infrastructure-oriented solutions" is not supported by the mechanism.
  • Scoring weight (35% lifetime / 65% recent) locks new archetypes out for 5+ months during legitimate career pivots. Users complain the system "isn't watching them."
  • Correction-rate gate (< 20% over 14 days) has no minimum-events floor. A returning user with 4 events and 1 correction is gate-blocked at 25%, despite normal behavior.
  • HMAC anti-cheat is the wrong primitive for the future. A planned "Borg Scoreboard" needs asymmetric signatures with per-install identity; HMAC's symmetric key (necessarily living in the binary) provides no defense beyond casual.

Solution

Rework the evolution system so that:

  1. Progression to and within Final stage is alive — Stage 2→3 gate drops to Lvl.80; Final has no level cap; post-99 levels arrive at a steady tempo with milestone celebrations every 25 levels.
  2. Identity tracks behavior — the evolution name is no longer permanent; it re-mints when archetype drifts stably for ≥14 days. Hybrid users get a hybrid LLM-generated name.
  3. All 11 archetypes are reachable — keyword sets added for Communicator, Creator, Caretaker, Merchant.
  4. Plan mode is rewarded — one plan emission per session earns full creation XP, classified by plan content.
  5. Evolution drives unprompted surfaces only — heartbeat seed, memory consolidation, skill-loading priority (tiebreaker, never override task), status/ambient/share-card. Per-turn <evolution_context> injection is removed. In-task behavior is always best-effort, not archetype-tilted.
  6. Anti-cheat moves to Ed25519 signatures with per-install keypair stored in the OS keychain — sets up future scoreboard-grade integrity without committing to server infrastructure today.
  7. Calibration fixes — 20/80 recent-weighted scoring; correction-rate floor of 50 events with Deferred state; uniform 1.3× dominance threshold across heartbeat/memory/skills with neutral fallback when sub-threshold.

User Stories

  1. As a Borg user reaching Stage 1 Lvl.99 in my first week, I want my evolution celebration to fire as a deliberate early-payoff hook so that I learn the system exists before disengaging.
  2. As an active long-term user past Stage 2 Lvl.80, I want to evolve to Final without grinding to Lvl.99 first, so that the gate matches the mastery I've actually demonstrated.
  3. As a Final-stage user, I want my level to keep climbing past 99 so that I have an ongoing reason to engage.
  4. As a Final-stage user past Lvl.99, I want milestone celebrations every 25 levels so that progression has rhythm rather than being a slow number tick.
  5. As a Final-stage user past Lvl.99, I want the ambient header to keep showing Ascending mood so that the agent feels active, not "stable / done."
  6. As a user pivoting careers, I want my evolution name to re-mint after stable archetype drift (≥14 days) so that the agent's identity reflects who I am now, not who I was at first evolution.
  7. As a generalist founder doing both engineering and marketing, I want to evolve through Stage 1→2 without being blocked by a dominance gate, so that my balanced behavior isn't penalized.
  8. As a generalist user, I want my evolution name to be a meaningful hybrid (e.g., "Builder-Marketer Hybrid") so that the system reflects my real working pattern.
  9. As a household-tooling-heavy user, I want Caretaker actions (home automation, wellness, household rhythms) to register and accumulate XP so that "Caretaker Borg" is a reachable identity.
  10. As a commerce-heavy user, I want Merchant actions (Stripe, inventory, P&L work) to register so that "Merchant Borg" is a reachable identity.
  11. As a content-creator user, I want Creator actions (writing, publishing, design tooling) to register beyond just write_memory so that creative work counts.
  12. As a communications-heavy user, I want shell-based messaging tools (mailx, signal-cli, etc.) to register as Communicator so that classification matches behavior.
  13. As a Plan-mode user, I want each session that produces a plan to earn creation-equivalent XP so that thoughtful work isn't penalized in evolution speed.
  14. As a Plan-mode user, I want the plan I emit to be classified by its content (not the tool that emitted it) so that planning a Kubernetes migration earns Ops XP, not generic XP.
  15. As a Plan-mode-then-execute user, I don't want to double-dip XP on the same intent — one plan per session, regardless of how many revisions.
  16. As a user asking my "Ops-specialized" Borg about marketing, I want the agent to perform at peak ability for that task — its archetype should not deprioritize task-relevant skills or instructions.
  17. As an Ops-specialized user, I want my heartbeat check-ins to surface Ops-relevant topics (deploy/CI/reliability) instead of generic productivity.
  18. As a Marketer-specialized user, I want my heartbeat check-ins to surface campaign/funnel/audience topics.
  19. As a hybrid (no dominant archetype) user, I want heartbeat check-ins to fall back to neutral "general productivity" content rather than pretending I'm specialized.
  20. As an Ops-specialized user, I want nightly memory consolidation to preserve incident notes and infrastructure observations preferentially.
  21. As an Ops-specialized user, I want my default skill-loading priority to surface Docker/K8s/Terraform skills first, while still loading task-relevant skills regardless of my archetype.
  22. As any user, I want the share card / /status / ambient header to display my evolution name and level so that identity is visible and shareable.
  23. As a future scoreboard participant, I want my events to be cryptographically signed by a key only my install holds, so that submissions are attributable and harder to forge.
  24. As a user who reinstalls Borg, I want the keychain-backed keypair to regenerate cleanly so that the system continues working (re-registration handled by the future scoreboard service).
  25. As a user who pivots career and pushes hard in the new archetype, I want the dominance shift to register within roughly two months (not five+).
  26. As a returning user after a vacation, I don't want a single bad day to disqualify me from Stage 2→3 evolution because of a small-sample correction-rate spike.
  27. As any user, I want the evolution gate readiness display to honestly show Deferred (rather than Failed) when I lack enough vitals events for the rate gate to be statistically meaningful.
  28. As a heavy-aligned-archetype user, I want my consistent specialization to earn a small bonus, but not so much that early dominance locks in irreversibly.
  29. As a developer reading docs/evolution.md, I want the documentation claim to match the implementation — no statements like "defaults to infrastructure-oriented solutions" if the mechanism doesn't deliver that.
  30. As a developer maintaining the evolution system, I want pure-logic modules (scoring, XP curve, plan emission) extracted and unit-tested so that calibration changes don't require integration tests to validate.
  31. As a developer, I want signature handling isolated behind a stable interface so that swapping HMAC → Ed25519 (and later, possibly server-side signing) doesn't ripple through the codebase.

Implementation Decisions

Stage and level changes

  • Stage 2 → Stage 3 gate: level requirement drops from Lvl.99 to Lvl.80. Lvl.99 in Stage 2 remains as a level_99_evolved milestone but is no longer a hard gate.
  • Final stage: level cap removed. Levels continue past Lvl.99 indefinitely.
  • Post-99 XP curve in Final is piecewise:
    • n ≤ 99: existing 80 + floor(n^1.8)
    • n > 99: 80 + floor(99^1.8) + (n - 99) * 50
  • The piecewise transition is intentionally not C¹-smooth; the +50/level rate is what calibrates the post-99 cadence.

Archetype scoring and naming

  • Scoring weight changes: effective_score = lifetime * 0.20 + recent_30d * 0.80 (was 0.35/0.65).
  • Dominance gate for Stage 1→2 evolution is removed. Hybrid users evolve normally.
  • LLM name generator receives top-2 archetypes with their scores. The prompt asks the LLM to mint a hybrid name when scores are close, a specialized name when one dominates clearly.
  • Name re-mint on stable archetype drift: when the dominant archetype changes and remains stable for ≥14 days, a new name is generated within the same stage. Implemented as a new evolution event variant with rename metadata, or a new archetype_renamed event type. Ledger preserves the full name history.
  • Fallback (no dominant archetype): an archetype is "dominant" only when its effective_score ≥ 1.3× runner-up. When no archetype clears the threshold, surfaces fall back to neutral.

Archetype classification expansion

  • New keyword sets in classification.rs for Communicator (mailx, mutt, signal-cli, telegram-cli, etc.), Creator (pandoc, mdbook, hugo, jekyll, obsidian, ffmpeg, etc.), Caretaker (homeassistant, philips-hue, nest, roomba, ifttt, oura, fitbit, grocery, meal-plan, etc.), Merchant (stripe, shopify, woocommerce, quickbooks, paypal, invoice, inventory, etsy, p&l, etc.).
  • All four archetypes added to the keyword_sets array iterated by classify_shell_command.

Plan-mode XP rewards

  • New event source plan_emission awards full creation XP (+2 base, +1 aligned bonus).
  • One plan_emission event per session_id — enforced at write time via existence check (EXISTS evolution_events WHERE source='plan_emission' AND session_id=?).
  • Plan content is classified using the same keyword/tool/LLM path as classify_shell_command, fed the plan text as input.
  • Adds session_id column (TEXT, nullable, indexed) to evolution_events schema.

Anti-cheat: signatures over HMAC

  • HMAC chain replaced with Ed25519 signature chain.
  • Per-install keypair: private key stored in OS keychain (macOS via security framework, Linux via libsecret), public key in a new device_keys table in SQLite.
  • New evolution::signature module replaces hmac.rs. Stable interface: sign(prev_sig, payload) → Signature, verify(prev_sig, payload, sig) → bool, current_pubkey() → PublicKey.
  • New evolution::keychain module abstracts platform-specific key storage. Mockable via trait for tests.
  • evolution_events schema rename: hmacsignature, prev_hmacprev_signature. Adds pubkey_id foreign key to device_keys.
  • Server-side enforcement (streaming submission, temporal sanity) is out of scope — deferred until a scoreboard product is concretely planned.

Behavior surfaces (where archetype actually pulls)

  • Per-turn <evolution_context> injection is removed. EvolutionHook::BeforeAgentStart and BeforeLlmCall no longer inject archetype/level into the LLM prompt. The agent's in-task behavior is unaffected by archetype.
  • Heartbeat seed: dominant archetype (1.3×-clear) drives topic selection; sub-threshold falls back to "general productivity" check-in.
  • Memory consolidation: nightly/weekly tasks weight archetype-relevant content when dominant is 1.3×-clear; otherwise consolidate by recency only.
  • Skill loading priority: archetype-relevant skills get a +1.0 priority boost as a tiebreaker under token budget pressure, only when dominant is 1.3×-clear. Task-relevant skills always load regardless of archetype.
  • Status/ambient/share-card surfaces: continue to display archetype name and level — pure UX/identity, no behavioral change.

Correction-rate gate fix

  • Minimum-events floor of 50 vitals events in the 14-day window. Sub-floor users see GateState::Deferred rather than Failed.
  • Status surfaces show "correction rate gate: deferred (need N more events for signal)" rather than blocking with a false-positive failure.

Milestones

  • Post-99 milestones every 25 levels in Final stage: level_125_final, level_150_final, level_175_final, ... extending indefinitely.
  • Mood::Ascending rule extended: fires when (stage == Final && level > 99) in addition to the existing rule (Lvl.99 at non-final stage + bond ≥ 30).

Aligned bonus magnitude

  • Aligned XP bonus stays at +1 XP (unchanged). If runaway specialization is observed post-launch, the dominance threshold is raised from 1.3× to 1.5× rather than reducing the bonus.

Module structure

Modified modules: evolution/xp.rs, evolution/mod.rs, evolution/classification.rs, evolution/milestones.rs, evolution/replay.rs, evolution/helpers.rs (mood rule), heartbeat scheduler, memory consolidation tasks, skill loader, db.rs migration, EvolutionHook.

New deep modules:

  • evolution::signature — Ed25519 sign/verify with stable interface
  • evolution::keychain — platform-abstracted private-key storage (trait-mockable)
  • evolution::scorer — pure-function archetype scoring (20/80 weighting, dominance threshold, fallback)
  • evolution::plan_emission — per-session cap + plan-content classification

Renamed/replaced: evolution/hmac.rsevolution/signature.rs.

Schema changes

  • New migration adds:
    • Rename evolution_events.hmacsignature, prev_hmacprev_signature
    • New column: evolution_events.session_id TEXT NULL, indexed
    • New column: evolution_events.pubkey_id INTEGER NULL referencing device_keys.id
    • New table: device_keys (id INTEGER PRIMARY KEY, public_key BLOB NOT NULL, created_at INTEGER NOT NULL)

Documentation

  • docs/evolution.md rewritten to:
    • Reflect the Lvl.80 Stage 2→3 gate
    • Describe the piecewise post-99 curve
    • Document name re-mint on archetype drift
    • List all 11 archetypes with full keyword sets
    • Replace the "defaults to infrastructure-oriented solutions" claim with an honest description: archetype influences proactive surfaces (heartbeat, memory, skill priority) but does not tilt in-task behavior
    • Document the Deferred gate state and the 50-event floor
    • Update the scoring formula to 20/80
    • Document Ed25519 signatures replacing HMAC

Testing Decisions

What makes a good test (per project CLAUDE.md)

  • Tests must exercise real code paths and assert on observable outcomes
  • No tautological tests, no smoke-only is_ok() tests, no over-mocked assertions on canned data, no near-duplicate per-variant tests
  • Bug fixes need regression tests that fail before the fix and pass after
  • New features need happy-path + at least one edge case

Modules tested in this PRD scope

  • evolution::scorer — table-driven tests over weighted blends (20/80), dominance ratio computation, fallback when ratio < 1.3×, archetype-shift detection over multi-event sequences. Tests both lifetime-heavy and recent-heavy realistic distributions.
  • evolution::xp_curve — boundary tests at the Lvl.99 → 100 piecewise transition for Final; verify Base/Evolved still cap at 99; verify post-99 levels cost +50 XP each.
  • evolution::plan_emission — one-plan-per-session enforcement: second plan_emission event for the same session_id must be rejected. Plan-content classification: a plan text containing "deploy kubernetes" classifies to Ops, "campaign funnel" to Marketer, etc.
  • evolution::signature — sign/verify roundtrip; chain-break detection (modified payload fails verification); replay rejects events whose signature doesn't match. Key generation determinism (regenerating from same keychain entry returns same keypair).
  • Gate logic in evolution/mod.rsDeferred state for sub-50-event correction-rate gate; all four Stage 1→2 gates pass simultaneously without dominance gate; Stage 2→3 gate uses Lvl.80 not Lvl.99; correction rate gate fails at 25%, passes at 18%, defers at 4-event window.

Modules not unit-tested in this scope

  • Heartbeat seeding integration, memory consolidation integration, skill-loading priority — these are wiring changes best covered by an end-to-end smoke test once the wires are physically connected, not by unit tests.

Prior art

  • crates/core/src/evolution/xp.rs — already has minimal table-driven tests; the post-99 extension follows the same pattern
  • crates/core/src/vitals.rs — event-sourced replay with pure-function scoring is the same pattern as the new scorer module
  • crates/apply-patch/ — deep-module pattern with extensive unit tests over a stable interface; signature module mirrors this shape

Out of Scope

  • Server-side scoreboard infrastructure. No event submission endpoint, no streaming, no server-side rate enforcement, no OAuth identity binding. The signature scheme prepares the schema for these features but does not deliver them.
  • Server-hosted agent loop. Borg remains local-first. The "scoreboard mode runs server-side" architecture is explicitly rejected for this product.
  • Per-archetype LLM prompt fragments. The "behavioral specialization via prompt content" approach is rejected in favor of unprompted-surface specialization. No per-archetype 100–200 token instruction blocks.
  • Adaptive scoring weights that vary based on event density. The 20/80 weight is a static constant; adaptive weighting is a future possibility but not in this scope.
  • Path-based apply_patch archetype routing (e.g., .md → Creator, .rs → Builder). Considered and rejected; the new keyword sets and plan-emission classification cover the documentation-heavy user case adequately.
  • Bayesian smoothed correction rate. The 50-event floor with Deferred state is the chosen approach; Bayesian smoothing was rejected as user-illegible.
  • Vitals/bond threshold recalibration. The existing values (≥30 / ≥20 / ≥55) remain unchanged in this PRD; recalibration awaits real-user telemetry.
  • Rate-limit recalibration. The existing per-source / per-event-type rate limits are unchanged; observation drives any future tuning.
  • Stage 1 celebration timing UX. Already deferred via pending_celebrations outbox; revisit only if post-launch shows it firing too eagerly mid-onboarding.

Further Notes

  • Migration ordering: the schema migration must run before the first signature event would be written. Existing HMAC-signed events should be either re-signed during a migration replay step or grandfathered as "legacy" with a flag indicating they predate the signature scheme. Design choice deferred to implementer; both options are valid.
  • Privacy posture for the future scoreboard: even when (if) the scoreboard ships, event submission must be explicit opt-in and submit only metadata (event types, timestamps, archetype labels, XP deltas) — never tool inputs/outputs, memory contents, or session text. This invariant should be documented when the scoreboard is designed.
  • "Aligned bonus may cause runaway specialization" risk: monitored, not pre-fixed. If observed, the fix is to raise the dominance threshold from 1.3× to 1.5× across heartbeat/memory/skill-priority surfaces — a single-number tweak — not to restructure the bonus.
  • The piecewise XP curve has a deliberate kink at Lvl.99→100 (it cheapens slightly). If users complain about the visible discontinuity, the alternative is a C¹-continuous slope of +92 XP/level post-99 (slower cadence, no kink). Defer this decision to post-launch observation.
  • The Pokémon framing of the design is preserved — three permanent stages, evolution celebrations, named identities — but with the honest amendments that (a) Final stage allows continued growth past 99, and (b) names update with stable behavior change rather than freezing at first evolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions