Actor model evolution patterns — lazy state migration vs projection-driven splits

## Background

Discussion seeded by #498 (Adopt AgentKind + Kind Registry for runtime actor
identity). #498 covers the **identity layer** — kind ≠ CLR type. This issue
covers the **state / event layer**: when a refactor changes business model
shape (split, merge, re-key, schema upgrade), what is the prescribed pattern,
and what infrastructure does each pattern need?

The trigger was a proposal to ship a generic
`IActorMigration { Migrate(old) → new }` interface that runs lazily on
`OnActivateAsync`. That pattern is **partially correct** — it is the right
answer for narrow within-actor upgrades and the wrong answer for cross-actor
splits / merges / re-keying. We need the matrix written down before we start
adding interfaces.

## Decision matrix

| Evolution type | Example | Correct mechanism | New abstraction needed? |
|---|---|---|---|
| Cross-assembly CLR rename | `ChannelRuntime.X` → `Scheduled.X` | `[LegacyAgentKind]` | No (covered by #498) |
| Proto field add / remove | `reserved 8,10,11,12` on a state proto | proto3 + `[LegacyProtoFullName]` | No |
| Within-actor logic upgrade / state recompute | derived field algorithm changes | lazy `IActorStateMigration` | **Yes — narrow, deferred until first real case** |
| Within-actor field hygiene | trim, case-fold, ID normalization | same | **same** |
| Actor split (one → many) | `SkillRunner` → `SkillDefinition` + `SkillExecution` | projection-pipeline + bootstrap-from-projection + retire | **Yes — bootstrap port (separate issue)** |
| Actor merge (many → one) | re-aggregation | same | same |
| Actor identity re-keying | `skill-runner-{user}-{name}` → `skill-definition-{team}-{name}` | `IActorRedirectSpec` | **Yes — separate issue** |
| **Event semantic change** | `UserUpdated` semantics shift over time | new TypeUrl, never mutate existing event | No — doctrine only |
| Hard schema break (no compat path) | rare — disaster scenarios | offline tooling + dual-write + cutover | No (out of band) |

## Where lazy on-activation migration is right (narrow, real)

`IActorStateMigration` is the right tool when the migration is **fully within
one actor's boundary**:

- The actor's persisted state proto needs a field recomputed (e.g., a derived
  field's algorithm changed).
- One-time hygiene: ID format normalization, case folding, trimming.
- Internal counter rebuild from existing state.

These work because:

- Cost is paid on first activation per actor — cold actors stay un-migrated.
- Orleans single-activation gives serial execution per actor.
- No cross-actor coordination is needed.
- Idempotent via a `state_schema_version` field carried in the runtime
  envelope (`RuntimeActorIdentity.state_schema_version` from #498).

**Doctrinal test for "is this a lazy-migration case":** the new state can be
derived from the old state alone, in pure code, without re-reading any
events. If you need events, it's a projection rebuild, not a lazy migration.

### Proposed contract (sketch — not landing now)

```csharp
public interface IActorStateMigration<TState>
{
    int FromStateVersion { get; }
    int ToStateVersion { get; }
    TState Apply(TState state);
}
```

Run from `RuntimeActorGrain.OnActivateAsync` after state load:

1. Read `Identity.state_schema_version` from the runtime envelope.
2. While there is a registered migration with `FromStateVersion == current`,
   apply it; advance.
3. Persist the new state with the new `state_schema_version` before
   processing any command.
4. If migration throws, fail activation explicitly — do not swallow.

**Constraints (locked at the contract level, enforced by CI guard):**

- **Pure function** of input state. No I/O, no other-actor calls, no
  random / time-dependent inputs.
- **Idempotent** — applying twice must yield the same result.
- **Total** — must not throw on any well-formed historical state.
- Migrations form a chain (`v1→v2`, `v2→v3`); skipping is forbidden.
- **Zero-dependency constructor**: implementations may not depend on
  `IServiceProvider`, any `IClient*`, any `*Async*` service, `ITimeService`,
  `IRandom`, or anything that performs I/O. CI guard scans constructor
  parameters of `IActorStateMigration` implementations and fails the build
  on violations. This is the structural defense against drift toward a
  "general-purpose data transformation framework".

### `state_schema_version` placement (resolved)

Resolved via companion ADR `actor-state-version-placement` (co-issued with
#498):

- Lives on the **runtime envelope** (`RuntimeActorIdentity.state_schema_version`
  per #498), **not** on business state protos.
- Business state protos remain pure domain artifacts. Migration concern does
  not leak into them.
- Migration registration keys on `(state_proto_descriptor, from, to)`;
  runtime reads version from the envelope.

### YAGNI: the interface is deferred

Lazy-migration applies to exactly two row types in the matrix and there is
no concrete case driving either today. Per CLAUDE.md ("Don't design for
hypothetical future requirements" / "抽象一旦能被滥用即设计未完成"):

- **This issue ships doctrine + matrix + ADRs — not the interface.**
- `IActorStateMigration<TState>` is sketched here for future reference.
- The first real within-actor migration case implements the interface
  alongside its concrete migration. Until then, no empty foundation.

This avoids the slippery slope toward a "general-purpose data transformation
framework" (the explicit non-goal below).

## Where lazy migration is wrong — use the projection pipeline

The lazy on-activation interface **cannot safely** support:

- **Actor split (one → many)**: actor A would have to spawn / initialize A''
  during its own activation — A is mutating another actor's authoritative
  state, violating "事实源唯一".
- **Actor merge (many → one)**: needs reads across multiple actor streams
  during a single activation — outside any one actor's boundary.
- **Identity re-keying**: requires global awareness that the same business
  fact moved key — not solvable from inside one activation.
- **Mixed-version safety**: migration that mutates state on activation breaks
  pods running older code that still expect the pre-migration shape.

For all of these, the architecturally correct path is **projection-pipeline
driven migration**, using infrastructure that already exists *plus one new
capability* (bootstrap-from-projection — see below):

1. A's committed events are already in the projection main pipeline
   (per `docs/canon/event-sourcing.md` — "committed domain event 必须可观察").
2. Stand up new projection consumers for A' / A'' that consume A's committed
   events and materialize A' / A'' state into a dedicated readmodel.
3. New actor A' / A'' bootstraps its initial state from that readmodel via
   `IActorBootstrapPort` (one-time import on first activation), then becomes
   authoritative.
4. Write commands progressively cut over from A to A' / A''. A keeps running
   as source-of-truth during the transition window.
5. A retires using #495's mechanism (keyed on `AgentKind` from #498) once
   read paths are migrated.

This is the "Strangler Fig" pattern at the actor level. It is gradual,
distributed-safe, replayable, and reversible.

### Missing infrastructure: bootstrap-from-projection

The split / merge cookbook glosses "stand up new projection consumers for A'
and A''" — but `RuntimeActorGrain` today only initializes from its own
persisted state slot (`AgentStateSnapshot`). There is no contract for
**a new actor to bootstrap from projected state derived from another actor's
events**. Without this, the strangler-fig pattern at the actor level cannot
work end-to-end.

**This bootstrap contract is filed as a separate prerequisite issue.** The
split / merge / re-key cookbook in this issue remains documentation-only
until that issue lands. Operational deliverables (worked example, CI gates
that depend on the cookbook) wait on it.

### Re-keying: separate spec, not extension of retired-actor spec

`IRetiredActorSpec` retires a kind ("this kind no longer exists"). Re-keying
preserves the kind but moves the actor id. Different semantics; conflating
them pollutes the #495 contract.

Re-keying gets its own:

```csharp
public interface IActorRedirectSpec
{
    string SpecId { get; }
    IAsyncEnumerable<RedirectTarget> DiscoverAsync(IServiceProvider services, CancellationToken ct);
}

public sealed record RedirectTarget(
    string FromKind, string FromActorId,
    string ToKind, string ToActorId);
```

Same hosted-service entrypoint as `IRetiredActorSpec`, executed once at
startup, idempotent. Filed as a separate issue when the first concrete
re-keying case arrives.

## Doctrine: events are append-only, semantics are immutable

The matrix's "event semantic change" row is doctrine, not infrastructure.
Concrete rules (recorded in ADR `event-immutability-policy`):

- A committed event's TypeUrl pins its semantics forever.
- New semantics → new event type with a new TypeUrl. Old type stays for
  history; projectors handle both during the transition window.
- Adding optional fields to an event proto is permitted (proto3 evolution
  rules). Adding fields whose absence implies a different semantic is
  forbidden — that is a semantic change, not a shape change.
- Backfilling history by replaying old events under new semantic assumptions
  is forbidden in normal operation. If a projection has to be rebuilt, it
  rebuilds *under the original semantics of each event*.

This row exists in the matrix because the most common silent failure mode
in event-sourced systems is "we tweaked what `UserUpdated` means" — the
matrix should refuse that path explicitly.

## Open design questions (resolved or deferred)

- ~~**Where does `state_version` live?**~~ Resolved: runtime envelope
  (`RuntimeActorIdentity.state_schema_version`), not business state proto.
  See companion ADR.
- ~~**Failure mode when no migration is registered for a stale
  `state_schema_version`**~~: Resolved: fail activation hard. Silent data
  drift is worse than visible startup error.
- ~~**Re-keying mechanism**~~: Resolved: separate `IActorRedirectSpec`,
  separate issue. Not a flavor of retire and not a flavor of state migration.
- **Migration registration shape**: Deferred until first concrete case.
  Default position: attribute-based
  (`[StateMigration(typeof(SkillDefinitionState), from: 1, to: 2)]`) for
  discoverability; DI registration acceptable for tests.
- **Projection-driven split protocol**: Cookbook documented in
  `docs/canon/projection-driven-actor-split.md` — but operational deliverable
  blocks on bootstrap-from-projection issue.

## Deliverables

- [ ] `docs/canon/actor-model-evolution.md` — the matrix above plus a
      one-paragraph example for each cell. Ship first.
- [ ] ADR `docs/adr/NNNN-actor-evolution-pattern-decision-matrix.md` — the
      decision matrix locked, supersedes any ad-hoc framing.
- [ ] ADR `docs/adr/NNNN-actor-state-version-placement.md` — co-issued with
      #498; locks placement on runtime envelope.
- [ ] ADR `docs/adr/NNNN-event-immutability-policy.md` — events append-only,
      semantics immutable; new TypeUrl on semantic change.
- [ ] CI / review skill check: every refactor PR that deletes / renames /
      moves an actor type or `*State` proto must declare in the description
      which row of the matrix it falls under.
- [ ] `docs/canon/projection-driven-actor-split.md` — split / merge cookbook
      with one worked example end-to-end (write commands cut-over phases,
      retire timing, projection consumer wiring, bootstrap port import).
      **Blocks on bootstrap-from-projection issue landing.**
- [ ] `IActorStateMigration<TState>` interface — **deferred** to first real
      case. Sketch retained in this issue for reference; CI guard
      (zero-dependency constructor) lands together with the interface, not
      before.
- [ ] Worked split-cookbook exemplar (e.g., a hypothetical `Foo` →
      `FooConfig` + `FooRun`). Blocks on bootstrap port.

## Relationship to other issues

- **#498** (AgentKind identity): prerequisite. `state_schema_version` lives
  in the `RuntimeActorIdentity` sub-message landed in #498 Phase 1.
  Cross-assembly rename and identity-only refactors collapse to kind-alias
  and never touch this issue.
- **#495** (retired-actor cleanup): the `Retire` half of any split / merge.
  Re-keying is **not** covered by #495 — separate `IActorRedirectSpec` spec.
- **#497** (`SkillRunner` split): currently blocks on #498. Its split is
  **not** a state-migration case — `SkillExecutionGAgent` is brand-new and
  session-scoped; no historical execution data needs migration. If we later
  want to back-fill historical execution actors from `SkillRunner` event
  history, that becomes the worked example for the split cookbook (and
  exercises bootstrap-from-projection).
- **(new) Bootstrap-from-projection contract**: prerequisite for the split /
  merge / re-key rows of this matrix to be operational. To be filed
  separately.
- **(new) `IActorRedirectSpec`**: prerequisite for the re-keying row. To be
  filed separately when first concrete case appears.

## Non-goals

- A general-purpose "data transformation framework". The lazy migration
  interface is intentionally narrow — purity is enforced by CI guard on
  constructor dependencies.
- Replacing proto3 / `[LegacyProtoFullName]` for payload codec compatibility.
  Those layers stay untouched.
- Online schema migration tooling for state stores other than Aevatar's own
  event store + actor state.
- Mutating event semantics in place. Always new TypeUrl; old type retained
  for history.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor model evolution patterns — lazy state migration vs projection-driven splits #500

Background

Decision matrix

Where lazy on-activation migration is right (narrow, real)

Proposed contract (sketch — not landing now)

`state_schema_version` placement (resolved)

YAGNI: the interface is deferred

Where lazy migration is wrong — use the projection pipeline

Missing infrastructure: bootstrap-from-projection

Re-keying: separate spec, not extension of retired-actor spec

Doctrine: events are append-only, semantics are immutable

Open design questions (resolved or deferred)

Deliverables

Relationship to other issues

Non-goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evolution type	Example	Correct mechanism	New abstraction needed?
Cross-assembly CLR rename	`ChannelRuntime.X` → `Scheduled.X`	`[LegacyAgentKind]`	No (covered by #498)
Proto field add / remove	`reserved 8,10,11,12` on a state proto	proto3 + `[LegacyProtoFullName]`	No
Within-actor logic upgrade / state recompute	derived field algorithm changes	lazy `IActorStateMigration`	Yes — narrow, deferred until first real case
Within-actor field hygiene	trim, case-fold, ID normalization	same	same
Actor split (one → many)	`SkillRunner` → `SkillDefinition` + `SkillExecution`	projection-pipeline + bootstrap-from-projection + retire	Yes — bootstrap port (separate issue)
Actor merge (many → one)	re-aggregation	same	same
Actor identity re-keying	`skill-runner-{user}-{name}` → `skill-definition-{team}-{name}`	`IActorRedirectSpec`	Yes — separate issue
Event semantic change	`UserUpdated` semantics shift over time	new TypeUrl, never mutate existing event	No — doctrine only
Hard schema break (no compat path)	rare — disaster scenarios	offline tooling + dual-write + cutover	No (out of band)

Actor model evolution patterns — lazy state migration vs projection-driven splits #500

Description

Background

Decision matrix

Where lazy on-activation migration is right (narrow, real)

Proposed contract (sketch — not landing now)

state_schema_version placement (resolved)

YAGNI: the interface is deferred

Where lazy migration is wrong — use the projection pipeline

Missing infrastructure: bootstrap-from-projection

Re-keying: separate spec, not extension of retired-actor spec

Doctrine: events are append-only, semantics are immutable

Open design questions (resolved or deferred)

Deliverables

Relationship to other issues

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`state_schema_version` placement (resolved)