Skip to content

Actor model evolution patterns — lazy state migration vs projection-driven splits #500

@eanzhao

Description

@eanzhao

Background

Discussion seeded by #498 (Adopt AgentKind + Kind Registry for runtime actor
identity). #498 covers the identity layer — kind ≠ CLR type. This issue
covers the state / event layer: when a refactor changes business model
shape (split, merge, re-key, schema upgrade), what is the prescribed pattern,
and what infrastructure does each pattern need?

The trigger was a proposal to ship a generic
IActorMigration { Migrate(old) → new } interface that runs lazily on
OnActivateAsync. That pattern is partially correct — it is the right
answer for narrow within-actor upgrades and the wrong answer for cross-actor
splits / merges / re-keying. We need the matrix written down before we start
adding interfaces.

Decision matrix

Evolution type Example Correct mechanism New abstraction needed?
Cross-assembly CLR rename ChannelRuntime.XScheduled.X [LegacyAgentKind] No (covered by #498)
Proto field add / remove reserved 8,10,11,12 on a state proto proto3 + [LegacyProtoFullName] No
Within-actor logic upgrade / state recompute derived field algorithm changes lazy IActorStateMigration Yes — narrow, deferred until first real case
Within-actor field hygiene trim, case-fold, ID normalization same same
Actor split (one → many) SkillRunnerSkillDefinition + SkillExecution projection-pipeline + bootstrap-from-projection + retire Yes — bootstrap port (separate issue)
Actor merge (many → one) re-aggregation same same
Actor identity re-keying skill-runner-{user}-{name}skill-definition-{team}-{name} IActorRedirectSpec Yes — separate issue
Event semantic change UserUpdated semantics shift over time new TypeUrl, never mutate existing event No — doctrine only
Hard schema break (no compat path) rare — disaster scenarios offline tooling + dual-write + cutover No (out of band)

Where lazy on-activation migration is right (narrow, real)

IActorStateMigration is the right tool when the migration is fully within
one actor's boundary
:

  • The actor's persisted state proto needs a field recomputed (e.g., a derived
    field's algorithm changed).
  • One-time hygiene: ID format normalization, case folding, trimming.
  • Internal counter rebuild from existing state.

These work because:

  • Cost is paid on first activation per actor — cold actors stay un-migrated.
  • Orleans single-activation gives serial execution per actor.
  • No cross-actor coordination is needed.
  • Idempotent via a state_schema_version field carried in the runtime
    envelope (RuntimeActorIdentity.state_schema_version from Adopt AgentKind + Kind Registry for runtime actor identity #498).

Doctrinal test for "is this a lazy-migration case": the new state can be
derived from the old state alone, in pure code, without re-reading any
events. If you need events, it's a projection rebuild, not a lazy migration.

Proposed contract (sketch — not landing now)

public interface IActorStateMigration<TState>
{
    int FromStateVersion { get; }
    int ToStateVersion { get; }
    TState Apply(TState state);
}

Run from RuntimeActorGrain.OnActivateAsync after state load:

  1. Read Identity.state_schema_version from the runtime envelope.
  2. While there is a registered migration with FromStateVersion == current,
    apply it; advance.
  3. Persist the new state with the new state_schema_version before
    processing any command.
  4. If migration throws, fail activation explicitly — do not swallow.

Constraints (locked at the contract level, enforced by CI guard):

  • Pure function of input state. No I/O, no other-actor calls, no
    random / time-dependent inputs.
  • Idempotent — applying twice must yield the same result.
  • Total — must not throw on any well-formed historical state.
  • Migrations form a chain (v1→v2, v2→v3); skipping is forbidden.
  • Zero-dependency constructor: implementations may not depend on
    IServiceProvider, any IClient*, any *Async* service, ITimeService,
    IRandom, or anything that performs I/O. CI guard scans constructor
    parameters of IActorStateMigration implementations and fails the build
    on violations. This is the structural defense against drift toward a
    "general-purpose data transformation framework".

state_schema_version placement (resolved)

Resolved via companion ADR actor-state-version-placement (co-issued with
#498):

  • Lives on the runtime envelope (RuntimeActorIdentity.state_schema_version
    per Adopt AgentKind + Kind Registry for runtime actor identity #498), not on business state protos.
  • Business state protos remain pure domain artifacts. Migration concern does
    not leak into them.
  • Migration registration keys on (state_proto_descriptor, from, to);
    runtime reads version from the envelope.

YAGNI: the interface is deferred

Lazy-migration applies to exactly two row types in the matrix and there is
no concrete case driving either today. Per CLAUDE.md ("Don't design for
hypothetical future requirements" / "抽象一旦能被滥用即设计未完成"):

  • This issue ships doctrine + matrix + ADRs — not the interface.
  • IActorStateMigration<TState> is sketched here for future reference.
  • The first real within-actor migration case implements the interface
    alongside its concrete migration. Until then, no empty foundation.

This avoids the slippery slope toward a "general-purpose data transformation
framework" (the explicit non-goal below).

Where lazy migration is wrong — use the projection pipeline

The lazy on-activation interface cannot safely support:

  • Actor split (one → many): actor A would have to spawn / initialize A''
    during its own activation — A is mutating another actor's authoritative
    state, violating "事实源唯一".
  • Actor merge (many → one): needs reads across multiple actor streams
    during a single activation — outside any one actor's boundary.
  • Identity re-keying: requires global awareness that the same business
    fact moved key — not solvable from inside one activation.
  • Mixed-version safety: migration that mutates state on activation breaks
    pods running older code that still expect the pre-migration shape.

For all of these, the architecturally correct path is projection-pipeline
driven migration
, using infrastructure that already exists plus one new
capability
(bootstrap-from-projection — see below):

  1. A's committed events are already in the projection main pipeline
    (per docs/canon/event-sourcing.md — "committed domain event 必须可观察").
  2. Stand up new projection consumers for A' / A'' that consume A's committed
    events and materialize A' / A'' state into a dedicated readmodel.
  3. New actor A' / A'' bootstraps its initial state from that readmodel via
    IActorBootstrapPort (one-time import on first activation), then becomes
    authoritative.
  4. Write commands progressively cut over from A to A' / A''. A keeps running
    as source-of-truth during the transition window.
  5. A retires using Fix retired ChannelRuntime startup cleanup #495's mechanism (keyed on AgentKind from Adopt AgentKind + Kind Registry for runtime actor identity #498) once
    read paths are migrated.

This is the "Strangler Fig" pattern at the actor level. It is gradual,
distributed-safe, replayable, and reversible.

Missing infrastructure: bootstrap-from-projection

The split / merge cookbook glosses "stand up new projection consumers for A'
and A''" — but RuntimeActorGrain today only initializes from its own
persisted state slot (AgentStateSnapshot). There is no contract for
a new actor to bootstrap from projected state derived from another actor's
events
. Without this, the strangler-fig pattern at the actor level cannot
work end-to-end.

This bootstrap contract is filed as a separate prerequisite issue. The
split / merge / re-key cookbook in this issue remains documentation-only
until that issue lands. Operational deliverables (worked example, CI gates
that depend on the cookbook) wait on it.

Re-keying: separate spec, not extension of retired-actor spec

IRetiredActorSpec retires a kind ("this kind no longer exists"). Re-keying
preserves the kind but moves the actor id. Different semantics; conflating
them pollutes the #495 contract.

Re-keying gets its own:

public interface IActorRedirectSpec
{
    string SpecId { get; }
    IAsyncEnumerable<RedirectTarget> DiscoverAsync(IServiceProvider services, CancellationToken ct);
}

public sealed record RedirectTarget(
    string FromKind, string FromActorId,
    string ToKind, string ToActorId);

Same hosted-service entrypoint as IRetiredActorSpec, executed once at
startup, idempotent. Filed as a separate issue when the first concrete
re-keying case arrives.

Doctrine: events are append-only, semantics are immutable

The matrix's "event semantic change" row is doctrine, not infrastructure.
Concrete rules (recorded in ADR event-immutability-policy):

  • A committed event's TypeUrl pins its semantics forever.
  • New semantics → new event type with a new TypeUrl. Old type stays for
    history; projectors handle both during the transition window.
  • Adding optional fields to an event proto is permitted (proto3 evolution
    rules). Adding fields whose absence implies a different semantic is
    forbidden — that is a semantic change, not a shape change.
  • Backfilling history by replaying old events under new semantic assumptions
    is forbidden in normal operation. If a projection has to be rebuilt, it
    rebuilds under the original semantics of each event.

This row exists in the matrix because the most common silent failure mode
in event-sourced systems is "we tweaked what UserUpdated means" — the
matrix should refuse that path explicitly.

Open design questions (resolved or deferred)

  • Where does state_version live? Resolved: runtime envelope
    (RuntimeActorIdentity.state_schema_version), not business state proto.
    See companion ADR.
  • Failure mode when no migration is registered for a stale
    state_schema_version
    : Resolved: fail activation hard. Silent data
    drift is worse than visible startup error.
  • Re-keying mechanism: Resolved: separate IActorRedirectSpec,
    separate issue. Not a flavor of retire and not a flavor of state migration.
  • Migration registration shape: Deferred until first concrete case.
    Default position: attribute-based
    ([StateMigration(typeof(SkillDefinitionState), from: 1, to: 2)]) for
    discoverability; DI registration acceptable for tests.
  • Projection-driven split protocol: Cookbook documented in
    docs/canon/projection-driven-actor-split.md — but operational deliverable
    blocks on bootstrap-from-projection issue.

Deliverables

  • docs/canon/actor-model-evolution.md — the matrix above plus a
    one-paragraph example for each cell. Ship first.
  • ADR docs/adr/NNNN-actor-evolution-pattern-decision-matrix.md — the
    decision matrix locked, supersedes any ad-hoc framing.
  • ADR docs/adr/NNNN-actor-state-version-placement.md — co-issued with
    Adopt AgentKind + Kind Registry for runtime actor identity #498; locks placement on runtime envelope.
  • ADR docs/adr/NNNN-event-immutability-policy.md — events append-only,
    semantics immutable; new TypeUrl on semantic change.
  • CI / review skill check: every refactor PR that deletes / renames /
    moves an actor type or *State proto must declare in the description
    which row of the matrix it falls under.
  • docs/canon/projection-driven-actor-split.md — split / merge cookbook
    with one worked example end-to-end (write commands cut-over phases,
    retire timing, projection consumer wiring, bootstrap port import).
    Blocks on bootstrap-from-projection issue landing.
  • IActorStateMigration<TState> interface — deferred to first real
    case. Sketch retained in this issue for reference; CI guard
    (zero-dependency constructor) lands together with the interface, not
    before.
  • Worked split-cookbook exemplar (e.g., a hypothetical Foo
    FooConfig + FooRun). Blocks on bootstrap port.

Relationship to other issues

Non-goals

  • A general-purpose "data transformation framework". The lazy migration
    interface is intentionally narrow — purity is enforced by CI guard on
    constructor dependencies.
  • Replacing proto3 / [LegacyProtoFullName] for payload codec compatibility.
    Those layers stay untouched.
  • Online schema migration tooling for state stores other than Aevatar's own
    event store + actor state.
  • Mutating event semantics in place. Always new TypeUrl; old type retained
    for history.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions