Task Summary
Each agent in the agent service is a stateful in-memory object with no durable backing store.
In agent-service/src/server.ts the entire fleet lives in a process-local map:
const agentStore = new Map<string, TexeraAgent>();
let agentCounter = 0;
All agent state — the ReAct step tree, current HEAD/checkout pointer, per-agent settings, delegate config, and the cached operator-result state — is held in TexeraAgent instance fields. The only thing persisted today is the workflow content, which is pushed back to the dashboard backend (persistWorkflow, debounced 500ms). The agent and its conversation/step history are not.
Consequences:
Before: restart / crash / redeploy -> agentStore is empty -> all agents + history lost
second replica -> separate agentStore -> agent not found
After: restart / crash / redeploy -> agents rehydrated from store
any replica -> shared store -> agent reachable
This also blocks horizontal scaling: a request routed to a replica that doesn't hold the agent gets Agent not found.
Introduce a persistence layer (e.g. database or Redis) for agent records and their step history, with load-on-demand / rehydration on startup, so agents survive restarts and can be served by more than one replica.
Task Type
Task Summary
Each agent in the agent service is a stateful in-memory object with no durable backing store.
In
agent-service/src/server.tsthe entire fleet lives in a process-local map:All agent state — the ReAct step tree, current HEAD/checkout pointer, per-agent settings, delegate config, and the cached operator-result state — is held in
TexeraAgentinstance fields. The only thing persisted today is the workflow content, which is pushed back to the dashboard backend (persistWorkflow, debounced 500ms). The agent and its conversation/step history are not.Consequences:
This also blocks horizontal scaling: a request routed to a replica that doesn't hold the agent gets
Agent not found.Introduce a persistence layer (e.g. database or Redis) for agent records and their step history, with load-on-demand / rehydration on startup, so agents survive restarts and can be served by more than one replica.
Task Type