Skip to content

Commit 86ca65c

Browse files
committed
fix(memory/typed-network): per-attempt 30s timeout on observer LLM invoke
The TypedNetworkObserver's underlying LLM adapter has no built-in request timeout. A hung TCP socket or unresponsive provider could deadlock long-running ingest pipelines indefinitely - at concurrency=1 a single hung request stalls every subsequent session forever. Reproduced three times running Stage E Phase A on LongMemEval-S N=54 where each run hung 1.5-3 hours at 0% CPU on a stuck HTTPS connection to the OpenAI/Cohere endpoint, while direct API tests via curl returned in 0.5-1.3s. Add Promise.race-based timeout with a clean clearTimeout in finally, configurable via the new TypedNetworkObserverOptions.timeoutMs field (default 30000ms). On timeout the attempt is abandoned and the observer falls through to its existing retry / empty-result path - the underlying request leaks its socket until GC / process exit but ingest moves on. This is a defense-in-depth fix; the underlying adapter SHOULD also support AbortSignal but adding that requires extending the ITypedExtractionLLM interface and updating every consumer's adapter, which is invasive enough to defer. Tests: 81/81 typed-network tests pass; new public surface (timeoutMs option) is opt-in with a sane default that consumers already get for free.
1 parent 543fa84 commit 86ca65c

1 file changed

Lines changed: 56 additions & 6 deletions

File tree

src/memory/retrieval/typed-network/TypedNetworkObserver.ts

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,14 @@ export interface TypedNetworkObserverOptions {
7070
maxTokens?: number;
7171
/** Temperature. Default 0 for deterministic extraction. */
7272
temperature?: number;
73+
/**
74+
* Per-attempt request timeout in milliseconds. When the underlying
75+
* `llm.invoke()` does not resolve within this window the attempt is
76+
* abandoned and the observer falls through to the retry path. Used
77+
* to prevent stale TCP sockets / hung OpenAI requests from
78+
* deadlocking long-running ingest pipelines. Default 30 000 ms (30 s).
79+
*/
80+
timeoutMs?: number;
7381
}
7482

7583
/**
@@ -87,11 +95,13 @@ export class TypedNetworkObserver {
8795
private readonly llm: ITypedExtractionLLM;
8896
private readonly maxTokens: number;
8997
private readonly temperature: number;
98+
private readonly timeoutMs: number;
9099

91100
constructor(options: TypedNetworkObserverOptions) {
92101
this.llm = options.llm;
93102
this.maxTokens = options.maxTokens ?? 4096;
94103
this.temperature = options.temperature ?? 0;
104+
this.timeoutMs = options.timeoutMs ?? 30_000;
95105
}
96106

97107
/**
@@ -125,12 +135,21 @@ export class TypedNetworkObserver {
125135
? baseUserPrompt
126136
: `${baseUserPrompt}\n\nThe previous response failed validation: ${lastValidationError}\nReturn JSON matching the schema strictly. Do not add commentary.`;
127137

128-
const raw = await this.llm.invoke({
129-
system: TYPED_EXTRACTION_SYSTEM_PROMPT,
130-
user: userPrompt,
131-
maxTokens: this.maxTokens,
132-
temperature: this.temperature,
133-
});
138+
// Race the underlying invoke against a per-attempt timeout. A
139+
// hung TCP socket / unresponsive provider would otherwise
140+
// deadlock long-running ingest pipelines — at concurrency=1 a
141+
// single hung request stalls every subsequent session forever.
142+
// The timeout fires from the agentos side without requiring the
143+
// adapter to surface AbortSignal support; the underlying request
144+
// is abandoned (its socket leaks until GC / process exit) but
145+
// the observer moves to its retry path.
146+
let raw: string;
147+
try {
148+
raw = await this.invokeWithTimeout(userPrompt);
149+
} catch (err) {
150+
lastValidationError = err instanceof Error ? err.message : String(err);
151+
continue;
152+
}
134153

135154
const stripped = stripCodeFence(raw);
136155

@@ -174,6 +193,37 @@ export class TypedNetworkObserver {
174193
// semantics.
175194
return [];
176195
}
196+
197+
/**
198+
* Run `this.llm.invoke()` with a per-attempt timeout. Throws an
199+
* `Error('TypedNetworkObserver: extraction timed out after Nms')`
200+
* when the timer fires before the LLM responds. The timer is cleared
201+
* on resolution to avoid leaking pending timeouts into subsequent
202+
* extractions.
203+
*/
204+
private async invokeWithTimeout(userPrompt: string): Promise<string> {
205+
let timeoutHandle: ReturnType<typeof setTimeout> | null = null;
206+
const invokePromise = this.llm.invoke({
207+
system: TYPED_EXTRACTION_SYSTEM_PROMPT,
208+
user: userPrompt,
209+
maxTokens: this.maxTokens,
210+
temperature: this.temperature,
211+
});
212+
const timeoutPromise = new Promise<never>((_resolve, reject) => {
213+
timeoutHandle = setTimeout(() => {
214+
reject(
215+
new Error(
216+
`TypedNetworkObserver: extraction timed out after ${this.timeoutMs}ms`,
217+
),
218+
);
219+
}, this.timeoutMs);
220+
});
221+
try {
222+
return await Promise.race([invokePromise, timeoutPromise]);
223+
} finally {
224+
if (timeoutHandle !== null) clearTimeout(timeoutHandle);
225+
}
226+
}
177227
}
178228

179229
/**

0 commit comments

Comments
 (0)