feat: implement resource exhaustion triggers for agent monitoring#16
Conversation
Implement per-agent resource tracking to detect and prevent runaway agents
that consume excessive resources without producing meaningful deliverables.
## Features
- **Per-Agent Tracking**: Track files accessed, API calls, subtasks spawned, tokens consumed
- **Phase Progression**: `normal` → `warning` → `intervention` → `termination`
- **Configurable Thresholds**: Set limits for each resource type
- **Pause/Resume Control**: Automatically pause agents exceeding thresholds
- **Deliverable Checkpoints**: Reset time-based tracking when agents produce results
- **Slack Notifications**: Alert on warnings, interventions, and terminations
- **Prometheus Metrics**: Full observability with counters, gauges, histograms
## Implementation
- New `ResourceExhaustionService` following `DriftDetectionService` pattern
- Database tables: `agent_resource_metrics`, `agent_deliverable_checkpoints`, `resource_exhaustion_events`
- REST API endpoints for resource monitoring and control
- Integration with spawner for automatic tracking
- 70 unit tests with 100% line coverage, 95.53% branch coverage
## Configuration
```json
{
"resourceExhaustion": {
"enabled": true,
"thresholds": {
"maxFilesAccessed": 50,
"maxApiCalls": 100,
"maxSubtasksSpawned": 20,
"maxTimeWithoutDeliverableMs": 1800000,
"maxTokensConsumed": 500000
},
"warningThresholdPercent": 0.7,
"autoTerminate": false,
"pauseOnIntervention": true
}
}
```
Closes #4
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces a comprehensive resource exhaustion monitoring system that tracks per-agent metrics (files accessed, API calls, tokens consumed, time without deliverables) with configurable thresholds triggering warning/intervention/termination phases. It includes lifecycle management (pause/resume), deliverable checkpointing, Slack notifications, and REST APIs for monitoring and control. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as Agent Spawner
participant Service as Resource<br/>Exhaustion Service
participant Metrics as Prometheus<br/>Metrics
participant Store as SQLiteStore
participant Slack as Slack<br/>Integration
Agent->>Service: initializeAgent(agentId)
Service->>Store: Load existing metrics from DB
Service->>Service: Create in-memory tracking
Agent->>Service: recordApiCall(tokens)
Service->>Metrics: Update token counters
Service->>Store: Persist metrics
Agent->>Service: evaluateAgent()
Service->>Service: Compare metrics vs thresholds
alt Threshold exceeded
Service->>Service: Transition phase (normal→warning)
Service->>Metrics: Update phase gauge
Service->>Store: Save exhaustion event
Service->>Slack: sendResourceWarning()
Slack-->>Slack: Format and send message
else Severe threshold exceeded
Service->>Service: Transition to intervention
Service->>Service: pauseAgent() if configured
Service->>Slack: sendResourceIntervention()
end
Agent->>Service: recordDeliverable(checkpoint)
Service->>Store: createDeliverableCheckpoint()
Service->>Service: Reset to normal phase
sequenceDiagram
participant Client as REST Client
participant Routes as Agent Routes
participant Service as Resource<br/>Exhaustion Service
participant Store as SQLiteStore
participant Memory as Memory<br/>Manager
Client->>Routes: POST /api/v1/agents/:id/pause
Routes->>Routes: Validate resourceExhaustion.enabled
Routes->>Service: pauseAgent(agentId, reason)
Service->>Store: Update pause state in metrics
Service->>Service: Register pauseAgent callback
Routes->>Routes: Emit websocket event
Routes-->>Client: { paused: true, timestamp }
Client->>Routes: GET /api/v1/agents/:id/resources
Routes->>Service: getResourceMetrics(agentId)
Service->>Store: getAgentResourceMetrics(agentId)
Store-->>Service: Return metrics object
Routes->>Memory: Load agent data for context
Routes-->>Client: { filesAccessed, apiCalls, tokens, phase }
Client->>Routes: POST /api/v1/agents/:id/deliverable
Routes->>Routes: Validate type, config enabled
Routes->>Service: recordDeliverable(checkpoint)
Service->>Store: createDeliverableCheckpoint()
Service->>Service: Update lastDeliverableAt, reset phase
Routes-->>Client: { id, agentId, type, createdAt }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/monitoring/resource-exhaustion-service.ts`:
- Around line 260-265: The code uses a non-null assertion on triggeredBy when
calling handlePhaseTransition (see variables triggeredBy, newPhase,
metrics.phase, handlePhaseTransition), which can lead to passing null/undefined
in edge cases; update the conditional logic to ensure triggeredBy is set (or
provide a safe default) before invoking handlePhaseTransition, e.g., only call
handlePhaseTransition when newPhase !== metrics.phase AND triggeredBy is
non-null (or compute a fallback reason), then continue to assign metrics.phase =
newPhase and call updateMetrics(agentId, metrics) as before.
- Around line 337-340: The isAgentPaused method incorrectly returns true when
metrics is undefined because metrics?.pausedAt !== null yields true for
undefined; update isAgentPaused to first ensure metrics exists and only then
check pausedAt (e.g., use an explicit existence check like metrics != null &&
metrics.pausedAt !== null, or metrics?.pausedAt != null) so that non-existent
agents return false; refer to isAgentPaused, this.metricsCache, and the pausedAt
field when making the change.
- Around line 539-542: The call to pauseAgent inside handlePhaseTransition is
not awaited even though pauseAgent returns Promise<boolean>, so make
handlePhaseTransition async and await this.pauseAgent(agentId, `Resource
threshold exceeded: ${triggeredBy}`) and propagate or handle its result,
updating the handlePhaseTransition signature and any callers as needed; if you
intentionally want fire-and-forget, instead append a .catch(...) to the
pauseAgent(...) call to log or handle errors so failures aren't swallowed
(reference functions: handlePhaseTransition and pauseAgent).
🧹 Nitpick comments (12)
src/web/routes/agents.ts (3)
346-355: Inconsistent error response pattern.Lines 350 and 354 handle errors differently: line 350 uses
sendJson(res, { error: ... }, 500)while line 354 usessendError(res, 500, message). ThesendErrorfunction produces a standardized response with{success: false, error: message}, whereassendJsonwith an error object produces{success: true, data: {error: ...}}which is misleading for error responses.Suggested fix for consistency
try { const success = await pauseAgent(agentId, reason); if (success) { agentEvents.emit('agent:paused', { id: agentId, reason }); sendJson(res, { paused: true, reason }); } else { - sendJson(res, { error: 'Failed to pause agent' }, 500); + sendError(res, 500, 'Failed to pause agent'); } } catch (error) {
376-382: Inconsistent error response pattern.Same issue here: line 381 uses
sendJson(res, { error: ... }, 400)instead ofsendError. This produces{success: true, data: {error: ...}}which incorrectly indicates success.Suggested fix for consistency
try { const success = resumeAgent(agentId); if (success) { agentEvents.emit('agent:resumed', { id: agentId }); sendJson(res, { resumed: true }); } else { - sendJson(res, { error: 'Agent was not paused or failed to resume' }, 400); + sendError(res, 400, 'Agent was not paused or failed to resume'); } } catch (error) {
242-245: Consider usingsendErrorfor all error responses in new endpoints.Multiple places in the new endpoints use
sendJson(res, { error: ... }, statusCode)which produces responses like{success: true, data: {error: ...}, timestamp: ...}. This is semantically incorrect for error conditions. Consider usingsendErrorconsistently for all error responses to produce proper{success: false, error: ...}responses.Affected lines: 243-244, 256-257, 296-297, 340-341, 371-372.
Also applies to: 255-258, 295-298, 339-342, 370-373
tests/unit/resource-exhaustion.test.ts (3)
746-748: Weak assertion does not verify actual behavior.The assertion
expect(true).toBe(true)only confirms no exception was thrown. Consider verifying observable state, such as confirming the metrics remain unchanged or the agent wasn't inadvertently created.♻️ Suggested improvement
- // No crash expected - expect(true).toBe(true); + // Metrics should remain unchanged or null since cache was cleared + const metrics = service.getAgentMetrics('agent-1'); + // Verify the operation was a no-op (metrics still null or unchanged) + expect(metrics).toBeNull();
619-639: Time-based test relies onsetTimeoutwhich can be flaky in CI.This test uses real time delays (150ms) to verify threshold detection. While acceptable for unit tests, consider using fake timers (
vi.useFakeTimers()) for more reliable and faster tests.♻️ Suggested improvement using fake timers
it('should detect time without deliverable threshold', () => { + vi.useFakeTimers(); const service = new ResourceExhaustionService( store, createConfig({ maxTimeWithoutDeliverableMs: 100, // Very short for testing warningThresholdPercent: 0.5, pauseOnIntervention: false, }) ); service.initializeAgent('agent-1', 'coder'); - // Wait for threshold to be exceeded - return new Promise<void>((resolve) => { - setTimeout(() => { - const phase = service.evaluateAgent('agent-1'); - expect(phase).toBe('intervention'); - resolve(); - }, 150); - }); + // Advance time past threshold + vi.advanceTimersByTime(150); + + const phase = service.evaluateAgent('agent-1'); + expect(phase).toBe('intervention'); + + vi.useRealTimers(); });
1148-1167: Time-based test for checkpoint ordering relies on real delay.Similar to the earlier time-based test, this uses a 10ms delay to ensure different timestamps. Consider using fake timers or manually setting timestamps in the test data for deterministic behavior.
src/monitoring/resource-exhaustion-service.ts (4)
345-357:terminateAgenthardcodestriggeredByasmaxTimeWithoutDeliverableMs.The termination may be triggered for various reasons (e.g., manual termination, different threshold breaches), but the event always logs
maxTimeWithoutDeliverableMsas the trigger. Consider acceptingtriggeredByas a parameter.♻️ Suggested improvement
-terminateAgent(agentId: string, reason: string): boolean { +terminateAgent( + agentId: string, + reason: string, + triggeredBy: keyof ResourceThresholds = 'maxTimeWithoutDeliverableMs' +): boolean { const metrics = this.metricsCache.get(agentId); if (!metrics) return false; // Trigger termination phase transition - this.handlePhaseTransition(agentId, metrics.phase, 'termination', 'maxTimeWithoutDeliverableMs'); + this.handlePhaseTransition(agentId, metrics.phase, 'termination', triggeredBy); metrics.phase = 'termination'; ...
584-591: Config comparison usesJSON.stringifyfor thresholds.While functional,
JSON.stringifycomparison is sensitive to property ordering and can be expensive for frequent calls. Consider a dedicated comparison function for clarity and reliability.♻️ Suggested improvement
function thresholdsEqual(a: ResourceThresholds, b: ResourceThresholds): boolean { return ( a.maxFilesAccessed === b.maxFilesAccessed && a.maxApiCalls === b.maxApiCalls && a.maxSubtasksSpawned === b.maxSubtasksSpawned && a.maxTimeWithoutDeliverableMs === b.maxTimeWithoutDeliverableMs && a.maxTokensConsumed === b.maxTokensConsumed ); }
362-366:waitForResumecreates a promise that may never resolve.If
waitForResumeis called but the agent is never paused or resumed, the promise remains pending indefinitely. Consider adding a timeout or cancellation mechanism for robustness.
414-424: Background monitoring interval should useunref()to not block Node.js exit.If this service is running when the process wants to exit gracefully, the interval may keep the event loop alive. Consider calling
unref()on the interval.♻️ Suggested improvement
start(): void { if (!this.isEnabled() || this.checkInterval) return; this.checkInterval = setInterval(() => { this.checkAllAgents(); }, this.config.checkIntervalMs); + + // Don't block process exit + this.checkInterval.unref(); log.info('Resource exhaustion monitoring started', { checkIntervalMs: this.config.checkIntervalMs, }); }src/memory/sqlite-store.ts (2)
2425-2462: Consider using SQL aggregation for better performance.The
getResourceExhaustionMetricsmethod fetches all events and aggregates in memory. For large event tables, this could be slow and memory-intensive. Consider using SQLGROUP BYandCOUNTfor better performance.♻️ SQL-based aggregation approach
getResourceExhaustionMetrics(since?: Date): { totalEvents: number; warningCount: number; interventionCount: number; terminationCount: number; byAgent: Record<string, number>; } { let baseCondition = '1=1'; const params: number[] = []; if (since) { baseCondition = 'created_at >= ?'; params.push(since.getTime()); } // Get counts by action_taken const countQuery = ` SELECT COUNT(*) as total, SUM(CASE WHEN action_taken = 'warned' THEN 1 ELSE 0 END) as warning_count, SUM(CASE WHEN action_taken = 'paused' THEN 1 ELSE 0 END) as intervention_count, SUM(CASE WHEN action_taken = 'terminated' THEN 1 ELSE 0 END) as termination_count FROM resource_exhaustion_events WHERE ${baseCondition} `; const counts = this.db.prepare(countQuery).get(...params) as { total: number; warning_count: number; intervention_count: number; termination_count: number; }; // Get counts by agent const byAgentQuery = ` SELECT agent_id, COUNT(*) as count FROM resource_exhaustion_events WHERE ${baseCondition} GROUP BY agent_id `; const agentRows = this.db.prepare(byAgentQuery).all(...params) as Array<{ agent_id: string; count: number; }>; const byAgent: Record<string, number> = {}; for (const row of agentRows) { byAgent[row.agent_id] = row.count; } return { totalEvents: counts.total, warningCount: counts.warning_count, interventionCount: counts.intervention_count, terminationCount: counts.termination_count, byAgent, }; }
2493-2516: Mutating the parsed JSON object could cause subtle issues.The transformer mutates
metricsafter parsing from JSON. While functional, this pattern can be confusing. Consider creating a new object with the converted dates instead.♻️ Immutable transformation approach
private rowToResourceExhaustionEvent(row: ResourceExhaustionEventRow): ResourceExhaustionEvent { - const metrics = JSON.parse(row.metrics) as AgentResourceMetrics; - // Convert date strings back to Date objects - metrics.startedAt = new Date(metrics.startedAt); - metrics.lastActivityAt = new Date(metrics.lastActivityAt); - if (metrics.lastDeliverableAt) { - metrics.lastDeliverableAt = new Date(metrics.lastDeliverableAt); - } - if (metrics.pausedAt) { - metrics.pausedAt = new Date(metrics.pausedAt); - } + const rawMetrics = JSON.parse(row.metrics) as Record<string, unknown>; + const metrics: AgentResourceMetrics = { + agentId: rawMetrics.agentId as string, + filesRead: rawMetrics.filesRead as number, + filesWritten: rawMetrics.filesWritten as number, + filesModified: rawMetrics.filesModified as number, + apiCallsCount: rawMetrics.apiCallsCount as number, + subtasksSpawned: rawMetrics.subtasksSpawned as number, + tokensConsumed: rawMetrics.tokensConsumed as number, + startedAt: new Date(rawMetrics.startedAt as string | number), + lastDeliverableAt: rawMetrics.lastDeliverableAt + ? new Date(rawMetrics.lastDeliverableAt as string | number) + : null, + lastActivityAt: new Date(rawMetrics.lastActivityAt as string | number), + phase: rawMetrics.phase as ResourceExhaustionPhase, + pausedAt: rawMetrics.pausedAt + ? new Date(rawMetrics.pausedAt as string | number) + : null, + pauseReason: rawMetrics.pauseReason as string | null, + }; return { id: row.id, ... }; }
| // Handle phase transition | ||
| if (newPhase !== metrics.phase) { | ||
| this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy!); | ||
| metrics.phase = newPhase; | ||
| this.updateMetrics(agentId, metrics); | ||
| } |
There was a problem hiding this comment.
Non-null assertion on triggeredBy may cause issues.
If newPhase stays 'normal' throughout the loop (all ratios below warning threshold), triggeredBy remains null, yet the code calls handlePhaseTransition with triggeredBy!. While the if (newPhase !== metrics.phase) condition may prevent this in normal→normal cases, edge cases (like phase already being warning/intervention) could trigger unexpected behavior.
🛠️ Suggested defensive fix
// Handle phase transition
- if (newPhase !== metrics.phase) {
- this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy!);
+ if (newPhase !== metrics.phase && triggeredBy !== null) {
+ this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy);
metrics.phase = newPhase;
this.updateMetrics(agentId, metrics);
}🤖 Prompt for AI Agents
In `@src/monitoring/resource-exhaustion-service.ts` around lines 260 - 265, The
code uses a non-null assertion on triggeredBy when calling handlePhaseTransition
(see variables triggeredBy, newPhase, metrics.phase, handlePhaseTransition),
which can lead to passing null/undefined in edge cases; update the conditional
logic to ensure triggeredBy is set (or provide a safe default) before invoking
handlePhaseTransition, e.g., only call handlePhaseTransition when newPhase !==
metrics.phase AND triggeredBy is non-null (or compute a fallback reason), then
continue to assign metrics.phase = newPhase and call updateMetrics(agentId,
metrics) as before.
| isAgentPaused(agentId: string): boolean { | ||
| const metrics = this.metricsCache.get(agentId); | ||
| return metrics?.pausedAt !== null; | ||
| } |
There was a problem hiding this comment.
isAgentPaused returns incorrect value when agent doesn't exist.
When metrics is undefined (agent not found), metrics?.pausedAt !== null evaluates to undefined !== null, which is true. This means isAgentPaused returns true for non-existent agents.
🐛 Proposed fix
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
- return metrics?.pausedAt !== null;
+ return metrics?.pausedAt != null;
}Or more explicitly:
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
- return metrics?.pausedAt !== null;
+ return metrics !== undefined && metrics.pausedAt !== null;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| isAgentPaused(agentId: string): boolean { | |
| const metrics = this.metricsCache.get(agentId); | |
| return metrics?.pausedAt !== null; | |
| } | |
| isAgentPaused(agentId: string): boolean { | |
| const metrics = this.metricsCache.get(agentId); | |
| return metrics?.pausedAt != null; | |
| } |
| isAgentPaused(agentId: string): boolean { | |
| const metrics = this.metricsCache.get(agentId); | |
| return metrics?.pausedAt !== null; | |
| } | |
| isAgentPaused(agentId: string): boolean { | |
| const metrics = this.metricsCache.get(agentId); | |
| return metrics !== undefined && metrics.pausedAt !== null; | |
| } |
🤖 Prompt for AI Agents
In `@src/monitoring/resource-exhaustion-service.ts` around lines 337 - 340, The
isAgentPaused method incorrectly returns true when metrics is undefined because
metrics?.pausedAt !== null yields true for undefined; update isAgentPaused to
first ensure metrics exists and only then check pausedAt (e.g., use an explicit
existence check like metrics != null && metrics.pausedAt !== null, or
metrics?.pausedAt != null) so that non-existent agents return false; refer to
isAgentPaused, this.metricsCache, and the pausedAt field when making the change.
| // Auto-pause if configured | ||
| if (this.config.pauseOnIntervention) { | ||
| this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`); | ||
| } |
There was a problem hiding this comment.
pauseAgent is async but not awaited.
pauseAgent returns a Promise<boolean> but is called without await inside handlePhaseTransition. This means the pause operation may not complete before the function returns, and any errors from pauseAgent won't be caught.
🛠️ Options to address
Option 1: Make handlePhaseTransition async and await the call:
-private handlePhaseTransition(
+private async handlePhaseTransition(
...
): void {
...
- if (this.config.pauseOnIntervention) {
- this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
+ if (this.config.pauseOnIntervention) {
+ await this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
}Option 2: If fire-and-forget is intentional, add a .catch() to handle potential errors:
if (this.config.pauseOnIntervention) {
- this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
+ this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`)
+ .catch(err => log.error('Failed to auto-pause agent', { agentId, error: err }));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Auto-pause if configured | |
| if (this.config.pauseOnIntervention) { | |
| this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`); | |
| } | |
| // Auto-pause if configured | |
| if (this.config.pauseOnIntervention) { | |
| this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`) | |
| .catch(err => log.error('Failed to auto-pause agent', { agentId, error: err })); | |
| } |
🤖 Prompt for AI Agents
In `@src/monitoring/resource-exhaustion-service.ts` around lines 539 - 542, The
call to pauseAgent inside handlePhaseTransition is not awaited even though
pauseAgent returns Promise<boolean>, so make handlePhaseTransition async and
await this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`)
and propagate or handle its result, updating the handlePhaseTransition signature
and any callers as needed; if you intentionally want fire-and-forget, instead
append a .catch(...) to the pauseAgent(...) call to log or handle errors so
failures aren't swallowed (reference functions: handlePhaseTransition and
pauseAgent).
Summary
ResourceExhaustionServicefollowing establishedDriftDetectionServicepatternFeatures
normal→warning→intervention→terminationNew Files
src/monitoring/resource-exhaustion-service.ts- Core service (622 lines)tests/unit/resource-exhaustion.test.ts- Unit tests (1251 lines, 70 tests)Modified Files
src/types.ts- New type definitionssrc/utils/config.ts- Zod schemas for configurationsrc/memory/sqlite-store.ts- Database tables and CRUD methodssrc/agents/spawner.ts- Integration with agent lifecyclesrc/agents/index.ts- New exports (pauseAgent, resumeAgent, isAgentPaused)src/integrations/slack.ts- Notification methodssrc/monitoring/metrics.ts- Prometheus metricssrc/web/routes/agents.ts- REST API endpointssrc/web/routes/system.ts- System resources endpointREADME.md,docs/API.md,docs/ARCHITECTURE.md,docs/DATA.md- DocumentationTest plan
npm run build)npm run lint)resourceExhaustion.enabled: true/api/v1/agents/:id/resourcesendpointCloses #4
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.