Skip to content

feat: implement resource exhaustion triggers for agent monitoring#16

Merged
blackms merged 1 commit into
mainfrom
feat/resource-exhaustion-monitoring
Jan 27, 2026
Merged

feat: implement resource exhaustion triggers for agent monitoring#16
blackms merged 1 commit into
mainfrom
feat/resource-exhaustion-monitoring

Conversation

@blackms
Copy link
Copy Markdown
Owner

@blackms blackms commented Jan 27, 2026

Summary

  • Implement per-agent resource tracking to detect and prevent runaway agents consuming excessive resources without producing meaningful deliverables
  • Add ResourceExhaustionService following established DriftDetectionService pattern
  • Database persistence for crash recovery with in-memory caching for performance
  • Full REST API for resource monitoring and agent control (pause/resume)
  • Comprehensive documentation sync across README, API.md, ARCHITECTURE.md, and DATA.md

Features

  • Per-Agent Tracking: Track files accessed, API calls, subtasks spawned, tokens consumed
  • Phase Progression: normalwarninginterventiontermination
  • Configurable Thresholds: Set limits for each resource type
  • Pause/Resume Control: Automatically pause agents exceeding thresholds
  • Deliverable Checkpoints: Reset time-based tracking when agents produce results
  • Slack Notifications: Alert on warnings, interventions, and terminations
  • Prometheus Metrics: Full observability with counters, gauges, histograms

New Files

  • src/monitoring/resource-exhaustion-service.ts - Core service (622 lines)
  • tests/unit/resource-exhaustion.test.ts - Unit tests (1251 lines, 70 tests)

Modified Files

  • src/types.ts - New type definitions
  • src/utils/config.ts - Zod schemas for configuration
  • src/memory/sqlite-store.ts - Database tables and CRUD methods
  • src/agents/spawner.ts - Integration with agent lifecycle
  • src/agents/index.ts - New exports (pauseAgent, resumeAgent, isAgentPaused)
  • src/integrations/slack.ts - Notification methods
  • src/monitoring/metrics.ts - Prometheus metrics
  • src/web/routes/agents.ts - REST API endpoints
  • src/web/routes/system.ts - System resources endpoint
  • README.md, docs/API.md, docs/ARCHITECTURE.md, docs/DATA.md - Documentation

Test plan

  • All 70 new unit tests pass
  • 100% line coverage, 95.53% branch coverage
  • Build passes (npm run build)
  • Lint passes (npm run lint)
  • Manual testing with resourceExhaustion.enabled: true
  • Verify Slack notifications (if configured)
  • Verify /api/v1/agents/:id/resources endpoint

Closes #4

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Resource Exhaustion Monitoring: Track per-agent resource usage with configurable thresholds and automatic phase transitions (normal, warning, intervention, termination).
    • Agent Control: Pause/resume agents and record deliverable checkpoints to reset metrics and prevent resource exhaustion.
    • System Dashboard: New endpoints to view real-time resource metrics and enforcement status across all agents.
    • Slack Notifications: Get alerted on resource warnings and interventions.
  • Documentation

    • Added comprehensive resource exhaustion configuration and monitoring guides.
    • Updated system architecture and data models to reflect monitoring capabilities.
    • Added new API endpoints documentation for resource management.

✏️ Tip: You can customize this high-level summary in your review settings.

Implement per-agent resource tracking to detect and prevent runaway agents
that consume excessive resources without producing meaningful deliverables.

## Features

- **Per-Agent Tracking**: Track files accessed, API calls, subtasks spawned, tokens consumed
- **Phase Progression**: `normal` → `warning` → `intervention` → `termination`
- **Configurable Thresholds**: Set limits for each resource type
- **Pause/Resume Control**: Automatically pause agents exceeding thresholds
- **Deliverable Checkpoints**: Reset time-based tracking when agents produce results
- **Slack Notifications**: Alert on warnings, interventions, and terminations
- **Prometheus Metrics**: Full observability with counters, gauges, histograms

## Implementation

- New `ResourceExhaustionService` following `DriftDetectionService` pattern
- Database tables: `agent_resource_metrics`, `agent_deliverable_checkpoints`, `resource_exhaustion_events`
- REST API endpoints for resource monitoring and control
- Integration with spawner for automatic tracking
- 70 unit tests with 100% line coverage, 95.53% branch coverage

## Configuration

```json
{
  "resourceExhaustion": {
    "enabled": true,
    "thresholds": {
      "maxFilesAccessed": 50,
      "maxApiCalls": 100,
      "maxSubtasksSpawned": 20,
      "maxTimeWithoutDeliverableMs": 1800000,
      "maxTokensConsumed": 500000
    },
    "warningThresholdPercent": 0.7,
    "autoTerminate": false,
    "pauseOnIntervention": true
  }
}
```

Closes #4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive resource exhaustion monitoring system that tracks per-agent metrics (files accessed, API calls, tokens consumed, time without deliverables) with configurable thresholds triggering warning/intervention/termination phases. It includes lifecycle management (pause/resume), deliverable checkpointing, Slack notifications, and REST APIs for monitoring and control.

Changes

Cohort / File(s) Summary
Documentation
README.md, docs/API.md, docs/ARCHITECTURE.md, docs/DATA.md
Added comprehensive documentation for resource exhaustion monitoring: feature overview, REST API endpoints, architecture diagrams reflecting new Monitoring Layer (41 tools total), and SQL schema/TypeScript interfaces for resource metrics, deliverable checkpoints, and exhaustion events.
Type Definitions & Configuration
src/types.ts, src/utils/config.ts
Introduced ResourceExhaustionConfig interface and related types (Phase, Action, DeliverableType, Thresholds, Metrics, Event, Checkpoint). Extended AgentStackConfig and SlackConfig with new optional fields. Added corresponding Zod schemas for validation with sensible defaults.
Core Service Implementation
src/monitoring/resource-exhaustion-service.ts, src/monitoring/metrics.ts
Implemented ResourceExhaustionService with per-agent tracking, phase management (normal→warning→intervention→termination), auto-pause/resume, deliverable checkpoint recording, persistent event logging, and background monitoring loop. Added three new Prometheus counters for exhaustion events and four histograms for agent activity.
Agent Lifecycle Integration
src/agents/spawner.ts, src/agents/index.ts
Added resource tracking initialization on agent spawn/persist and cleanup on stop. Integrated resource checks before agent execution with pause/intervention handling. Exported pauseAgent, resumeAgent, isAgentPaused public APIs.
Persistence Layer
src/memory/sqlite-store.ts
Added 11 new methods supporting agent resource metrics (save/get/list/delete), deliverable checkpoints (create/get/list/delete), and exhaustion events (save/get/summarize) with corresponding SQL tables and row-to-object mappers.
External Integrations
src/integrations/slack.ts
Added three new Slack notification methods: sendResourceWarning, sendResourceIntervention, sendResourceTermination with formatted block messages.
REST API Endpoints
src/web/routes/agents.ts, src/web/routes/system.ts
Added four agent-specific endpoints (GET /resources, POST /deliverable, POST /pause, POST /resume) and one system endpoint (GET /api/v1/system/resources) with resource exhaustion config checks, error handling, and ISO timestamp formatting.
Test Suite
tests/unit/resource-exhaustion.test.ts
Added 1251 lines of comprehensive Vitest coverage for ResourceExhaustionService and SQLiteStore: phase transitions, pause/resume, deliverable tracking, threshold evaluation, database persistence, singleton management, and end-to-end integration scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent Spawner
    participant Service as Resource<br/>Exhaustion Service
    participant Metrics as Prometheus<br/>Metrics
    participant Store as SQLiteStore
    participant Slack as Slack<br/>Integration

    Agent->>Service: initializeAgent(agentId)
    Service->>Store: Load existing metrics from DB
    Service->>Service: Create in-memory tracking

    Agent->>Service: recordApiCall(tokens)
    Service->>Metrics: Update token counters
    Service->>Store: Persist metrics

    Agent->>Service: evaluateAgent()
    Service->>Service: Compare metrics vs thresholds
    alt Threshold exceeded
        Service->>Service: Transition phase (normal→warning)
        Service->>Metrics: Update phase gauge
        Service->>Store: Save exhaustion event
        Service->>Slack: sendResourceWarning()
        Slack-->>Slack: Format and send message
    else Severe threshold exceeded
        Service->>Service: Transition to intervention
        Service->>Service: pauseAgent() if configured
        Service->>Slack: sendResourceIntervention()
    end

    Agent->>Service: recordDeliverable(checkpoint)
    Service->>Store: createDeliverableCheckpoint()
    Service->>Service: Reset to normal phase
Loading
sequenceDiagram
    participant Client as REST Client
    participant Routes as Agent Routes
    participant Service as Resource<br/>Exhaustion Service
    participant Store as SQLiteStore
    participant Memory as Memory<br/>Manager

    Client->>Routes: POST /api/v1/agents/:id/pause
    Routes->>Routes: Validate resourceExhaustion.enabled
    Routes->>Service: pauseAgent(agentId, reason)
    Service->>Store: Update pause state in metrics
    Service->>Service: Register pauseAgent callback
    Routes->>Routes: Emit websocket event
    Routes-->>Client: { paused: true, timestamp }

    Client->>Routes: GET /api/v1/agents/:id/resources
    Routes->>Service: getResourceMetrics(agentId)
    Service->>Store: getAgentResourceMetrics(agentId)
    Store-->>Service: Return metrics object
    Routes->>Memory: Load agent data for context
    Routes-->>Client: { filesAccessed, apiCalls, tokens, phase }

    Client->>Routes: POST /api/v1/agents/:id/deliverable
    Routes->>Routes: Validate type, config enabled
    Routes->>Service: recordDeliverable(checkpoint)
    Service->>Store: createDeliverableCheckpoint()
    Service->>Service: Update lastDeliverableAt, reset phase
    Routes-->>Client: { id, agentId, type, createdAt }
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #2: Extends monitoring surface (Prometheus metrics, health checks) and touches SQLiteStore persistence and Slack integration similarly; both involve new monitoring infrastructure and agent instrumentation.
  • PR #10: Modifies src/agents/spawner.ts lifecycle hooks (spawn/stop) for agent identity/deactivation; overlaps with this PR's resource tracking initialization in the same functions.

Poem

🐰 Hops of monitoring, bounds of care,
Tracking tokens through the air,
When resources run deep and wide,
We pause the agents, side by side,
Deliverables light the way,
No more runaway loops today! 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: implement resource exhaustion triggers for agent monitoring' directly and clearly summarizes the main change—implementing resource exhaustion detection and monitoring for agents.
Linked Issues check ✅ Passed The PR implementation comprehensively addresses all required capabilities from issue #4: per-agent metrics tracking (files, API calls, subtasks, time, tokens), configurable thresholds with defaults, phase progression (warning/intervention/termination), deliverable checkpoints that reset time tracking, Slack notifications, and Prometheus metrics.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing resource exhaustion monitoring as specified in issue #4. Documentation updates, configuration schemas, API endpoints, database persistence, and tests all support the core feature without introducing unrelated modifications.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/monitoring/resource-exhaustion-service.ts`:
- Around line 260-265: The code uses a non-null assertion on triggeredBy when
calling handlePhaseTransition (see variables triggeredBy, newPhase,
metrics.phase, handlePhaseTransition), which can lead to passing null/undefined
in edge cases; update the conditional logic to ensure triggeredBy is set (or
provide a safe default) before invoking handlePhaseTransition, e.g., only call
handlePhaseTransition when newPhase !== metrics.phase AND triggeredBy is
non-null (or compute a fallback reason), then continue to assign metrics.phase =
newPhase and call updateMetrics(agentId, metrics) as before.
- Around line 337-340: The isAgentPaused method incorrectly returns true when
metrics is undefined because metrics?.pausedAt !== null yields true for
undefined; update isAgentPaused to first ensure metrics exists and only then
check pausedAt (e.g., use an explicit existence check like metrics != null &&
metrics.pausedAt !== null, or metrics?.pausedAt != null) so that non-existent
agents return false; refer to isAgentPaused, this.metricsCache, and the pausedAt
field when making the change.
- Around line 539-542: The call to pauseAgent inside handlePhaseTransition is
not awaited even though pauseAgent returns Promise<boolean>, so make
handlePhaseTransition async and await this.pauseAgent(agentId, `Resource
threshold exceeded: ${triggeredBy}`) and propagate or handle its result,
updating the handlePhaseTransition signature and any callers as needed; if you
intentionally want fire-and-forget, instead append a .catch(...) to the
pauseAgent(...) call to log or handle errors so failures aren't swallowed
(reference functions: handlePhaseTransition and pauseAgent).
🧹 Nitpick comments (12)
src/web/routes/agents.ts (3)

346-355: Inconsistent error response pattern.

Lines 350 and 354 handle errors differently: line 350 uses sendJson(res, { error: ... }, 500) while line 354 uses sendError(res, 500, message). The sendError function produces a standardized response with {success: false, error: message}, whereas sendJson with an error object produces {success: true, data: {error: ...}} which is misleading for error responses.

Suggested fix for consistency
     try {
       const success = await pauseAgent(agentId, reason);
       if (success) {
         agentEvents.emit('agent:paused', { id: agentId, reason });
         sendJson(res, { paused: true, reason });
       } else {
-        sendJson(res, { error: 'Failed to pause agent' }, 500);
+        sendError(res, 500, 'Failed to pause agent');
       }
     } catch (error) {

376-382: Inconsistent error response pattern.

Same issue here: line 381 uses sendJson(res, { error: ... }, 400) instead of sendError. This produces {success: true, data: {error: ...}} which incorrectly indicates success.

Suggested fix for consistency
     try {
       const success = resumeAgent(agentId);
       if (success) {
         agentEvents.emit('agent:resumed', { id: agentId });
         sendJson(res, { resumed: true });
       } else {
-        sendJson(res, { error: 'Agent was not paused or failed to resume' }, 400);
+        sendError(res, 400, 'Agent was not paused or failed to resume');
       }
     } catch (error) {

242-245: Consider using sendError for all error responses in new endpoints.

Multiple places in the new endpoints use sendJson(res, { error: ... }, statusCode) which produces responses like {success: true, data: {error: ...}, timestamp: ...}. This is semantically incorrect for error conditions. Consider using sendError consistently for all error responses to produce proper {success: false, error: ...} responses.

Affected lines: 243-244, 256-257, 296-297, 340-341, 371-372.

Also applies to: 255-258, 295-298, 339-342, 370-373

tests/unit/resource-exhaustion.test.ts (3)

746-748: Weak assertion does not verify actual behavior.

The assertion expect(true).toBe(true) only confirms no exception was thrown. Consider verifying observable state, such as confirming the metrics remain unchanged or the agent wasn't inadvertently created.

♻️ Suggested improvement
-      // No crash expected
-      expect(true).toBe(true);
+      // Metrics should remain unchanged or null since cache was cleared
+      const metrics = service.getAgentMetrics('agent-1');
+      // Verify the operation was a no-op (metrics still null or unchanged)
+      expect(metrics).toBeNull();

619-639: Time-based test relies on setTimeout which can be flaky in CI.

This test uses real time delays (150ms) to verify threshold detection. While acceptable for unit tests, consider using fake timers (vi.useFakeTimers()) for more reliable and faster tests.

♻️ Suggested improvement using fake timers
 it('should detect time without deliverable threshold', () => {
+  vi.useFakeTimers();
   const service = new ResourceExhaustionService(
     store,
     createConfig({
       maxTimeWithoutDeliverableMs: 100, // Very short for testing
       warningThresholdPercent: 0.5,
       pauseOnIntervention: false,
     })
   );
   service.initializeAgent('agent-1', 'coder');

-  // Wait for threshold to be exceeded
-  return new Promise<void>((resolve) => {
-    setTimeout(() => {
-      const phase = service.evaluateAgent('agent-1');
-      expect(phase).toBe('intervention');
-      resolve();
-    }, 150);
-  });
+  // Advance time past threshold
+  vi.advanceTimersByTime(150);
+  
+  const phase = service.evaluateAgent('agent-1');
+  expect(phase).toBe('intervention');
+  
+  vi.useRealTimers();
 });

1148-1167: Time-based test for checkpoint ordering relies on real delay.

Similar to the earlier time-based test, this uses a 10ms delay to ensure different timestamps. Consider using fake timers or manually setting timestamps in the test data for deterministic behavior.

src/monitoring/resource-exhaustion-service.ts (4)

345-357: terminateAgent hardcodes triggeredBy as maxTimeWithoutDeliverableMs.

The termination may be triggered for various reasons (e.g., manual termination, different threshold breaches), but the event always logs maxTimeWithoutDeliverableMs as the trigger. Consider accepting triggeredBy as a parameter.

♻️ Suggested improvement
-terminateAgent(agentId: string, reason: string): boolean {
+terminateAgent(
+  agentId: string,
+  reason: string,
+  triggeredBy: keyof ResourceThresholds = 'maxTimeWithoutDeliverableMs'
+): boolean {
   const metrics = this.metricsCache.get(agentId);
   if (!metrics) return false;

   // Trigger termination phase transition
-  this.handlePhaseTransition(agentId, metrics.phase, 'termination', 'maxTimeWithoutDeliverableMs');
+  this.handlePhaseTransition(agentId, metrics.phase, 'termination', triggeredBy);
   metrics.phase = 'termination';
   ...

584-591: Config comparison uses JSON.stringify for thresholds.

While functional, JSON.stringify comparison is sensitive to property ordering and can be expensive for frequent calls. Consider a dedicated comparison function for clarity and reliability.

♻️ Suggested improvement
function thresholdsEqual(a: ResourceThresholds, b: ResourceThresholds): boolean {
  return (
    a.maxFilesAccessed === b.maxFilesAccessed &&
    a.maxApiCalls === b.maxApiCalls &&
    a.maxSubtasksSpawned === b.maxSubtasksSpawned &&
    a.maxTimeWithoutDeliverableMs === b.maxTimeWithoutDeliverableMs &&
    a.maxTokensConsumed === b.maxTokensConsumed
  );
}

362-366: waitForResume creates a promise that may never resolve.

If waitForResume is called but the agent is never paused or resumed, the promise remains pending indefinitely. Consider adding a timeout or cancellation mechanism for robustness.


414-424: Background monitoring interval should use unref() to not block Node.js exit.

If this service is running when the process wants to exit gracefully, the interval may keep the event loop alive. Consider calling unref() on the interval.

♻️ Suggested improvement
 start(): void {
   if (!this.isEnabled() || this.checkInterval) return;

   this.checkInterval = setInterval(() => {
     this.checkAllAgents();
   }, this.config.checkIntervalMs);
+  
+  // Don't block process exit
+  this.checkInterval.unref();

   log.info('Resource exhaustion monitoring started', {
     checkIntervalMs: this.config.checkIntervalMs,
   });
 }
src/memory/sqlite-store.ts (2)

2425-2462: Consider using SQL aggregation for better performance.

The getResourceExhaustionMetrics method fetches all events and aggregates in memory. For large event tables, this could be slow and memory-intensive. Consider using SQL GROUP BY and COUNT for better performance.

♻️ SQL-based aggregation approach
getResourceExhaustionMetrics(since?: Date): {
  totalEvents: number;
  warningCount: number;
  interventionCount: number;
  terminationCount: number;
  byAgent: Record<string, number>;
} {
  let baseCondition = '1=1';
  const params: number[] = [];

  if (since) {
    baseCondition = 'created_at >= ?';
    params.push(since.getTime());
  }

  // Get counts by action_taken
  const countQuery = `
    SELECT 
      COUNT(*) as total,
      SUM(CASE WHEN action_taken = 'warned' THEN 1 ELSE 0 END) as warning_count,
      SUM(CASE WHEN action_taken = 'paused' THEN 1 ELSE 0 END) as intervention_count,
      SUM(CASE WHEN action_taken = 'terminated' THEN 1 ELSE 0 END) as termination_count
    FROM resource_exhaustion_events
    WHERE ${baseCondition}
  `;
  const counts = this.db.prepare(countQuery).get(...params) as {
    total: number;
    warning_count: number;
    intervention_count: number;
    termination_count: number;
  };

  // Get counts by agent
  const byAgentQuery = `
    SELECT agent_id, COUNT(*) as count
    FROM resource_exhaustion_events
    WHERE ${baseCondition}
    GROUP BY agent_id
  `;
  const agentRows = this.db.prepare(byAgentQuery).all(...params) as Array<{
    agent_id: string;
    count: number;
  }>;

  const byAgent: Record<string, number> = {};
  for (const row of agentRows) {
    byAgent[row.agent_id] = row.count;
  }

  return {
    totalEvents: counts.total,
    warningCount: counts.warning_count,
    interventionCount: counts.intervention_count,
    terminationCount: counts.termination_count,
    byAgent,
  };
}

2493-2516: Mutating the parsed JSON object could cause subtle issues.

The transformer mutates metrics after parsing from JSON. While functional, this pattern can be confusing. Consider creating a new object with the converted dates instead.

♻️ Immutable transformation approach
 private rowToResourceExhaustionEvent(row: ResourceExhaustionEventRow): ResourceExhaustionEvent {
-  const metrics = JSON.parse(row.metrics) as AgentResourceMetrics;
-  // Convert date strings back to Date objects
-  metrics.startedAt = new Date(metrics.startedAt);
-  metrics.lastActivityAt = new Date(metrics.lastActivityAt);
-  if (metrics.lastDeliverableAt) {
-    metrics.lastDeliverableAt = new Date(metrics.lastDeliverableAt);
-  }
-  if (metrics.pausedAt) {
-    metrics.pausedAt = new Date(metrics.pausedAt);
-  }
+  const rawMetrics = JSON.parse(row.metrics) as Record<string, unknown>;
+  const metrics: AgentResourceMetrics = {
+    agentId: rawMetrics.agentId as string,
+    filesRead: rawMetrics.filesRead as number,
+    filesWritten: rawMetrics.filesWritten as number,
+    filesModified: rawMetrics.filesModified as number,
+    apiCallsCount: rawMetrics.apiCallsCount as number,
+    subtasksSpawned: rawMetrics.subtasksSpawned as number,
+    tokensConsumed: rawMetrics.tokensConsumed as number,
+    startedAt: new Date(rawMetrics.startedAt as string | number),
+    lastDeliverableAt: rawMetrics.lastDeliverableAt 
+      ? new Date(rawMetrics.lastDeliverableAt as string | number) 
+      : null,
+    lastActivityAt: new Date(rawMetrics.lastActivityAt as string | number),
+    phase: rawMetrics.phase as ResourceExhaustionPhase,
+    pausedAt: rawMetrics.pausedAt 
+      ? new Date(rawMetrics.pausedAt as string | number) 
+      : null,
+    pauseReason: rawMetrics.pauseReason as string | null,
+  };

   return {
     id: row.id,
     ...
   };
 }

Comment on lines +260 to +265
// Handle phase transition
if (newPhase !== metrics.phase) {
this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy!);
metrics.phase = newPhase;
this.updateMetrics(agentId, metrics);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Non-null assertion on triggeredBy may cause issues.

If newPhase stays 'normal' throughout the loop (all ratios below warning threshold), triggeredBy remains null, yet the code calls handlePhaseTransition with triggeredBy!. While the if (newPhase !== metrics.phase) condition may prevent this in normal→normal cases, edge cases (like phase already being warning/intervention) could trigger unexpected behavior.

🛠️ Suggested defensive fix
     // Handle phase transition
-    if (newPhase !== metrics.phase) {
-      this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy!);
+    if (newPhase !== metrics.phase && triggeredBy !== null) {
+      this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy);
       metrics.phase = newPhase;
       this.updateMetrics(agentId, metrics);
     }
🤖 Prompt for AI Agents
In `@src/monitoring/resource-exhaustion-service.ts` around lines 260 - 265, The
code uses a non-null assertion on triggeredBy when calling handlePhaseTransition
(see variables triggeredBy, newPhase, metrics.phase, handlePhaseTransition),
which can lead to passing null/undefined in edge cases; update the conditional
logic to ensure triggeredBy is set (or provide a safe default) before invoking
handlePhaseTransition, e.g., only call handlePhaseTransition when newPhase !==
metrics.phase AND triggeredBy is non-null (or compute a fallback reason), then
continue to assign metrics.phase = newPhase and call updateMetrics(agentId,
metrics) as before.

Comment on lines +337 to +340
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
return metrics?.pausedAt !== null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

isAgentPaused returns incorrect value when agent doesn't exist.

When metrics is undefined (agent not found), metrics?.pausedAt !== null evaluates to undefined !== null, which is true. This means isAgentPaused returns true for non-existent agents.

🐛 Proposed fix
 isAgentPaused(agentId: string): boolean {
   const metrics = this.metricsCache.get(agentId);
-  return metrics?.pausedAt !== null;
+  return metrics?.pausedAt != null;
 }

Or more explicitly:

 isAgentPaused(agentId: string): boolean {
   const metrics = this.metricsCache.get(agentId);
-  return metrics?.pausedAt !== null;
+  return metrics !== undefined && metrics.pausedAt !== null;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
return metrics?.pausedAt !== null;
}
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
return metrics?.pausedAt != null;
}
Suggested change
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
return metrics?.pausedAt !== null;
}
isAgentPaused(agentId: string): boolean {
const metrics = this.metricsCache.get(agentId);
return metrics !== undefined && metrics.pausedAt !== null;
}
🤖 Prompt for AI Agents
In `@src/monitoring/resource-exhaustion-service.ts` around lines 337 - 340, The
isAgentPaused method incorrectly returns true when metrics is undefined because
metrics?.pausedAt !== null yields true for undefined; update isAgentPaused to
first ensure metrics exists and only then check pausedAt (e.g., use an explicit
existence check like metrics != null && metrics.pausedAt !== null, or
metrics?.pausedAt != null) so that non-existent agents return false; refer to
isAgentPaused, this.metricsCache, and the pausedAt field when making the change.

Comment on lines +539 to +542
// Auto-pause if configured
if (this.config.pauseOnIntervention) {
this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

pauseAgent is async but not awaited.

pauseAgent returns a Promise<boolean> but is called without await inside handlePhaseTransition. This means the pause operation may not complete before the function returns, and any errors from pauseAgent won't be caught.

🛠️ Options to address

Option 1: Make handlePhaseTransition async and await the call:

-private handlePhaseTransition(
+private async handlePhaseTransition(
   ...
 ): void {
   ...
-      if (this.config.pauseOnIntervention) {
-        this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
+      if (this.config.pauseOnIntervention) {
+        await this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
       }

Option 2: If fire-and-forget is intentional, add a .catch() to handle potential errors:

       if (this.config.pauseOnIntervention) {
-        this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
+        this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`)
+          .catch(err => log.error('Failed to auto-pause agent', { agentId, error: err }));
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Auto-pause if configured
if (this.config.pauseOnIntervention) {
this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
}
// Auto-pause if configured
if (this.config.pauseOnIntervention) {
this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`)
.catch(err => log.error('Failed to auto-pause agent', { agentId, error: err }));
}
🤖 Prompt for AI Agents
In `@src/monitoring/resource-exhaustion-service.ts` around lines 539 - 542, The
call to pauseAgent inside handlePhaseTransition is not awaited even though
pauseAgent returns Promise<boolean>, so make handlePhaseTransition async and
await this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`)
and propagate or handle its result, updating the handlePhaseTransition signature
and any callers as needed; if you intentionally want fire-and-forget, instead
append a .catch(...) to the pauseAgent(...) call to log or handle errors so
failures aren't swallowed (reference functions: handlePhaseTransition and
pauseAgent).

@blackms blackms merged commit becca2c into main Jan 27, 2026
6 checks passed
@blackms blackms deleted the feat/resource-exhaustion-monitoring branch January 27, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement resource exhaustion triggers for agent monitoring

1 participant