feat: implement resource exhaustion triggers for agent monitoring by blackms · Pull Request #16 · blackms/aistack

blackms · 2026-01-27T19:42:37Z

Summary

Implement per-agent resource tracking to detect and prevent runaway agents consuming excessive resources without producing meaningful deliverables
Add ResourceExhaustionService following established DriftDetectionService pattern
Database persistence for crash recovery with in-memory caching for performance
Full REST API for resource monitoring and agent control (pause/resume)
Comprehensive documentation sync across README, API.md, ARCHITECTURE.md, and DATA.md

Features

Per-Agent Tracking: Track files accessed, API calls, subtasks spawned, tokens consumed
Phase Progression: normal → warning → intervention → termination
Configurable Thresholds: Set limits for each resource type
Pause/Resume Control: Automatically pause agents exceeding thresholds
Deliverable Checkpoints: Reset time-based tracking when agents produce results
Slack Notifications: Alert on warnings, interventions, and terminations
Prometheus Metrics: Full observability with counters, gauges, histograms

New Files

src/monitoring/resource-exhaustion-service.ts - Core service (622 lines)
tests/unit/resource-exhaustion.test.ts - Unit tests (1251 lines, 70 tests)

Modified Files

src/types.ts - New type definitions
src/utils/config.ts - Zod schemas for configuration
src/memory/sqlite-store.ts - Database tables and CRUD methods
src/agents/spawner.ts - Integration with agent lifecycle
src/agents/index.ts - New exports (pauseAgent, resumeAgent, isAgentPaused)
src/integrations/slack.ts - Notification methods
src/monitoring/metrics.ts - Prometheus metrics
src/web/routes/agents.ts - REST API endpoints
src/web/routes/system.ts - System resources endpoint
README.md, docs/API.md, docs/ARCHITECTURE.md, docs/DATA.md - Documentation

Test plan

All 70 new unit tests pass
100% line coverage, 95.53% branch coverage
Build passes (npm run build)
Lint passes (npm run lint)
Manual testing with resourceExhaustion.enabled: true
Verify Slack notifications (if configured)
Verify /api/v1/agents/:id/resources endpoint

Closes #4

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Resource Exhaustion Monitoring: Track per-agent resource usage with configurable thresholds and automatic phase transitions (normal, warning, intervention, termination).
- Agent Control: Pause/resume agents and record deliverable checkpoints to reset metrics and prevent resource exhaustion.
- System Dashboard: New endpoints to view real-time resource metrics and enforcement status across all agents.
- Slack Notifications: Get alerted on resource warnings and interventions.
Documentation
- Added comprehensive resource exhaustion configuration and monitoring guides.
- Updated system architecture and data models to reflect monitoring capabilities.
- Added new API endpoints documentation for resource management.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Implement per-agent resource tracking to detect and prevent runaway agents that consume excessive resources without producing meaningful deliverables. ## Features - **Per-Agent Tracking**: Track files accessed, API calls, subtasks spawned, tokens consumed - **Phase Progression**: `normal` → `warning` → `intervention` → `termination` - **Configurable Thresholds**: Set limits for each resource type - **Pause/Resume Control**: Automatically pause agents exceeding thresholds - **Deliverable Checkpoints**: Reset time-based tracking when agents produce results - **Slack Notifications**: Alert on warnings, interventions, and terminations - **Prometheus Metrics**: Full observability with counters, gauges, histograms ## Implementation - New `ResourceExhaustionService` following `DriftDetectionService` pattern - Database tables: `agent_resource_metrics`, `agent_deliverable_checkpoints`, `resource_exhaustion_events` - REST API endpoints for resource monitoring and control - Integration with spawner for automatic tracking - 70 unit tests with 100% line coverage, 95.53% branch coverage ## Configuration ```json { "resourceExhaustion": { "enabled": true, "thresholds": { "maxFilesAccessed": 50, "maxApiCalls": 100, "maxSubtasksSpawned": 20, "maxTimeWithoutDeliverableMs": 1800000, "maxTokensConsumed": 500000 }, "warningThresholdPercent": 0.7, "autoTerminate": false, "pauseOnIntervention": true } } ``` Closes #4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-27T19:42:58Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive resource exhaustion monitoring system that tracks per-agent metrics (files accessed, API calls, tokens consumed, time without deliverables) with configurable thresholds triggering warning/intervention/termination phases. It includes lifecycle management (pause/resume), deliverable checkpointing, Slack notifications, and REST APIs for monitoring and control.

Changes

Cohort / File(s)	Summary
Documentation `README.md`, `docs/API.md`, `docs/ARCHITECTURE.md`, `docs/DATA.md`	Added comprehensive documentation for resource exhaustion monitoring: feature overview, REST API endpoints, architecture diagrams reflecting new Monitoring Layer (41 tools total), and SQL schema/TypeScript interfaces for resource metrics, deliverable checkpoints, and exhaustion events.
Type Definitions & Configuration `src/types.ts`, `src/utils/config.ts`	Introduced ResourceExhaustionConfig interface and related types (Phase, Action, DeliverableType, Thresholds, Metrics, Event, Checkpoint). Extended AgentStackConfig and SlackConfig with new optional fields. Added corresponding Zod schemas for validation with sensible defaults.
Core Service Implementation `src/monitoring/resource-exhaustion-service.ts`, `src/monitoring/metrics.ts`	Implemented ResourceExhaustionService with per-agent tracking, phase management (normal→warning→intervention→termination), auto-pause/resume, deliverable checkpoint recording, persistent event logging, and background monitoring loop. Added three new Prometheus counters for exhaustion events and four histograms for agent activity.
Agent Lifecycle Integration `src/agents/spawner.ts`, `src/agents/index.ts`	Added resource tracking initialization on agent spawn/persist and cleanup on stop. Integrated resource checks before agent execution with pause/intervention handling. Exported pauseAgent, resumeAgent, isAgentPaused public APIs.
Persistence Layer `src/memory/sqlite-store.ts`	Added 11 new methods supporting agent resource metrics (save/get/list/delete), deliverable checkpoints (create/get/list/delete), and exhaustion events (save/get/summarize) with corresponding SQL tables and row-to-object mappers.
External Integrations `src/integrations/slack.ts`	Added three new Slack notification methods: sendResourceWarning, sendResourceIntervention, sendResourceTermination with formatted block messages.
REST API Endpoints `src/web/routes/agents.ts`, `src/web/routes/system.ts`	Added four agent-specific endpoints (GET /resources, POST /deliverable, POST /pause, POST /resume) and one system endpoint (GET /api/v1/system/resources) with resource exhaustion config checks, error handling, and ISO timestamp formatting.
Test Suite `tests/unit/resource-exhaustion.test.ts`	Added 1251 lines of comprehensive Vitest coverage for ResourceExhaustionService and SQLiteStore: phase transitions, pause/resume, deliverable tracking, threshold evaluation, database persistence, singleton management, and end-to-end integration scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent Spawner
    participant Service as Resource<br/>Exhaustion Service
    participant Metrics as Prometheus<br/>Metrics
    participant Store as SQLiteStore
    participant Slack as Slack<br/>Integration

    Agent->>Service: initializeAgent(agentId)
    Service->>Store: Load existing metrics from DB
    Service->>Service: Create in-memory tracking

    Agent->>Service: recordApiCall(tokens)
    Service->>Metrics: Update token counters
    Service->>Store: Persist metrics

    Agent->>Service: evaluateAgent()
    Service->>Service: Compare metrics vs thresholds
    alt Threshold exceeded
        Service->>Service: Transition phase (normal→warning)
        Service->>Metrics: Update phase gauge
        Service->>Store: Save exhaustion event
        Service->>Slack: sendResourceWarning()
        Slack-->>Slack: Format and send message
    else Severe threshold exceeded
        Service->>Service: Transition to intervention
        Service->>Service: pauseAgent() if configured
        Service->>Slack: sendResourceIntervention()
    end

    Agent->>Service: recordDeliverable(checkpoint)
    Service->>Store: createDeliverableCheckpoint()
    Service->>Service: Reset to normal phase

sequenceDiagram
    participant Client as REST Client
    participant Routes as Agent Routes
    participant Service as Resource<br/>Exhaustion Service
    participant Store as SQLiteStore
    participant Memory as Memory<br/>Manager

    Client->>Routes: POST /api/v1/agents/:id/pause
    Routes->>Routes: Validate resourceExhaustion.enabled
    Routes->>Service: pauseAgent(agentId, reason)
    Service->>Store: Update pause state in metrics
    Service->>Service: Register pauseAgent callback
    Routes->>Routes: Emit websocket event
    Routes-->>Client: { paused: true, timestamp }

    Client->>Routes: GET /api/v1/agents/:id/resources
    Routes->>Service: getResourceMetrics(agentId)
    Service->>Store: getAgentResourceMetrics(agentId)
    Store-->>Service: Return metrics object
    Routes->>Memory: Load agent data for context
    Routes-->>Client: { filesAccessed, apiCalls, tokens, phase }

    Client->>Routes: POST /api/v1/agents/:id/deliverable
    Routes->>Routes: Validate type, config enabled
    Routes->>Service: recordDeliverable(checkpoint)
    Service->>Store: createDeliverableCheckpoint()
    Service->>Service: Update lastDeliverableAt, reset phase
    Routes-->>Client: { id, agentId, type, createdAt }

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

PR #2: Extends monitoring surface (Prometheus metrics, health checks) and touches SQLiteStore persistence and Slack integration similarly; both involve new monitoring infrastructure and agent instrumentation.
PR #10: Modifies src/agents/spawner.ts lifecycle hooks (spawn/stop) for agent identity/deactivation; overlaps with this PR's resource tracking initialization in the same functions.

Poem

🐰 Hops of monitoring, bounds of care,
Tracking tokens through the air,
When resources run deep and wide,
We pause the agents, side by side,
Deliverables light the way,
No more runaway loops today! 🥕✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: implement resource exhaustion triggers for agent monitoring' directly and clearly summarizes the main change—implementing resource exhaustion detection and monitoring for agents.
Linked Issues check	✅ Passed	The PR implementation comprehensively addresses all required capabilities from issue `#4`: per-agent metrics tracking (files, API calls, subtasks, time, tokens), configurable thresholds with defaults, phase progression (warning/intervention/termination), deliverable checkpoints that reset time tracking, Slack notifications, and Prometheus metrics.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing resource exhaustion monitoring as specified in issue `#4`. Documentation updates, configuration schemas, API endpoints, database persistence, and tests all support the core feature without introducing unrelated modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/monitoring/resource-exhaustion-service.ts`:
- Around line 260-265: The code uses a non-null assertion on triggeredBy when
calling handlePhaseTransition (see variables triggeredBy, newPhase,
metrics.phase, handlePhaseTransition), which can lead to passing null/undefined
in edge cases; update the conditional logic to ensure triggeredBy is set (or
provide a safe default) before invoking handlePhaseTransition, e.g., only call
handlePhaseTransition when newPhase !== metrics.phase AND triggeredBy is
non-null (or compute a fallback reason), then continue to assign metrics.phase =
newPhase and call updateMetrics(agentId, metrics) as before.
- Around line 337-340: The isAgentPaused method incorrectly returns true when
metrics is undefined because metrics?.pausedAt !== null yields true for
undefined; update isAgentPaused to first ensure metrics exists and only then
check pausedAt (e.g., use an explicit existence check like metrics != null &&
metrics.pausedAt !== null, or metrics?.pausedAt != null) so that non-existent
agents return false; refer to isAgentPaused, this.metricsCache, and the pausedAt
field when making the change.
- Around line 539-542: The call to pauseAgent inside handlePhaseTransition is
not awaited even though pauseAgent returns Promise<boolean>, so make
handlePhaseTransition async and await this.pauseAgent(agentId, `Resource
threshold exceeded: ${triggeredBy}`) and propagate or handle its result,
updating the handlePhaseTransition signature and any callers as needed; if you
intentionally want fire-and-forget, instead append a .catch(...) to the
pauseAgent(...) call to log or handle errors so failures aren't swallowed
(reference functions: handlePhaseTransition and pauseAgent).

🧹 Nitpick comments (12)

src/web/routes/agents.ts (3)
346-355: Inconsistent error response pattern.

Lines 350 and 354 handle errors differently: line 350 uses sendJson(res, { error: ... }, 500) while line 354 uses sendError(res, 500, message). The sendError function produces a standardized response with {success: false, error: message}, whereas sendJson with an error object produces {success: true, data: {error: ...}} which is misleading for error responses.
Suggested fix for consistency
     try {
       const success = await pauseAgent(agentId, reason);
       if (success) {
         agentEvents.emit('agent:paused', { id: agentId, reason });
         sendJson(res, { paused: true, reason });
       } else {
-        sendJson(res, { error: 'Failed to pause agent' }, 500);
+        sendError(res, 500, 'Failed to pause agent');
       }
     } catch (error) {
376-382: Inconsistent error response pattern.

Same issue here: line 381 uses sendJson(res, { error: ... }, 400) instead of sendError. This produces {success: true, data: {error: ...}} which incorrectly indicates success.
Suggested fix for consistency
     try {
       const success = resumeAgent(agentId);
       if (success) {
         agentEvents.emit('agent:resumed', { id: agentId });
         sendJson(res, { resumed: true });
       } else {
-        sendJson(res, { error: 'Agent was not paused or failed to resume' }, 400);
+        sendError(res, 400, 'Agent was not paused or failed to resume');
       }
     } catch (error) {
242-245: Consider using sendError for all error responses in new endpoints.

Multiple places in the new endpoints use sendJson(res, { error: ... }, statusCode) which produces responses like {success: true, data: {error: ...}, timestamp: ...}. This is semantically incorrect for error conditions. Consider using sendError consistently for all error responses to produce proper {success: false, error: ...} responses.

Affected lines: 243-244, 256-257, 296-297, 340-341, 371-372.

Also applies to: 255-258, 295-298, 339-342, 370-373
tests/unit/resource-exhaustion.test.ts (3)
746-748: Weak assertion does not verify actual behavior.

The assertion expect(true).toBe(true) only confirms no exception was thrown. Consider verifying observable state, such as confirming the metrics remain unchanged or the agent wasn't inadvertently created.
♻️ Suggested improvement
-      // No crash expected
-      expect(true).toBe(true);
+      // Metrics should remain unchanged or null since cache was cleared
+      const metrics = service.getAgentMetrics('agent-1');
+      // Verify the operation was a no-op (metrics still null or unchanged)
+      expect(metrics).toBeNull();
619-639: Time-based test relies on setTimeout which can be flaky in CI.

This test uses real time delays (150ms) to verify threshold detection. While acceptable for unit tests, consider using fake timers (vi.useFakeTimers()) for more reliable and faster tests.
♻️ Suggested improvement using fake timers
 it('should detect time without deliverable threshold', () => {
+  vi.useFakeTimers();
   const service = new ResourceExhaustionService(
     store,
     createConfig({
       maxTimeWithoutDeliverableMs: 100, // Very short for testing
       warningThresholdPercent: 0.5,
       pauseOnIntervention: false,
     })
   );
   service.initializeAgent('agent-1', 'coder');

-  // Wait for threshold to be exceeded
-  return new Promise<void>((resolve) => {
-    setTimeout(() => {
-      const phase = service.evaluateAgent('agent-1');
-      expect(phase).toBe('intervention');
-      resolve();
-    }, 150);
-  });
+  // Advance time past threshold
+  vi.advanceTimersByTime(150);
+  
+  const phase = service.evaluateAgent('agent-1');
+  expect(phase).toBe('intervention');
+  
+  vi.useRealTimers();
 });
1148-1167: Time-based test for checkpoint ordering relies on real delay.

Similar to the earlier time-based test, this uses a 10ms delay to ensure different timestamps. Consider using fake timers or manually setting timestamps in the test data for deterministic behavior.
src/monitoring/resource-exhaustion-service.ts (4)
345-357: terminateAgent hardcodes triggeredBy as maxTimeWithoutDeliverableMs.

The termination may be triggered for various reasons (e.g., manual termination, different threshold breaches), but the event always logs maxTimeWithoutDeliverableMs as the trigger. Consider accepting triggeredBy as a parameter.
♻️ Suggested improvement
-terminateAgent(agentId: string, reason: string): boolean {
+terminateAgent(
+  agentId: string,
+  reason: string,
+  triggeredBy: keyof ResourceThresholds = 'maxTimeWithoutDeliverableMs'
+): boolean {
   const metrics = this.metricsCache.get(agentId);
   if (!metrics) return false;

   // Trigger termination phase transition
-  this.handlePhaseTransition(agentId, metrics.phase, 'termination', 'maxTimeWithoutDeliverableMs');
+  this.handlePhaseTransition(agentId, metrics.phase, 'termination', triggeredBy);
   metrics.phase = 'termination';
   ...
584-591: Config comparison uses JSON.stringify for thresholds.

While functional, JSON.stringify comparison is sensitive to property ordering and can be expensive for frequent calls. Consider a dedicated comparison function for clarity and reliability.
♻️ Suggested improvement
function thresholdsEqual(a: ResourceThresholds, b: ResourceThresholds): boolean {
  return (
    a.maxFilesAccessed === b.maxFilesAccessed &&
    a.maxApiCalls === b.maxApiCalls &&
    a.maxSubtasksSpawned === b.maxSubtasksSpawned &&
    a.maxTimeWithoutDeliverableMs === b.maxTimeWithoutDeliverableMs &&
    a.maxTokensConsumed === b.maxTokensConsumed
  );
}
362-366: waitForResume creates a promise that may never resolve.

If waitForResume is called but the agent is never paused or resumed, the promise remains pending indefinitely. Consider adding a timeout or cancellation mechanism for robustness.

414-424: Background monitoring interval should use unref() to not block Node.js exit.

If this service is running when the process wants to exit gracefully, the interval may keep the event loop alive. Consider calling unref() on the interval.
♻️ Suggested improvement
 start(): void {
   if (!this.isEnabled() || this.checkInterval) return;

   this.checkInterval = setInterval(() => {
     this.checkAllAgents();
   }, this.config.checkIntervalMs);
+  
+  // Don't block process exit
+  this.checkInterval.unref();

   log.info('Resource exhaustion monitoring started', {
     checkIntervalMs: this.config.checkIntervalMs,
   });
 }
src/memory/sqlite-store.ts (2)
2425-2462: Consider using SQL aggregation for better performance.

The getResourceExhaustionMetrics method fetches all events and aggregates in memory. For large event tables, this could be slow and memory-intensive. Consider using SQL GROUP BY and COUNT for better performance.
♻️ SQL-based aggregation approach
getResourceExhaustionMetrics(since?: Date): {
  totalEvents: number;
  warningCount: number;
  interventionCount: number;
  terminationCount: number;
  byAgent: Record<string, number>;
} {
  let baseCondition = '1=1';
  const params: number[] = [];

  if (since) {
    baseCondition = 'created_at >= ?';
    params.push(since.getTime());
  }

  // Get counts by action_taken
  const countQuery = `
    SELECT 
      COUNT(*) as total,
      SUM(CASE WHEN action_taken = 'warned' THEN 1 ELSE 0 END) as warning_count,
      SUM(CASE WHEN action_taken = 'paused' THEN 1 ELSE 0 END) as intervention_count,
      SUM(CASE WHEN action_taken = 'terminated' THEN 1 ELSE 0 END) as termination_count
    FROM resource_exhaustion_events
    WHERE ${baseCondition}
  `;
  const counts = this.db.prepare(countQuery).get(...params) as {
    total: number;
    warning_count: number;
    intervention_count: number;
    termination_count: number;
  };

  // Get counts by agent
  const byAgentQuery = `
    SELECT agent_id, COUNT(*) as count
    FROM resource_exhaustion_events
    WHERE ${baseCondition}
    GROUP BY agent_id
  `;
  const agentRows = this.db.prepare(byAgentQuery).all(...params) as Array<{
    agent_id: string;
    count: number;
  }>;

  const byAgent: Record<string, number> = {};
  for (const row of agentRows) {
    byAgent[row.agent_id] = row.count;
  }

  return {
    totalEvents: counts.total,
    warningCount: counts.warning_count,
    interventionCount: counts.intervention_count,
    terminationCount: counts.termination_count,
    byAgent,
  };
}
2493-2516: Mutating the parsed JSON object could cause subtle issues.

The transformer mutates metrics after parsing from JSON. While functional, this pattern can be confusing. Consider creating a new object with the converted dates instead.
♻️ Immutable transformation approach
 private rowToResourceExhaustionEvent(row: ResourceExhaustionEventRow): ResourceExhaustionEvent {
-  const metrics = JSON.parse(row.metrics) as AgentResourceMetrics;
-  // Convert date strings back to Date objects
-  metrics.startedAt = new Date(metrics.startedAt);
-  metrics.lastActivityAt = new Date(metrics.lastActivityAt);
-  if (metrics.lastDeliverableAt) {
-    metrics.lastDeliverableAt = new Date(metrics.lastDeliverableAt);
-  }
-  if (metrics.pausedAt) {
-    metrics.pausedAt = new Date(metrics.pausedAt);
-  }
+  const rawMetrics = JSON.parse(row.metrics) as Record<string, unknown>;
+  const metrics: AgentResourceMetrics = {
+    agentId: rawMetrics.agentId as string,
+    filesRead: rawMetrics.filesRead as number,
+    filesWritten: rawMetrics.filesWritten as number,
+    filesModified: rawMetrics.filesModified as number,
+    apiCallsCount: rawMetrics.apiCallsCount as number,
+    subtasksSpawned: rawMetrics.subtasksSpawned as number,
+    tokensConsumed: rawMetrics.tokensConsumed as number,
+    startedAt: new Date(rawMetrics.startedAt as string | number),
+    lastDeliverableAt: rawMetrics.lastDeliverableAt 
+      ? new Date(rawMetrics.lastDeliverableAt as string | number) 
+      : null,
+    lastActivityAt: new Date(rawMetrics.lastActivityAt as string | number),
+    phase: rawMetrics.phase as ResourceExhaustionPhase,
+    pausedAt: rawMetrics.pausedAt 
+      ? new Date(rawMetrics.pausedAt as string | number) 
+      : null,
+    pauseReason: rawMetrics.pauseReason as string | null,
+  };

   return {
     id: row.id,
     ...
   };
 }

coderabbitai · 2026-01-27T19:47:18Z

+    // Handle phase transition
+    if (newPhase !== metrics.phase) {
+      this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy!);
+      metrics.phase = newPhase;
+      this.updateMetrics(agentId, metrics);
+    }


⚠️ Potential issue | 🟡 Minor

Non-null assertion on triggeredBy may cause issues.

If newPhase stays 'normal' throughout the loop (all ratios below warning threshold), triggeredBy remains null, yet the code calls handlePhaseTransition with triggeredBy!. While the if (newPhase !== metrics.phase) condition may prevent this in normal→normal cases, edge cases (like phase already being warning/intervention) could trigger unexpected behavior.

🛠️ Suggested defensive fix

// Handle phase transition - if (newPhase !== metrics.phase) { - this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy!); + if (newPhase !== metrics.phase && triggeredBy !== null) { + this.handlePhaseTransition(agentId, metrics.phase, newPhase, triggeredBy); metrics.phase = newPhase; this.updateMetrics(agentId, metrics); }

🤖 Prompt for AI Agents

In `@src/monitoring/resource-exhaustion-service.ts` around lines 260 - 265, The code uses a non-null assertion on triggeredBy when calling handlePhaseTransition (see variables triggeredBy, newPhase, metrics.phase, handlePhaseTransition), which can lead to passing null/undefined in edge cases; update the conditional logic to ensure triggeredBy is set (or provide a safe default) before invoking handlePhaseTransition, e.g., only call handlePhaseTransition when newPhase !== metrics.phase AND triggeredBy is non-null (or compute a fallback reason), then continue to assign metrics.phase = newPhase and call updateMetrics(agentId, metrics) as before.

coderabbitai · 2026-01-27T19:47:18Z

+  isAgentPaused(agentId: string): boolean {
+    const metrics = this.metricsCache.get(agentId);
+    return metrics?.pausedAt !== null;
+  }


⚠️ Potential issue | 🟠 Major

isAgentPaused returns incorrect value when agent doesn't exist.

When metrics is undefined (agent not found), metrics?.pausedAt !== null evaluates to undefined !== null, which is true. This means isAgentPaused returns true for non-existent agents.

🐛 Proposed fix

isAgentPaused(agentId: string): boolean { const metrics = this.metricsCache.get(agentId); - return metrics?.pausedAt !== null; + return metrics?.pausedAt != null; }

Or more explicitly:

isAgentPaused(agentId: string): boolean { const metrics = this.metricsCache.get(agentId); - return metrics?.pausedAt !== null; + return metrics !== undefined && metrics.pausedAt !== null; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

isAgentPaused(agentId: string): boolean {

const metrics = this.metricsCache.get(agentId);

return metrics?.pausedAt !== null;

}

isAgentPaused(agentId: string): boolean {

const metrics = this.metricsCache.get(agentId);

return metrics?.pausedAt != null;

}

Suggested change

isAgentPaused(agentId: string): boolean {

const metrics = this.metricsCache.get(agentId);

return metrics?.pausedAt !== null;

}

isAgentPaused(agentId: string): boolean {

const metrics = this.metricsCache.get(agentId);

return metrics !== undefined && metrics.pausedAt !== null;

}

🤖 Prompt for AI Agents

In `@src/monitoring/resource-exhaustion-service.ts` around lines 337 - 340, The isAgentPaused method incorrectly returns true when metrics is undefined because metrics?.pausedAt !== null yields true for undefined; update isAgentPaused to first ensure metrics exists and only then check pausedAt (e.g., use an explicit existence check like metrics != null && metrics.pausedAt !== null, or metrics?.pausedAt != null) so that non-existent agents return false; refer to isAgentPaused, this.metricsCache, and the pausedAt field when making the change.

coderabbitai · 2026-01-27T19:47:18Z

+      // Auto-pause if configured
+      if (this.config.pauseOnIntervention) {
+        this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);
+      }


⚠️ Potential issue | 🟡 Minor

pauseAgent is async but not awaited.

pauseAgent returns a Promise<boolean> but is called without await inside handlePhaseTransition. This means the pause operation may not complete before the function returns, and any errors from pauseAgent won't be caught.

🛠️ Options to address

Option 1: Make handlePhaseTransition async and await the call:

-private handlePhaseTransition( +private async handlePhaseTransition( ... ): void { ... - if (this.config.pauseOnIntervention) { - this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`); + if (this.config.pauseOnIntervention) { + await this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`); }

Option 2: If fire-and-forget is intentional, add a .catch() to handle potential errors:

if (this.config.pauseOnIntervention) { - this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`); + this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`) + .catch(err => log.error('Failed to auto-pause agent', { agentId, error: err })); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Auto-pause if configured

if (this.config.pauseOnIntervention) {

this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`);

}

// Auto-pause if configured

if (this.config.pauseOnIntervention) {

this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`)

.catch(err => log.error('Failed to auto-pause agent', { agentId, error: err }));

}

🤖 Prompt for AI Agents

In `@src/monitoring/resource-exhaustion-service.ts` around lines 539 - 542, The call to pauseAgent inside handlePhaseTransition is not awaited even though pauseAgent returns Promise<boolean>, so make handlePhaseTransition async and await this.pauseAgent(agentId, `Resource threshold exceeded: ${triggeredBy}`) and propagate or handle its result, updating the handlePhaseTransition signature and any callers as needed; if you intentionally want fire-and-forget, instead append a .catch(...) to the pauseAgent(...) call to log or handle errors so failures aren't swallowed (reference functions: handlePhaseTransition and pauseAgent).

coderabbitai Bot reviewed Jan 27, 2026

View reviewed changes

blackms merged commit becca2c into main Jan 27, 2026
6 checks passed

blackms deleted the feat/resource-exhaustion-monitoring branch January 27, 2026 20:42

coderabbitai Bot mentioned this pull request Jan 28, 2026

fix: prevent memory contamination between agents in different sessions #17

Merged

3 tasks

This was referenced May 28, 2026

[AIG-633] Durable execution checkpointer (M1-5) #22

Merged

[AIG-649] Multi-tenancy + workspace isolation (M3-21) #37

Merged

[AIG-632] Add OpenTelemetry tracing #45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement resource exhaustion triggers for agent monitoring#16

feat: implement resource exhaustion triggers for agent monitoring#16
blackms merged 1 commit into
mainfrom
feat/resource-exhaustion-monitoring

blackms commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Uh oh!

coderabbitai Bot Jan 27, 2026

Uh oh!

coderabbitai Bot Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blackms commented Jan 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

New Files

Modified Files

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

blackms commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading