Skip to content

fix: rate-limit notification retries for non-existent windows#26

Merged
UserAd merged 2 commits intomainfrom
stuck-in-cycle
Jan 15, 2026
Merged

fix: rate-limit notification retries for non-existent windows#26
UserAd merged 2 commits intomainfrom
stuck-in-cycle

Conversation

@UserAd
Copy link
Owner

@UserAd UserAd commented Jan 15, 2026

Summary

  • Fix infinite retry loop when mailman tries to notify non-existent tmux windows
  • Check window existence before attempting notification (cleaner approach)
  • If window doesn't exist, skip entirely - no "failed" errors in logs
  • Keep rate-limiting as fallback for other notification failures

Test plan

  • Added TestStatelessNotification_WindowNotExists to verify window check works
  • Updated TestStatelessNotification_NotifyFailure to verify rate-limiting on other failures
  • All existing tests pass with go test -race ./...
  • Manual test: send message to non-existent window, verify logs show "window does not exist" (not "failed")

When notification failed for a stateless agent (e.g., tmux window doesn't
exist), the agent was not marked as notified. This caused mailman to retry
every cycle, creating an infinite loop of failed notifications.

Now notification failures also mark the agent as notified, rate-limiting
retries to once per interval (60 seconds). The retry will stop entirely
if the message gets read.
@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

When stateless agent notification fails, the code now marks the agent as notified via StatelessTracker to rate-limit retry attempts, preventing immediate retries. The successful notification path remains unchanged. This failure-handling branch complements existing behavior to avoid repeated notification attempts.

Changes

Cohort / File(s) Summary
Loop notification handling
internal/daemon/loop.go
Added StatelessTracker.MarkNotified(mailboxRecipient) call in failure path of CheckAndNotifyWithNotifier to rate-limit retries on notification failures (+4 lines)
Loop notification tests
internal/daemon/loop_test.go
Updated StatlessNotification test (T033) to verify agent is marked as notified on failure, enforcing rate-limited retry behavior; test now validates that immediate retries are blocked and subsequent attempts occur only after tracker interval elapses (+20/-7 lines)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: rate-limit notification retries for non-existent windows' directly and clearly describes the main change—implementing rate-limiting for notification retries in a specific failure scenario.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings


📜 Recent review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e7c67d and b596843.

📒 Files selected for processing (2)
  • internal/daemon/loop.go
  • internal/daemon/loop_test.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Store messages in JSONL format in .agentmail/mailboxes/<recipient>.jsonl with file locking for atomic operations
Use only Go standard library (os/exec for tmux, encoding/json for JSONL) without external dependencies
Build system must pass go vet ./... without errors
Build system must pass go fmt ./... without errors
Build system must pass govulncheck ./... without errors
Build system must pass gosec ./... without errors
Use JSONL file format in .agentmail/ directory for message storage and state management
Use standard library only (os, filepath, syscall, encoding/json, time, sync, os/signal, bufio, fmt, io) for Go implementation

Files:

  • internal/daemon/loop.go
  • internal/daemon/loop_test.go
**/*_test.go

📄 CodeRabbit inference engine (CLAUDE.md)

Build system must pass go test -v -race ./... without errors

Files:

  • internal/daemon/loop_test.go
🧬 Code graph analysis (1)
internal/daemon/loop_test.go (1)
internal/daemon/loop.go (1)
  • CheckAndNotifyWithNotifier (136-276)
🔇 Additional comments (2)
internal/daemon/loop.go (1)

252-258: LGTM! Rate-limiting on failure prevents infinite retry loops.

The change correctly marks the stateless agent as notified even when the notification fails. This ensures that subsequent notification cycles won't immediately retry, rate-limiting attempts to once per StatelessNotifyInterval (60 seconds). The approach is appropriate because:

  1. If the window doesn't exist, repeated immediate retries are wasteful
  2. The window might become available later, so retries after the interval are still attempted
  3. If the mailbox is cleared (message read), retries stop entirely as there are no unread messages

The log message suffix "(after failure)" clearly distinguishes this path from successful notifications.

internal/daemon/loop_test.go (1)

1316-1376: LGTM! Test correctly validates the rate-limiting behavior on notification failure.

The updated test thoroughly verifies the new failure-handling semantics:

  1. First call (lines 1341-1348): Notification fails, notifyAttempts = 1
  2. Rate-limit check (lines 1350-1354): Confirms ShouldNotify returns false after failure
  3. Immediate retry blocked (lines 1356-1363): Second call doesn't trigger notify due to rate-limiting
  4. Interval elapsed (lines 1365-1366): Waits 60ms (> 50ms interval)
  5. Retry allowed (lines 1368-1375): Third call succeeds, notifyAttempts = 2

The test interval (50ms) and sleep duration (60ms) provide adequate margin for timing reliability. The mock notifier correctly returns success on retry (notifyAttempts > 1), validating that recovery is possible after the rate-limit window.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Instead of letting notification fail for non-existent windows and then
rate-limiting, check upfront if the tmux window exists. This provides
cleaner logs (no "failed" errors for expected cases) and avoids
unnecessary notification attempts.

Added WindowCheckerFunc to make the check mockable in tests.
@UserAd UserAd merged commit eb86928 into main Jan 15, 2026
2 checks passed
@UserAd UserAd deleted the stuck-in-cycle branch January 15, 2026 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant