fix(round-tables): split-brain, error codes, expiry notification, TTL purge#19
fix(round-tables): split-brain, error codes, expiry notification, TTL purge#19
Conversation
…expiry notification, TTL purge - Fix split-brain: only store participants that successfully enrolled in the backing ADMP group (enrolledParticipants vs uniqueParticipants) - Fail-fast when zero participants enroll; clean up orphaned group - Add makeError() helper so routes use error.statusCode, not string-matching - Use nullish coalescing (??) for timeout_minutes default (0 was falsy) - Require integer timeout_minutes; reject floats (e.g. 1.5) - Notify participants via inbox on expireStale() so they know the session ended - Add purgeStaleRoundTables() to memory.js and purgeStale() service method to prevent unbounded storage growth; hooked into cleanup loop with configurable ROUND_TABLE_PURGE_TTL_MS env var (default: 7 days) - Add 5 missing tests: facilitator speak, expireStale lifecycle, dedup, goal validation, non-integer timeout validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Code ReviewGood follow-up to PR 18. The split-brain fix is the most important change here and it is well-implemented. Several smaller issues worth discussing below. What works well
Bugs / Correctness1. Facilitator not notified on expiry ( The expiry notification loop iterates Suggested fix: const toNotify = [rt.facilitator, ...rt.participants];
for (const participantId of toNotify) { ... }2. The new env var uses the same old pattern this PR was fixing: const ROUND_TABLE_PURGE_TTL_MS = parseInt(process.env.ROUND_TABLE_PURGE_TTL_MS) || 7 * 24 * 60 * 60 * 1000;
3. for (const rt of tables) {
if (rt.expires_at && ...) {
await storage.updateRoundTable(rt.id, { status: 'expired' }); // no try-catch: throws -> loop aborts
// notifications and group cleanup ...
expired++;
}
}If Issues / Concerns4. Silent partial enrollment - no API signal ( When some but not all participants fail to enroll, the session is created with a reduced participant list and the response looks normal. The caller has no way to know an agent was dropped without comparing the request to the response. Consider adding a 5. Group settings: { max_members: uniqueParticipants.length + 1, ... }This is calculated from the requested participant count, not the enrolled count. If 3 of 5 participants fail to enroll, the group allows 6 members but only 4 are present. Depending on how 6.
Test gaps7. Zero-enrollment case not tested The PR adds the guard for 8. Expiry test does not assert inbox delivery The Minor / Nits
Summary
The split-brain fix and error code cleanup are solid and worth merging. Two items to prioritize before merge: facilitator not notified on expiry (1), and the unguarded |
Greptile SummaryAddresses second review feedback from PR #18 with several critical fixes and improvements to round table session management. Key improvements:
Test coverage: Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| src/services/round-table.service.js | Split-brain fix, error code refactor with makeError helper, expiry notifications, and purge method added |
| src/routes/round-tables.js | Simplified error handling, added integer validation for timeout_minutes, fixed nullish coalescing |
| src/server.test.js | Added 5 comprehensive tests covering facilitator access, expiry lifecycle, deduplication, and validation |
Sequence Diagram
sequenceDiagram
participant Cleanup as Cleanup Job
participant RTS as RoundTableService
participant Storage as Storage
participant Inbox as InboxService
participant Group as GroupService
participant P1 as Participant 1
participant P2 as Participant 2
Cleanup->>RTS: expireStale()
RTS->>Storage: listRoundTables({status: 'open'})
Storage-->>RTS: [round tables]
loop For each expired table
RTS->>Storage: updateRoundTable(id, {status: 'expired'})
Storage-->>RTS: updated
RTS->>Inbox: send(to: P1, type: 'notification')
Inbox-->>P1: Expiry notification
RTS->>Inbox: send(to: P2, type: 'notification')
Inbox-->>P2: Expiry notification
RTS->>Group: delete(group_id)
Group-->>RTS: deleted
end
RTS-->>Cleanup: expired count
Last reviewed commit: e401137
…resilient expiry loop, excluded_participants, max_members fix - Notify facilitator AND participants on expiry (facilitator was excluded) - Add version:1.0 + id to all inboxService.send envelopes (required fields) - Wrap per-record expireStale in try-catch so one failure does not abort rest - Return excluded_participants in create response so callers know who was dropped - Align group max_members to enrolledParticipants.length+1 after partial enrollment via groupService.update() (prevents unauthorized join slots) - Fix parseInt(env) || default -> NaN-safe pattern for ROUND_TABLE_PURGE_TTL_MS - Use rt.expires_at (not new Date()) as canonical timestamp in expiry notifications - Remove unnecessary return await in purgeStale - Strengthen expiry test: assert inbox delivery to both participant and facilitator - Add zero-enrollment test: all-ghost participants returns 400 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code ReviewThis is a well-executed follow-up. The core fixes are all present and most of the previously raised issues have been addressed. A few items remain worth discussing before merge. What's well done
Bugs / Correctness1. Silent group deletion swallows errors on zero-enrollment path if (enrolledParticipants.length === 0) {
try { await groupService.delete(group.id, facilitator); } catch (_) {}
throw makeError(...);
}An orphaned group is a real correctness problem — it occupies a slot and may block future sessions with the same name. Yet a failure here produces no log entry. The rest of the expiry and cleanup paths use try {
await groupService.delete(group.id, facilitator);
} catch (err) {
logger.warn({ groupId: group.id, err: err.message }, '[RoundTable] Could not clean up group on zero-enrollment');
}2.
Design / Behaviour3.
4. try {
await groupService.update(group.id, facilitator, { settings: { max_members: enrolledParticipants.length + 1, ... } });
} catch (err) {
logger.warn(..., '[RoundTable] Could not update group max_members after partial enrollment');
}If this update fails, the group retains the original upper bound based on Test gaps5. Zero-enrollment test doesn't verify group cleanup The test description says "leaves no orphaned records", but the test body only asserts the HTTP status and error message. The whole point of the Nits
SummaryThe critical correctness bugs from the earlier review are fixed. The two items I'd resolve before merging are:
The design questions around Generated with Claude Code |
…oll log, test robustness - Add startup assertion for purgeStaleRoundTables in storage/index.js so any backend missing the method fails fast at startup instead of at runtime - Log groupService.delete failure on zero-enrollment path (was silently swallowed; inconsistent with every other error path in the service) - Fix zero-enrollment test: use groupService.listForAgent for backend-agnostic group cleanup assertion (storage.groups Map only exists in memory backend) - Use unique ghost agent IDs in zero-enrollment test to prevent collision with agents registered by other tests - Import groupService in test file for direct service-level assertions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code Review (Third Pass)Two prior reviews have been addressed well. This pass focuses on what's changed since the last review and a few remaining items. Confirmed fixed from previous reviews
Good incremental progress on each iteration. Remaining issues1. No test for partial enrollment (some succeed, some fail) The split-brain fix's most important scenario — partial enrollment — is untested. The tests cover the all-succeed case (normal create) and the all-fail case (zero-enrollment), but nothing in between. A test with one valid and one invalid participant would confirm that:
Without this test, the core behavioral change of this PR (deriving 2. If the 3.
Nits
Summary
The critical correctness issues from the previous reviews are all addressed. The one item worth resolving before merge is the missing test for partial enrollment — it's the primary behavioral guarantee this PR adds. The other items are lower priority but worth tracking. Generated with Claude Code |
…pe, encodeURIComponent in test - Add partial enrollment test: verifies split-brain prevention is exercised (one valid + one ghost participant; asserts participants=[valid], excluded_participants=[ghost], group max_members=2) - Only include excluded_participants in create response when non-empty, keeping create and GET response shapes consistent on the happy path - Fix group ID URL encoding in test (group:// IDs contain slashes) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code Review — PR #19Good set of targeted fixes. The split-brain correction is the most important change and is done well. Notes below, roughly by severity. CorrectnessExpiry notification sender identity ( Notifications are sent with // Current: facilitator notifies themselves
from: rt.facilitator,
to: rt.facilitator, // looks like a self-message
The
API Contract / Client Impact
The create response conditionally includes
Consider one of:
Missing Docs UpdatePer
PerformanceSequential notification loop in for (const recipientId of toNotify) {
try {
await inboxService.send(...) // sequential
} catch ...
}For a table with 20 participants, this sends notifications one at a time. Since each send is independent, await Promise.allSettled(toNotify.map(async (recipientId) => {
try {
await inboxService.send(...)
} catch (err) {
logger.warn(...)
}
}));Low / MinorStorage startup assertion may break rolling deployments The new assertion in if (typeof _storage.purgeStaleRoundTables !== 'function') {
throw new Error('storage: purgeStaleRoundTables is missing ...');
}This crashes the process if the storage adapter doesn't implement the new method. On a rolling deploy, new server instances will fail to start until the storage adapter (e.g. a Redis adapter) is also updated. Operator runbook should call this out, or the error message should include upgrade steps.
The integer check now exists in both the route handler and
For expired tables, the purge uses Strengths
Summary: The core fixes are correct and well-tested. The main asks before merge are: (1) clarify/fix the notification sender identity, (2) add a comment on |
…cs update - Replace sequential notification loop in expireStale with Promise.allSettled for parallel delivery to all recipients (up to 21 for a full session) - Add inline comments explaining: - why from=rt.facilitator is used for expiry notifications (logical author) - why facilitator receives a self-addressed copy (documented intent) - why verify_signature:false is required (server holds no private keys) - why rt.expires_at is used as timestamp (canonical, not processing time) - Update docs/AGENT-GUIDE.md: add Round Tables endpoint table and behavior notes documenting excluded_participants, expiry notifications, partial enrollment, storage growth/purge, and ROUND_TABLE_PURGE_TTL_MS env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code Review — PR #19Previous review items are well addressed. Here is a focused pass on the current diff. What was fixed from prior reviews
Remaining items[MEDIUM] Facilitator-in-participants edge case — no guard If the caller includes the facilitator's own agent ID in the participants list, groupService.addMember will attempt to add the owner as a member of their own group. This likely throws (already a member / already owner), which would land the facilitator in excludedParticipants. Two failure modes:
Suggested fix in the route or service: [MEDIUM] Storage startup assertion blocks rolling deployments The new assertion in storage/index.js crashes the process if purgeStaleRoundTables is missing: On a rolling deploy, new application instances will refuse to start until every storage adapter is updated. Operators using Redis or custom adapters will hit a hard failure. The error message points to memory.js which helps, but the README or a CHANGELOG entry should explicitly call this out as a breaking change for custom adapter implementors. [LOW] _purgeTtlParsed is an unusual module-level name The leading underscore convention typically signals an unused/private variable and may trigger linter warnings. An IIFE is cleaner: [LOW] timeout_minutes validation duplicated in route and service The integer check (!Number.isInteger) now exists in both the route handler and roundTableService.create. As defense-in-depth this is harmless, but the two error messages will drift if the valid range changes. Since makeError now propagates structured status codes cleanly, the route could drop its copy and rely on catching the service error. [LOW] Partial enrollment test does not verify speak access post-create The test confirms that only the enrolled participant appears in rt.participants and that max_members is aligned — both good. It does not verify:
These are the behavioral guarantees the split-brain fix is meant to provide. A couple of extra assertions in the existing test would close that gap. NitpurgeStaleRoundTables uses rt.resolved_at || rt.expires_at as the cutoff, which is the correct semantic (canonical close time, not processing time). A brief comment in memory.js would help custom adapter authors understand why expires_at is used instead of Date.now(). Summary
The core split-brain fix is correct and well-tested. The two medium items are worth addressing before merge: the facilitator edge case is a real correctness hole, and the deploy note is an operator safety concern. The low items can follow in a subsequent cleanup pass. Generated with Claude Code |
…TTL, rolling deploy note - Add FACILITATOR_IN_PARTICIPANTS validation: reject creates where the facilitator is listed as a participant (addMember would throw 'already owner', landing the facilitator in excluded_participants or aborting on zero-enrollment path) - Replace _purgeTtlParsed temp var with IIFE to avoid underscore-private convention - Add BREAKING CHANGE comment to storage/index.js purgeStaleRoundTables assertion so operators deploying custom adapters see upgrade instructions at the assertion site - Add canonical-close-time comment to memory.js purgeStaleRoundTables explaining why resolved_at/expires_at is used instead of Date.now() - Strengthen partial enrollment test: assert enrolled agent can speak (201) and a non-enrolled registered agent gets 403 (not just unregistered ghost) - Add facilitator-in-participants test: verify 400 + FACILITATOR_IN_PARTICIPANTS error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code Review — PR #19 (Final Pass)This is a well-executed series of fixes. All medium and high priority items from prior review rounds are resolved. Notes below on what is confirmed fixed and what remains. Fixed from previous reviews
Remaining[LOW] Duplicate The integer check now lives in both [LOW] Single call site in [LOW] Breaking change docs for custom adapters
Summary
Merge assessment: Ready to merge. The remaining items are all low-priority nits. The core behavioral changes (split-brain prevention, expiry notifications, TTL purge) are correct, well-guarded, and tested. Generated with Claude Code |
Documents the Round Table feature shipped in PR #19 (bf46539): - API-REFERENCE.md: full Round Tables section (5 endpoints, request/response shapes, role semantics, error examples, excluded_participants behaviour) - ERROR-CODES.md: new Round Table Errors section (13 error codes) - llms.txt: Round Tables endpoint block, ROUND_TABLE_PURGE_TTL_MS env var, error codes for AI agent consumption - docs-generator.json: round-tables.js and round-table.service.js added to sources so future /docs-generator runs include them AGENT-GUIDE.md already documented Round Tables during the feature PR review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: add Round Tables to API reference, error codes, and llms.txt Documents the Round Table feature shipped in PR #19 (bf46539): - API-REFERENCE.md: full Round Tables section (5 endpoints, request/response shapes, role semantics, error examples, excluded_participants behaviour) - ERROR-CODES.md: new Round Table Errors section (13 error codes) - llms.txt: Round Tables endpoint block, ROUND_TABLE_PURGE_TTL_MS env var, error codes for AI agent consumption - docs-generator.json: round-tables.js and round-table.service.js added to sources so future /docs-generator runs include them AGENT-GUIDE.md already documented Round Tables during the feature PR review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(docs): correct Round Table status value, speak access, and RESOLVE_FAILED HTTP codes - llms.txt: ?status=active → ?status=open (active is not a valid status) - API-REFERENCE.md: facilitator CAN speak (_requireParticipant allows it) - ERROR-CODES.md: RESOLVE_FAILED HTTP range is 400/403/404/409 not 403/404 (409 when session is already resolved/expired via _getOpen) Addresses self-review on PR #20. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(docs): cycle 2 — llms.txt under limit, status value, RESOLVE_FAILED codes - llms.txt: compress to 246 lines (limit 250); consolidate RT error codes, compact envelope example and CLI section - llms.txt: ?status=open not ?status=active (wrong status name from cycle 1) - llms.txt: RESOLVE_FAILED updated to 400/403/404/409 - API-REFERENCE.md: facilitator CAN speak (fix incorrect claim) - ERROR-CODES.md: RESOLVE_FAILED HTTP column now 400/403/404/409 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up to PR #18. Addresses issues from the second claude review that was not seen before merge.
Changes
Tests added (5 new)
Results
153 tests total, 138 pass, 15 fail (same 15 pre-existing failures as before).
Generated with Claude Code