Skip to content

fix(round-tables): split-brain, error codes, expiry notification, TTL purge#19

Merged
dundas merged 6 commits intomainfrom
fix/round-table-followup
Feb 28, 2026
Merged

fix(round-tables): split-brain, error codes, expiry notification, TTL purge#19
dundas merged 6 commits intomainfrom
fix/round-table-followup

Conversation

@dundas
Copy link
Owner

@dundas dundas commented Feb 28, 2026

Follow-up to PR #18. Addresses issues from the second claude review that was not seen before merge.

Changes

  • [HIGH] Split-brain fix: only enrolled participants are stored in rt.participants
  • [MEDIUM] Nullish coalescing: timeout_minutes || 30 -> timeout_minutes ?? 30
  • [MEDIUM] Integer-only timeout: floats (e.g. 1.5) now return 400
  • [MEDIUM] Error codes: replaced string-matching getErrorStatusCode with makeError(msg, statusCode) helper
  • [MEDIUM] Expiry notifications: expireStale() notifies participants via inbox
  • [LOW] TTL purge: added purgeStaleRoundTables() + ROUND_TABLE_PURGE_TTL_MS env var (default 7 days)

Tests added (5 new)

  • facilitator can speak in their own session
  • expireStale lifecycle (backdated expiry via storage, verify 409 on late speak)
  • duplicate participants are deduplicated
  • missing goal returns 400
  • non-integer timeout_minutes returns 400

Results

153 tests total, 138 pass, 15 fail (same 15 pre-existing failures as before).

Generated with Claude Code

…expiry notification, TTL purge

- Fix split-brain: only store participants that successfully enrolled in
  the backing ADMP group (enrolledParticipants vs uniqueParticipants)
- Fail-fast when zero participants enroll; clean up orphaned group
- Add makeError() helper so routes use error.statusCode, not string-matching
- Use nullish coalescing (??) for timeout_minutes default (0 was falsy)
- Require integer timeout_minutes; reject floats (e.g. 1.5)
- Notify participants via inbox on expireStale() so they know the session ended
- Add purgeStaleRoundTables() to memory.js and purgeStale() service method
  to prevent unbounded storage growth; hooked into cleanup loop with
  configurable ROUND_TABLE_PURGE_TTL_MS env var (default: 7 days)
- Add 5 missing tests: facilitator speak, expireStale lifecycle, dedup,
  goal validation, non-integer timeout validation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

Good follow-up to PR 18. The split-brain fix is the most important change here and it is well-implemented. Several smaller issues worth discussing below.


What works well

  • makeError helper - clean replacement for the brittle string-matching getErrorStatusCode. Structured error codes are much more robust.
  • Nullish coalescing for timeout_minutes - correct fix; the old timeout_minutes || 30 would have silently defaulted to 30 if the caller passed 0.
  • Split-brain prevention - building enrolledParticipants from actual successful addMember calls rather than the input list is the right approach. The guard for enrolledParticipants.length === 0 with group cleanup is a good defensive measure.
  • Expiry notifications - good addition; participants should not have to poll to learn a session died.
  • Test coverage - the 5 new tests are well-scoped and the backdating trick in the expiry test is a clean pattern.

Bugs / Correctness

1. Facilitator not notified on expiry (round-table.service.js line 237)

The expiry notification loop iterates rt.participants, which by definition excludes the facilitator. The facilitator is equally interested in knowing their session expired.

Suggested fix:

const toNotify = [rt.facilitator, ...rt.participants];
for (const participantId of toNotify) { ... }

2. parseInt(env) || default is inconsistent with the PR's own fix (server.js line 39)

The new env var uses the same old pattern this PR was fixing:

const ROUND_TABLE_PURGE_TTL_MS = parseInt(process.env.ROUND_TABLE_PURGE_TTL_MS) || 7 * 24 * 60 * 60 * 1000;

parseInt returns NaN for an unset env var, and NaN || default works correctly. But if an operator sets ROUND_TABLE_PURGE_TTL_MS=0 intending "purge immediately", they silently get 7 days instead. Not critical given 0ms TTL is an unlikely value, but inconsistent with the PR's stated motivation.

3. storage.updateRoundTable in expireStale is outside try-catch (round-table.service.js line 233)

for (const rt of tables) {
  if (rt.expires_at && ...) {
    await storage.updateRoundTable(rt.id, { status: 'expired' });  // no try-catch: throws -> loop aborts
    // notifications and group cleanup ...
    expired++;
  }
}

If updateRoundTable throws on one record, all subsequent entries in the batch are skipped. They remain open so they will be picked up again next cycle, but any mid-batch failure silently drops the rest of that run. Wrapping the inner block in a per-record try-catch would make expiry resilient to single-record storage failures.


Issues / Concerns

4. Silent partial enrollment - no API signal (round-table.service.js lines 63-66)

When some but not all participants fail to enroll, the session is created with a reduced participant list and the response looks normal. The caller has no way to know an agent was dropped without comparing the request to the response. Consider adding a warnings or excluded_participants field to the response, or at minimum documenting this behavior in docs/AGENT-GUIDE.md.

5. Group max_members set to pre-enrollment count (round-table.service.js line 53)

settings: { max_members: uniqueParticipants.length + 1, ... }

This is calculated from the requested participant count, not the enrolled count. If 3 of 5 participants fail to enroll, the group allows 6 members but only 4 are present. Depending on how max_members is enforced, this could allow additional unauthorized agents to join later. The split-brain fix is the right time to also align this to enrolledParticipants.length + 1 (set after the enrollment loop completes).

6. purgeStaleRoundTables not in the documented storage interface contract

memory.js now implements purgeStaleRoundTables, but custom backend authors following the storage/index.js comment ("implement the same interface as memory.js") will not see it documented or asserted. If STORAGE_BACKEND=mech is in use, the mech adapter likely lacks this method and the cleanup job will throw a runtime error. A startup assertion (similar to the existing createAgent guard in storage/index.js) or a comment on the new method would help operators building custom adapters.


Test gaps

7. Zero-enrollment case not tested

The PR adds the guard for enrolledParticipants.length === 0, but no test covers this path. A test with all-invalid participant IDs should verify: 400 response, no round-table record created, no orphaned group left behind.

8. Expiry test does not assert inbox delivery

The expireStale test confirms rt.status === 'expired' and that late speak returns 409, but does not verify that participant inboxes received the expiry notification. Given notifications are the primary new behavior added to expireStale, asserting inbox delivery would give the test more signal.


Minor / Nits

  • getErrorStatusCode in round-tables.js is now a one-liner (return error.statusCode || 400). It could be inlined at call sites, but fine as-is.
  • expiredAt in expireStale is captured after updateRoundTable, so it may differ slightly from the logical close time. rt.expires_at is a more semantically precise timestamp for the notification body since that is when the session logically closed.
  • purgeStale in the service is just return await storage.purgeStaleRoundTables(...). The await on a direct return is unnecessary (though harmless).

Summary

Area Status
Split-brain fix Correct and well-guarded
Error codes Clean improvement
Integer timeout validation Route + service both validate (defense in depth)
Nullish coalescing Correct
Expiry notifications Good, but facilitator is excluded (bug 1)
TTL purge Good, but storage interface contract gap (issue 6)
Tests Good, two gaps noted above

The split-brain fix and error code cleanup are solid and worth merging. Two items to prioritize before merge: facilitator not notified on expiry (1), and the unguarded updateRoundTable in the expiry loop (3). The max_members mismatch (5) is also cleanest to fix while the enrollment loop is already being touched.

@greptile-apps
Copy link

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Addresses second review feedback from PR #18 with several critical fixes and improvements to round table session management.

Key improvements:

  • Split-brain prevention: Only successfully enrolled participants are stored in rt.participants, preventing desync with backing group membership
  • Cleaner error handling: Replaced fragile string-matching in getErrorStatusCode with a makeError(msg, statusCode) helper that attaches status codes directly to errors
  • Integer timeout validation: Added Number.isInteger() check to reject fractional timeout values
  • Nullish coalescing: Changed timeout_minutes || 30 to timeout_minutes ?? 30 for correct handling of explicit 0 (though validation prevents this)
  • Expiry notifications: expireStale() now notifies all participants via inbox when sessions timeout
  • TTL purge: New purgeStale() method with configurable ROUND_TABLE_PURGE_TTL_MS (default 7 days) prevents unbounded storage growth

Test coverage:
All 5 new tests pass, covering facilitator access, expiry lifecycle with backdated expiry verification, deduplication, and validation edge cases.

Confidence Score: 5/5

  • This PR is safe to merge with no issues found
  • All changes are well-implemented with proper error handling, defensive coding (e.g., rt.participants || []), and comprehensive test coverage. The split-brain fix is a critical improvement that prevents data inconsistency. Code quality is high with clear logic and appropriate fallbacks.
  • No files require special attention

Important Files Changed

Filename Overview
src/services/round-table.service.js Split-brain fix, error code refactor with makeError helper, expiry notifications, and purge method added
src/routes/round-tables.js Simplified error handling, added integer validation for timeout_minutes, fixed nullish coalescing
src/server.test.js Added 5 comprehensive tests covering facilitator access, expiry lifecycle, deduplication, and validation

Sequence Diagram

sequenceDiagram
    participant Cleanup as Cleanup Job
    participant RTS as RoundTableService
    participant Storage as Storage
    participant Inbox as InboxService
    participant Group as GroupService
    participant P1 as Participant 1
    participant P2 as Participant 2

    Cleanup->>RTS: expireStale()
    RTS->>Storage: listRoundTables({status: 'open'})
    Storage-->>RTS: [round tables]
    
    loop For each expired table
        RTS->>Storage: updateRoundTable(id, {status: 'expired'})
        Storage-->>RTS: updated
        
        RTS->>Inbox: send(to: P1, type: 'notification')
        Inbox-->>P1: Expiry notification
        
        RTS->>Inbox: send(to: P2, type: 'notification')
        Inbox-->>P2: Expiry notification
        
        RTS->>Group: delete(group_id)
        Group-->>RTS: deleted
    end
    
    RTS-->>Cleanup: expired count
Loading

Last reviewed commit: e401137

…resilient expiry loop, excluded_participants, max_members fix

- Notify facilitator AND participants on expiry (facilitator was excluded)
- Add version:1.0 + id to all inboxService.send envelopes (required fields)
- Wrap per-record expireStale in try-catch so one failure does not abort rest
- Return excluded_participants in create response so callers know who was dropped
- Align group max_members to enrolledParticipants.length+1 after partial enrollment
  via groupService.update() (prevents unauthorized join slots)
- Fix parseInt(env) || default -> NaN-safe pattern for ROUND_TABLE_PURGE_TTL_MS
- Use rt.expires_at (not new Date()) as canonical timestamp in expiry notifications
- Remove unnecessary return await in purgeStale
- Strengthen expiry test: assert inbox delivery to both participant and facilitator
- Add zero-enrollment test: all-ghost participants returns 400

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review

This is a well-executed follow-up. The core fixes are all present and most of the previously raised issues have been addressed. A few items remain worth discussing before merge.


What's well done

  • makeError helper — clean, consistent replacement for the string-matching getErrorStatusCode. Every throw in the service now uses it uniformly.
  • Split-brain fix — deriving enrolledParticipants from actual addMember successes is the correct approach. The zero-enrollment guard with group cleanup is good defensive programming.
  • expireStale resilience — wrapping each record in its own try-catch so one storage failure doesn't abort the whole batch. The "will retry next cycle" log message communicates intent clearly.
  • ROUND_TABLE_PURGE_TTL_MS parsing — correct use of Number.isNaN instead of ||; internally consistent with the PR's stated motivation.
  • Expiry notifications[rt.facilitator, ...rt.participants] correctly includes the facilitator. Using rt.expires_at as the canonical timestamp is sensible.
  • Tests — the backdating pattern in the expiry test is clean, inbox delivery is now asserted, and the zero-enrollment case is covered.

Bugs / Correctness

1. Silent group deletion swallows errors on zero-enrollment path

if (enrolledParticipants.length === 0) {
  try { await groupService.delete(group.id, facilitator); } catch (_) {}
  throw makeError(...);
}

An orphaned group is a real correctness problem — it occupies a slot and may block future sessions with the same name. Yet a failure here produces no log entry. The rest of the expiry and cleanup paths use logger.warn; this should too:

try {
  await groupService.delete(group.id, facilitator);
} catch (err) {
  logger.warn({ groupId: group.id, err: err.message }, '[RoundTable] Could not clean up group on zero-enrollment');
}

2. purgeStaleRoundTables is not in the storage interface contract

memory.js gains purgeStaleRoundTables, but there appears to be no corresponding update to storage/index.js or the adapter documentation. Operators running STORAGE_BACKEND=mech (or any custom backend) will hit a runtime TypeError: storage.purgeStaleRoundTables is not a function in the cleanup loop. A startup existence check (parallel to the createAgent guard) or a documented no-op stub would prevent a silent production failure.


Design / Behaviour

3. excluded_participants is only present in the create response

create returns { ...rt, excluded_participants: [...] }, but the stored rt object does not carry this field. A client that captures the id and later fetches via GET will see a different shape. This is arguably intentional, but the discrepancy is surprising and untrapped. Options:

  • Store excluded_participants in the record so GET returns it too, or
  • Document the field as create-only in docs/AGENT-GUIDE.md.

4. max_members realignment failure is silently tolerated

try {
  await groupService.update(group.id, facilitator, { settings: { max_members: enrolledParticipants.length + 1, ... } });
} catch (err) {
  logger.warn(..., '[RoundTable] Could not update group max_members after partial enrollment');
}

If this update fails, the group retains the original upper bound based on uniqueParticipants.length and can accept more members than were enrolled. Whether this is exploitable depends on how the group service enforces membership, but the round table's own participant guard won't catch it. Worth deciding whether this should fail the create or at least surface the mismatch in the API response.


Test gaps

5. Zero-enrollment test doesn't verify group cleanup

The test description says "leaves no orphaned records", but the test body only asserts the HTTP status and error message. The whole point of the groupService.delete call is to prevent a dangling group, so it's worth asserting — e.g., that a subsequent GET on the group returns 404.


Nits

  • CLEANUP_INTERVAL_MS inconsistency (pre-existing): parseInt(env) || 60000 is now inconsistent with the new Number.isNaN pattern on line 38. Not introduced by this PR, but worth a follow-up ticket.
  • Self-notification on expiry: When the facilitator is in toNotify, the envelope will have from: rt.facilitator, to: rt.facilitator. Not wrong, but an agent filtering its inbox by from could inadvertently drop it. Documenting this or using a system-level sender identity (system@admp) would be cleaner.
  • Fragile agent IDs in zero-enrollment test: 'ghost-agent-1' and 'ghost-agent-2' are not unique across the test suite. If another test registers an agent with this ID before this test runs, the participant would actually enroll and the test would not exercise the intended path. Suffixing with a timestamp or random value is safer.

Summary

The critical correctness bugs from the earlier review are fixed. The two items I'd resolve before merging are:

  1. Log the group deletion failure on zero-enrollment — the silent swallow is inconsistent with every other error path.
  2. Add purgeStaleRoundTables to the storage interface contract — without this, any non-memory backend will fail at runtime in the cleanup loop.

The design questions around excluded_participants and max_members realignment are worth a follow-up but don't block merge if time is short.

Generated with Claude Code

…oll log, test robustness

- Add startup assertion for purgeStaleRoundTables in storage/index.js so any
  backend missing the method fails fast at startup instead of at runtime
- Log groupService.delete failure on zero-enrollment path (was silently swallowed;
  inconsistent with every other error path in the service)
- Fix zero-enrollment test: use groupService.listForAgent for backend-agnostic
  group cleanup assertion (storage.groups Map only exists in memory backend)
- Use unique ghost agent IDs in zero-enrollment test to prevent collision with
  agents registered by other tests
- Import groupService in test file for direct service-level assertions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review (Third Pass)

Two prior reviews have been addressed well. This pass focuses on what's changed since the last review and a few remaining items.


Confirmed fixed from previous reviews

  • Zero-enrollment group deletion now logs via logger.warn — consistent with all other error paths.
  • purgeStaleRoundTables startup assertion added to storage/index.js alongside the existing createAgent guard — operators with custom backends will get a clear crash at startup rather than a runtime TypeError.
  • Zero-enrollment test now verifies no orphaned groups remain (groupsAfter === groupsBefore) and uses unique participant IDs via Date.now() + random suffix.

Good incremental progress on each iteration.


Remaining issues

1. No test for partial enrollment (some succeed, some fail)

The split-brain fix's most important scenario — partial enrollment — is untested. The tests cover the all-succeed case (normal create) and the all-fail case (zero-enrollment), but nothing in between.

A test with one valid and one invalid participant would confirm that:

  • The response participants array contains only the enrolled agent
  • excluded_participants contains the dropped agent
  • The backing group has the correct max_members (enrolled + 1, not requested + 1)

Without this test, the core behavioral change of this PR (deriving enrolledParticipants from actual addMember outcomes) is only indirectly exercised.

2. max_members realignment failure is silently tolerated (carried from previous review)

If the groupService.update call to realign max_members after partial enrollment fails, the group retains a higher cap than the actual enrollment. The code logs a warn and continues, but the round table record is then stored with a mismatch between rt.participants.length and the group's max_members. Whether this allows additional unauthorized agents to join depends on how the group service enforces membership, but the round table's own _requireParticipant check won't catch it at the group level. This is not a new issue but remains open.

3. excluded_participants is always present in the create response

return { ...rt, excluded_participants: excludedParticipants } returns excluded_participants: [] even when no participants were excluded. Callers must handle this field unconditionally. The field is useful when non-empty, but returning it as an empty array in the happy path adds a shape difference between the create response and the stored record (which omits it). A minor API contract inconsistency worth either documenting or resolving by only including the field when non-empty.


Nits

  • CLEANUP_INTERVAL_MS inconsistency (pre-existing): parseInt(env) || 60000 is now inconsistent with the Number.isNaN pattern on the new ROUND_TABLE_PURGE_TTL_MS line. Not introduced here, but worth a follow-up.
  • Self-notification on expiry: The from: rt.facilitator, to: rt.facilitator envelope is harmless but slightly odd; agents filtering by sender could inadvertently suppress it. A system sender identity (system@admp or similar) would be cleaner for automated notifications.

Summary

Item Status
logger.warn on zero-enrollment cleanup Fixed
storage interface assertion Fixed
Zero-enrollment test + group check Fixed
Partial enrollment test Missing
max_members realignment failure handling Open
excluded_participants API shape Design question

The critical correctness issues from the previous reviews are all addressed. The one item worth resolving before merge is the missing test for partial enrollment — it's the primary behavioral guarantee this PR adds. The other items are lower priority but worth tracking.

Generated with Claude Code

…pe, encodeURIComponent in test

- Add partial enrollment test: verifies split-brain prevention is exercised
  (one valid + one ghost participant; asserts participants=[valid],
  excluded_participants=[ghost], group max_members=2)
- Only include excluded_participants in create response when non-empty,
  keeping create and GET response shapes consistent on the happy path
- Fix group ID URL encoding in test (group:// IDs contain slashes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review — PR #19

Good set of targeted fixes. The split-brain correction is the most important change and is done well. Notes below, roughly by severity.


Correctness

Expiry notification sender identity (round-table.service.js, expireStale)

Notifications are sent with from: rt.facilitator, which means the facilitator receives a message that appears to be from themselves. This is semantically odd and could confuse agents that inspect the from field to decide how to handle a notification. A system identity (e.g. 'system' or the server's own DID) would be cleaner. At minimum it should be documented.

// Current: facilitator notifies themselves
from: rt.facilitator,
to: rt.facilitator,   // looks like a self-message

verify_signature: false in expiry notifications

The { verify_signature: false } option bypasses signature verification when sending inbox messages on behalf of the facilitator. The server doesn't hold participants' private keys so this is probably necessary, but:

  1. There is no comment explaining why this bypass is needed.
  2. If inboxService.send has validation gating on the from field, this bypass could allow spoofing any sender in pathological cases. A brief inline comment clarifying the intent would help reviewers.

API Contract / Client Impact

excluded_participants is optional in create response, absent from GET

The create response conditionally includes excluded_participants when non-empty; the GET endpoint never returns it. This creates two inconsistencies:

  • Clients that want to know which agents were excluded after the fact cannot retrieve that information via GET.
  • Partial enrollment is a recoverable state, not an error — clients shouldn't have to treat an optional field as a signal.

Consider one of:

  • Always return excluded_participants: [] from create (so clients can rely on the field existing).
  • Persist excluded_participants on the rt record and expose it from GET as well.
  • Document in docs/AGENT-GUIDE.md that this field only appears at create-time and why.

Missing Docs Update

Per CLAUDE.md, behavior changes must be documented in docs/AGENT-GUIDE.md. This PR introduces three significant behavioral additions with no docs update:

  • Expiry notifications sent to facilitator and participants on timeout
  • excluded_participants in the create response
  • ROUND_TABLE_PURGE_TTL_MS env var and purge behavior

Performance

Sequential notification loop in expireStale

for (const recipientId of toNotify) {
  try {
    await inboxService.send(...)  // sequential
  } catch ...
}

For a table with 20 participants, this sends notifications one at a time. Since each send is independent, Promise.allSettled would be a straight drop-in and would parallelize the work:

await Promise.allSettled(toNotify.map(async (recipientId) => {
  try {
    await inboxService.send(...)
  } catch (err) {
    logger.warn(...)
  }
}));

Low / Minor

Storage startup assertion may break rolling deployments

The new assertion in storage/index.js:

if (typeof _storage.purgeStaleRoundTables !== 'function') {
  throw new Error('storage: purgeStaleRoundTables is missing ...');
}

This crashes the process if the storage adapter doesn't implement the new method. On a rolling deploy, new server instances will fail to start until the storage adapter (e.g. a Redis adapter) is also updated. Operator runbook should call this out, or the error message should include upgrade steps.

timeout_minutes validation is duplicated

The integer check now exists in both the route handler and roundTableService.create. This is fine as defense-in-depth, but the duplicate logic means if the allowed range changes in the service, the route error message will diverge. Consider keeping validation in one layer (the service, since it's the authority) and letting the route rely on the structured error from makeError.

purgeStaleRoundTables uses expires_at as close time

For expired tables, the purge uses rt.expires_at as the start of the retention window, not when expiration was actually processed. This is the correct semantic choice (use canonical close time, not processing time), but if the cleanup job is significantly delayed, records may be purged sooner than their wall-clock age suggests. Worth a brief comment in purgeStaleRoundTables.


Strengths

  • The split-brain fix is solid: partial enrollment tracked, excluded_participants returned, zero-enrollment cleaned up, group max_members aligned.
  • makeError is a clear improvement over string-matching in getErrorStatusCode.
  • Failure isolation in expireStale (one expired record can't abort the rest) is robust.
  • Test coverage is good — the backdated expiry test and zero-enrollment cleanup test are particularly valuable.
  • Storage startup assertion pattern is good practice.

Summary: The core fixes are correct and well-tested. The main asks before merge are: (1) clarify/fix the notification sender identity, (2) add a comment on verify_signature: false, (3) update docs/AGENT-GUIDE.md. The API contract point on excluded_participants is worth discussing but not necessarily a blocker.

…cs update

- Replace sequential notification loop in expireStale with Promise.allSettled
  for parallel delivery to all recipients (up to 21 for a full session)
- Add inline comments explaining:
  - why from=rt.facilitator is used for expiry notifications (logical author)
  - why facilitator receives a self-addressed copy (documented intent)
  - why verify_signature:false is required (server holds no private keys)
  - why rt.expires_at is used as timestamp (canonical, not processing time)
- Update docs/AGENT-GUIDE.md: add Round Tables endpoint table and behavior notes
  documenting excluded_participants, expiry notifications, partial enrollment,
  storage growth/purge, and ROUND_TABLE_PURGE_TTL_MS env var

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review — PR #19

Previous review items are well addressed. Here is a focused pass on the current diff.


What was fixed from prior reviews

  • Facilitator now included in expiry notifications via [rt.facilitator, ...rt.participants]
  • parseInt || default replaced with Number.isNaN guard for ROUND_TABLE_PURGE_TTL_MS
  • Sequential notification loop replaced with Promise.allSettled
  • docs/AGENT-GUIDE.md updated with expiry notification, partial enrollment, purge, and excluded_participants behavior
  • verify_signature: false has an explanatory comment
  • excluded_participants is omitted (not returned as empty array) on the happy path
  • Partial enrollment test added, covers max_members alignment

Remaining items

[MEDIUM] Facilitator-in-participants edge case — no guard

If the caller includes the facilitator's own agent ID in the participants list, groupService.addMember will attempt to add the owner as a member of their own group. This likely throws (already a member / already owner), which would land the facilitator in excludedParticipants. Two failure modes:

  1. All other participants also fail → session aborts with 400 'No participants could be enrolled' even though the facilitator is a valid agent.
  2. Some participants succeed → excluded_participants unexpectedly contains the facilitator's own ID, which is confusing to clients.

Suggested fix in the route or service:

if (participants.includes(facilitatorId)) {
  throw makeError('facilitator cannot be listed as a participant', 400);
}

[MEDIUM] Storage startup assertion blocks rolling deployments

The new assertion in storage/index.js crashes the process if purgeStaleRoundTables is missing:

if (typeof _storage.purgeStaleRoundTables \!== 'function') {
  throw new Error('storage: purgeStaleRoundTables is missing...');
}

On a rolling deploy, new application instances will refuse to start until every storage adapter is updated. Operators using Redis or custom adapters will hit a hard failure. The error message points to memory.js which helps, but the README or a CHANGELOG entry should explicitly call this out as a breaking change for custom adapter implementors.

[LOW] _purgeTtlParsed is an unusual module-level name

const _purgeTtlParsed = parseInt(process.env.ROUND_TABLE_PURGE_TTL_MS);
const ROUND_TABLE_PURGE_TTL_MS = Number.isNaN(_purgeTtlParsed) ? 7 * 24 * 60 * 60 * 1000 : _purgeTtlParsed;

The leading underscore convention typically signals an unused/private variable and may trigger linter warnings. An IIFE is cleaner:

const ROUND_TABLE_PURGE_TTL_MS = (() => {
  const parsed = parseInt(process.env.ROUND_TABLE_PURGE_TTL_MS);
  return Number.isNaN(parsed) ? 7 * 24 * 60 * 60 * 1000 : parsed;
})();

[LOW] timeout_minutes validation duplicated in route and service

The integer check (!Number.isInteger) now exists in both the route handler and roundTableService.create. As defense-in-depth this is harmless, but the two error messages will drift if the valid range changes. Since makeError now propagates structured status codes cleanly, the route could drop its copy and rely on catching the service error.

[LOW] Partial enrollment test does not verify speak access post-create

The test confirms that only the enrolled participant appears in rt.participants and that max_members is aligned — both good. It does not verify:

  1. The enrolled participant can POST /speak and get 201
  2. The ghost agent attempting to speak gets 403

These are the behavioral guarantees the split-brain fix is meant to provide. A couple of extra assertions in the existing test would close that gap.


Nit

purgeStaleRoundTables uses rt.resolved_at || rt.expires_at as the cutoff, which is the correct semantic (canonical close time, not processing time). A brief comment in memory.js would help custom adapter authors understand why expires_at is used instead of Date.now().


Summary

Item Status
Facilitator expiry notification Fixed
parseInt NaN handling Fixed
Parallel notifications via Promise.allSettled Fixed
AGENT-GUIDE.md docs Updated
verify_signature comment Added
Partial enrollment test + max_members check Added
Facilitator-in-participants guard Open
Rolling deploy note for storage assertion Open
_purgeTtlParsed naming Nit
Duplicate timeout_minutes validation Nit

The core split-brain fix is correct and well-tested. The two medium items are worth addressing before merge: the facilitator edge case is a real correctness hole, and the deploy note is an operator safety concern. The low items can follow in a subsequent cleanup pass.

Generated with Claude Code

…TTL, rolling deploy note

- Add FACILITATOR_IN_PARTICIPANTS validation: reject creates where the facilitator
  is listed as a participant (addMember would throw 'already owner', landing the
  facilitator in excluded_participants or aborting on zero-enrollment path)
- Replace _purgeTtlParsed temp var with IIFE to avoid underscore-private convention
- Add BREAKING CHANGE comment to storage/index.js purgeStaleRoundTables assertion
  so operators deploying custom adapters see upgrade instructions at the assertion site
- Add canonical-close-time comment to memory.js purgeStaleRoundTables explaining
  why resolved_at/expires_at is used instead of Date.now()
- Strengthen partial enrollment test: assert enrolled agent can speak (201) and
  a non-enrolled registered agent gets 403 (not just unregistered ghost)
- Add facilitator-in-participants test: verify 400 + FACILITATOR_IN_PARTICIPANTS error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 28, 2026

Code Review — PR #19 (Final Pass)

This is a well-executed series of fixes. All medium and high priority items from prior review rounds are resolved. Notes below on what is confirmed fixed and what remains.


Fixed from previous reviews

  • Facilitator-in-participants guard — route-level check (FACILITATOR_IN_PARTICIPANTS) and matching test added. This closes the edge case where the facilitator's own ID in the participants list could cause confusing behavior or an unexpected entry in excluded_participants.
  • IIFE for ROUND_TABLE_PURGE_TTL_MS — cleaner than the intermediate _purgeTtlParsed variable; avoids potential linter warnings on the underscore prefix.
  • Partial enrollment test — now includes post-create speak access verification (enrolled participant gets 201, non-enrolled agent gets 403) alongside the max_members alignment check. This is the behavioral guarantee the split-brain fix exists to provide and was the last meaningful test gap.
  • purgeStale service method — returns directly without a redundant await on the already-returned promise.
  • expireStale resilience — per-record try/catch is in place so a single storage failure does not abort the rest of the batch.
  • Promise.allSettled for expiry notifications — correct; a 20-participant table no longer sends notifications serially.
  • [rt.facilitator, ...rt.participants] — facilitator now correctly included in expiry notifications.

Remaining

[LOW] Duplicate timeout_minutes validation

The integer check now lives in both src/routes/round-tables.js:47 and src/services/round-table.service.js:37. The route error message is "timeout_minutes must be an integer"; the service message is "timeout_minutes must be an integer between 1 and 10080 (7 days)". Since makeError propagates status codes cleanly through the route's getErrorStatusCode, the route could drop its copy and rely on the richer service error. As-is, the messages will silently diverge if the valid range changes. Defense-in-depth is a valid reason to keep both — worth a brief comment if intentional.

[LOW] getErrorStatusCode is now a one-liner

Single call site in round-tables.js. Could be inlined as res.status(error.statusCode || 400), but fine as-is.

[LOW] Breaking change docs for custom adapters

storage/index.js crashes clearly with a helpful error message, and AGENT-GUIDE.md now documents the purge behavior and env var. README.md has no mention of the storage adapter interface change. Operators on rolling deploys with custom backends won't find it until startup fails. A one-liner in the README upgrade notes or a CHANGELOG entry would close the gap.


Summary

Area Status
Split-brain fix Fixed and well-tested
makeError error codes Clean
Facilitator-in-participants guard Fixed
Expiry notifications (facilitator + participants) Fixed
Promise.allSettled for notifications Fixed
ROUND_TABLE_PURGE_TTL_MS IIFE parsing Fixed
Storage startup assertion Added with helpful message
Partial enrollment test + speak access Complete
Zero-enrollment cleanup + test Complete
Duplicate timeout validation Low — defensible but messages diverge
Breaking change docs for adapters Low — error message helps, no README note

Merge assessment: Ready to merge. The remaining items are all low-priority nits. The core behavioral changes (split-brain prevention, expiry notifications, TTL purge) are correct, well-guarded, and tested.

Generated with Claude Code

@dundas dundas merged commit bf46539 into main Feb 28, 2026
2 checks passed
@dundas dundas deleted the fix/round-table-followup branch February 28, 2026 21:08
dundas added a commit that referenced this pull request Mar 1, 2026
Documents the Round Table feature shipped in PR #19 (bf46539):
- API-REFERENCE.md: full Round Tables section (5 endpoints, request/response
  shapes, role semantics, error examples, excluded_participants behaviour)
- ERROR-CODES.md: new Round Table Errors section (13 error codes)
- llms.txt: Round Tables endpoint block, ROUND_TABLE_PURGE_TTL_MS env var,
  error codes for AI agent consumption
- docs-generator.json: round-tables.js and round-table.service.js added
  to sources so future /docs-generator runs include them

AGENT-GUIDE.md already documented Round Tables during the feature PR review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dundas added a commit that referenced this pull request Mar 1, 2026
* docs: add Round Tables to API reference, error codes, and llms.txt

Documents the Round Table feature shipped in PR #19 (bf46539):
- API-REFERENCE.md: full Round Tables section (5 endpoints, request/response
  shapes, role semantics, error examples, excluded_participants behaviour)
- ERROR-CODES.md: new Round Table Errors section (13 error codes)
- llms.txt: Round Tables endpoint block, ROUND_TABLE_PURGE_TTL_MS env var,
  error codes for AI agent consumption
- docs-generator.json: round-tables.js and round-table.service.js added
  to sources so future /docs-generator runs include them

AGENT-GUIDE.md already documented Round Tables during the feature PR review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(docs): correct Round Table status value, speak access, and RESOLVE_FAILED HTTP codes

- llms.txt: ?status=active → ?status=open (active is not a valid status)
- API-REFERENCE.md: facilitator CAN speak (_requireParticipant allows it)
- ERROR-CODES.md: RESOLVE_FAILED HTTP range is 400/403/404/409 not 403/404
  (409 when session is already resolved/expired via _getOpen)

Addresses self-review on PR #20.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(docs): cycle 2 — llms.txt under limit, status value, RESOLVE_FAILED codes

- llms.txt: compress to 246 lines (limit 250); consolidate RT error codes,
  compact envelope example and CLI section
- llms.txt: ?status=open not ?status=active (wrong status name from cycle 1)
- llms.txt: RESOLVE_FAILED updated to 400/403/404/409
- API-REFERENCE.md: facilitator CAN speak (fix incorrect claim)
- ERROR-CODES.md: RESOLVE_FAILED HTTP column now 400/403/404/409

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant