Skip to content

TOTP verification blocks asyncio event loop with synchronous subprocess calls #123

@dcellison

Description

@dcellison

Summary

All three file-access functions in totp.py use synchronous subprocess.run() to call sudo cat and sudo tee. These are called from async handlers in bot.py, blocking the asyncio event loop for the duration of each subprocess. A single TOTP verification can spawn up to three sequential sudo calls, each with a 5-second timeout, meaning the event loop can be blocked for up to 15 seconds during a single verify_code() invocation.

Blocking calls

Three private functions perform synchronous subprocess calls:

  • _read_secret() (line 49): subprocess.run(["sudo", "-n", "cat", TOTP_SECRET_PATH], timeout=5) - reads the TOTP secret
  • _read_attempts() (line 73): subprocess.run(["sudo", "-n", "cat", TOTP_ATTEMPTS_PATH], timeout=5) - reads rate-limiting state
  • _write_attempts() (line 105): subprocess.run(["sudo", "-n", "tee", TOTP_ATTEMPTS_PATH], timeout=5) - persists rate-limiting state

These are wrapped by the public API that bot.py calls:

Public function Subprocess calls Max block time
is_totp_configured() _read_secret() x1 5s (cached after first True)
get_lockout_remaining() _read_attempts() x1 5s
get_failure_count() _read_attempts() x1 5s
verify_code() _read_attempts() + _read_secret() + _write_attempts() 15s

Call sites in bot.py

During a TOTP verification flow in handle_message(), the following synchronous calls execute on the event loop:

  1. Line 1627: is_totp_configured() - usually cached, but blocks on first call
  2. Line 1658: get_lockout_remaining() - one sudo call
  3. Line 1669: verify_code() - three sequential sudo calls
  4. Line 1676: get_lockout_remaining() - one sudo call (on failure path)
  5. Line 1683: get_failure_count() - one sudo call (on failure path)

A failed verification attempt hits steps 1-5, totaling up to 5 sudo calls with a combined max block time of 25 seconds. Even on the success path (steps 1-3), that is up to 4 calls and 20 seconds.

Impact

While subprocess.run() is executing, the entire asyncio event loop is frozen. No other coroutines can make progress:

  • All users are affected, not just the one authenticating. Other users' messages, webhook deliveries, health checks, and cron job dispatches all stall.
  • Under normal conditions, sudo -n completes in milliseconds and the blocking is negligible. But if sudo hangs (misconfigured sudoers, PAM issue, NFS-mounted /etc, disk I/O pressure), the event loop freezes for up to the 5-second timeout per call.
  • The 5-second timeout per call is a hard cap, but even sub-second blocking is problematic in principle. Synchronous I/O in an async event loop violates the cooperative scheduling contract and can cause cascading latency spikes for all concurrent operations.

Why sudo is involved

The TOTP secret and rate-limiting state live in root-owned files under /etc/kai/ (mode 0600). The bot runs as user kai and accesses them via sudoers-authorized sudo -n cat and sudo -n tee commands. This is a deliberate security boundary: the kai user (and inner Claude) cannot directly read the secret or tamper with the lockout state. The subprocess-based access is correct from a security standpoint; the issue is that it uses the synchronous subprocess API instead of the async one.

Severity

MEDIUM - only affects deployments with TOTP enabled (opt-in). Under normal conditions the blocking is sub-millisecond and invisible. But it degrades poorly: any sudo slowdown (disk, PAM, sudoers misconfiguration) blocks all bot operations for all users, with no timeout visibility or graceful degradation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions