Skip to content

daemon: log file freezes silently at 20 MB cap; tcpPortUsed.check ETIMEDOUT becomes unhandledRejection #37

@Rinse12

Description

@Rinse12

Three related bugs were diagnosed on a long-running production daemon (new-plebbit host, bitsocial container, image ghcr.io/bitsocialnet/bitsocial-cli:latest v0.19.49, up 11 days, healthy). Symptoms reported: bitsocial logs is stuck — no new output despite the daemon clearly being alive (RPC reachable, healthcheck green).

Bug 1 — silent 20 MB cap permanently disables file logging

src/cli/commands/daemon.ts lines 121–134:

const logFile = fs.createWriteStream(logFilePath, { flags: "a" });
const isLogFileOverLimit = () => logFile.bytesWritten > 20000000; // 20mb

const writeTimestampedLine = (text: string, stream: "stdout" | "stderr") => {
    if (isLogFileOverLimit()) return;   // ← drops all further writes for the rest of the daemon's life
    ...
    logFile.write(timestamped);
};

Once logFile.bytesWritten exceeds 20,000,000, every subsequent write is silently dropped for the lifetime of the daemon process. There is no in-process rotation (_getNewLogfileByEvacuatingOldLogsIfNeeded only runs at startup).

On the affected container this happened on 2026-04-25 14:38:48 after ~28 hours of operation, ~10 days ago. The log file is exactly 20,022,081 bytes, daemon PID 7 still holds fd 20 open (pos: 20022081, flags: 02102001). bitsocial logs -f is correctly tailing a file that no longer changes.

What filled it up: a flood of pkc-js:base-client-manager:resolveAuthorNamesInBackground:error debug entries (~1.5 KB each) for failing .eth resolutions.

Fix

Replace the silent hard-cap with front-truncate rotation: when bytesWritten exceeds 20 MB, drop the oldest portion of the file (keep the most recent ~15 MB), reopen the stream in append mode (resetting bytesWritten), and continue writing. Always keep the file around 20 MB max.

Bug 2 — stderr override silently swallows debug output when file logger fails

Same file, lines 157–164:

process.stderr.write = (...args) => {
    // Only write stderr to the log file, not to the terminal.
    writeTimestampedLine(asString(args[0]).trimStart(), "stderr");
    return true;
};

process.stderr.write was overridden to ONLY route to the file logger. After Bug 1 trips, all stderr output from JS code disappears entirely (not in file, not in docker logs) — only errors hand-routed via writeErrorToTerminal (uncaughtException/unhandledRejection handlers) survive because they call the captured original stderrWrite.

Fix

If file write fails or the stream is closed, fall back to writing to the original stderr. With Bug 1 fixed (file rotation always writes), the fallback is mostly defensive, but it ensures we never silently lose error output.

Bug 3 — tcpPortUsed.check rejection in keepKuboUpInterval becomes unhandledRejection

Same file, lines 533–545:

keepKuboUpInterval = setInterval(async () => {
    if (mainProcessExited) return;
    const isRpcPortTaken = await tcpPortUsed.check(Number(pkcRpcUrl.port), pkcRpcUrl.hostname);
    try {
        ...
    } catch (error) {
        log.trace(`keepKuboUp error (will retry): ${...}`);
    }
    await createOrConnectRpc();
}, 5000);

The await tcpPortUsed.check(...) is outside the try/catch. tcp-port-used rejects with anything other than ECONNREFUSED (e.g., ETIMEDOUT). When the connect attempt experiences a transient network hiccup and the kernel SYN retry exhausts (~127s), the async setInterval callback rejects → unhandledRejection. The daemon's handler then prints the bare error to stderr/docker logs:

Error: connect ETIMEDOUT 127.0.0.1:9138
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1637:16)

Observed on the affected container: 18 occurrences over ~5 hours on 2026-05-04, then it stopped (transient host networking issue). No daemon crash because the runtime tolerates these — but they pollute logs and indicate a real bug.

Fix

Move await tcpPortUsed.check(...) inside the try/catch. Treat its rejection as "port unknown — retry next tick", same as keepKuboUp errors are already handled.

Diagnostics summary (affected daemon)

  • Container: bitsocial, started 2026-04-24T10:20:58Z, PID namespaces verified
  • Log file: /logs/bitsocial/bitsocial_cli_daemon_2026-04-24T10-21-02.253Z.log — 20,022,081 bytes, last write 2026-04-25T14:38:48Z
  • RPC: listening on [::]:9138 (IPv6 dual-stack, also reachable via 127.0.0.1)
  • Process tree: PID 1 tini, PID 7 daemon (still alive), PID 51 kubo, PID 45489 stuck bitsocial logs -f (correctly tailing a frozen file)
  • Healthcheck: passing (uses IPv6-mapped localhost connect)

Plan

  1. Add regression tests (unit-level, exercising the file-logger module in isolation and the keepKuboUp tick body)
  2. Fix Bug 1 (rotation)
  3. Fix Bug 2 (stderr fallback)
  4. Fix Bug 3 (try/catch around tcpPortUsed.check)
  5. Build + run full test suite

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions