Skip to content

Two latent bugs in daemon startup surfaced while debugging #46 #48

@Rinse12

Description

@Rinse12

While investigating the macOS CI failure on PR #47 (root-caused as a stderr-vs-exit race in _spawnAsync — fixed there) I bumped into two larger latent bugs in the daemon startup path. They were dormant on master because the race that exposed them was previously off the critical path, but they will bite again the next time something throws during kubo startup. Filing here so they don't get lost.

1. startKuboNode uses new Promise(async (resolve, reject) => { ... })

src/ipfs/startIpfs.ts:207.

The Promise constructor only inspects its executor's resolve/reject calls. When the executor is an async function, any throw inside it becomes a rejection on the async function's own return promise — which nobody is awaiting — and the outer Promise stays pending forever. So every uncaught throw inside startKuboNode (the inner throw new Error("Failed to call ipfs init" + …), the migration error rethrow, anything else added later) becomes an unhandledRejection instead of a real error from await keepKuboUp(). That's how PR #47's symptom was a [unhandledRejection] plus a silent exit code 0 instead of a visible startup failure.

Fix: rewrite startKuboNode as a plain async function (or use a deferred-then-wire pattern). Then throws propagate to the outer awaiter cleanly.

Verified the antipattern on Node v22.22.0:
```
$ node -e 'process.on("unhandledRejection", e => console.error("[u]", e.message));
new Promise(async () => { throw new Error("x"); })
.then(() => console.log("resolved")).catch(e => console.log("caught:", e.message));
setTimeout(() => console.log("still pending"), 200);'
[u] x
still pending
```

2. daemon.ts run() sets up the keepalive interval and exit-hook after the initial keepKuboUp/createOrConnectRpc awaits

src/cli/commands/daemon.tssetInterval(... runKeepKuboUpTick ...) and asyncExitHook(...) are registered after the initial await keepKuboUp() / await createOrConnectRpc(). If those initial awaits hang (e.g. via bug #1), the event loop has nothing keeping it alive yet and Node exits cleanly with code 0. The user sees a "daemon" process that started fine, printed PKC options, then vanished without an error.

Fix: either move the keepalive/exit-hook registration to happen before the initial awaits, or restructure run() so it only returns once the daemon is fully wired and at least one error path can surface the failure.

Severity / urgency

Neither is broken on master right now, but both made PR #47 much harder to debug than it should have been. Doing #1 first would auto-improve the debugging experience for any future failure in this neighborhood.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions