Skip to content

fix(ext/node): implement dns Resolver cancel() and timeout support#33580

Merged
bartlomieju merged 9 commits intomainfrom
fix/dns-cancel-and-timeout
Apr 28, 2026
Merged

fix(ext/node): implement dns Resolver cancel() and timeout support#33580
bartlomieju merged 9 commits intomainfrom
fix/dns-cancel-and-timeout

Conversation

@bartlomieju
Copy link
Copy Markdown
Member

Summary

  • Implements ChannelWrap.prototype.cancel() which aborts all pending DNS queries with ECANCELLED error code.
  • Wires up the timeout option in the Resolver constructor to race DNS queries against a timer, returning ETIMEOUT on expiry.
  • Passes timeout to the underlying hickory DNS resolver so queries complete promptly instead of running with hickory's default 5s timeout.
  • Enables 4 Node.js compat tests: parallel/test-dns-cancel-reverse-lookup.js, parallel/test-dns-channel-cancel.js, parallel/test-dns-channel-cancel-promise.js, parallel/test-dns-channel-timeout.js

Test plan

  • ./x test-compat test-dns-channel — all 3 pass
  • ./x test-compat test-dns-cancel-reverse-lookup passes

Implements `ChannelWrap.prototype.cancel()` which aborts all pending DNS
queries with `ECANCELLED` error code. Also wires up the `timeout` option
in the Resolver constructor to race DNS queries against a timer, returning
`ETIMEOUT` on expiry. Passes timeout to the underlying hickory resolver
so the DNS query completes promptly.

Enables 4 more Node.js compat tests:
- parallel/test-dns-cancel-reverse-lookup.js
- parallel/test-dns-channel-cancel.js
- parallel/test-dns-channel-cancel-promise.js
- parallel/test-dns-channel-timeout.js
@bartlomieju
Copy link
Copy Markdown
Member Author

Note: test-dns-cancel-reverse-lookup.js requires getHostByAddr (dns.reverse) from #33579. This PR should land after #33579.

- Handle IPv4-mapped IPv6 addresses (e.g. ::ffff:1.2.3.4) in
  getHostByAddr by converting the dotted-quad tail to two 16-bit
  hex groups before building the ip6.arpa reverse name.
- Replace dead code guard in assert_success_err(0) with debug_assert.
These tests involve DNS queries to non-responsive local UDP servers.
After the JS-level timeout fires, the underlying hickory DNS resolver
keeps the process alive until its internal timeout expires (~15s total).
The Linux CI test runner has a 10s timeout, causing flaky failures.
Disabled on Linux until the underlying op can be properly cancelled.
Copy link
Copy Markdown
Contributor

@fibibot fibibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The substantive logic looks right — cancel() walks #pendingQueries and dispatches ECANCELLED to each pending req.oncomplete, the per-query add/has-and-delete pattern around every queryX correctly suppresses late-arriving completions after cancel, and the Promise.race-against-setTimeout pattern surfaces ETIMEOUT when the resolver runs long. The Rust-side opts.timeout = Duration::from_millis(ms.max(1)) plumbing into hickory is the right place to actually bound the underlying query. Glad to see the assert_success_err debug_assert! cleanup made it in here too — exactly the simplification I suggested on #33579.

CI is red on lint though (3 platforms, format-only — the opts.timeout = line in ext/net/ops.rs needs to wrap onto a single line). Per my own gate I shouldn't APPROVE while CI is red — going COMMENT until that lands.

Three substantive notes inline (none blocking the format fix; the format itself is the only CI blocker).

Comment thread ext/net/ops.rs Outdated
opts.edns0 = true;
}
if let Some(ms) = timeout_ms {
opts.timeout =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format blocker — rustfmt wants this on one line (current diff has opts.timeout = and Duration::from_millis(...) on separate lines):

opts.timeout = std::time::Duration::from_millis(ms.max(1));

Fails on all three lint platforms: https://github.com/denoland/deno/actions/runs/25009734436/job/73242045807. ./tools/format.js will auto-fix.

Separate substantive note while we're here: forcing opts.attempts = 1 whenever timeout_ms is set diverges from Node's Resolver({ timeout, tries }) contract — Node treats tries and timeout as independent (timeout per attempt, multiplied by tries). With this PR, any caller passing both gets a single attempt regardless of tries. The four target tests don't exercise the tries > 1 + timeout combo so they pass, but a faithful Node port would respect both. Worth a TODO comment here at minimum.

Comment thread ext/node/polyfills/internal_binding/cares_wrap.ts Outdated
Comment thread ext/net/ops.rs Outdated
Replace Promise.race timeout approach with deno_core cancel handles.
When the timeout fires, the cancel handle is closed which aborts the
underlying hickory DNS resolver future immediately, allowing the
process to exit promptly instead of waiting ~15s for hickory's
internal timeout.

Remove the timeout_ms pass-through to hickory resolver opts since
cancellation handles this properly now.

Re-enable all 4 DNS cancel/timeout tests on Linux.
Copy link
Copy Markdown
Contributor

@fibibot fibibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancel-handle approach in e4a548fd is much cleaner than my clearTimeout suggestion — using core.createCancelHandle() + core.tryClose(cancelRid) actually aborts the underlying op rather than just dropping the result, and the Deno.errors.Interrupted"ETIMEOUT" mapping in the catch is the right shape. Also nice: dropped the timeout_ms field from the public ResolveDnsOption struct entirely (addressed my soft-API concern), and the EXPECTED_VIOLATIONS bump to 3 accounts for the new core.tryClose calls.

But CI went red — test node_compat debug linux-aarch64 is failing on test-dns-cancel-reverse-lookup.js (flaky on runs 0+1, then 10002ms timeout, then fail). Detail inline.

Also worth checking: the linux: false skip was removed for all four enrolled tests, but the cancel tests don't exercise the timeout cancel-handle path — they call resolver.cancel() (no timeout set), which means cancelRid is undefined and the underlying op runs to hickory's natural timeout. The user-facing callback fires ECANCELLED via the cancel() loop, but the still-running op keeps the test's event loop alive. That's likely what's producing the 10s hang.

cancel() {
notImplemented("cares.ChannelWrap.prototype.cancel");
for (const req of this.#pendingQueries) {
req.oncomplete("ECANCELLED", []);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel() only walks #pendingQueries and fires req.oncomplete("ECANCELLED", []) synchronously — it doesn't actually abort the underlying op_dns_resolve. For the timeout path the cancel handle does abort the op, but here (no timeout set) the op stays alive until hickory's own timeout (~5s default × attempts) fires.

Failure pattern in CI on linux-aarch64 (https://github.com/denoland/deno/actions/runs/25011070179/job/73247390033):

[Warning] node_compat::parallel::test-dns-cancel-reverse-lookup.js was flaky on run 0
[Warning] ... was flaky on run 1
test test-dns-cancel-reverse-lookup.js has been running for more than 60 seconds
test test-dns-cancel-reverse-lookup.js ... fail (10002ms)

The user-facing callback fires ECANCELLED immediately, but the orphaned op_dns_resolve keeps the test's event loop alive until hickory times out. The 10002ms pretty clearly aligns with hickory's default 5s × 2 attempts.

Fix: extend the cancel-handle pattern to all queries, not just the timeout path. Roughly — give every #query call a cancelRid up front, store it on the req (or in a parallel map), and have cancel() core.tryClose each one in addition to (or instead of) firing oncomplete. Something like:

#cancelRids: Map<QueryReqWrap, number> = new Map();

// in #query, always create a cancelRid:
const cancelRid = core.createCancelHandle();
this.#cancelRids.set(req, cancelRid);  // populated by caller

// cancel():
for (const [req, rid] of this.#cancelRids) {
  core.tryClose(rid);
  req.oncomplete("ECANCELLED", []);
}
this.#cancelRids.clear();
this.#pendingQueries.clear();

The op should then return Deno.errors.Interrupted and the existing catch in #query will short-circuit, so the test's event loop can drain.

(There may be additional issues if req.oncomplete("ECANCELLED", []) doesn't get mapped through dnsException to set .code/.syscall/.hostname correctly in the test's assertion — worth verifying that string code passes through the existing dispatch, since dnsException is typically built around integer error codes.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already addressed in the current code -- cancel() now walks #cancelRids and calls core.tryClose(rid) on each one (lines 748-751), in addition to firing oncomplete("ECANCELLED", []) on pending queries. Every #query call creates a cancel handle via core.createCancelHandle() (line 305), stores it in #cancelRids (line 306), and cleans it up on completion (lines 312, 339). So orphaned op_dns_resolve calls are properly aborted and the event loop can drain.

"parallel/test-dns-cancel-reverse-lookup.js": {},
"parallel/test-dns-channel-cancel.js": {},
"parallel/test-dns-channel-cancel-promise.js": {},
"parallel/test-dns-channel-timeout.js": {},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing linux: false from these four entries is the right end-state, but at least test-dns-cancel-reverse-lookup.js is still failing on linux-aarch64 in this commit (10s hang → fail per the inline above). Either restore the platform skip for that one entry temporarily, or block this PR on the underlying fix. The other three (channel-cancel, channel-cancel-promise, channel-timeout) appear to pass — channel-timeout shows up green in the test log (test-dns-channel-timeout.js ... ok (141ms)).

Copy link
Copy Markdown
Contributor

@lunadogbot lunadogbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- read the cancel/timeout flow end-to-end:

  • #pendingQueries: Set<QueryReqWrap> is added/checked-then-deleted at every query* entrypoint and inside every .then(...). The cancel() iterates and calls req.oncomplete('ECANCELLED', []), then clear()s. The double-fire race is handled correctly: a late-resolving promise hits if (!this.#pendingQueries.has(req)) return after cancel cleared the set, so the original oncomplete is skipped.
  • The timeout path sets up a core.createCancelHandle() and a setTimeout that calls core.tryClose(cancelRid). The op throws Deno.errors.Interrupted, the catch maps it to code = 'ETIMEOUT'. The finally clears the timer and closes the rid in either order (timer-fired-first vs op-completed-first), and the cancelRid !== undefined guard prevents a double-close.
  • The new code type is string | number. Verified against dnsException in internal/errors.ts:376 -- it explicitly handles both: if typeof code === 'number' it does getSystemErrorName, otherwise the string code passes through to ex.code. So 'ECANCELLED' and 'ETIMEOUT' produce the right shape on the user-facing error.
  • The IPv4-mapped IPv6 expansion at line 678+ for reverse-lookup correctly turns ::ffff:1.2.3.4 into the two 16-bit hex groups.

One tiny note: in ext/node/ops/dns.rs, swapping if code == 0 { return DnsError::Io(...) } for debug_assert!(code != 0, ...) is a strict improvement in debug builds, but in release builds the unreachable case becomes a silent fall-through to whatever match code returns next (probably an arm that doesn't match -> generic UNKNOWN). Worth either a comment confirming the caller really can't pass 0 (the function name assert_success_err suggests that's the contract), or unreachable!() instead of debug_assert!() so release builds at least panic with a clear message rather than producing a misleading mapping. Not blocking.

- Track cancel handles (cancelRids) for each DNS query so cancel() can
  abort in-flight op_dns_resolve operations via core.tryClose
- Use a short hickory resolver timeout (1s, 1 attempt) when a cancel
  handle is present, so background connection tasks clean up quickly
  after cancellation instead of hanging for the default 5s×2 attempts
@bartlomieju bartlomieju enabled auto-merge (squash) April 28, 2026 09:20
@bartlomieju bartlomieju merged commit d24270c into main Apr 28, 2026
136 checks passed
@bartlomieju bartlomieju deleted the fix/dns-cancel-and-timeout branch April 28, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants