Skip to content

fuzz: improve coverage with new serializers target and stronger invariant checks#1104

Merged
anonrig merged 11 commits intomainfrom
claude/improve-fuzzing-coverage-qRMnO
Mar 28, 2026
Merged

fuzz: improve coverage with new serializers target and stronger invariant checks#1104
anonrig merged 11 commits intomainfrom
claude/improve-fuzzing-coverage-qRMnO

Conversation

@anonrig
Copy link
Copy Markdown
Member

@anonrig anonrig commented Mar 28, 2026

Add a new fuzz/serializers.cc target that directly exercises code paths
that were only reachable indirectly before:

  • ada::serializers::ipv4() and ada::serializers::ipv6() with arbitrary
    uint64_t / uint16_t[8] inputs, including a round-trip invariant check for
    find_longest_sequence_of_ipv6_pieces.
  • ada::checkers::try_parse_ipv4_fast() with a serialisation round-trip
    assertion (parse → serialise → re-parse must give the same address).
  • All six ada::unicode::percent_encode() character-set variants, plus the
    template <false>/<true> forms, with a decode round-trip check.
  • Integration tests that embed the synthesised addresses in real URLs and
    verify href idempotency.

Strengthen existing targets with new invariant-based checks:

parse.cc

  • FDP-controlled sequential setter testing: the fuzzer now picks an
    arbitrary sequence of setter/value pairs and asserts that ada::url and
    ada::url_aggregator stay in sync (same href) after every step.
  • Re-parse idempotency: parse(href).get_href() == href is asserted for
    every successfully parsed URL.
  • url_search_params round-trip via URL: parses the fuzz input as a query
    string, mutates via search-params, and verifies serialisation idempotency
    before setting it back on the URL.

can_parse.cc

  • Adds href round-trip: if parse(source) succeeds, can_parse(href) must
    be true and parse(href).get_href() must equal href.

url_search_params.cc

  • Adds serialisation idempotency: url_search_params(sp.to_string()).to_string() == sp.to_string() is checked for both constructed and mutated params.
  • Adds URL round-trip integration test.

idna.cc

  • Adds to_ascii stability: to_ascii(to_ascii(x)) == to_ascii(x).
  • Adds to_unicode stability.
  • Adds long-domain (≤270 char) inputs to exercise DNS-length boundary checks.
  • Adds Punycode decode→encode→decode round-trip with equality assertion.

url.dict

  • Adds hex/octal IPv4 forms, IPv4-mapped IPv6, Windows/UNC file paths,
    data URIs with content types, tricky percent sequences (overlong UTF-8,
    double-encoding, C0 controls), opaque-host URLs, and scheme case variants.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef

claude added 2 commits March 28, 2026 00:50
…iant checks

Add a new `fuzz/serializers.cc` target that directly exercises code paths
that were only reachable indirectly before:
- `ada::serializers::ipv4()` and `ada::serializers::ipv6()` with arbitrary
  uint64_t / uint16_t[8] inputs, including a round-trip invariant check for
  `find_longest_sequence_of_ipv6_pieces`.
- `ada::checkers::try_parse_ipv4_fast()` with a serialisation round-trip
  assertion (parse → serialise → re-parse must give the same address).
- All six `ada::unicode::percent_encode()` character-set variants, plus the
  template `<false>`/`<true>` forms, with a decode round-trip check.
- Integration tests that embed the synthesised addresses in real URLs and
  verify href idempotency.

Strengthen existing targets with new invariant-based checks:

`parse.cc`
- FDP-controlled sequential setter testing: the fuzzer now picks an
  arbitrary sequence of setter/value pairs and asserts that `ada::url` and
  `ada::url_aggregator` stay in sync (same href) after every step.
- Re-parse idempotency: `parse(href).get_href() == href` is asserted for
  every successfully parsed URL.
- `url_search_params` round-trip via URL: parses the fuzz input as a query
  string, mutates via search-params, and verifies serialisation idempotency
  before setting it back on the URL.

`can_parse.cc`
- Adds href round-trip: if `parse(source)` succeeds, `can_parse(href)` must
  be true and `parse(href).get_href()` must equal `href`.

`url_search_params.cc`
- Adds serialisation idempotency: `url_search_params(sp.to_string()).to_string()
  == sp.to_string()` is checked for both constructed and mutated params.
- Adds URL round-trip integration test.

`idna.cc`
- Adds `to_ascii` stability: `to_ascii(to_ascii(x)) == to_ascii(x)`.
- Adds `to_unicode` stability.
- Adds long-domain (≤270 char) inputs to exercise DNS-length boundary checks.
- Adds Punycode decode→encode→decode round-trip with equality assertion.

`url.dict`
- Adds hex/octal IPv4 forms, IPv4-mapped IPv6, Windows/UNC file paths,
  data URIs with content types, tricky percent sequences (overlong UTF-8,
  double-encoding, C0 controls), opaque-host URLs, and scheme case variants.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
…rage

url_pattern.cc
- Add exec/test consistency invariant: exec() finds a match iff test()
  returns true, checked for both no-base and with-base variants, and for
  url_pattern_input (init-based) inputs. An abort is triggered on
  any disagreement.
- Introduce exercise_match_result() helper that also walks the previously
  untested url_pattern_result::inputs vector.
- Add testing for url_pattern_init::base_url, a field that was previously
  never exercised by any fuzzer (tested with a fixed and a fuzzed base).
- Add random bitmask field selection so every subset (0..255) of the eight
  url_pattern_init component fields is reachable, covering combinations the
  previous single-field and all-fields cases missed.

ada_c.c
- Add C-API re-parse idempotency check: after all setter mutations the URL's
  final href is re-parsed via ada_parse() and compared byte-for-byte; an
  abort is triggered if parsing fails or produces a different href.

serializers.cc
- Add direct coverage of internal utility functions unreachable through the
  URL parsing pipeline alone:
  - ada::unicode::has_tabs_or_newline()
  - ada::checkers::is_ipv4() with cross-check: if is_ipv4 reports true,
    embedding the string as an http:// host must be accepted by the parser
  - ada::checkers::path_signature()
  - ada::checkers::is_windows_drive_letter() and
    is_normalized_windows_drive_letter() with a consistency assertion
    (normalised ⊆ drive-letter)
  - ada::unicode::to_lower_ascii() (in-place, length invariant asserted)
  - ada::unicode::contains_forbidden_domain_code_point()
  - ada::unicode::contains_forbidden_domain_code_point_or_upper()
  - ada::checkers::verify_dns_length()

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 17.94872% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.60%. Comparing base (3b9bacb) to head (a616e41).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/ada_idna.cpp 8.57% 26 Missing and 6 partials ⚠️

❌ Your patch status has failed because the patch coverage (17.94%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
- Coverage   59.86%   59.60%   -0.27%     
==========================================
  Files          37       37              
  Lines        5811     5815       +4     
  Branches     2846     2848       +2     
==========================================
- Hits         3479     3466      -13     
- Misses        537      569      +32     
+ Partials     1795     1780      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude added 5 commits March 28, 2026 01:37
tl::expected<T, E> cannot be compared to std::nullopt_t; use .has_value()
instead. This was caught by the OSS-Fuzz CI build.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
…ents

- Run clang-format --style=Google on all fuzz files to fix CI violations
- Add fuzz/unicode.cc: new fuzzer for 15+ low-level unicode/checker/helper
  utilities (per-char classification, string helpers, prune_hash, trim,
  remove_ascii_tab_or_newline, delimiter finders, to_ascii internal API)
  with consistency invariants (forbidden_host⊆forbidden_domain,
  ascii_digit⊆ascii_hex_digit, lowercase_hex⊆ascii_hex_digit, etc.)
- Add fuzz/unicode.options pointing at url.dict
- Add unicode.cc compilation to fuzz/build.sh
- fuzz/parse.cc: assert clear_port/clear_search/clear_hash postconditions
- fuzz/url_search_params.cc: assert has(k) and get(k) consistency during
  range-for iteration; verify has(k,v)→has(k) invariant
- fuzz/ada_c.c: validate ada_get_scheme_type ∈ [0,6],
  ada_get_host_type ∈ [0,2], and all ada_url_components offsets ≤
  href.length when not omitted

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
ada_get_components() returns const ada_url_components*, not a value.
Use pointer variable + -> dereference for comps, and keep . access
for the ada_string and ada_version_components value types.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
ada_url_components.port stores the port NUMBER (0-65535), not an
offset into the href string. Checking port > href.length gives false
positives for any URL whose port number exceeds the href length (e.g.
http://example.com:8080/). Remove the erroneous port bounds check.

The other offset fields (protocol_end, username_end, host_start,
host_end, pathname_start, search_start, hash_start) are genuine byte
offsets into the href and their bounds checks are correct.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Both functions had an early-exit check `input.starts_with("xn--")` that
was intended to prevent double-encoded ACE labels (a decoded Punycode
label that itself starts with "xn--"). However the check was incorrect:
Punycode always outputs all ASCII (basic) code points first in the
encoded stream, regardless of their position in the original label. A
mixed label like "CJK-chars xn--" encodes to content starting with
"xn--" even though the decoded label starts with a non-ASCII character.

Consequence: ada accepted such URLs (e.g. http://[cjk]xn--./), produced
a normalized href with the host ACE-encoded as "xn--xn---...", then
failed to re-parse that href -- a self-consistency violation.

Fix: remove the premature starts_with check from punycode_to_utf32 and
instead add a post-decode check that rejects labels whose decoded form
actually starts with U+0078 U+006E U+002D U+002D ("xn--"). This
correctly handles the dangerous double-encoding case while allowing
valid mixed labels.

verify_punycode is simplified to delegate to punycode_to_utf32, which
carries the correct check, rather than duplicating the logic with the
same bug.

Fixes round-trip idempotency: ada_parse(href) now succeeds when href was
produced by ada itself, for all tested inputs.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 28, 2026

Merging this PR will not alter performance

✅ 27 untouched benchmarks
⏩ 4 skipped benchmarks1


Comparing claude/improve-fuzzing-coverage-qRMnO (a616e41) with main (3b9bacb)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

claude added 4 commits March 28, 2026 13:11
The previous fix for the nested-ACE-label security check
(whatwg/url#803) made verify_punycode
delegate to punycode_to_utf32, which allocates a std::u32string.
This regressed percent-encode / IDNA performance because:

  1. verify_punycode now heap-allocates on every call.
  2. The to_unicode call site called verify_punycode then immediately
     called punycode_to_utf32 on the same input -- two full decodes.

Fix both issues:

* Restore verify_punycode to be fully allocation-free: the new
  implementation mirrors punycode_to_utf32's decode loop but only
  tracks the first 4 code points (via a stack-allocated uint32_t[4]),
  which is all that is needed to check the "xn--" prefix invariant.

* Collapse the nested verify_punycode + punycode_to_utf32 calls at the
  to_unicode call site into a single punycode_to_utf32 call.  Both
  outer/inner else-branches did the same thing (append original label),
  so the two levels reduce to one.

Also add a regression test that exercises the mixed-label case that
triggered the original bug (a label like U+33FF U+33FD xn-- that
encodes to Punycode content starting with "xn--" even though the
decoded label starts with a non-ASCII code point).

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Both url::parse_host and url_aggregator::parse_host have fast paths
for pure decimal IPv4 and simple ASCII domains that returned true
without setting is_valid=true. If a prior setter call left is_valid=false
(e.g. set_host with a forbidden code point), a subsequent set_host
taking the fast path would leave is_valid=false, causing parse_port
to skip updating the port silently — diverging url and url_aggregator
state and triggering the sequential-setter href-mismatch abort in
fuzz/parse.cc:534.

Fix by explicitly setting is_valid=true before returning from each
successful fast path in both implementations.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Adds set_host_fast_path_restores_is_valid typed test covering the bug
where a failed set_host (leaving is_valid=false) followed by a
successful set_host taking the fast path would leave is_valid=false,
causing parse_port to silently skip updating the port.

https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
@anonrig anonrig merged commit 5737046 into main Mar 28, 2026
54 of 55 checks passed
@anonrig anonrig deleted the claude/improve-fuzzing-coverage-qRMnO branch March 28, 2026 23:10
anonrig pushed a commit to ada-url/idna that referenced this pull request Mar 28, 2026
Applies relevant changes from ada-url/ada#1104:

- punycode_to_utf32: remove premature starts_with("xn--") early-exit that
  incorrectly rejected valid punycode whose payload begins with "xn--"
  (e.g. a mixed label like U+33FF U+33FD followed by ASCII "xn--"). Replace
  with a post-decode check that rejects labels whose *decoded* form starts
  with U'x' U'n' U'-' U'-', which correctly identifies double-encoded ACE.

- verify_punycode: rewrite to be allocation-free by tracking only the first
  4 decoded code points using a stack array, simulating the insert-at-position
  operation just for those slots. This avoids heap allocation while preserving
  the correctness of the double-ACE detection.

- to_unicode: collapse the redundant verify_punycode + punycode_to_utf32
  calls into a single punycode_to_utf32 call, eliminating the double-decode
  overhead.

Adds a regression test covering the mixed-label case and verifying
to_ascii idempotency.

Fixes: whatwg/url#803

https://claude.ai/code/session_01C5dm8PKRZ7y7a8aTp53zBQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants