fuzz: improve coverage with new serializers target and stronger invariant checks#1104
fuzz: improve coverage with new serializers target and stronger invariant checks#1104
Conversation
…iant checks Add a new `fuzz/serializers.cc` target that directly exercises code paths that were only reachable indirectly before: - `ada::serializers::ipv4()` and `ada::serializers::ipv6()` with arbitrary uint64_t / uint16_t[8] inputs, including a round-trip invariant check for `find_longest_sequence_of_ipv6_pieces`. - `ada::checkers::try_parse_ipv4_fast()` with a serialisation round-trip assertion (parse → serialise → re-parse must give the same address). - All six `ada::unicode::percent_encode()` character-set variants, plus the template `<false>`/`<true>` forms, with a decode round-trip check. - Integration tests that embed the synthesised addresses in real URLs and verify href idempotency. Strengthen existing targets with new invariant-based checks: `parse.cc` - FDP-controlled sequential setter testing: the fuzzer now picks an arbitrary sequence of setter/value pairs and asserts that `ada::url` and `ada::url_aggregator` stay in sync (same href) after every step. - Re-parse idempotency: `parse(href).get_href() == href` is asserted for every successfully parsed URL. - `url_search_params` round-trip via URL: parses the fuzz input as a query string, mutates via search-params, and verifies serialisation idempotency before setting it back on the URL. `can_parse.cc` - Adds href round-trip: if `parse(source)` succeeds, `can_parse(href)` must be true and `parse(href).get_href()` must equal `href`. `url_search_params.cc` - Adds serialisation idempotency: `url_search_params(sp.to_string()).to_string() == sp.to_string()` is checked for both constructed and mutated params. - Adds URL round-trip integration test. `idna.cc` - Adds `to_ascii` stability: `to_ascii(to_ascii(x)) == to_ascii(x)`. - Adds `to_unicode` stability. - Adds long-domain (≤270 char) inputs to exercise DNS-length boundary checks. - Adds Punycode decode→encode→decode round-trip with equality assertion. `url.dict` - Adds hex/octal IPv4 forms, IPv4-mapped IPv6, Windows/UNC file paths, data URIs with content types, tricky percent sequences (overlong UTF-8, double-encoding, C0 controls), opaque-host URLs, and scheme case variants. https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
…rage
url_pattern.cc
- Add exec/test consistency invariant: exec() finds a match iff test()
returns true, checked for both no-base and with-base variants, and for
url_pattern_input (init-based) inputs. An abort is triggered on
any disagreement.
- Introduce exercise_match_result() helper that also walks the previously
untested url_pattern_result::inputs vector.
- Add testing for url_pattern_init::base_url, a field that was previously
never exercised by any fuzzer (tested with a fixed and a fuzzed base).
- Add random bitmask field selection so every subset (0..255) of the eight
url_pattern_init component fields is reachable, covering combinations the
previous single-field and all-fields cases missed.
ada_c.c
- Add C-API re-parse idempotency check: after all setter mutations the URL's
final href is re-parsed via ada_parse() and compared byte-for-byte; an
abort is triggered if parsing fails or produces a different href.
serializers.cc
- Add direct coverage of internal utility functions unreachable through the
URL parsing pipeline alone:
- ada::unicode::has_tabs_or_newline()
- ada::checkers::is_ipv4() with cross-check: if is_ipv4 reports true,
embedding the string as an http:// host must be accepted by the parser
- ada::checkers::path_signature()
- ada::checkers::is_windows_drive_letter() and
is_normalized_windows_drive_letter() with a consistency assertion
(normalised ⊆ drive-letter)
- ada::unicode::to_lower_ascii() (in-place, length invariant asserted)
- ada::unicode::contains_forbidden_domain_code_point()
- ada::unicode::contains_forbidden_domain_code_point_or_upper()
- ada::checkers::verify_dns_length()
https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (17.94%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1104 +/- ##
==========================================
- Coverage 59.86% 59.60% -0.27%
==========================================
Files 37 37
Lines 5811 5815 +4
Branches 2846 2848 +2
==========================================
- Hits 3479 3466 -13
- Misses 537 569 +32
+ Partials 1795 1780 -15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tl::expected<T, E> cannot be compared to std::nullopt_t; use .has_value() instead. This was caught by the OSS-Fuzz CI build. https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
…ents - Run clang-format --style=Google on all fuzz files to fix CI violations - Add fuzz/unicode.cc: new fuzzer for 15+ low-level unicode/checker/helper utilities (per-char classification, string helpers, prune_hash, trim, remove_ascii_tab_or_newline, delimiter finders, to_ascii internal API) with consistency invariants (forbidden_host⊆forbidden_domain, ascii_digit⊆ascii_hex_digit, lowercase_hex⊆ascii_hex_digit, etc.) - Add fuzz/unicode.options pointing at url.dict - Add unicode.cc compilation to fuzz/build.sh - fuzz/parse.cc: assert clear_port/clear_search/clear_hash postconditions - fuzz/url_search_params.cc: assert has(k) and get(k) consistency during range-for iteration; verify has(k,v)→has(k) invariant - fuzz/ada_c.c: validate ada_get_scheme_type ∈ [0,6], ada_get_host_type ∈ [0,2], and all ada_url_components offsets ≤ href.length when not omitted https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
ada_get_components() returns const ada_url_components*, not a value. Use pointer variable + -> dereference for comps, and keep . access for the ada_string and ada_version_components value types. https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
ada_url_components.port stores the port NUMBER (0-65535), not an offset into the href string. Checking port > href.length gives false positives for any URL whose port number exceeds the href length (e.g. http://example.com:8080/). Remove the erroneous port bounds check. The other offset fields (protocol_end, username_end, host_start, host_end, pathname_start, search_start, hash_start) are genuine byte offsets into the href and their bounds checks are correct. https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Both functions had an early-exit check `input.starts_with("xn--")` that
was intended to prevent double-encoded ACE labels (a decoded Punycode
label that itself starts with "xn--"). However the check was incorrect:
Punycode always outputs all ASCII (basic) code points first in the
encoded stream, regardless of their position in the original label. A
mixed label like "CJK-chars xn--" encodes to content starting with
"xn--" even though the decoded label starts with a non-ASCII character.
Consequence: ada accepted such URLs (e.g. http://[cjk]xn--./), produced
a normalized href with the host ACE-encoded as "xn--xn---...", then
failed to re-parse that href -- a self-consistency violation.
Fix: remove the premature starts_with check from punycode_to_utf32 and
instead add a post-decode check that rejects labels whose decoded form
actually starts with U+0078 U+006E U+002D U+002D ("xn--"). This
correctly handles the dangerous double-encoding case while allowing
valid mixed labels.
verify_punycode is simplified to delegate to punycode_to_utf32, which
carries the correct check, rather than duplicating the logic with the
same bug.
Fixes round-trip idempotency: ada_parse(href) now succeeds when href was
produced by ada itself, for all tested inputs.
https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Merging this PR will not alter performance
Comparing Footnotes
|
The previous fix for the nested-ACE-label security check (whatwg/url#803) made verify_punycode delegate to punycode_to_utf32, which allocates a std::u32string. This regressed percent-encode / IDNA performance because: 1. verify_punycode now heap-allocates on every call. 2. The to_unicode call site called verify_punycode then immediately called punycode_to_utf32 on the same input -- two full decodes. Fix both issues: * Restore verify_punycode to be fully allocation-free: the new implementation mirrors punycode_to_utf32's decode loop but only tracks the first 4 code points (via a stack-allocated uint32_t[4]), which is all that is needed to check the "xn--" prefix invariant. * Collapse the nested verify_punycode + punycode_to_utf32 calls at the to_unicode call site into a single punycode_to_utf32 call. Both outer/inner else-branches did the same thing (append original label), so the two levels reduce to one. Also add a regression test that exercises the mixed-label case that triggered the original bug (a label like U+33FF U+33FD xn-- that encodes to Punycode content starting with "xn--" even though the decoded label starts with a non-ASCII code point). https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Both url::parse_host and url_aggregator::parse_host have fast paths for pure decimal IPv4 and simple ASCII domains that returned true without setting is_valid=true. If a prior setter call left is_valid=false (e.g. set_host with a forbidden code point), a subsequent set_host taking the fast path would leave is_valid=false, causing parse_port to skip updating the port silently — diverging url and url_aggregator state and triggering the sequential-setter href-mismatch abort in fuzz/parse.cc:534. Fix by explicitly setting is_valid=true before returning from each successful fast path in both implementations. https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Adds set_host_fast_path_restores_is_valid typed test covering the bug where a failed set_host (leaving is_valid=false) followed by a successful set_host taking the fast path would leave is_valid=false, causing parse_port to silently skip updating the port. https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef
Applies relevant changes from ada-url/ada#1104: - punycode_to_utf32: remove premature starts_with("xn--") early-exit that incorrectly rejected valid punycode whose payload begins with "xn--" (e.g. a mixed label like U+33FF U+33FD followed by ASCII "xn--"). Replace with a post-decode check that rejects labels whose *decoded* form starts with U'x' U'n' U'-' U'-', which correctly identifies double-encoded ACE. - verify_punycode: rewrite to be allocation-free by tracking only the first 4 decoded code points using a stack array, simulating the insert-at-position operation just for those slots. This avoids heap allocation while preserving the correctness of the double-ACE detection. - to_unicode: collapse the redundant verify_punycode + punycode_to_utf32 calls into a single punycode_to_utf32 call, eliminating the double-decode overhead. Adds a regression test covering the mixed-label case and verifying to_ascii idempotency. Fixes: whatwg/url#803 https://claude.ai/code/session_01C5dm8PKRZ7y7a8aTp53zBQ
Add a new
fuzz/serializers.cctarget that directly exercises code pathsthat were only reachable indirectly before:
ada::serializers::ipv4()andada::serializers::ipv6()with arbitraryuint64_t / uint16_t[8] inputs, including a round-trip invariant check for
find_longest_sequence_of_ipv6_pieces.ada::checkers::try_parse_ipv4_fast()with a serialisation round-tripassertion (parse → serialise → re-parse must give the same address).
ada::unicode::percent_encode()character-set variants, plus thetemplate
<false>/<true>forms, with a decode round-trip check.verify href idempotency.
Strengthen existing targets with new invariant-based checks:
parse.ccarbitrary sequence of setter/value pairs and asserts that
ada::urlandada::url_aggregatorstay in sync (same href) after every step.parse(href).get_href() == hrefis asserted forevery successfully parsed URL.
url_search_paramsround-trip via URL: parses the fuzz input as a querystring, mutates via search-params, and verifies serialisation idempotency
before setting it back on the URL.
can_parse.ccparse(source)succeeds,can_parse(href)mustbe true and
parse(href).get_href()must equalhref.url_search_params.ccurl_search_params(sp.to_string()).to_string() == sp.to_string()is checked for both constructed and mutated params.idna.ccto_asciistability:to_ascii(to_ascii(x)) == to_ascii(x).to_unicodestability.url.dictdata URIs with content types, tricky percent sequences (overlong UTF-8,
double-encoding, C0 controls), opaque-host URLs, and scheme case variants.
https://claude.ai/code/session_018zxmHQjAxYzQMdPzWzZjef