Skip to content

fix!: normalize_url preserves path, query, and fragment case#2021

Closed
anxkhn wants to merge 1 commit into
apify:masterfrom
anxkhn:fix/normalize-url-preserve-path-case
Closed

fix!: normalize_url preserves path, query, and fragment case#2021
anxkhn wants to merge 1 commit into
apify:masterfrom
anxkhn:fix/normalize-url-preserve-path-case

Conversation

@anxkhn

@anxkhn anxkhn commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

normalize_url lower-cased the entire URL, including the path and query,
even though only the scheme and host are case-insensitive (RFC 3986).

  • In normalize_url (src/crawlee/_utils/requests.py), the final line returned
    str(yarl_new_url).lower(), so the path, query, and fragment were folded to
    lower case along with the scheme and host.
  • compute_unique_key uses the normalized URL as the default unique_key, so any
    two URLs that differ only in the case of their path or query (for example
    /Product/ABC vs /product/abc, or ?token=SeCrEt vs ?token=secret) produced
    the same key. They were treated as duplicates and silently deduplicated, dropping
    case-distinct pages with no log or statistic. This also contradicted the
    function's own docstring, which only promised scheme/host lower-casing.
  • Fix: drop the trailing whole-string .lower(). yarl already lower-cases the
    scheme and host when it parses the URL while preserving the case of the path,
    query, and fragment, so removing the extra .lower() yields RFC 3986 compliant
    normalization and matches the behavior of normalizeUrl in Crawlee for
    JavaScript. The docstring and a comment were corrected to match.
-    return str(yarl_new_url).lower()
+    return str(yarl_new_url)

This is a breaking change: default unique keys for URLs with mixed-case paths or
queries now differ from previous releases, so such requests are no longer
deduplicated together, and keys persisted by older versions will not match newly
computed ones. Callers who want the old behavior can pass an explicit unique_key,
or lower-case URLs via transform_request_function before enqueuing.

Issues

Testing

  • uv run pytest tests/unit/_utils/test_requests.py - 19 passed.
  • Updated the one stale parametrized expectation that asserted the buggy query
    lower-casing (HTTPS://EXAMPLE.COM/?KEY=VALUE now normalizes to
    https://example.com/?KEY=VALUE, scheme/host still lower-cased) and renamed its
    id to lowercase_scheme_host_only.
  • Added parametrized cases preserve_path_case and preserve_query_case, plus two
    regression tests, test_normalize_url_preserves_case_distinct_paths_and_queries
    and test_compute_unique_key_preserves_case_distinct_paths_and_queries, that
    assert case-distinct paths/queries no longer collide at both the helper and the
    unique_key level. Both fail on the old .lower() and pass after the fix.
  • ruff check + ruff format --check on the changed files: clean. ty
    (type-check) adds no new diagnostics.

Checklist

  • CI passed

Targeting 2.0 (call out in the PR / discussion, not hidden)

The maintainers deliberately milestoned #2008 to 2.0 and labeled it a breaking
change, because it changes default unique keys and breaks compatibility with queues
persisted by older versions. There is currently no separate v2 branch upstream
(all work, including breaking items batched for 2.0, lands on master, currently
v1.8.0) and no docs/upgrading/upgrading_to_v2.md yet.

The issue also asks to "document the change loudly in the upgrading guide." Since
that guide does not exist yet, the PR opening comment should say the change is
2.0-targeted and offer to add the upgrading-guide entry once the v2 guide is
created (or ask the maintainers whether they want that note added in this PR).
Do not present this as a drop-in, non-breaking bugfix.


Suggested PR opening comment (the note to post with the PR)

This targets the 2.0 milestone, per #2008: it changes the default unique_key
for URLs whose path or query differs only in case, so it is a breaking change
(marked fix! with a BREAKING CHANGE: footer). The core change is dropping a
whole-URL .lower() in normalize_url so only the scheme and host are
lower-cased, matching RFC 3986 and Crawlee for JavaScript.

The issue also asks to document this in the upgrading guide. I did not see a v2
upgrading guide in the repo yet; happy to add an entry there once it exists, or
to include a note here if you prefer. Let me know which you'd like.

normalize_url lowercased the entire URL, including the path and query,
even though only the scheme and host are case-insensitive per RFC 3986.
Because compute_unique_key uses the normalized URL as the default
unique_key, any two URLs that differ only in path or query casing (for
example /Product/ABC vs /product/abc, or ?token=SeCrEt vs ?token=secret)
collided and were silently deduplicated, dropping case-distinct pages
with no log or statistic.

yarl already lower-cases the scheme and host on parse while preserving
the case of the path, query, and fragment, so dropping the trailing
whole-string .lower() yields RFC 3986 compliant normalization and
matches the behavior of Crawlee for JavaScript.

BREAKING CHANGE: default unique keys for URLs with mixed-case paths or
queries now differ from previous releases, so such requests are no
longer deduplicated together and keys persisted by older versions will
not match newly computed ones. Pass an explicit unique_key (or lowercase
URLs via transform_request_function before enqueuing) to keep the old
behavior.

Closes apify#2008
@vdusek

vdusek commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

Hi @anxkhn, thanks for the PR. However, this issue is currently milestoned for v2 because the fix would introduce breaking changes to how requests are represented in storages.

Because of that, this needs to be handled carefully as part of a major version update, along with an appropriate upgrade guide. I'm going to close this for now, since v2 is still planned for the future rather than being actively targeted at the moment.

@vdusek vdusek closed this Jul 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

normalize_url lowercases the entire URL, silently deduplicating case-distinct pages

3 participants