fix!: normalize_url preserves path, query, and fragment case#2021
Closed
anxkhn wants to merge 1 commit into
Closed
fix!: normalize_url preserves path, query, and fragment case#2021anxkhn wants to merge 1 commit into
anxkhn wants to merge 1 commit into
Conversation
normalize_url lowercased the entire URL, including the path and query, even though only the scheme and host are case-insensitive per RFC 3986. Because compute_unique_key uses the normalized URL as the default unique_key, any two URLs that differ only in path or query casing (for example /Product/ABC vs /product/abc, or ?token=SeCrEt vs ?token=secret) collided and were silently deduplicated, dropping case-distinct pages with no log or statistic. yarl already lower-cases the scheme and host on parse while preserving the case of the path, query, and fragment, so dropping the trailing whole-string .lower() yields RFC 3986 compliant normalization and matches the behavior of Crawlee for JavaScript. BREAKING CHANGE: default unique keys for URLs with mixed-case paths or queries now differ from previous releases, so such requests are no longer deduplicated together and keys persisted by older versions will not match newly computed ones. Pass an explicit unique_key (or lowercase URLs via transform_request_function before enqueuing) to keep the old behavior. Closes apify#2008
Collaborator
|
Hi @anxkhn, thanks for the PR. However, this issue is currently milestoned for v2 because the fix would introduce breaking changes to how requests are represented in storages. Because of that, this needs to be handled carefully as part of a major version update, along with an appropriate upgrade guide. I'm going to close this for now, since v2 is still planned for the future rather than being actively targeted at the moment. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
normalize_urllower-cased the entire URL, including the path and query,even though only the scheme and host are case-insensitive (RFC 3986).
normalize_url(src/crawlee/_utils/requests.py), the final line returnedstr(yarl_new_url).lower(), so the path, query, and fragment were folded tolower case along with the scheme and host.
compute_unique_keyuses the normalized URL as the defaultunique_key, so anytwo URLs that differ only in the case of their path or query (for example
/Product/ABCvs/product/abc, or?token=SeCrEtvs?token=secret) producedthe same key. They were treated as duplicates and silently deduplicated, dropping
case-distinct pages with no log or statistic. This also contradicted the
function's own docstring, which only promised scheme/host lower-casing.
.lower().yarlalready lower-cases thescheme and host when it parses the URL while preserving the case of the path,
query, and fragment, so removing the extra
.lower()yields RFC 3986 compliantnormalization and matches the behavior of
normalizeUrlin Crawlee forJavaScript. The docstring and a comment were corrected to match.
This is a breaking change: default unique keys for URLs with mixed-case paths or
queries now differ from previous releases, so such requests are no longer
deduplicated together, and keys persisted by older versions will not match newly
computed ones. Callers who want the old behavior can pass an explicit
unique_key,or lower-case URLs via
transform_request_functionbefore enqueuing.Issues
normalize_urllowercases the entire URL, silently deduplicating case-distinct pages #2008Testing
uv run pytest tests/unit/_utils/test_requests.py- 19 passed.lower-casing (
HTTPS://EXAMPLE.COM/?KEY=VALUEnow normalizes tohttps://example.com/?KEY=VALUE, scheme/host still lower-cased) and renamed itsid to
lowercase_scheme_host_only.preserve_path_caseandpreserve_query_case, plus tworegression tests,
test_normalize_url_preserves_case_distinct_paths_and_queriesand
test_compute_unique_key_preserves_case_distinct_paths_and_queries, thatassert case-distinct paths/queries no longer collide at both the helper and the
unique_keylevel. Both fail on the old.lower()and pass after the fix.ruff check+ruff format --checkon the changed files: clean.ty(type-check) adds no new diagnostics.
Checklist
Targeting 2.0 (call out in the PR / discussion, not hidden)
The maintainers deliberately milestoned #2008 to 2.0 and labeled it a breaking
change, because it changes default unique keys and breaks compatibility with queues
persisted by older versions. There is currently no separate v2 branch upstream
(all work, including breaking items batched for 2.0, lands on
master, currentlyv1.8.0) and no
docs/upgrading/upgrading_to_v2.mdyet.The issue also asks to "document the change loudly in the upgrading guide." Since
that guide does not exist yet, the PR opening comment should say the change is
2.0-targeted and offer to add the upgrading-guide entry once the v2 guide is
created (or ask the maintainers whether they want that note added in this PR).
Do not present this as a drop-in, non-breaking bugfix.
Suggested PR opening comment (the note to post with the PR)