-
-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Fixed #32702 -- Don't decode escaped URL fragments. #14275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @vshih! Thank you for your contribution 💪 As it's your first contribution be sure to check out the patch review checklist. If you're fixing a ticket from Trac make sure to set the "Has patch" flag and include a link to this PR in the ticket! If you have any design or process questions then you can ask in the Django forum. Welcome aboard ⛵️! |
@vshih Thanks for this patch 👍 Please create a new ticket in Trac and follow our bug reporting guidelines. All bugfixes require an accepted ticket. |
Ticket created - https://code.djangoproject.com/ticket/32702#ticket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vshih - Thanks for this.
I left a couple of comments but it's a tricky one so I need to think it through more. There's discussion on ticket-22267 which is related.
django/utils/html.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move this line? There's no behaviour change no? I think revert this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This orders the processing of URL components to match the order that they appear, left to right. I can revert if you really want but I think this improves readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but it makes the commit here less clear. Please do revert. (We could assess whether there's a readability improvement as a separate change, but it's probably not worth it for me.)
django/utils/html.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the question is whether it's safe to do this. (From security or regressions POV.)
Hey @apollo13 — as you have bandwidth, can I ask you to consider this with your eagle-eye? Without digging through the whole history it's hard to say if the proposal here is benign or (I suspect) not. Happy to do that but thinking you're maybe able to point out obvious (to you) issues more quickly. (If it's no clearer to you, say and I'll do the digging. 🙂) Thanks! |
Adding |
@apollo13 I don't know for sure as I am only skimming, but this seems indicative:
from https://datatracker.ietf.org/doc/html/rfc3986#section-3.5 |
@vshih I do not think that this is indicative. The way I read it is that this means that for instance you cannot assume that the fragment is a json string (format), because that depends on the media type… In the exact link you sent, a fragment is defined as follows:
The way I read this is that a fragment consists of one or more characters out of (pchar, / and ?). "/" and "?" are rather clear, lets see what pchar is:
":" & "@" are rather clear,
So, one way is to pct-encode characters, the only other valid ones (aside from the special ones already listed) are ALPHA & DIGIT. Looking at https://datatracker.ietf.org/doc/html/rfc3986#section-2.3 ALPHA & DIGITS seems to be the ranges %41-%5A, %61-%7A, %30-%39 -- essentially printable ASCII characters. As such I think that we indeed have to percent encode. So while the format of a fragment is arbitrary, the encoding is not. (Encoding is how strings/bytes are represented in the fragment and format is how the represented fragment transports meaning to the application; ie the first character could be a "command" identifier and the following characters are then arguments to that command. This is what the RFC means when it talks about arbitrary format imo). Does that make any sense? |
@apollo13 Okay, I understand what you're saying. What is If that is all, then I can proceed with implementing that. But if it is supposed to somehow detect when the fragment is already percent-encoded, that's when things get really complicated. |
That is a very good question. Generally those utils try to unquote and requote the URL (the use Line 230 in 127fd92
|
I've implemented the fragment quoting by detecting first whether it is needed, and skipping if not. The unquoting/re-quoting approach fails if a site is expecting an encoded parameter in its fragment:
after unquoting becomes
then, after re-quoting, remains the same, because ":" and "/" are considered safe. Also I rebased. |
Which is correct, no? After all those characters are safe and don't need to be quoted. Which brings me to the main question: What does this PR offer (aside from more complexity) that didn't work before (ignoring broken apps that don't seem to adhere to the spec) |
No, the current behavior is incorrect. These are valid URLs. The current approach unquotes/re-quotes in an attempt to account for already-encoded URLs. But this is incorrect because This PR is a better approach - detect whether the fragment contains anything out of spec and only quote it if it does. I suspect this approach is probably a better way to treat the URL as a whole too. |
Can you show an example of those? While it is true that quote is not the exact inverse of unquote (which it doesn't have to be btw), the final URL would imo still be properly quoted. Taking only the tests from your PR and applying them on main I get the following failure:
This test seems to suggest that the current result of |
Just because the final URL is "valid" does not mean that server will accept it as equivalent. Here is the real-world example that I simplified:
This is a console of Amazon Web Services. The fragment contains a URL representing the selected queue in the console interface. After the unquote/re-quote transformation, the result is garbled and no longer works correctly. |
A server should never see a fragment. This is solely handled on the client. If they don't consider it equivalent I am leaning towards calling it a bug on their side. (In this case AWS) Taking your argument further, if we were to merge this patch, we would have to apply the same for query parts and url path etc… That said I am not sure that the increased complexity is worth it to handle edgecases like this (or even broken apps). What would be the next steps @carltongibson? Discussion on the ML to get more input? |
Thanks for the input both. I'm struggling to see the benefit here. This seems telling:
And FWIW the kind of URL Django has outputted for years. I'm reluctant (to say the least) to add the additional logic without full consideration, which means justification from a spec, and consensus on the mailing list that this is the correct change. (This is hard to reason about in the abstract: I think the discussion beginning with the real-world examples is better/easier. Here's how this comes up, Here's a possible workaround, Here's the suggested change is much easier to get traction on than just Here's the suggested change, with a small number of test-cases. Often someone can provide Here's a possible workaround that's good enough.) I hope that makes sense. |
ticket-32702
Currently
urlize()
will unquote then quote the fragment component of URLs. This transformation can be problematic - for example if it contains a %-encoded URL:https://example.com/home#next=https%3A%2F%2Fexample2.com
This results in:
Note how the generated
href
has its fragment decoded.Because the formatting for the fragment is completely arbitrary and site-dependent, I suggest that the fragment should not be altered at all and simply rendered as-is.
Previous related PRs:
Related Trac: