Fixed #32702 -- Don't decode escaped URL fragments. #14275

vshih · 2021-04-17T01:18:16Z

Currently urlize() will unquote then quote the fragment component of URLs. This transformation can be problematic - for example if it contains a %-encoded URL:

https://example.com/home#next=https%3A%2F%2Fexample2.com

This results in:

<a href="https://example.com/home#next=https://example2.com">https://example.com/home#next=https%3A%2F%2Fexample2.com</a>

Note how the generated href has its fragment decoded.

Because the formatting for the fragment is completely arbitrary and site-dependent, I suggest that the fragment should not be altered at all and simply rendered as-is.

Previous related PRs:

Related Trac:

https://code.djangoproject.com/ticket/9655

github-actions · 2021-04-17T01:25:11Z

Hello @vshih! Thank you for your contribution 💪

As it's your first contribution be sure to check out the patch review checklist.

If you're fixing a ticket from Trac make sure to set the "Has patch" flag and include a link to this PR in the ticket!

If you have any design or process questions then you can ask in the Django forum.

Welcome aboard ⛵️!

felixxm · 2021-04-30T08:51:20Z

@vshih Thanks for this patch 👍 Please create a new ticket in Trac and follow our bug reporting guidelines. All bugfixes require an accepted ticket.

vshih · 2021-04-30T23:40:40Z

Ticket created - https://code.djangoproject.com/ticket/32702#ticket.

carltongibson

Hi @vshih - Thanks for this.

I left a couple of comments but it's a tricky one so I need to think it through more. There's discussion on ticket-22267 which is related.

carltongibson · 2021-05-11T09:02:12Z

django/utils/html.py

Why move this line? There's no behaviour change no? I think revert this please.

This orders the processing of URL components to match the order that they appear, left to right. I can revert if you really want but I think this improves readability.

Possibly, but it makes the commit here less clear. Please do revert. (We could assess whether there's a readability improvement as a separate change, but it's probably not worth it for me.)

carltongibson · 2021-05-11T09:02:58Z

django/utils/html.py

So the question is whether it's safe to do this. (From security or regressions POV.)

carltongibson · 2021-05-12T06:30:38Z

Hey @apollo13 — as you have bandwidth, can I ask you to consider this with your eagle-eye?

Without digging through the whole history it's hard to say if the proposal here is benign or (I suspect) not. Happy to do that but thinking you're maybe able to point out obvious (to you) issues more quickly.

(If it's no clearer to you, say and I'll do the digging. 🙂)

Thanks!

apollo13 · 2021-05-12T07:19:21Z

Adding #öäü to the current URL (this github PR) and executing window.location.hash in the JS devtools yields "#%C3%A4%C3%B6%C3%BC". Adding that the browser will decode it for display in the URL-bar. So I am not exactly sure if it is true that the fragment is technically arbitrary. Can you link us to the RFCs sections supporting your statements @vshih

vshih · 2021-05-14T21:56:18Z

@apollo13 I don't know for sure as I am only skimming, but this seems indicative:

... The fragment's format and resolution is therefore
dependent on the media type [RFC2046] of a potentially retrieved
representation, even though such a retrieval is only performed if the
URI is dereferenced. If no such representation exists, then the
semantics of the fragment are considered unknown and are effectively
unconstrained.

from https://datatracker.ietf.org/doc/html/rfc3986#section-3.5

apollo13 · 2021-05-17T10:18:19Z

@vshih I do not think that this is indicative. The way I read it is that this means that for instance you cannot assume that the fragment is a json string (format), because that depends on the media type…

In the exact link you sent, a fragment is defined as follows:

  fragment    = *( pchar / "/" / "?" )

The way I read this is that a fragment consists of one or more characters out of (pchar, / and ?). "/" and "?" are rather clear, lets see what pchar is:

 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

":" & "@" are rather clear, sub-delims is also just a explicit list of characters. So this leaves us with unreserved and pct-encoded:

  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
  pct-encoded   = "%" HEXDIG HEXDIG

So, one way is to pct-encode characters, the only other valid ones (aside from the special ones already listed) are ALPHA & DIGIT.

Looking at https://datatracker.ietf.org/doc/html/rfc3986#section-2.3 ALPHA & DIGITS seems to be the ranges %41-%5A, %61-%7A, %30-%39 -- essentially printable ASCII characters.

As such I think that we indeed have to percent encode. So while the format of a fragment is arbitrary, the encoding is not. (Encoding is how strings/bytes are represented in the fragment and format is how the represented fragment transports meaning to the application; ie the first character could be a "command" identifier and the following characters are then arguments to that command. This is what the RFC means when it talks about arbitrary format imo). Does that make any sense?

vshih · 2021-05-17T18:49:26Z

@apollo13 Okay, I understand what you're saying. What is urlize's responsibility then? To percent-encode anything in the fragment that should be encoded, but isn't?

If that is all, then I can proceed with implementing that. But if it is supposed to somehow detect when the fragment is already percent-encoded, that's when things get really complicated.

apollo13 · 2021-05-19T07:11:34Z

That is a very good question. Generally those utils try to unquote and requote the URL (the use smart_urlquote internally). So yes it should already automatically handle a fragment that is already encoded.

django/django/utils/html.py

Line 230 in 127fd92

fragment = unquote_quote(fragment)

-- does that help?

vshih · 2021-06-26T00:12:20Z

I've implemented the fragment quoting by detecting first whether it is needed, and skipping if not.

The unquoting/re-quoting approach fails if a site is expecting an encoded parameter in its fragment:

example.com/home#next=https%3A%2F%2Fexample2.com

after unquoting becomes

example.com/home#next=https://example2.com

then, after re-quoting, remains the same, because ":" and "/" are considered safe.

Also I rebased.

apollo13 · 2021-06-26T08:54:41Z

then, after re-quoting, remains the same, because ":" and "/" are considered safe.

Which is correct, no? After all those characters are safe and don't need to be quoted. Which brings me to the main question: What does this PR offer (aside from more complexity) that didn't work before (ignoring broken apps that don't seem to adhere to the spec)

vshih · 2021-06-26T22:14:16Z

No, the current behavior is incorrect. These are valid URLs. The current approach unquotes/re-quotes in an attempt to account for already-encoded URLs. But this is incorrect because quote is not an exact inverse of unquote for valid data. The result is that valid URLs are getting corrupted.

This PR is a better approach - detect whether the fragment contains anything out of spec and only quote it if it does. I suspect this approach is probably a better way to treat the URL as a whole too.

apollo13 · 2021-06-27T11:41:40Z

But this is incorrect because quote is not an exact inverse of unquote for valid data. The result is that valid URLs are getting corrupted.

Can you show an example of those? While it is true that quote is not the exact inverse of unquote (which it doesn't have to be btw), the final URL would imo still be properly quoted. Taking only the tests from your PR and applying them on main I get the following failure:

AssertionError: 'Esca[62 chars]=http://example.com/admin">https://example.com[52 chars]</a>' != 'Esca[62 chars]=http%3A%2F%2Fexample.com%2Fadmin">https://exa[60 chars]</a>'
- Escaped fragment - <a href="https://example.com/login?u=1#redirect=http://example.com/admin">https://example.com/login?u=1#redirect=http%3A%2F%2Fexample.com%2Fadmin</a>
?                                                                        ^^^           ^
+ Escaped fragment - <a href="https://example.com/login?u=1#redirect=http%3A%2F%2Fexample.com%2Fadmin">https://example.com/login?u=1#redirect=http%3A%2F%2Fexample.com%2Fadmin</a>
?                                                                        ^^^^^^^^^           ^^^

This test seems to suggest that the current result of https://example.com/login?u=1#redirect=http://example.com/admin is incorrect which is simply not the case. This is a perfectly fine and valid URL.

vshih · 2021-06-27T16:47:04Z

Just because the final URL is "valid" does not mean that server will accept it as equivalent.

Here is the real-world example that I simplified:

https://console.aws.amazon.com/sqs/v2/home?region=us-east-1#/queues/https%3A%2F%2Fsqs.us-east-1.amazonaws.com%2F12345%2Fqueue-name

This is a console of Amazon Web Services. The fragment contains a URL representing the selected queue in the console interface. After the unquote/re-quote transformation, the result is garbled and no longer works correctly.

apollo13 · 2021-06-28T07:14:48Z

Just because the final URL is "valid" does not mean that server will accept it as equivalent.

A server should never see a fragment. This is solely handled on the client. If they don't consider it equivalent I am leaning towards calling it a bug on their side. (In this case AWS)

Taking your argument further, if we were to merge this patch, we would have to apply the same for query parts and url path etc… That said I am not sure that the increased complexity is worth it to handle edgecases like this (or even broken apps).

What would be the next steps @carltongibson? Discussion on the ML to get more input?

carltongibson · 2021-07-07T06:33:13Z

Thanks for the input both.

I'm struggling to see the benefit here. This seems telling:

This test seems to suggest that the current result of https://example.com/login?u=1#redirect=http://example.com/admin is incorrect which is simply not the case. This is a perfectly fine and valid URL.

And FWIW the kind of URL Django has outputted for years.

I'm reluctant (to say the least) to add the additional logic without full consideration, which means justification from a spec, and consensus on the mailing list that this is the correct change.

(This is hard to reason about in the abstract: I think the discussion beginning with the real-world examples is better/easier. Here's how this comes up, Here's a possible workaround, Here's the suggested change is much easier to get traction on than just Here's the suggested change, with a small number of test-cases. Often someone can provide Here's a possible workaround that's good enough.)

I hope that makes sense.

carltongibson changed the title ~~Don't decode escaped URL fragments.~~ Fixed #32702 -- Don't decode escaped URL fragments. May 11, 2021

carltongibson reviewed May 11, 2021

View reviewed changes

carltongibson requested a review from apollo13 May 12, 2021 06:26

vshih added 3 commits June 25, 2021 12:23

Don't decode escaped URL fragments.

3452cbe

Revert processing order change.

cc3f8d5

Detect whether fragment needs quoting.

c15fbbd

vshih force-pushed the main branch from 55a9942 to c15fbbd Compare June 26, 2021 00:10

carltongibson closed this Jul 7, 2021

Uh oh!

Fixed #32702 -- Don't decode escaped URL fragments. #14275

Fixed #32702 -- Don't decode escaped URL fragments. #14275

Uh oh!

Conversation

vshih commented Apr 17, 2021 • edited by carltongibson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 17, 2021

Uh oh!

felixxm commented Apr 30, 2021

Uh oh!

vshih commented Apr 30, 2021

Uh oh!

carltongibson left a comment

Choose a reason for hiding this comment

Uh oh!

carltongibson May 11, 2021

Choose a reason for hiding this comment

Uh oh!

vshih May 11, 2021

Choose a reason for hiding this comment

Uh oh!

carltongibson May 12, 2021

Choose a reason for hiding this comment

Uh oh!

carltongibson May 11, 2021

Choose a reason for hiding this comment

Uh oh!

carltongibson commented May 12, 2021

Uh oh!

apollo13 commented May 12, 2021

Uh oh!

vshih commented May 14, 2021

Uh oh!

apollo13 commented May 17, 2021

Uh oh!

vshih commented May 17, 2021

Uh oh!

apollo13 commented May 19, 2021

Uh oh!

vshih commented Jun 26, 2021

Uh oh!

apollo13 commented Jun 26, 2021

Uh oh!

vshih commented Jun 26, 2021

Uh oh!

apollo13 commented Jun 27, 2021

Uh oh!

vshih commented Jun 27, 2021

Uh oh!

apollo13 commented Jun 28, 2021

Uh oh!

carltongibson commented Jul 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vshih commented Apr 17, 2021 •

edited by carltongibson

Loading

carltongibson commented Jul 7, 2021 •

edited

Loading