Skip to content

Conversation

vshih
Copy link

@vshih vshih commented Apr 17, 2021

ticket-32702

Currently urlize() will unquote then quote the fragment component of URLs. This transformation can be problematic - for example if it contains a %-encoded URL:

https://example.com/home#next=https%3A%2F%2Fexample2.com

This results in:

<a href="https://example.com/home#next=https://example2.com">https://example.com/home#next=https%3A%2F%2Fexample2.com</a>

Note how the generated href has its fragment decoded.

Because the formatting for the fragment is completely arbitrary and site-dependent, I suggest that the fragment should not be altered at all and simply rendered as-is.

Previous related PRs:

Related Trac:

@github-actions
Copy link

Hello @vshih! Thank you for your contribution 💪

As it's your first contribution be sure to check out the patch review checklist.

If you're fixing a ticket from Trac make sure to set the "Has patch" flag and include a link to this PR in the ticket!

If you have any design or process questions then you can ask in the Django forum.

Welcome aboard ⛵️!

@felixxm
Copy link
Member

felixxm commented Apr 30, 2021

@vshih Thanks for this patch 👍 Please create a new ticket in Trac and follow our bug reporting guidelines. All bugfixes require an accepted ticket.

@vshih
Copy link
Author

vshih commented Apr 30, 2021

Ticket created - https://code.djangoproject.com/ticket/32702#ticket.

@carltongibson carltongibson changed the title Don't decode escaped URL fragments. Fixed #32702 -- Don't decode escaped URL fragments. May 11, 2021
Copy link
Member

@carltongibson carltongibson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vshih - Thanks for this.

I left a couple of comments but it's a tricky one so I need to think it through more. There's discussion on ticket-22267 which is related.

Comment on lines 221 to 222
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this line? There's no behaviour change no? I think revert this please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This orders the processing of URL components to match the order that they appear, left to right. I can revert if you really want but I think this improves readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but it makes the commit here less clear. Please do revert. (We could assess whether there's a readability improvement as a separate change, but it's probably not worth it for me.)

Comment on lines 230 to 231
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question is whether it's safe to do this. (From security or regressions POV.)

@carltongibson carltongibson requested a review from apollo13 May 12, 2021 06:26
@carltongibson
Copy link
Member

Hey @apollo13 — as you have bandwidth, can I ask you to consider this with your eagle-eye?

Without digging through the whole history it's hard to say if the proposal here is benign or (I suspect) not. Happy to do that but thinking you're maybe able to point out obvious (to you) issues more quickly.

(If it's no clearer to you, say and I'll do the digging. 🙂)

Thanks!

@apollo13
Copy link
Member

Adding #öäü to the current URL (this github PR) and executing window.location.hash in the JS devtools yields "#%C3%A4%C3%B6%C3%BC". Adding that the browser will decode it for display in the URL-bar. So I am not exactly sure if it is true that the fragment is technically arbitrary. Can you link us to the RFCs sections supporting your statements @vshih

@vshih
Copy link
Author

vshih commented May 14, 2021

@apollo13 I don't know for sure as I am only skimming, but this seems indicative:

... The fragment's format and resolution is therefore
dependent on the media type [RFC2046] of a potentially retrieved
representation, even though such a retrieval is only performed if the
URI is dereferenced. If no such representation exists, then the
semantics of the fragment are considered unknown and are effectively
unconstrained.

from https://datatracker.ietf.org/doc/html/rfc3986#section-3.5

@apollo13
Copy link
Member

@vshih I do not think that this is indicative. The way I read it is that this means that for instance you cannot assume that the fragment is a json string (format), because that depends on the media type…

In the exact link you sent, a fragment is defined as follows:

  fragment    = *( pchar / "/" / "?" )

The way I read this is that a fragment consists of one or more characters out of (pchar, / and ?). "/" and "?" are rather clear, lets see what pchar is:

 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

":" & "@" are rather clear, sub-delims is also just a explicit list of characters. So this leaves us with unreserved and pct-encoded:

  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
  pct-encoded   = "%" HEXDIG HEXDIG

So, one way is to pct-encode characters, the only other valid ones (aside from the special ones already listed) are ALPHA & DIGIT.

Looking at https://datatracker.ietf.org/doc/html/rfc3986#section-2.3 ALPHA & DIGITS seems to be the ranges %41-%5A, %61-%7A, %30-%39 -- essentially printable ASCII characters.

As such I think that we indeed have to percent encode. So while the format of a fragment is arbitrary, the encoding is not. (Encoding is how strings/bytes are represented in the fragment and format is how the represented fragment transports meaning to the application; ie the first character could be a "command" identifier and the following characters are then arguments to that command. This is what the RFC means when it talks about arbitrary format imo). Does that make any sense?

@vshih
Copy link
Author

vshih commented May 17, 2021

@apollo13 Okay, I understand what you're saying. What is urlize's responsibility then? To percent-encode anything in the fragment that should be encoded, but isn't?

If that is all, then I can proceed with implementing that. But if it is supposed to somehow detect when the fragment is already percent-encoded, that's when things get really complicated.

@apollo13
Copy link
Member

That is a very good question. Generally those utils try to unquote and requote the URL (the use smart_urlquote internally). So yes it should already automatically handle a fragment that is already encoded.

fragment = unquote_quote(fragment)
-- does that help?

@vshih
Copy link
Author

vshih commented Jun 26, 2021

I've implemented the fragment quoting by detecting first whether it is needed, and skipping if not.

The unquoting/re-quoting approach fails if a site is expecting an encoded parameter in its fragment:

example.com/home#next=https%3A%2F%2Fexample2.com

after unquoting becomes

example.com/home#next=https://example2.com

then, after re-quoting, remains the same, because ":" and "/" are considered safe.

Also I rebased.

@apollo13
Copy link
Member

then, after re-quoting, remains the same, because ":" and "/" are considered safe.

Which is correct, no? After all those characters are safe and don't need to be quoted. Which brings me to the main question: What does this PR offer (aside from more complexity) that didn't work before (ignoring broken apps that don't seem to adhere to the spec)

@vshih
Copy link
Author

vshih commented Jun 26, 2021

No, the current behavior is incorrect. These are valid URLs. The current approach unquotes/re-quotes in an attempt to account for already-encoded URLs. But this is incorrect because quote is not an exact inverse of unquote for valid data. The result is that valid URLs are getting corrupted.

This PR is a better approach - detect whether the fragment contains anything out of spec and only quote it if it does. I suspect this approach is probably a better way to treat the URL as a whole too.

@apollo13
Copy link
Member

But this is incorrect because quote is not an exact inverse of unquote for valid data. The result is that valid URLs are getting corrupted.

Can you show an example of those? While it is true that quote is not the exact inverse of unquote (which it doesn't have to be btw), the final URL would imo still be properly quoted. Taking only the tests from your PR and applying them on main I get the following failure:

AssertionError: 'Esca[62 chars]=http://example.com/admin">https://example.com[52 chars]</a>' != 'Esca[62 chars]=http%3A%2F%2Fexample.com%2Fadmin">https://exa[60 chars]</a>'
- Escaped fragment - <a href="https://example.com/login?u=1#redirect=http://example.com/admin">https://example.com/login?u=1#redirect=http%3A%2F%2Fexample.com%2Fadmin</a>
?                                                                        ^^^           ^
+ Escaped fragment - <a href="https://example.com/login?u=1#redirect=http%3A%2F%2Fexample.com%2Fadmin">https://example.com/login?u=1#redirect=http%3A%2F%2Fexample.com%2Fadmin</a>
?                                                                        ^^^^^^^^^           ^^^

This test seems to suggest that the current result of https://example.com/login?u=1#redirect=http://example.com/admin is incorrect which is simply not the case. This is a perfectly fine and valid URL.

@vshih
Copy link
Author

vshih commented Jun 27, 2021

Just because the final URL is "valid" does not mean that server will accept it as equivalent.

Here is the real-world example that I simplified:

https://console.aws.amazon.com/sqs/v2/home?region=us-east-1#/queues/https%3A%2F%2Fsqs.us-east-1.amazonaws.com%2F12345%2Fqueue-name

This is a console of Amazon Web Services. The fragment contains a URL representing the selected queue in the console interface. After the unquote/re-quote transformation, the result is garbled and no longer works correctly.

@apollo13
Copy link
Member

Just because the final URL is "valid" does not mean that server will accept it as equivalent.

A server should never see a fragment. This is solely handled on the client. If they don't consider it equivalent I am leaning towards calling it a bug on their side. (In this case AWS)

Taking your argument further, if we were to merge this patch, we would have to apply the same for query parts and url path etc… That said I am not sure that the increased complexity is worth it to handle edgecases like this (or even broken apps).

What would be the next steps @carltongibson? Discussion on the ML to get more input?

@carltongibson
Copy link
Member

carltongibson commented Jul 7, 2021

Thanks for the input both.

I'm struggling to see the benefit here. This seems telling:

This test seems to suggest that the current result of https://example.com/login?u=1#redirect=http://example.com/admin is incorrect which is simply not the case. This is a perfectly fine and valid URL.

And FWIW the kind of URL Django has outputted for years.

I'm reluctant (to say the least) to add the additional logic without full consideration, which means justification from a spec, and consensus on the mailing list that this is the correct change.

(This is hard to reason about in the abstract: I think the discussion beginning with the real-world examples is better/easier. Here's how this comes up, Here's a possible workaround, Here's the suggested change is much easier to get traction on than just Here's the suggested change, with a small number of test-cases. Often someone can provide Here's a possible workaround that's good enough.)

I hope that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants