Wanted: correct URI escaping #23

Closed
technoweenie opened this Issue Jan 7, 2012 · 13 comments

Comments

Projects
None yet
5 participants

Look at the difference:

>> EscapeUtils.escape_uri 'a/ +b'
=> "a/%20+b"
>> EscapeUtils.escape_url 'a/ +b'
=> "a%2F+%2Bb"

IMO both are wrong. Also, the similarity in the two method names is ridiculous. This is what I'd like to see:

>> EscapeUtils.escape_full_uri 'a/ +b'
=> "a/%20%2Bb"
>> EscapeUtils.escape_uri_path 'a/ +b'
=> "a%2F%20%2Bb"

This is closer to how javascript works:

> encodeURI('a/ +b')
'a/%20+b'
> encodeURIComponent('a/ +b')
'a%2F%20%2Bb'

I like to escape + to be safe, because it's not entirely clear if you mean a + or a space. If you agree, I'll happily hack it up and send you a pull. If not, I'll do it anyway and force my fork in GitHub :) :)

Collaborator

vmg commented Jan 7, 2012

escape_url and escape_uri are named and implemented following the behavior of the Ruby standard library. I don't have any opinions on which behavior is saner, that's up to @brianmario.

@technoweenie: don't change anything in escape_utils, send a PR against tanoku/houdini and I'll backport this to escape_utils and other relevant places.

Owner

brianmario commented Jan 7, 2012

Just for reference:

>> URI.escape 'a +b'
=> "a%20+b"
>> URI.escape 'a/ +b'
=> "a/%20+b"
>> Rack::Utils.escape 'a/ +b'
=> "a%2F+%2Bb"
>> Rack::Utils.escape 'a +b'
=> "a+%2Bb"
Owner

brianmario commented Jan 7, 2012

Oh and

>> CGI.escape 'a +b'
=> "a+%2Bb"
>> CGI.escape 'a/ +b'
=> "a%2F+%2Bb"
Owner

brianmario commented Jan 7, 2012

Also this is interesting https://github.com/rack/rack/blob/master/lib/rack/utils.rb#L28-33 - especially the part that says "Strictly speaking this is true URI escaping." because according to RFC 3986 I think the + character is in the "reserved" list (to not be escaped)?

It's easier if you always encodes spaces as %20, and + as %2B. I realize this could break people that rely on ruby behavior, which is why I suggested completely different names.

Is this fixed in any version of escape_utils? There's a bug in gollum related to incorrect URI escaping.

I ended up porting V8's encodeURIComponent, although the downside is a pure ruby solution is slow. It'd be nice to have the option of correct URI escaping (matching JavaScript) even if it isn't the default.

For correct JavaScript escaping one has to fork both houdini and escape_utils? This issue has been open for 9 months and is causing bugs in gollum. I'll submit a pull request if it'll be merged.

Contributor

ptoomey3 commented Mar 24, 2015

Bump. It seems like there have a been a few more issues opened up related to this:

The issue raised in #31 is particularly troublesome:

q = "plus+plus"
escaped = EscapeUtils.escape_url(q)
EscapeUtils.unescape_url(escaped) == q
=> false

Surely things should round trip unambiguously. Given that escape_url seems to form encode values we can look to the CGI module to see what we would expect:

CGI.unescape("a%2bb")
=> "a+b"

EscapeUtils.unescape_url("a%2bb")
=> "a b"

Regarding @technoweenie's original point, I think he is correct in that there is currently no API to encode a component of a path correctly. Though @brianmario went through the other libraries above, I'll repeat his experiment with a bit of context for each:

URI

Overall it is just not clear what context escape is meant to be used in.

>> URI.escape 'a +b'
=> "a%20+b"

At first glance it looks like that encoding is appropriate for a path component (where + does not require encoding). But, then you do:

>> URI.escape 'a/ +b'
=> "a/%20+b"

Doh, that isn't useful as a path component. It seems like it is trying to interpret the string as a full URI and escape only the "special characters" for each URI component. This is analogous to what JavaScript's encodeURI does. But, what if I want to send a / in the path? JavaScript has encodeURIComponent to handle this, but the URI module has no such equivalent method.

CGI

CGI seems to be pretty easy to reason about.

>> CGI.escape 'a +b'
=> "a+%2Bb"
>> CGI.escape 'a/ +b'
=> "a%2F+%2Bb"

It looks like CGI simply encodes things appropriately for query/form parameter encoding. The CGI module has no awareness/support for encoding a path component. We can't use the CGI module for encoding path components because any space will get encoded as a +, which is not legit for a path component.

Rack

This one seems to do more or less what I would expect:

>> Rack::Utils.escape 'a/ +b'
=> "a%2F+%2Bb"
>> Rack::Utils.escape 'a +b'
=> "a+%2Bb"

The above works exactly like CGI.escape and is appropriate for query parameter encoding. But, unlike CGI, Rack also supports an escape_path method that works as follows:

>> Rack::Utils.escape_path 'a/ +b'
=> "a%2F%20%2Bb"
>> Rack::Utils.escape_path 'a +b'
=> "a%20%2Bb"

Technically you might hope for the + to have been left alone, but encoding it as %2B is perfectly fine since it shouldn't break any browser behavior and it is unambiguously decodable.

EscapeUtils

EscapeUtils#escape_url seems to work like CGI#escape and is fine for encoding query parameters. And, it looks like EscapeUtils.escape_uri works more or less like JavaScript's encodeURI:

encodeURI('http://google.com/foo bar/ +b?yada=blah bar')
"http://google.com/foo%20bar/%20+b?yada=blah%20bar"

EscapeUtils.escape_uri('http://google.com/foo bar/ +b?yada=blah bar')
=> "http://google.com/foo%20bar/%20+b?yada=blah%20bar"

This is fine, but it is missing the JavaScript analogue of encodeURIComponent. So, maybe it would be best to simply add one additional API: EscapeUtils#escape_uri_component that models encodeURIComponent (which seems more or less the same as Rack's Rack::Util::escape_path):

encodeURIComponent('a/ +b')
"a%2F%20%2Bb"

Rack::Utils.escape_path('a/ +b')
=> "a%2F%20%2Bb"

@brianmario - Do you have any objections to the following two changes?:

  • Make EscapeUtils#unescape_url match the behavior of CGI.unescape so that things can roundtrip.
  • Add EscapeUtils#escape_uri_component/EscapeUtils#unescape_uri_component that can be used to correctly escape a component of a path?

@technoweenie - Would that satisfy what you were looking for?

Owner

brianmario commented Mar 24, 2015

I don't have any objections to those changes. This stuff has been a huge pain because the implementations and specifications don't always match up and depending on the context a caller might expect different things from the same APIs.

I think your proposal of adding a new API is safe though. Then at least we won't be messing with what people may or may not be expecting from the existing API.

Unfortunately I'm probably not going to have any time in the near future to work on this. How are your C skills? ;) Also, this may actually be an issue with https://github.com/vmg/houdini which is embedded in this gem but we're just wrapping that.

Contributor

ptoomey3 commented Mar 24, 2015

@vmg - Does is make sense that the logic in https://github.com/brianmario/escape_utils/blob/a1a8f32ba79323d38ba007127c9b1096950cd437/ext/escape_utils/houdini_uri_u.c#L45-L49 is incorrect? We don't want to convert all + to spaces after you are done percent decoding, as this would preclude any literal + from appearing in the output. So, it seems like we can just move that logic above (working on the raw buffer rather than the gh_buf), before the percent decoding is done, eh?

Collaborator

vmg commented Mar 24, 2015

I'm... I really don't recall what that logic was meant to accomplish, it's been more than 2 years since I wrote that. If you tell me exactly what behavior you want, I'll implement it for you. That I can do. :)

Contributor

ptoomey3 commented Mar 24, 2015

@vmg - 🙇 if you could make the following changes/additions to Houdini then I'll take care of updating EscapeUtils (the easy part 😄 )

  • during unescape_url you should substitute + for space before you do any percent decoding. Doing it after makes it such that a percent encoded plus (%2b) is first decoded to a +, but then that is translated to a space. In the end, the current code makes it so that a + can never end up in unescaped output.
  • add a new function escape_uri_component. This works exactly the same as escape_url, but encodes space as %20 instead of +. I guess technically we also need an unescape_uri_component, though I think the implementation will be exactly the same as unescape_uri.

vmg self-assigned this Mar 24, 2015

brianmario closed this in #59 Mar 25, 2015

Collaborator

vmg commented Mar 25, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment