Look at the difference:
>> EscapeUtils.escape_uri 'a/ +b'
>> EscapeUtils.escape_url 'a/ +b'
IMO both are wrong. Also, the similarity in the two method names is ridiculous. This is what I'd like to see:
>> EscapeUtils.escape_full_uri 'a/ +b'
>> EscapeUtils.escape_uri_path 'a/ +b'
> encodeURI('a/ +b')
> encodeURIComponent('a/ +b')
I like to escape + to be safe, because it's not entirely clear if you mean a + or a space. If you agree, I'll happily hack it up and send you a pull. If not, I'll do it anyway and force my fork in GitHub :) :)
escape_url and escape_uri are named and implemented following the behavior of the Ruby standard library. I don't have any opinions on which behavior is saner, that's up to @brianmario.
@technoweenie: don't change anything in escape_utils, send a PR against tanoku/houdini and I'll backport this to escape_utils and other relevant places.
Just for reference:
>> URI.escape 'a +b'
>> URI.escape 'a/ +b'
>> Rack::Utils.escape 'a/ +b'
>> Rack::Utils.escape 'a +b'
>> CGI.escape 'a +b'
>> CGI.escape 'a/ +b'
Also this is interesting https://github.com/rack/rack/blob/master/lib/rack/utils.rb#L28-33 - especially the part that says "Strictly speaking this is true URI escaping." because according to RFC 3986 I think the + character is in the "reserved" list (to not be escaped)?
It's easier if you always encodes spaces as %20, and + as %2B. I realize this could break people that rely on ruby behavior, which is why I suggested completely different names.
Is this fixed in any version of escape_utils? There's a bug in gollum related to incorrect URI escaping.
Bump. It seems like there have a been a few more issues opened up related to this:
The issue raised in #31 is particularly troublesome:
q = "plus+plus"
escaped = EscapeUtils.escape_url(q)
EscapeUtils.unescape_url(escaped) == q
Surely things should round trip unambiguously. Given that escape_url seems to form encode values we can look to the CGI module to see what we would expect:
=> "a b"
Regarding @technoweenie's original point, I think he is correct in that there is currently no API to encode a component of a path correctly. Though @brianmario went through the other libraries above, I'll repeat his experiment with a bit of context for each:
Overall it is just not clear what context escape is meant to be used in.
>> URI.escape 'a +b'
At first glance it looks like that encoding is appropriate for a path component (where + does not require encoding). But, then you do:
>> URI.escape 'a/ +b'
CGI seems to be pretty easy to reason about.
It looks like CGI simply encodes things appropriately for query/form parameter encoding. The CGI module has no awareness/support for encoding a path component. We can't use the CGI module for encoding path components because any space will get encoded as a +, which is not legit for a path component.
This one seems to do more or less what I would expect:
The above works exactly like CGI.escape and is appropriate for query parameter encoding. But, unlike CGI, Rack also supports an escape_path method that works as follows:
>> Rack::Utils.escape_path 'a/ +b'
>> Rack::Utils.escape_path 'a +b'
Technically you might hope for the + to have been left alone, but encoding it as %2B is perfectly fine since it shouldn't break any browser behavior and it is unambiguously decodable.
encodeURI('http://google.com/foo bar/ +b?yada=blah bar')
EscapeUtils.escape_uri('http://google.com/foo bar/ +b?yada=blah bar')
@brianmario - Do you have any objections to the following two changes?:
@technoweenie - Would that satisfy what you were looking for?
I don't have any objections to those changes. This stuff has been a huge pain because the implementations and specifications don't always match up and depending on the context a caller might expect different things from the same APIs.
I think your proposal of adding a new API is safe though. Then at least we won't be messing with what people may or may not be expecting from the existing API.
Unfortunately I'm probably not going to have any time in the near future to work on this. How are your C skills? ;) Also, this may actually be an issue with https://github.com/vmg/houdini which is embedded in this gem but we're just wrapping that.
@vmg - Does is make sense that the logic in https://github.com/brianmario/escape_utils/blob/a1a8f32ba79323d38ba007127c9b1096950cd437/ext/escape_utils/houdini_uri_u.c#L45-L49 is incorrect? We don't want to convert all + to spaces after you are done percent decoding, as this would preclude any literal + from appearing in the output. So, it seems like we can just move that logic above (working on the raw buffer rather than the gh_buf), before the percent decoding is done, eh?
I'm... I really don't recall what that logic was meant to accomplish, it's been more than 2 years since I wrote that. If you tell me exactly what behavior you want, I'll implement it for you. That I can do. :)
@vmg - 🙇 if you could make the following changes/additions to Houdini then I'll take care of updating EscapeUtils (the easy part 😄 )