New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow urlencoded '+' characters in _scheduler/jobs/<job_id> field #825

Closed
nickva opened this Issue Sep 18, 2017 · 9 comments

Comments

Projects
None yet
2 participants
@nickva
Contributor

nickva commented Sep 18, 2017

Since replication IDs could have + characters in them (they might look like 2abc+create_target+continuous for example), the ID has to be URL encoded when specified as part of a URI. Otherwise '+' signs will be decoded into spaces and on the server side it will look like 2abc create_target continuous.

As a convenience it is possible to unambiguously accept both versions, and thus make URL encoding on the client optional. In other words, client could specify 2abc%2Bcreate_target%2Bcontinuous or or 2abc+create_target+continuous. On the server the former will be decoded as 2abc+create_target+continuous and the latter - 2abc create_target continuous. Then spaces can be turned into +-s since a replication ID generated by server will never have space characters in it.

@eiri

This comment has been minimized.

Show comment
Hide comment
@eiri

eiri Sep 18, 2017

Member

I'm against of breaking with URI RFC for convenience and in my experience encoding ambiguity in general leads to nothing but mess and confusion. If replication ID scheme is hard to work with we probably should just change that scheme.

Member

eiri commented Sep 18, 2017

I'm against of breaking with URI RFC for convenience and in my experience encoding ambiguity in general leads to nothing but mess and confusion. If replication ID scheme is hard to work with we probably should just change that scheme.

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Sep 18, 2017

Contributor

@eiri how does this break the URI RFC?

Contributor

nickva commented Sep 18, 2017

@eiri how does this break the URI RFC?

@eiri

This comment has been minimized.

Show comment
Hide comment
@eiri

eiri Sep 18, 2017

Member

Plus is sub-delim reserved character. RFC's stating:

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

It seems to me that "unambiguously accept both versions, and thus make URL encoding on the client optional" is a suggestion to do exact opposite of that, or do you mean something different here?

Member

eiri commented Sep 18, 2017

Plus is sub-delim reserved character. RFC's stating:

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

It seems to me that "unambiguously accept both versions, and thus make URL encoding on the client optional" is a suggestion to do exact opposite of that, or do you mean something different here?

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Sep 18, 2017

Contributor

+ can be considered a sub-delim because it separates the base replication ID from the extensions. For example replication checkpoints use only the base part but not the extension. So parsing out bits of the ID is practical and probably already used by user code out there.

In other words replication ID is really a compound ID and we just accept both the + delimited version or the space delimited one.

Contributor

nickva commented Sep 18, 2017

+ can be considered a sub-delim because it separates the base replication ID from the extensions. For example replication checkpoints use only the base part but not the extension. So parsing out bits of the ID is practical and probably already used by user code out there.

In other words replication ID is really a compound ID and we just accept both the + delimited version or the space delimited one.

@eiri

This comment has been minimized.

Show comment
Hide comment
@eiri

eiri Sep 18, 2017

Member

You really got me confused here with what exactly you are proposing. At first I've read this as a suggestion to take + in URL as a literal + symbol, hence my comment on breaking RFC. Now, after rereading, I see that you are suggesting to make replicator's URI into not unique resource allocator, i.e. make "2abc%2Bcreate_target%2Bcontinuous", "2abc+create_target+continuous" (and also "2abc%20create_target%20continuous" by accident, right?) to point to the same document.

This still feels wrong, to be honest. Interchangeable delimiters in resource identification already sounds as unnecessary complication and confusion with encoding\representation of such things (imagine whole number of possible mixes: ""2abc+create%20target%2Bcontinuous"`) seems problematic. Again, if it is only about convenience then let's change that plus to something not requiring encoding.

Also, anecdotical, reading around gave me a notion that in web's URL + requires encoding only in query, not in path, but I can't find RFC to back this up.

Member

eiri commented Sep 18, 2017

You really got me confused here with what exactly you are proposing. At first I've read this as a suggestion to take + in URL as a literal + symbol, hence my comment on breaking RFC. Now, after rereading, I see that you are suggesting to make replicator's URI into not unique resource allocator, i.e. make "2abc%2Bcreate_target%2Bcontinuous", "2abc+create_target+continuous" (and also "2abc%20create_target%20continuous" by accident, right?) to point to the same document.

This still feels wrong, to be honest. Interchangeable delimiters in resource identification already sounds as unnecessary complication and confusion with encoding\representation of such things (imagine whole number of possible mixes: ""2abc+create%20target%2Bcontinuous"`) seems problematic. Again, if it is only about convenience then let's change that plus to something not requiring encoding.

Also, anecdotical, reading around gave me a notion that in web's URL + requires encoding only in query, not in path, but I can't find RFC to back this up.

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Sep 18, 2017

Contributor

""2abc+create%20target%2Bcontinuous"`

Replication ID format is not arbitrary though. This is possible exactly because 2abc+create%20target%2Bcontinuous is not valid replication ID.

Changing the plus sign might work, however since the ID is compound (made of parts) there are probably user written scripts would split the ID. For example if they inspect the checkpoints documents where only the base part is used.

Contributor

nickva commented Sep 18, 2017

""2abc+create%20target%2Bcontinuous"`

Replication ID format is not arbitrary though. This is possible exactly because 2abc+create%20target%2Bcontinuous is not valid replication ID.

Changing the plus sign might work, however since the ID is compound (made of parts) there are probably user written scripts would split the ID. For example if they inspect the checkpoints documents where only the base part is used.

@eiri

This comment has been minimized.

Show comment
Hide comment
@eiri

eiri Sep 18, 2017

Member

2bc+create%20target%2Bcontinuous is not valid replication ID, but valid representation of replication ID if we'd go with your suggestion of treating both + and spaces as a valid delimiter.

Member

eiri commented Sep 18, 2017

2bc+create%20target%2Bcontinuous is not valid replication ID, but valid representation of replication ID if we'd go with your suggestion of treating both + and spaces as a valid delimiter.

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Sep 18, 2017

Contributor

It would be a different encoding of the same resource. Is encoding the same as representation, I not sure. For example is document %61 the same as document a.

Also, anecdotical, reading around gave me a notion that in web's URL + requires encoding only in query, not in path, but I can't find RFC to back this up.

Yap, but mochiweb decodes + in path as a space, probably because it reuses the code used for application/x-www-form-urlencoded. So if we switched to a different server the code the converting + to space would never be called.

In other words, this is mostly about bypassing a mochiweb peculiarity. Note in https://tools.ietf.org/html/rfc3986#section-3.3 that the path can contain sub-delims (which include +).

segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                    ; non-zero-length segment without any colon ":"

But mochiweb doesn't pass + to the application correctly there and decodes it to a space.

Contributor

nickva commented Sep 18, 2017

It would be a different encoding of the same resource. Is encoding the same as representation, I not sure. For example is document %61 the same as document a.

Also, anecdotical, reading around gave me a notion that in web's URL + requires encoding only in query, not in path, but I can't find RFC to back this up.

Yap, but mochiweb decodes + in path as a space, probably because it reuses the code used for application/x-www-form-urlencoded. So if we switched to a different server the code the converting + to space would never be called.

In other words, this is mostly about bypassing a mochiweb peculiarity. Note in https://tools.ietf.org/html/rfc3986#section-3.3 that the path can contain sub-delims (which include +).

segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                    ; non-zero-length segment without any colon ":"

But mochiweb doesn't pass + to the application correctly there and decodes it to a space.

@eiri

This comment has been minimized.

Show comment
Hide comment
@eiri

eiri Sep 18, 2017

Member

Right. And you are suggesting to fix this by, depending on perspective, using non-standard encoder which treats +, %2B and %20 all as the same "plusOrSpace" symbol, or, from an other angle, by making replication ID into non-unique identificator from http clients point of view.

Well, if you see all the consequences and still think this is a good idea I don't know if I can convince you otherwise, but read this saga as an illustration how messy and confusing hacking URL encoding could get.

Member

eiri commented Sep 18, 2017

Right. And you are suggesting to fix this by, depending on perspective, using non-standard encoder which treats +, %2B and %20 all as the same "plusOrSpace" symbol, or, from an other angle, by making replication ID into non-unique identificator from http clients point of view.

Well, if you see all the consequences and still think this is a good idea I don't know if I can convince you otherwise, but read this saga as an illustration how messy and confusing hacking URL encoding could get.

nickva added a commit to cloudant/couchdb that referenced this issue Sep 19, 2017

Fix replication ID parsing in URL paths
Previously users had to URL encode replication IDs when using
`_scheduler/jobs/<job_id>` endpoint because Mochiweb incorrectly decoded the
`+` character from URL path. So users were forced to encode so that the
replicator would correctly receive a `+` after Mochiweb parsing.

`+` is decoded as ` ` (space) probably because in query strings that's a valid
application/x-www-form-urlencoded encoding, but that decoding is not meant for
decoding URL paths, only query strings.

Notice RFC 3986 https://tools.ietf.org/html/rfc3986#section-2.2

`+` is a `sub-delim` (term from RFC)  and in the path component it can be used
unquoted as a delimiter.

https://tools.ietf.org/html/rfc3986#section-3.3

Indeed, the replication ID is a compound ID and `+` is a valid delimiter
which separates the base part from the extensions.

For more details see also:

perwendel/spark#490

https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

Fixes #825

@nickva nickva closed this in #828 Sep 19, 2017

nickva added a commit that referenced this issue Sep 19, 2017

Fix replication ID parsing in URL paths
Previously users had to URL encode replication IDs when using
`_scheduler/jobs/<job_id>` endpoint because Mochiweb incorrectly decoded the
`+` character from URL path. So users were forced to encode so that the
replicator would correctly receive a `+` after Mochiweb parsing.

`+` is decoded as ` ` (space) probably because in query strings that's a valid
application/x-www-form-urlencoded encoding, but that decoding is not meant for
decoding URL paths, only query strings.

Notice RFC 3986 https://tools.ietf.org/html/rfc3986#section-2.2

`+` is a `sub-delim` (term from RFC)  and in the path component it can be used
unquoted as a delimiter.

https://tools.ietf.org/html/rfc3986#section-3.3

Indeed, the replication ID is a compound ID and `+` is a valid delimiter
which separates the base part from the extensions.

For more details see also:

perwendel/spark#490

https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

Fixes #825

wohali added a commit that referenced this issue Oct 19, 2017

Fix replication ID parsing in URL paths
Previously users had to URL encode replication IDs when using
`_scheduler/jobs/<job_id>` endpoint because Mochiweb incorrectly decoded the
`+` character from URL path. So users were forced to encode so that the
replicator would correctly receive a `+` after Mochiweb parsing.

`+` is decoded as ` ` (space) probably because in query strings that's a valid
application/x-www-form-urlencoded encoding, but that decoding is not meant for
decoding URL paths, only query strings.

Notice RFC 3986 https://tools.ietf.org/html/rfc3986#section-2.2

`+` is a `sub-delim` (term from RFC)  and in the path component it can be used
unquoted as a delimiter.

https://tools.ietf.org/html/rfc3986#section-3.3

Indeed, the replication ID is a compound ID and `+` is a valid delimiter
which separates the base part from the extensions.

For more details see also:

perwendel/spark#490

https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

Fixes #825

willholley added a commit to willholley/couchdb that referenced this issue May 22, 2018

Fix replication ID parsing in URL paths
Previously users had to URL encode replication IDs when using
`_scheduler/jobs/<job_id>` endpoint because Mochiweb incorrectly decoded the
`+` character from URL path. So users were forced to encode so that the
replicator would correctly receive a `+` after Mochiweb parsing.

`+` is decoded as ` ` (space) probably because in query strings that's a valid
application/x-www-form-urlencoded encoding, but that decoding is not meant for
decoding URL paths, only query strings.

Notice RFC 3986 https://tools.ietf.org/html/rfc3986#section-2.2

`+` is a `sub-delim` (term from RFC)  and in the path component it can be used
unquoted as a delimiter.

https://tools.ietf.org/html/rfc3986#section-3.3

Indeed, the replication ID is a compound ID and `+` is a valid delimiter
which separates the base part from the extensions.

For more details see also:

perwendel/spark#490

https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

Fixes #825
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment