-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: return to standard RFC 3986 URLs #581
Comments
I'm still reviewing this, but one thing immediately caught my eye:
My understanding is that this is not what the standard says. If So in other words, URLs and paths do have the same semantics. Both examples would import the same file if you hosted the directory tree locally vs remotely. |
I made a mistake, my example should have been (I have updated my original post above, so Gabriel's comment might look confusing now.) |
I think this is fine, with two caveats:
|
Agreed on both points 👍 I’ll draft something, though it might take me a few weeks. |
@philandstuff thanks for the great writeup! Given that I'm the author of #205 and that I think #585 should be able to represent all RFC3986-valid URLs as a Dhall datastructure, I'm very much in favor of this 👍 |
@Gabriel439 after thinking about this quite a bit more, I finally understand what were saying here:
...and you're absolutely correct. The quoting or otherwise of the local import doesn't make a difference. I've updated my original post to clarify this. I am still in support of this change, and I'm working on step 1 above as documented, but I concede I was mistaken about one of my arguments. |
This is part 1 of the proposal in #581. This changes the chaining operation so that a parent URL chained to a child local import is now defined in terms of the RFC3986 section 5 URL resolution algorithm. I have erred on the side of being too verbose rather than not verbose enough. As part of this, I have started to use the notation `URL` (and `URL₀`, `URL₁`, etc) to stand in for any generic URL import. I think as things progress on #581, URLs can be treated more opaquely and the current `https://authority₀ directory₀ file₀` notation can be removed. As part of this, I have also changed the internal representation of a URL import to be a literal URL, rather than the parsed and decoded components of a URL. This was necessary, because without this change we would do percent-encoding in two places: when converting a local import to a relative reference, and when fetching a URL. This meant that we could end up double-percent-encoding a URL. For example, if we chain `https://example.com/foo` on to the local path `./" "`, we would first convert the local path to `./%20`, resolve the relative reference to `https://example.com/%20`, and then when fetching the URL, percent-encode it again to `https://example.com/%2520`. This is incorrect behaviour. After this change, we only perform percent-encoding either a) when parsing a URL containing quoted path components; or b) when converting a local import to a relative reference. The same data cannot go through both processes, so there is now no risk of double-percent-encoding. Further, we never perform percent-decoding any more. In particular: - the binary representation of a URL is now the CBOR-specified encoding of a URL (see RFC7049 section 2.4.4.3) - quoted path components in URLs should be percent-encoded at parse time and converted to a valid URL This commit is a breaking change; however, its breakages are somewhat esoteric. The binary format has changed, but it won't affect cached expressions or semantic hashes, because both are computed on fully-resolved expressions, which don't contain imports.
This is part 1 of the proposal in #581. This changes the chaining operation so that a parent URL chained to a child local import is now defined in terms of the RFC3986 section 5 URL resolution algorithm.
@philandstuff: Actually, I still think the gist of what you were saying was correct even if your initial example was off. I'll explain what convinced me that this change was necessary. Under the current standard, these two URLs are not the same:
... but these two paths are the same:
... which is inconsistent. So my view was that as long as we treat URL path components the same as file path components we will inevitably run into issues because of that mismatch in how we treat path components. That's what convinced me that we should treat URLs as opaque values like you proposed. |
If we're really worried about the fact that quoting uses different escaping mechanisms based on context, we could just ban quoted segments in URLs.
I go back and forth on if I like them anyway -- and I assume using the RFC grammar would require not allowing them also -- so that might be a useful incremental change?
|
Yeah, the quoted URL thing is my main concern (hence it was number 1 on my list above). I haven’t implemented it in dhall-golang. |
* Use RFC3986 section 5 URL resolution algorithm This is part 1 of the proposal in #581. This changes the chaining operation so that a parent URL chained to a child local import is now defined in terms of the RFC3986 section 5 URL resolution algorithm. * use `. path₁ file₁` syntax for relative references
Mostly reverts #593. I realised I said I'd keep judgment rules for RFC3986 resolution [1], and then I didn't. So I set about writing some judgment rules and realised... the old judgment rules were exactly what was required to implement RFC3986. I think the key thing from #593 that I think is worth keeping is a note about how the chain-remote-with-local rules implement the RFC 3986 algorithm, and some advice to take care when using a URL resolution library to get the percent-encoding correct. [1]: #581 (comment)
This expands the URL syntax to allow all RFC3986-compliant URLs. This means that, in particular: * empty path segments are now permitted * an entirely empty path is also permitted (though it is defined to be equivalent to the path `/`, which has a single empty segment) * paths may contain the characters '(', ')' and ',' * strictly, any path segment must be entirely `pchar`s or `path-characters`, so some combinations are still disallowed, such as "foo,bar^baz" because ',' is not a `path-character` and '^' is not a `pchar`. As this purely expands the space of strings which can be parsed as URLs, this is a mostly-not-breaking change; no existing URLs will fail to parse. However, there are some dhall programs whose interpretation will change, because a URL will consume more characters than it previously was able to. For example http://example.com/foo//bar was previously parsed as http://example.com/foo ⫽ bar but is now parsed as a single URL. Another example is [ http://example.com/foo, bar ] which previously was a list containing two items `http://example.com/foo` and `bar`, and is now a list with a single item: the result of applying `http://example.com/foo,` to `bar`. A final awkward example is: (http://example.com/foo) which previously parsed as `http://example.com/foo`, but now fails to parse (in a PEG grammar, at least) because the final ')' character is greedily consumed by the URL parser and then we have nothing to close the leading parenthesis. The last two examples could be avoided if we disallow the characters '(', ')' and ',' from the `pchar` rule. This would get us closer to standard URL syntax but avoid some tricky pitfalls of parsing. My preference is to go with the standard `pchar` rule despite the edge cases. Fixes #205. See also #581.
Mostly reverts #593. I realised I said I'd keep judgment rules for RFC3986 resolution [1], and then I didn't. So I set about writing some judgment rules and realised... the old judgment rules were exactly what was required to implement RFC3986. I think the key thing from #593 that I think is worth keeping is a note about how the chain-remote-with-local rules implement the RFC 3986 algorithm, and some advice to take care when using a URL resolution library to get the percent-encoding correct. [1]: #581 (comment)
This expands the URL syntax to allow all RFC3986-compliant URLs. This means that, in particular: * empty path segments are now permitted * an entirely empty path is also permitted (though it is defined to be equivalent to the path `/`, which has a single empty segment) * paths may contain the characters '(', ')' and ',' * strictly, any path segment must be entirely `pchar`s or `path-characters`, so some combinations are still disallowed, such as "foo,bar^baz" because ',' is not a `path-character` and '^' is not a `pchar`. As this purely expands the space of strings which can be parsed as URLs, this is a mostly-not-breaking change; no existing URLs will fail to parse. However, there are some dhall programs whose interpretation will change, because a URL will consume more characters than it previously was able to. For example http://example.com/foo//bar was previously parsed as http://example.com/foo ⫽ bar but is now parsed as a single URL. Another example is [ http://example.com/foo, bar ] which previously was a list containing two items `http://example.com/foo` and `bar`, and is now a list with a single item: the result of applying `http://example.com/foo,` to `bar`. A final awkward example is: (http://example.com/foo) which previously parsed as `http://example.com/foo`, but now fails to parse (in a PEG grammar, at least) because the final ')' character is greedily consumed by the URL parser and then we have nothing to close the leading parenthesis. The last two examples could be avoided if we disallow the characters '(', ')' and ',' from the `pchar` rule. This would get us closer to standard URL syntax but avoid some tricky pitfalls of parsing. My preference is to go with the standard `pchar` rule despite the edge cases. Fixes #205. See also #581.
* Allow all RFC3986-compliant URLs This expands the URL syntax to allow all RFC3986-compliant URLs. This means that, in particular: * empty path segments are now permitted * an entirely empty path is also permitted (though it is defined to be equivalent to the path `/`, which has a single empty segment) * paths may contain the characters '(', ')' and ',' * strictly, any path segment must be entirely `pchar`s or `path-characters`, so some combinations are still disallowed, such as "foo,bar^baz" because ',' is not a `path-character` and '^' is not a `pchar`. As this purely expands the space of strings which can be parsed as URLs, this is a mostly-not-breaking change; no existing URLs will fail to parse. However, there are some dhall programs whose interpretation will change, because a URL will consume more characters than it previously was able to. For example http://example.com/foo//bar was previously parsed as http://example.com/foo ⫽ bar but is now parsed as a single URL. Another example is [ http://example.com/foo, bar ] which previously was a list containing two items `http://example.com/foo` and `bar`, and is now a list with a single item: the result of applying `http://example.com/foo,` to `bar`. A final awkward example is: (http://example.com/foo) which previously parsed as `http://example.com/foo`, but now fails to parse (in a PEG grammar, at least) because the final ')' character is greedily consumed by the URL parser and then we have nothing to close the leading parenthesis. The last two examples could be avoided if we disallow the characters '(', ')' and ',' from the `pchar` rule. This would get us closer to standard URL syntax but avoid some tricky pitfalls of parsing. My preference is to go with the standard `pchar` rule despite the edge cases. Fixes #205. See also #581. * Remove (), from sub-delims This is to avoid the problems mentioned in the previous commit with parsing URLs containing these characters. * explain why we disallow (), in URLs
Currently, the standard requires implementations to URL-decode unquoted path segments. I think this is a mistake. In particular, the URLs http://example.com/foo:bar and http://example.com/foo%3Abar are potentially different URLs, even though `:` is URL-encoded as `%3A`. They are different because `:` is a reserved character, which means that servers may interpret the `:` as a special delimiter rather than the literal colon character. An example of this kind of processing is the "matrix parameters" in http://example.com/foo;bar=baz/quux -- matrix parameters are paremeters on a path component, similar to how a query string contains parameters on the whole URL. Matrix parameters use the reserved characters `;` and `=` as delimiter syntax. If you want to use these characters within a matrix parameter name or value, you need to percent-encode them. This demonstrates why `;` and `%3B` may not be equivalent in path segments. By requiring implementations to URL-decode path segments, we unintentionally fold distinct URLs into the same representation. This commit fixes that by changing the internal representation to be the original (encoded) text, and by requiring *quoted* path segments to be URL-*encoded* to match. This changes the binary representation for URIs, but the binary representation for URIs is rarely used because it cannot appear in a resolved and normalized expression. Reserved characters are defined here: https://tools.ietf.org/html/rfc3986#section-2.2 Matrix URIs: https://www.w3.org/DesignIssues/MatrixURIs.html Related to #581 and dhall-lang/dhall-haskell#1109.
Currently, the standard requires implementations to URL-decode unquoted path segments. I think this is a mistake. In particular, the URLs http://example.com/foo:bar and http://example.com/foo%3Abar are potentially different URLs, even though `:` is URL-encoded as `%3A`. They are different because `:` is a reserved character, which means that servers may interpret the `:` as a special delimiter rather than the literal colon character. An example of this kind of processing is the "matrix parameters" in http://example.com/foo;bar=baz/quux -- matrix parameters are paremeters on a path component, similar to how a query string contains parameters on the whole URL. Matrix parameters use the reserved characters `;` and `=` as delimiter syntax. If you want to use these characters within a matrix parameter name or value, you need to percent-encode them. This demonstrates why `;` and `%3B` may not be equivalent in path segments. By requiring implementations to URL-decode path segments, we unintentionally fold distinct URLs into the same representation. This commit fixes that by changing the internal representation to be the original (encoded) text, and by requiring *quoted* path segments to be URL-*encoded* to match. This changes the binary representation for URIs, but the binary representation for URIs is rarely used because it cannot appear in a resolved and normalized expression. Reserved characters are defined here: https://tools.ietf.org/html/rfc3986#section-2.2 Matrix URIs: https://www.w3.org/DesignIssues/MatrixURIs.html Related to #581 and dhall-lang/dhall-haskell#1109.
Do we have a roadmap for the removal of old URLs from the standard ? |
Thanks for the prod. The short answer is: no, but we should make one. |
This should allow us to close #581, finally.
urgh, github was too clever. This shouldn't be closed until we actually remove quoted URLs from the standard |
I plan to work on the final bit of this - removing non-RFC3986 URLs - this week. |
Fixes #581. This is the last step in the deprecation plan documented here: https://docs.dhall-lang.org/howtos/migrations/Deprecation-of-quoted-paths-in-URLs.html This is a breaking change, but `dhall format` has been rewriting URLs to remove quoted paths for months now. I also renamed the `url-path` rule to `path-abempty` for greater consistency with upstream RFC 3986.
Fixes #581. This is the last step in the deprecation plan documented here: https://docs.dhall-lang.org/howtos/migrations/Deprecation-of-quoted-paths-in-URLs.html This is a breaking change, but `dhall format` has been rewriting URLs to remove quoted paths for months now. I also renamed the `url-path` rule to `path-abempty` for greater consistency with upstream RFC 3986.
Fixes #2467 After dhall-lang/dhall-lang#581 the standard requires implementations to not do anything special with respect to URL encoding. In other words, a conforming implementation: - Should not URL-encode path components when parsing imports from Dhall source code - Should not URL-encode path components when resolving those same imports The user can still specify an import that uses URL encoded path components, but the implementation does not give them any special treatment. It just blindly forwards those path components undisturbed. Before this change, the Haskell implementation of Dhall was correctly handling the first part (not url encoding path components at parse time) but was still incorrectly handling the second part (because it would URL encode path components at import resolution time). This change fixes that.
Fixes #2467 After dhall-lang/dhall-lang#581 the standard requires implementations to not do anything special with respect to URL encoding. In other words, a conforming implementation: - Should not URL-encode path components when parsing imports from Dhall source code - Should not URL-encode path components when resolving those same imports The user can still specify an import that uses URL encoded path components, but the implementation does not give them any special treatment. It just blindly forwards those path components undisturbed. Before this change, the Haskell implementation of Dhall was correctly handling the first part (not url encoding path components at parse time) but was still incorrectly handling the second part (because it would URL encode path components at import resolution time). This change fixes that.
Fixes dhall-lang#2467 After dhall-lang/dhall-lang#581 the standard requires implementations to not do anything special with respect to URL encoding. In other words, a conforming implementation: - Should not URL-encode path components when parsing imports from Dhall source code - Should not URL-encode path components when resolving those same imports The user can still specify an import that uses URL encoded path components, but the implementation does not give them any special treatment. It just blindly forwards those path components undisturbed. Before this change, the Haskell implementation of Dhall was correctly handling the first part (not url encoding path components at parse time) but was still incorrectly handling the second part (because it would URL encode path components at import resolution time). This change fixes that.
Fixes dhall-lang#2467 After dhall-lang/dhall-lang#581 the standard requires implementations to not do anything special with respect to URL encoding. In other words, a conforming implementation: - Should not URL-encode path components when parsing imports from Dhall source code - Should not URL-encode path components when resolving those same imports The user can still specify an import that uses URL encoded path components, but the implementation does not give them any special treatment. It just blindly forwards those path components undisturbed. Before this change, the Haskell implementation of Dhall was correctly handling the first part (not url encoding path components at parse time) but was still incorrectly handling the second part (because it would URL encode path components at import resolution time). This change fixes that.
I brought the checklist up to the top, for visibility:
imports in terms of RFC3986 relative reference resolution; (Use RFC3986 section 5 URL resolution algorithm #593, Simpler way of incorporating RFC3986 resolution #603)
dhall.abnf
to allow any RFC3986-compliant URL, in addition to thecurrently-allowed URLs with path components. This will need some care, but
it will not be a breaking change because we are only allowing new URLs, not
removing or changing the meaning of existing URLs. (It is helpful here that
the double quote character
"
is not a valid character in RFC3986). (Allow all RFC3986-compliant URLs #604)dhall format
to prefer percent-encoded URLs over quoted-path URLs (dhall format
should prefer percent-encoded URL paths instead of quoted paths dhall-haskell#1109 and [dhall format] Prefer unquoted URLs dhall-haskell#1235)language
Original proposal below:
Before #127, Dhall accepted standard URLs as defined by RFC 3986. #127 changed
URL paths to have the same syntax as local paths. The stated reason for this
seems to be that we want to be able to host the same set of Dhall files on local
disk or remote server, and have them work the same.
#205 shows that some confusion arises as a result of not allowing all standard URLs.
I see two main problems with the current solution:
First, it means URLs in Dhall are surprisingly different from URLs everywhere
else. As an example, I particularly dislike how
dhall format
convertspercent-encoded URLs (a thing I understand and expect from other contexts) to
quoted paths (a thing I do not understand in a URL context, because it is not
valid anywhere else). I find it a jarring experience, and it dents my confidence
about whether I know how to format URLs the way Dhall expects, and whether
the URLs I put into Dhall mean what I think they mean.
Second, it assumes that URL path segments and local path segments have the same
semantics, when they do not. In particular, percent-encoding means something in
URLs but does not mean anything in local paths. To see this, imagine a file
foo.dhall
containing a local import./foo%20bar.dhall
. Iffoo.dhall
isa local file, this will open a file whose literal name is
foo%20bar.dhall
; butif
foo.dhall
is a remote file, it will make a web request forhttps://.../foo%20bar.dhall
which will open a file whose literal name isfoo bar.dhall
.This means that, in some cases, moving a set of files from local to remote (or vice versa) will break imports.EDIT: this is incorrect, see Gabriel's comment below
(There is also a very minor point that we do not currently accept some valid
URLs, beyond the types listed at the top of this issue. As an example,
http://example.com/(foo) is a valid URL that Dhall will not parse.)
In my view, the correct solution to this is to define a way to turn a local
import into a relative URI, and resolve it against the appropriate base URI.
Resolving a relative reference is defined in RFC3986 section 5; we should reuse
that prior art rather than reinvent it ourselves. Languages should already have
support for resolving relative URIs. (Indeed, this is how dhall-golang
implements import chaining; it does not naively implement the judgment rules in
the spec, but rather calls the Go standard library method
net.url/URL#ResolveReference()
).In the above example, to chain the import
./foo%20bar.dhall
onto a remoteimport, we would first convert the path to a relative reference
./foo%2520bar.dhall
(where the percent character has itself beenpercent-encoded), resolve that reference with the appropriate base URI, make the
network request, and the server would then correctly interpret the
foo%2520bar.dhall
path segment as a request for a file calledfoo%20bar.dhall
. Similarly, to chain the import./"foo bar.dhall"
, we wouldconvert to a relative reference
./foo%20bar.dhall
, resolve relative to a baseURI, make the request, and the server would correctly interpret the path segment
as a request for the file
foo bar.dhall
.Once we have done this, the syntax of remote URLs does not have to match local
paths at all. Then we can return to the happy familiarity of true
RFC3986-compliant URLs.
My proposal is thus:
imports in terms of RFC3986 relative reference resolution;
dhall.abnf
to allow any RFC3986-compliant URL, in addition to thecurrently-allowed URLs with path components. This will need some care, but
it will not be a breaking change because we are only allowing new URLs, not
removing or changing the meaning of existing URLs. (It is helpful here that
the double quote character
"
is not a valid character in RFC3986).dhall format
to prefer percent-encoded URLs over quoted-path URLslanguage
At this point, dhall URL syntax will just be standard URL syntax; but dhall
files will still be relocatable between local and remote without breaking
things. I think this is the best of all worlds.
The text was updated successfully, but these errors were encountered: