Proposal: return to standard RFC 3986 URLs #581

philandstuff · 2019-06-06T20:18:00Z

I brought the checklist up to the top, for visibility:

change the import semantics to define chaining local imports onto remote
imports in terms of RFC3986 relative reference resolution; (Use RFC3986 section 5 URL resolution algorithm #593, Simpler way of incorporating RFC3986 resolution #603)
change dhall.abnf to allow any RFC3986-compliant URL, in addition to the
currently-allowed URLs with path components. This will need some care, but
it will not be a breaking change because we are only allowing new URLs, not
removing or changing the meaning of existing URLs. (It is helpful here that
the double quote character " is not a valid character in RFC3986). (Allow all RFC3986-compliant URLs #604)
change dhall format to prefer percent-encoded URLs over quoted-path URLs (dhall format should prefer percent-encoded URL paths instead of quoted paths dhall-haskell#1109 and [dhall format] Prefer unquoted URLs dhall-haskell#1235)
deprecate non-RFC3986 URLs (including quoted-path URLs) (Add deprecation plan for URLs with quoted paths #913)
after a suitable deprecation period, remove non-RFC3986 URLs from the
language

Original proposal below:

Before #127, Dhall accepted standard URLs as defined by RFC 3986. #127 changed
URL paths to have the same syntax as local paths. The stated reason for this
seems to be that we want to be able to host the same set of Dhall files on local
disk or remote server, and have them work the same.

#205 shows that some confusion arises as a result of not allowing all standard URLs.

I see two main problems with the current solution:

First, it means URLs in Dhall are surprisingly different from URLs everywhere
else. As an example, I particularly dislike how dhall format converts
percent-encoded URLs (a thing I understand and expect from other contexts) to
quoted paths (a thing I do not understand in a URL context, because it is not
valid anywhere else). I find it a jarring experience, and it dents my confidence
about whether I know how to format URLs the way Dhall expects, and whether
the URLs I put into Dhall mean what I think they mean.

Second, it assumes that URL path segments and local path segments have the same
semantics, when they do not. In particular, percent-encoding means something in
URLs but does not mean anything in local paths. To see this, imagine a file
foo.dhall containing a local import ./foo%20bar.dhall. If foo.dhall is
a local file, this will open a file whose literal name is foo%20bar.dhall; but
if foo.dhall is a remote file, it will make a web request for
https://.../foo%20bar.dhall which will open a file whose literal name is foo bar.dhall.

~~This means that, in some cases, moving a set of files from local to remote (or vice versa) will break imports.~~

EDIT: this is incorrect, see Gabriel's comment below

(There is also a very minor point that we do not currently accept some valid
URLs, beyond the types listed at the top of this issue. As an example,
http://example.com/(foo) is a valid URL that Dhall will not parse.)

In my view, the correct solution to this is to define a way to turn a local
import into a relative URI, and resolve it against the appropriate base URI.
Resolving a relative reference is defined in RFC3986 section 5; we should reuse
that prior art rather than reinvent it ourselves. Languages should already have
support for resolving relative URIs. (Indeed, this is how dhall-golang
implements import chaining; it does not naively implement the judgment rules in
the spec, but rather calls the Go standard library method
net.url/URL#ResolveReference()).

In the above example, to chain the import ./foo%20bar.dhall onto a remote
import, we would first convert the path to a relative reference
./foo%2520bar.dhall (where the percent character has itself been
percent-encoded), resolve that reference with the appropriate base URI, make the
network request, and the server would then correctly interpret the
foo%2520bar.dhall path segment as a request for a file called
foo%20bar.dhall. Similarly, to chain the import ./"foo bar.dhall", we would
convert to a relative reference ./foo%20bar.dhall, resolve relative to a base
URI, make the request, and the server would correctly interpret the path segment
as a request for the file foo bar.dhall.

Once we have done this, the syntax of remote URLs does not have to match local
paths at all. Then we can return to the happy familiarity of true
RFC3986-compliant URLs.

My proposal is thus:

change the import semantics to define chaining local imports onto remote
imports in terms of RFC3986 relative reference resolution;
change dhall.abnf to allow any RFC3986-compliant URL, in addition to the
currently-allowed URLs with path components. This will need some care, but
it will not be a breaking change because we are only allowing new URLs, not
removing or changing the meaning of existing URLs. (It is helpful here that
the double quote character " is not a valid character in RFC3986).
change dhall format to prefer percent-encoded URLs over quoted-path URLs
deprecate non-RFC3986 URLs (including quoted-path URLs)
after a suitable deprecation period, remove non-RFC3986 URLs from the
language

At this point, dhall URL syntax will just be standard URL syntax; but dhall
files will still be relocatable between local and remote without breaking
things. I think this is the best of all worlds.

The text was updated successfully, but these errors were encountered:

Gabriella439 · 2019-06-06T20:25:36Z

I'm still reviewing this, but one thing immediately caught my eye:

Second, it assumes that URL path segments and local path segments have the same
semantics, when they do not. In particular, percent-encoding means something in
URLs but does not mean anything in local paths. To see this, imagine a file
foo.dhall containing a local import ./"foo%20bar.dhall". If foo.dhall is
a local file, this will open a file whose literal name is foo%20bar.dhall; but
if foo.dhall is a remote file, it will make a web request for
https://.../foo%20bar.dhall which will open a file whose literal name is foo bar.dhall.

My understanding is that this is not what the standard says. If foo.dhall were a remote file it would transitively import https://…/foo%2520bar.dhall (i.e. it would percent-encode foo%20bar.dhall and then fetch that).

See: https://github.com/dhall-lang/dhall-lang/blob/6e67b2202eeb333a99556350979e37216b86510b/standard/imports.md#quoted-paths

So in other words, URLs and paths do have the same semantics. Both examples would import the same file if you hosted the directory tree locally vs remotely.

philandstuff · 2019-06-06T20:32:29Z

I made a mistake, my example should have been ./foo%20bar.dhall, not ./"foo%20bar.dhall". You are correct in your statement; but the unquoted path still has the problem I describe.

(I have updated my original post above, so Gabriel's comment might look confusing now.)

Gabriella439 · 2019-06-07T01:34:00Z

I think this is fine, with two caveats:

We might want to specify the RFC3986 logic for making filepaths relative to URLs using the same judgment syntax that we use for everything else (mainly to ease implementation for languages that don't have library support for this)
We should still disallow fragments (See: Why allow fragments in import URIs? #403)

philandstuff · 2019-06-07T05:43:43Z

Agreed on both points 👍

I’ll draft something, though it might take me a few weeks.

f-f · 2019-06-10T15:04:32Z

@philandstuff thanks for the great writeup! Given that I'm the author of #205 and that I think #585 should be able to represent all RFC3986-valid URLs as a Dhall datastructure, I'm very much in favor of this 👍

philandstuff · 2019-06-15T13:00:15Z

@Gabriel439 after thinking about this quite a bit more, I finally understand what were saying here:

My understanding is that this is not what the standard says. If foo.dhall were a remote file it would transitively import https://…/foo%2520bar.dhall (i.e. it would percent-encode foo%20bar.dhall and then fetch that).

See: https://github.com/dhall-lang/dhall-lang/blob/6e67b2202eeb333a99556350979e37216b86510b/standard/imports.md#quoted-paths

So in other words, URLs and paths do have the same semantics. Both examples would import the same file if you hosted the directory tree locally vs remotely.

...and you're absolutely correct. The quoting or otherwise of the local import doesn't make a difference. I've updated my original post to clarify this.

I am still in support of this change, and I'm working on step 1 above as documented, but I concede I was mistaken about one of my arguments.

This is part 1 of the proposal in #581. This changes the chaining operation so that a parent URL chained to a child local import is now defined in terms of the RFC3986 section 5 URL resolution algorithm. I have erred on the side of being too verbose rather than not verbose enough. As part of this, I have started to use the notation `URL` (and `URL₀`, `URL₁`, etc) to stand in for any generic URL import. I think as things progress on #581, URLs can be treated more opaquely and the current `https://authority₀ directory₀ file₀` notation can be removed. As part of this, I have also changed the internal representation of a URL import to be a literal URL, rather than the parsed and decoded components of a URL. This was necessary, because without this change we would do percent-encoding in two places: when converting a local import to a relative reference, and when fetching a URL. This meant that we could end up double-percent-encoding a URL. For example, if we chain `https://example.com/foo` on to the local path `./" "`, we would first convert the local path to `./%20`, resolve the relative reference to `https://example.com/%20`, and then when fetching the URL, percent-encode it again to `https://example.com/%2520`. This is incorrect behaviour. After this change, we only perform percent-encoding either a) when parsing a URL containing quoted path components; or b) when converting a local import to a relative reference. The same data cannot go through both processes, so there is now no risk of double-percent-encoding. Further, we never perform percent-decoding any more. In particular: - the binary representation of a URL is now the CBOR-specified encoding of a URL (see RFC7049 section 2.4.4.3) - quoted path components in URLs should be percent-encoded at parse time and converted to a valid URL This commit is a breaking change; however, its breakages are somewhat esoteric. The binary format has changed, but it won't affect cached expressions or semantic hashes, because both are computed on fully-resolved expressions, which don't contain imports.

This is part 1 of the proposal in #581. This changes the chaining operation so that a parent URL chained to a child local import is now defined in terms of the RFC3986 section 5 URL resolution algorithm.

Gabriella439 · 2019-06-18T01:44:54Z

@philandstuff: Actually, I still think the gist of what you were saying was correct even if your initial example was off. I'll explain what convinced me that this change was necessary.

Under the current standard, these two URLs are not the same:

https://example.com/foo%20bar.dhall
https://example.com/"foo%20bar.dhall"

... but these two paths are the same:

./foo%20bar.dhall
./"foo%20bar.dhall"

... which is inconsistent. So my view was that as long as we treat URL path components the same as file path components we will inevitably run into issues because of that mismatch in how we treat path components. That's what convinced me that we should treat URLs as opaque values like you proposed.

singpolyma · 2019-06-18T01:49:56Z

‎If we're really worried about the fact that quoting uses different escaping mechanisms based on context, we could just ban quoted segments in URLs. I go back and forth on if I like them anyway -- and I assume using the RFC grammar would require not allowing them also -- so that might be a useful incremental change?

philandstuff · 2019-06-18T05:37:59Z

Yeah, the quoted URL thing is my main concern (hence it was number 1 on my list above). I haven’t implemented it in dhall-golang.

* Use RFC3986 section 5 URL resolution algorithm This is part 1 of the proposal in #581. This changes the chaining operation so that a parent URL chained to a child local import is now defined in terms of the RFC3986 section 5 URL resolution algorithm. * use `. path₁ file₁` syntax for relative references

Mostly reverts #593. I realised I said I'd keep judgment rules for RFC3986 resolution [1], and then I didn't. So I set about writing some judgment rules and realised... the old judgment rules were exactly what was required to implement RFC3986. I think the key thing from #593 that I think is worth keeping is a note about how the chain-remote-with-local rules implement the RFC 3986 algorithm, and some advice to take care when using a URL resolution library to get the percent-encoding correct. [1]: #581 (comment)

This expands the URL syntax to allow all RFC3986-compliant URLs. This means that, in particular: * empty path segments are now permitted * an entirely empty path is also permitted (though it is defined to be equivalent to the path `/`, which has a single empty segment) * paths may contain the characters '(', ')' and ',' * strictly, any path segment must be entirely `pchar`s or `path-characters`, so some combinations are still disallowed, such as "foo,bar^baz" because ',' is not a `path-character` and '^' is not a `pchar`. As this purely expands the space of strings which can be parsed as URLs, this is a mostly-not-breaking change; no existing URLs will fail to parse. However, there are some dhall programs whose interpretation will change, because a URL will consume more characters than it previously was able to. For example http://example.com/foo//bar was previously parsed as http://example.com/foo ⫽ bar but is now parsed as a single URL. Another example is [ http://example.com/foo, bar ] which previously was a list containing two items `http://example.com/foo` and `bar`, and is now a list with a single item: the result of applying `http://example.com/foo,` to `bar`. A final awkward example is: (http://example.com/foo) which previously parsed as `http://example.com/foo`, but now fails to parse (in a PEG grammar, at least) because the final ')' character is greedily consumed by the URL parser and then we have nothing to close the leading parenthesis. The last two examples could be avoided if we disallow the characters '(', ')' and ',' from the `pchar` rule. This would get us closer to standard URL syntax but avoid some tricky pitfalls of parsing. My preference is to go with the standard `pchar` rule despite the edge cases. Fixes #205. See also #581.

Mostly reverts #593. I realised I said I'd keep judgment rules for RFC3986 resolution [1], and then I didn't. So I set about writing some judgment rules and realised... the old judgment rules were exactly what was required to implement RFC3986. I think the key thing from #593 that I think is worth keeping is a note about how the chain-remote-with-local rules implement the RFC 3986 algorithm, and some advice to take care when using a URL resolution library to get the percent-encoding correct. [1]: #581 (comment)

This expands the URL syntax to allow all RFC3986-compliant URLs. This means that, in particular: * empty path segments are now permitted * an entirely empty path is also permitted (though it is defined to be equivalent to the path `/`, which has a single empty segment) * paths may contain the characters '(', ')' and ',' * strictly, any path segment must be entirely `pchar`s or `path-characters`, so some combinations are still disallowed, such as "foo,bar^baz" because ',' is not a `path-character` and '^' is not a `pchar`. As this purely expands the space of strings which can be parsed as URLs, this is a mostly-not-breaking change; no existing URLs will fail to parse. However, there are some dhall programs whose interpretation will change, because a URL will consume more characters than it previously was able to. For example http://example.com/foo//bar was previously parsed as http://example.com/foo ⫽ bar but is now parsed as a single URL. Another example is [ http://example.com/foo, bar ] which previously was a list containing two items `http://example.com/foo` and `bar`, and is now a list with a single item: the result of applying `http://example.com/foo,` to `bar`. A final awkward example is: (http://example.com/foo) which previously parsed as `http://example.com/foo`, but now fails to parse (in a PEG grammar, at least) because the final ')' character is greedily consumed by the URL parser and then we have nothing to close the leading parenthesis. The last two examples could be avoided if we disallow the characters '(', ')' and ',' from the `pchar` rule. This would get us closer to standard URL syntax but avoid some tricky pitfalls of parsing. My preference is to go with the standard `pchar` rule despite the edge cases. Fixes #205. See also #581.

* Allow all RFC3986-compliant URLs This expands the URL syntax to allow all RFC3986-compliant URLs. This means that, in particular: * empty path segments are now permitted * an entirely empty path is also permitted (though it is defined to be equivalent to the path `/`, which has a single empty segment) * paths may contain the characters '(', ')' and ',' * strictly, any path segment must be entirely `pchar`s or `path-characters`, so some combinations are still disallowed, such as "foo,bar^baz" because ',' is not a `path-character` and '^' is not a `pchar`. As this purely expands the space of strings which can be parsed as URLs, this is a mostly-not-breaking change; no existing URLs will fail to parse. However, there are some dhall programs whose interpretation will change, because a URL will consume more characters than it previously was able to. For example http://example.com/foo//bar was previously parsed as http://example.com/foo ⫽ bar but is now parsed as a single URL. Another example is [ http://example.com/foo, bar ] which previously was a list containing two items `http://example.com/foo` and `bar`, and is now a list with a single item: the result of applying `http://example.com/foo,` to `bar`. A final awkward example is: (http://example.com/foo) which previously parsed as `http://example.com/foo`, but now fails to parse (in a PEG grammar, at least) because the final ')' character is greedily consumed by the URL parser and then we have nothing to close the leading parenthesis. The last two examples could be avoided if we disallow the characters '(', ')' and ',' from the `pchar` rule. This would get us closer to standard URL syntax but avoid some tricky pitfalls of parsing. My preference is to go with the standard `pchar` rule despite the edge cases. Fixes #205. See also #581. * Remove (), from sub-delims This is to avoid the problems mentioned in the previous commit with parsing URLs containing these characters. * explain why we disallow (), in URLs

Currently, the standard requires implementations to URL-decode unquoted path segments. I think this is a mistake. In particular, the URLs http://example.com/foo:bar and http://example.com/foo%3Abar are potentially different URLs, even though `:` is URL-encoded as `%3A`. They are different because `:` is a reserved character, which means that servers may interpret the `:` as a special delimiter rather than the literal colon character. An example of this kind of processing is the "matrix parameters" in http://example.com/foo;bar=baz/quux -- matrix parameters are paremeters on a path component, similar to how a query string contains parameters on the whole URL. Matrix parameters use the reserved characters `;` and `=` as delimiter syntax. If you want to use these characters within a matrix parameter name or value, you need to percent-encode them. This demonstrates why `;` and `%3B` may not be equivalent in path segments. By requiring implementations to URL-decode path segments, we unintentionally fold distinct URLs into the same representation. This commit fixes that by changing the internal representation to be the original (encoded) text, and by requiring *quoted* path segments to be URL-*encoded* to match. This changes the binary representation for URIs, but the binary representation for URIs is rarely used because it cannot appear in a resolved and normalized expression. Reserved characters are defined here: https://tools.ietf.org/html/rfc3986#section-2.2 Matrix URIs: https://www.w3.org/DesignIssues/MatrixURIs.html Related to #581 and dhall-lang/dhall-haskell#1109.

Nadrieril · 2020-02-09T13:14:51Z

Do we have a roadmap for the removal of old URLs from the standard ?

philandstuff · 2020-02-09T13:19:24Z

Thanks for the prod. The short answer is: no, but we should make one.

This should allow us to close #581, finally.

philandstuff · 2020-02-18T18:21:57Z

urgh, github was too clever. This shouldn't be closed until we actually remove quoted URLs from the standard

philandstuff · 2020-05-11T13:47:16Z

I plan to work on the final bit of this - removing non-RFC3986 URLs - this week.

Fixes #581. This is the last step in the deprecation plan documented here: https://docs.dhall-lang.org/howtos/migrations/Deprecation-of-quoted-paths-in-URLs.html This is a breaking change, but `dhall format` has been rewriting URLs to remove quoted paths for months now. I also renamed the `url-path` rule to `path-abempty` for greater consistency with upstream RFC 3986.

Fixes #2467 After dhall-lang/dhall-lang#581 the standard requires implementations to not do anything special with respect to URL encoding. In other words, a conforming implementation: - Should not URL-encode path components when parsing imports from Dhall source code - Should not URL-encode path components when resolving those same imports The user can still specify an import that uses URL encoded path components, but the implementation does not give them any special treatment. It just blindly forwards those path components undisturbed. Before this change, the Haskell implementation of Dhall was correctly handling the first part (not url encoding path components at parse time) but was still incorrectly handling the second part (because it would URL encode path components at import resolution time). This change fixes that.

Fixes dhall-lang#2467 After dhall-lang/dhall-lang#581 the standard requires implementations to not do anything special with respect to URL encoding. In other words, a conforming implementation: - Should not URL-encode path components when parsing imports from Dhall source code - Should not URL-encode path components when resolving those same imports The user can still specify an import that uses URL encoded path components, but the implementation does not give them any special treatment. It just blindly forwards those path components undisturbed. Before this change, the Haskell implementation of Dhall was correctly handling the first part (not url encoding path components at parse time) but was still incorrectly handling the second part (because it would URL encode path components at import resolution time). This change fixes that.

philandstuff mentioned this issue Jun 7, 2019

add a bunch of binary decode tests #582

Merged

philandstuff mentioned this issue Jun 11, 2019

Implement importing paths as Location #585

Merged

philandstuff mentioned this issue Jun 15, 2019

Use RFC3986 section 5 URL resolution algorithm #591

Closed

philandstuff mentioned this issue Jun 16, 2019

Use RFC3986 section 5 URL resolution algorithm #593

Merged

philandstuff mentioned this issue Jun 24, 2019

Simpler way of incorporating RFC3986 resolution #603

Merged

philandstuff mentioned this issue Jun 24, 2019

Allow all RFC3986-compliant URLs #604

Merged

philandstuff mentioned this issue Jul 14, 2019

dhall format should prefer percent-encoded URL paths instead of quoted paths dhall-lang/dhall-haskell#1109

Closed

This was referenced Jul 19, 2019

Allow only valid HTTP(S) reg-names dhall-lang/dhall-haskell#1124

Merged

potPourriB.dhallb: Fix encoding of query string #655

Merged

Nadrieril mentioned this issue Aug 6, 2019

Canonicalization of imports Nadrieril/dhall-rust#76

Closed

philandstuff mentioned this issue Aug 13, 2019

Don't URL-decode path segments #704

Merged

philandstuff mentioned this issue Feb 9, 2020

Version 13.0.0 → 14.0.0 #908

Merged

philandstuff added a commit that referenced this issue Feb 15, 2020

Add deprecation plan for URLs with quoted paths

3ac7898

This should allow us to close #581, finally.

This was referenced Feb 15, 2020

Add deprecation plan for URLs with quoted paths #913

Merged

Document how to deprecate features #912

Closed

philandstuff closed this as completed in 68c8aa8 Feb 18, 2020

philandstuff reopened this Feb 18, 2020

Nadrieril mentioned this issue Mar 4, 2020

Parse RFC3986 URLs Nadrieril/dhall-rust#136

Closed

philandstuff self-assigned this May 11, 2020

philandstuff mentioned this issue May 12, 2020

Remove the ability to quote paths in URLs #1005

Merged

philandstuff closed this as completed in #1005 May 15, 2020

Gabriella439 mentioned this issue Apr 8, 2023

Problem with space character in URL import. dhall-lang/dhall-haskell#2467

Closed

Gabriella439 mentioned this issue Apr 8, 2023

Don't URL encode path components dhall-lang/dhall-haskell#2505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: return to standard RFC 3986 URLs #581

Proposal: return to standard RFC 3986 URLs #581

philandstuff commented Jun 6, 2019 •

edited

Loading

Gabriella439 commented Jun 6, 2019

philandstuff commented Jun 6, 2019 •

edited

Loading

Gabriella439 commented Jun 7, 2019

philandstuff commented Jun 7, 2019

f-f commented Jun 10, 2019

philandstuff commented Jun 15, 2019

Gabriella439 commented Jun 18, 2019

singpolyma commented Jun 18, 2019 via email

philandstuff commented Jun 18, 2019

Nadrieril commented Feb 9, 2020

philandstuff commented Feb 9, 2020

philandstuff commented Feb 18, 2020

philandstuff commented May 11, 2020

Proposal: return to standard RFC 3986 URLs #581

Proposal: return to standard RFC 3986 URLs #581

Comments

philandstuff commented Jun 6, 2019 • edited Loading

Gabriella439 commented Jun 6, 2019

philandstuff commented Jun 6, 2019 • edited Loading

Gabriella439 commented Jun 7, 2019

philandstuff commented Jun 7, 2019

f-f commented Jun 10, 2019

philandstuff commented Jun 15, 2019

Gabriella439 commented Jun 18, 2019

singpolyma commented Jun 18, 2019 via email

philandstuff commented Jun 18, 2019

Nadrieril commented Feb 9, 2020

philandstuff commented Feb 9, 2020

philandstuff commented Feb 18, 2020

philandstuff commented May 11, 2020

philandstuff commented Jun 6, 2019 •

edited

Loading

philandstuff commented Jun 6, 2019 •

edited

Loading