Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: return to standard RFC 3986 URLs #581

Open
philandstuff opened this issue Jun 6, 2019 · 9 comments

Comments

@philandstuff
Copy link
Collaborator

@philandstuff philandstuff commented Jun 6, 2019

I brought the checklist up to the top, for visibility:

  • change the import semantics to define chaining local imports onto remote
    imports in terms of RFC3986 relative reference resolution; (#593, #603)
  • change dhall.abnf to allow any RFC3986-compliant URL, in addition to the
    currently-allowed URLs with path components. This will need some care, but
    it will not be a breaking change because we are only allowing new URLs, not
    removing or changing the meaning of existing URLs. (It is helpful here that
    the double quote character " is not a valid character in RFC3986). (#604)
  • change dhall format to prefer percent-encoded URLs over quoted-path URLs (dhall-lang/dhall-haskell#1109 and dhall-lang/dhall-haskell#1235)
  • deprecate non-RFC3986 URLs (including quoted-path URLs)
  • after a suitable deprecation period, remove non-RFC3986 URLs from the
    language

Original proposal below:


Before #127, Dhall accepted standard URLs as defined by RFC 3986. #127 changed
URL paths to have the same syntax as local paths. The stated reason for this
seems to be that we want to be able to host the same set of Dhall files on local
disk or remote server, and have them work the same.

#205 shows that some confusion arises as a result of not allowing all standard URLs.

I see two main problems with the current solution:

First, it means URLs in Dhall are surprisingly different from URLs everywhere
else. As an example, I particularly dislike how dhall format converts
percent-encoded URLs (a thing I understand and expect from other contexts) to
quoted paths (a thing I do not understand in a URL context, because it is not
valid anywhere else). I find it a jarring experience, and it dents my confidence
about whether I know how to format URLs the way Dhall expects, and whether
the URLs I put into Dhall mean what I think they mean.

Second, it assumes that URL path segments and local path segments have the same
semantics, when they do not. In particular, percent-encoding means something in
URLs but does not mean anything in local paths. To see this, imagine a file
foo.dhall containing a local import ./foo%20bar.dhall. If foo.dhall is
a local file, this will open a file whose literal name is foo%20bar.dhall; but
if foo.dhall is a remote file, it will make a web request for
https://.../foo%20bar.dhall which will open a file whose literal name is foo bar.dhall.

This means that, in some cases, moving a set of files from local to remote (or vice versa) will break imports.

EDIT: this is incorrect, see Gabriel's comment below

(There is also a very minor point that we do not currently accept some valid
URLs, beyond the types listed at the top of this issue. As an example,
http://example.com/(foo) is a valid URL that Dhall will not parse.)

In my view, the correct solution to this is to define a way to turn a local
import into a relative URI, and resolve it against the appropriate base URI.
Resolving a relative reference is defined in RFC3986 section 5; we should reuse
that prior art rather than reinvent it ourselves. Languages should already have
support for resolving relative URIs. (Indeed, this is how dhall-golang
implements import chaining; it does not naively implement the judgment rules in
the spec, but rather calls the Go standard library method
net.url/URL#ResolveReference()).

In the above example, to chain the import ./foo%20bar.dhall onto a remote
import, we would first convert the path to a relative reference
./foo%2520bar.dhall (where the percent character has itself been
percent-encoded), resolve that reference with the appropriate base URI, make the
network request, and the server would then correctly interpret the
foo%2520bar.dhall path segment as a request for a file called
foo%20bar.dhall. Similarly, to chain the import ./"foo bar.dhall", we would
convert to a relative reference ./foo%20bar.dhall, resolve relative to a base
URI, make the request, and the server would correctly interpret the path segment
as a request for the file foo bar.dhall.

Once we have done this, the syntax of remote URLs does not have to match local
paths at all. Then we can return to the happy familiarity of true
RFC3986-compliant URLs.

My proposal is thus:

  1. change the import semantics to define chaining local imports onto remote
    imports in terms of RFC3986 relative reference resolution;
  2. change dhall.abnf to allow any RFC3986-compliant URL, in addition to the
    currently-allowed URLs with path components. This will need some care, but
    it will not be a breaking change because we are only allowing new URLs, not
    removing or changing the meaning of existing URLs. (It is helpful here that
    the double quote character " is not a valid character in RFC3986).
  3. change dhall format to prefer percent-encoded URLs over quoted-path URLs
  4. deprecate non-RFC3986 URLs (including quoted-path URLs)
  5. after a suitable deprecation period, remove non-RFC3986 URLs from the
    language

At this point, dhall URL syntax will just be standard URL syntax; but dhall
files will still be relocatable between local and remote without breaking
things. I think this is the best of all worlds.

@Gabriel439

This comment has been minimized.

Copy link
Contributor

@Gabriel439 Gabriel439 commented Jun 6, 2019

I'm still reviewing this, but one thing immediately caught my eye:

Second, it assumes that URL path segments and local path segments have the same
semantics, when they do not. In particular, percent-encoding means something in
URLs but does not mean anything in local paths. To see this, imagine a file
foo.dhall containing a local import ./"foo%20bar.dhall". If foo.dhall is
a local file, this will open a file whose literal name is foo%20bar.dhall; but
if foo.dhall is a remote file, it will make a web request for
https://.../foo%20bar.dhall which will open a file whose literal name is foo bar.dhall.

My understanding is that this is not what the standard says. If foo.dhall were a remote file it would transitively import https://…/foo%2520bar.dhall (i.e. it would percent-encode foo%20bar.dhall and then fetch that).

See: https://github.com/dhall-lang/dhall-lang/blob/6e67b2202eeb333a99556350979e37216b86510b/standard/imports.md#quoted-paths

So in other words, URLs and paths do have the same semantics. Both examples would import the same file if you hosted the directory tree locally vs remotely.

@philandstuff

This comment has been minimized.

Copy link
Collaborator Author

@philandstuff philandstuff commented Jun 6, 2019

I made a mistake, my example should have been ./foo%20bar.dhall, not ./"foo%20bar.dhall". You are correct in your statement; but the unquoted path still has the problem I describe.

(I have updated my original post above, so Gabriel's comment might look confusing now.)

@Gabriel439

This comment has been minimized.

Copy link
Contributor

@Gabriel439 Gabriel439 commented Jun 7, 2019

I think this is fine, with two caveats:

  • We might want to specify the RFC3986 logic for making filepaths relative to URLs using the same judgment syntax that we use for everything else (mainly to ease implementation for languages that don't have library support for this)
  • We should still disallow fragments (See: #403)
@philandstuff

This comment has been minimized.

Copy link
Collaborator Author

@philandstuff philandstuff commented Jun 7, 2019

Agreed on both points 👍

I’ll draft something, though it might take me a few weeks.

@f-f

This comment has been minimized.

Copy link
Member

@f-f f-f commented Jun 10, 2019

@philandstuff thanks for the great writeup! Given that I'm the author of #205 and that I think #585 should be able to represent all RFC3986-valid URLs as a Dhall datastructure, I'm very much in favor of this 👍

@philandstuff

This comment has been minimized.

Copy link
Collaborator Author

@philandstuff philandstuff commented Jun 15, 2019

@Gabriel439 after thinking about this quite a bit more, I finally understand what were saying here:

My understanding is that this is not what the standard says. If foo.dhall were a remote file it would transitively import https://…/foo%2520bar.dhall (i.e. it would percent-encode foo%20bar.dhall and then fetch that).

See: https://github.com/dhall-lang/dhall-lang/blob/6e67b2202eeb333a99556350979e37216b86510b/standard/imports.md#quoted-paths

So in other words, URLs and paths do have the same semantics. Both examples would import the same file if you hosted the directory tree locally vs remotely.

...and you're absolutely correct. The quoting or otherwise of the local import doesn't make a difference. I've updated my original post to clarify this.

I am still in support of this change, and I'm working on step 1 above as documented, but I concede I was mistaken about one of my arguments.

philandstuff added a commit that referenced this issue Jun 15, 2019
This is part 1 of the proposal in #581.  This changes the chaining
operation so that a parent URL chained to a child local import is now
defined in terms of the RFC3986 section 5 URL resolution algorithm.

I have erred on the side of being too verbose rather than not verbose
enough.

As part of this, I have started to use the notation `URL` (and `URL₀`,
`URL₁`, etc) to stand in for any generic URL import.  I think as things
progress on #581, URLs can be treated more opaquely and the current
`https://authority₀ directory₀ file₀` notation can be removed.

As part of this, I have also changed the internal representation of a
URL import to be a literal URL, rather than the parsed and decoded
components of a URL.  This was necessary, because without this change we
would do percent-encoding in two places: when converting a local import
to a relative reference, and when fetching a URL.  This meant that we
could end up double-percent-encoding a URL.  For example, if we chain
`https://example.com/foo` on to the local path `./" "`, we would first
convert the local path to `./%20`, resolve the relative reference to
`https://example.com/%20`, and then when fetching the URL,
percent-encode it again to `https://example.com/%2520`.  This is
incorrect behaviour.

After this change, we only perform percent-encoding either a) when
parsing a URL containing quoted path components; or b) when converting a
local import to a relative reference.  The same data cannot go through
both processes, so there is now no risk of double-percent-encoding.
Further, we never perform percent-decoding any more.

In particular:

 - the binary representation of a URL is now the CBOR-specified encoding
   of a URL (see RFC7049 section 2.4.4.3)

 - quoted path components in URLs should be percent-encoded at parse
   time and converted to a valid URL

This commit is a breaking change; however, its breakages are somewhat
esoteric.  The binary format has changed, but it won't affect cached
expressions or semantic hashes, because both are computed on
fully-resolved expressions, which don't contain imports.
philandstuff added a commit that referenced this issue Jun 16, 2019
This is part 1 of the proposal in #581.  This changes the chaining
operation so that a parent URL chained to a child local import is now
defined in terms of the RFC3986 section 5 URL resolution algorithm.
@Gabriel439

This comment has been minimized.

Copy link
Contributor

@Gabriel439 Gabriel439 commented Jun 18, 2019

@philandstuff: Actually, I still think the gist of what you were saying was correct even if your initial example was off. I'll explain what convinced me that this change was necessary.

Under the current standard, these two URLs are not the same:

  • https://example.com/foo%20bar.dhall
  • https://example.com/"foo%20bar.dhall"

... but these two paths are the same:

  • ./foo%20bar.dhall
  • ./"foo%20bar.dhall"

... which is inconsistent. So my view was that as long as we treat URL path components the same as file path components we will inevitably run into issues because of that mismatch in how we treat path components. That's what convinced me that we should treat URLs as opaque values like you proposed.

@singpolyma

This comment has been minimized.

Copy link
Collaborator

@singpolyma singpolyma commented Jun 18, 2019

@philandstuff

This comment has been minimized.

Copy link
Collaborator Author

@philandstuff philandstuff commented Jun 18, 2019

Yeah, the quoted URL thing is my main concern (hence it was number 1 on my list above). I haven’t implemented it in dhall-golang.

philandstuff added a commit that referenced this issue Jun 20, 2019
* Use RFC3986 section 5 URL resolution algorithm

This is part 1 of the proposal in #581.  This changes the chaining
operation so that a parent URL chained to a child local import is now
defined in terms of the RFC3986 section 5 URL resolution algorithm.

* use `. path₁ file₁` syntax for relative references
philandstuff added a commit that referenced this issue Jun 24, 2019
Mostly reverts #593.

I realised I said I'd keep judgment rules for RFC3986 resolution [1],
and then I didn't.  So I set about writing some judgment rules and
realised... the old judgment rules were exactly what was required to
implement RFC3986.

I think the key thing from #593 that I think is worth keeping is a note
about how the chain-remote-with-local rules implement the RFC 3986
algorithm, and some advice to take care when using a URL resolution
library to get the percent-encoding correct.

[1]: #581 (comment)
philandstuff added a commit that referenced this issue Jun 24, 2019
This expands the URL syntax to allow all RFC3986-compliant URLs.  This
means that, in particular:

 * empty path segments are now permitted
 * an entirely empty path is also permitted (though it is defined to be
   equivalent to the path `/`, which has a single empty segment)
 * paths may contain the characters '(', ')' and ','
   * strictly, any path segment must be entirely `pchar`s or
     `path-characters`, so some combinations are still disallowed, such
     as "foo,bar^baz" because ',' is not a `path-character` and '^' is
     not a `pchar`.

As this purely expands the space of strings which can be parsed as URLs,
this is a mostly-not-breaking change; no existing URLs will fail to
parse.  However, there are some dhall programs whose interpretation will
change, because a URL will consume more characters than it previously
was able to.  For example

    http://example.com/foo//bar

was previously parsed as

    http://example.com/foo ⫽ bar

but is now parsed as a single URL.  Another example is

    [ http://example.com/foo, bar ]

which previously was a list containing two items
`http://example.com/foo` and `bar`, and is now a list with a single
item: the result of applying `http://example.com/foo,` to `bar`.

A final awkward example is:

    (http://example.com/foo)

which previously parsed as `http://example.com/foo`, but now fails to
parse (in a PEG grammar, at least) because the final ')' character is
greedily consumed by the URL parser and then we have nothing to close
the leading parenthesis.

The last two examples could be avoided if we disallow the characters
'(', ')' and ',' from the `pchar` rule.  This would get us closer to
standard URL syntax but avoid some tricky pitfalls of parsing.  My
preference is to go with the standard `pchar` rule despite the edge
cases.

Fixes #205.  See also #581.
philandstuff added a commit that referenced this issue Jun 27, 2019
Mostly reverts #593.

I realised I said I'd keep judgment rules for RFC3986 resolution [1],
and then I didn't.  So I set about writing some judgment rules and
realised... the old judgment rules were exactly what was required to
implement RFC3986.

I think the key thing from #593 that I think is worth keeping is a note
about how the chain-remote-with-local rules implement the RFC 3986
algorithm, and some advice to take care when using a URL resolution
library to get the percent-encoding correct.

[1]: #581 (comment)
philandstuff added a commit that referenced this issue Jun 30, 2019
This expands the URL syntax to allow all RFC3986-compliant URLs.  This
means that, in particular:

 * empty path segments are now permitted
 * an entirely empty path is also permitted (though it is defined to be
   equivalent to the path `/`, which has a single empty segment)
 * paths may contain the characters '(', ')' and ','
   * strictly, any path segment must be entirely `pchar`s or
     `path-characters`, so some combinations are still disallowed, such
     as "foo,bar^baz" because ',' is not a `path-character` and '^' is
     not a `pchar`.

As this purely expands the space of strings which can be parsed as URLs,
this is a mostly-not-breaking change; no existing URLs will fail to
parse.  However, there are some dhall programs whose interpretation will
change, because a URL will consume more characters than it previously
was able to.  For example

    http://example.com/foo//bar

was previously parsed as

    http://example.com/foo ⫽ bar

but is now parsed as a single URL.  Another example is

    [ http://example.com/foo, bar ]

which previously was a list containing two items
`http://example.com/foo` and `bar`, and is now a list with a single
item: the result of applying `http://example.com/foo,` to `bar`.

A final awkward example is:

    (http://example.com/foo)

which previously parsed as `http://example.com/foo`, but now fails to
parse (in a PEG grammar, at least) because the final ')' character is
greedily consumed by the URL parser and then we have nothing to close
the leading parenthesis.

The last two examples could be avoided if we disallow the characters
'(', ')' and ',' from the `pchar` rule.  This would get us closer to
standard URL syntax but avoid some tricky pitfalls of parsing.  My
preference is to go with the standard `pchar` rule despite the edge
cases.

Fixes #205.  See also #581.
philandstuff added a commit that referenced this issue Jul 1, 2019
* Allow all RFC3986-compliant URLs

This expands the URL syntax to allow all RFC3986-compliant URLs.  This
means that, in particular:

 * empty path segments are now permitted
 * an entirely empty path is also permitted (though it is defined to be
   equivalent to the path `/`, which has a single empty segment)
 * paths may contain the characters '(', ')' and ','
   * strictly, any path segment must be entirely `pchar`s or
     `path-characters`, so some combinations are still disallowed, such
     as "foo,bar^baz" because ',' is not a `path-character` and '^' is
     not a `pchar`.

As this purely expands the space of strings which can be parsed as URLs,
this is a mostly-not-breaking change; no existing URLs will fail to
parse.  However, there are some dhall programs whose interpretation will
change, because a URL will consume more characters than it previously
was able to.  For example

    http://example.com/foo//bar

was previously parsed as

    http://example.com/foo ⫽ bar

but is now parsed as a single URL.  Another example is

    [ http://example.com/foo, bar ]

which previously was a list containing two items
`http://example.com/foo` and `bar`, and is now a list with a single
item: the result of applying `http://example.com/foo,` to `bar`.

A final awkward example is:

    (http://example.com/foo)

which previously parsed as `http://example.com/foo`, but now fails to
parse (in a PEG grammar, at least) because the final ')' character is
greedily consumed by the URL parser and then we have nothing to close
the leading parenthesis.

The last two examples could be avoided if we disallow the characters
'(', ')' and ',' from the `pchar` rule.  This would get us closer to
standard URL syntax but avoid some tricky pitfalls of parsing.  My
preference is to go with the standard `pchar` rule despite the edge
cases.

Fixes #205.  See also #581.

* Remove (), from sub-delims

This is to avoid the problems mentioned in the previous commit with
parsing URLs containing these characters.

* explain why we disallow (), in URLs
philandstuff added a commit that referenced this issue Aug 13, 2019
Currently, the standard requires implementations to URL-decode unquoted
path segments.  I think this is a mistake.

In particular, the URLs http://example.com/foo:bar and
http://example.com/foo%3Abar are potentially different URLs, even though
`:` is URL-encoded as `%3A`.  They are different because `:` is a
reserved character, which means that servers may interpret the `:` as a
special delimiter rather than the literal colon character.

An example of this kind of processing is the "matrix parameters" in
http://example.com/foo;bar=baz/quux -- matrix parameters are paremeters
on a path component, similar to how a query string contains parameters
on the whole URL.  Matrix parameters use the reserved characters `;` and
`=` as delimiter syntax.  If you want to use these characters within a
matrix parameter name or value, you need to percent-encode them.  This
demonstrates why `;` and `%3B` may not be equivalent in path segments.

By requiring implementations to URL-decode path segments, we
unintentionally fold distinct URLs into the same representation.

This commit fixes that by changing the internal representation to be the
original (encoded) text, and by requiring *quoted* path segments to be
URL-*encoded* to match.  This changes the binary representation for
URIs, but the binary representation for URIs is rarely used because it
cannot appear in a resolved and normalized expression.

Reserved characters are defined here: https://tools.ietf.org/html/rfc3986#section-2.2

Matrix URIs: https://www.w3.org/DesignIssues/MatrixURIs.html

Related to #581 and dhall-lang/dhall-haskell#1109.
philandstuff added a commit that referenced this issue Aug 21, 2019
Currently, the standard requires implementations to URL-decode unquoted
path segments.  I think this is a mistake.

In particular, the URLs http://example.com/foo:bar and
http://example.com/foo%3Abar are potentially different URLs, even though
`:` is URL-encoded as `%3A`.  They are different because `:` is a
reserved character, which means that servers may interpret the `:` as a
special delimiter rather than the literal colon character.

An example of this kind of processing is the "matrix parameters" in
http://example.com/foo;bar=baz/quux -- matrix parameters are paremeters
on a path component, similar to how a query string contains parameters
on the whole URL.  Matrix parameters use the reserved characters `;` and
`=` as delimiter syntax.  If you want to use these characters within a
matrix parameter name or value, you need to percent-encode them.  This
demonstrates why `;` and `%3B` may not be equivalent in path segments.

By requiring implementations to URL-decode path segments, we
unintentionally fold distinct URLs into the same representation.

This commit fixes that by changing the internal representation to be the
original (encoded) text, and by requiring *quoted* path segments to be
URL-*encoded* to match.  This changes the binary representation for
URIs, but the binary representation for URIs is rarely used because it
cannot appear in a resolved and normalized expression.

Reserved characters are defined here: https://tools.ietf.org/html/rfc3986#section-2.2

Matrix URIs: https://www.w3.org/DesignIssues/MatrixURIs.html

Related to #581 and dhall-lang/dhall-haskell#1109.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.