Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional tests related to pathToFileURL #248

Closed
wants to merge 7 commits into from
Closed

Conversation

lemire
Copy link
Member

@lemire lemire commented Feb 28, 2023

Node.js has a distinct pathToFileURL function. This PR adds some related tests.

@lemire lemire requested a review from anonrig February 28, 2023 19:52
@lemire
Copy link
Member Author

lemire commented Feb 28, 2023

@anonrig

The node documentation says that this is incorrect...

> var u = new URL('/foo#1', 'file:');
> u
URL {
  href: 'file:///foo#1',
  origin: 'null',
  protocol: 'file:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '/foo',
  search: '',
  searchParams: URLSearchParams {},
  hash: '#1'
}

Well, it is a matter of intent. If you intention that the path be "/foo#1", then you can use a setter:

> var u = new URL('file:')
> u.pathname = "/foo#1"
'/foo#1'
> u
URL {
  href: 'file:///foo%231',
  origin: 'null',
  protocol: 'file:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '/foo%231',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}

@lemire
Copy link
Member Author

lemire commented Feb 28, 2023

The claim in the node documentation that the following is incorrect...

> new URL('/some/path%.c', 'file:');
URL {
  href: 'file:///some/path%.c',
  origin: 'null',
  protocol: 'file:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '/some/path%.c',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}

... I don't understand. This looks correct.

Now, the issue is whether this should be allowed...

> var u = new URL('file:')
> u.pathname = "/some/path%.c"
'/some/path%.c'
> u
URL {
  href: 'file:///some/path%.c',
  origin: 'null',
  protocol: 'file:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '/some/path%.c',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}

It looks fine to me but depending on how you read the spec... it is a validation error, suggesting that one ought to log an error.

* This function ensures that path is resolved absolutely,
* and that the URL control characters are correctly encoded when converting into a File URL.
*
* new URL('/foo#1', 'file:'); // Incorrect: file:///foo#1
Copy link
Member

@anonrig anonrig Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Safari both this example and new URL("file:///foo#1") has a href value of file:///foo#1. In both of them #1 is considered a hash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whereas in let u = new URL('file:///test'); u.pathname = "/foo#1"; foo#1 is considered as query and percent encoded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anonrig Right. So if you have a path that is "/foo#1", then it needs to be percent encoded. You can either do it yourself, or use ada:

     ada::result url = ada::parse("file://");
     if(!url->set_pathname("/foo#1")) { return false; }
     if(url->get_href() != "file:///foo%231") { return false; }

So there is no need for a special function (e.g., pathToFileURL) in this instance. Just use the path setter from ada.

Correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path setter would do it. I also assume that path setter changes % to %25 right? Since it's the other requirement of the function. If yes, we don't have to do a for loop in here: https://github.com/nodejs/node/blob/main/src/node_url.cc#L332

* new URL('/foo#1', 'file:'); // Incorrect: file:///foo#1
* pathToFileURL('/foo#1'); // Correct: file:///foo%231 (POSIX)
*
* new URL('/some/path%.c', 'file:'); // Incorrect: file:///some/path%.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safari returns "file:///some/path%.c" for href for this value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Firefox too...
I get file:///some/path%.c. It does not try to percent encode the '%' character.

* new URL('/some/path%.c', 'file:'); // Incorrect: file:///some/path%.c
* pathToFileURL('/some/path%.c'); // Correct: file:///some/path%25.c (POSIX)
*
* It is unclear why node states that 'file:///some/path%.c' is incorrect. Any path part of URL should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is the commit that added the function: nodejs/node@eef072f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing is done on the C++ side too: https://github.com/nodejs/node/blob/main/src/node_url.cc#L332

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@lemire lemire Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So that's your code here... why do you change '%' to '%25'?
Screenshot 2023-02-28 at 6 40 59 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a refactor on the Ada migration pull request. Originally it was added in this commit: nodejs/node@cf340fe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Stack Overflow answer makes sense, but I don't see how it helps us or Node. If the user has a path with '%' in it, what is Node going to do ? Percent encoding it does not make it go away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it percent encodes it, and later decodes it. There are 2 functions in Node:

pathToFileURL and fileURLToPath

@lemire lemire changed the title Potential bug in set_pathname Additional tests related to pathToFileURL Feb 28, 2023
@lemire
Copy link
Member Author

lemire commented Feb 28, 2023

@guybedford

Hello Guy. Can you help us understand better the node pathToFileURL function? It is currently implemented in JavaScript within Node. Node relies on the C++ URL parser ada. For performance reasons, it seems that we could simply ask Node to call the path setter from ada.

However, before changing anything, it is best to understand what is already present (Chesterton's fence).

Specifically...

  1. In a path, the '%' character is not in the percent-escape set for paths. Even so, both URLs 'file:///some/path%.c' and 'file:///some/path%25.c', once percent decoded will map to the same string ('file:///some/path%.c'), no ? There is some reference to 'POSIX' in the Node documentation but it is unclear to me what it refers to in this context. Certainly, '%' may belong to a path... under POSIX systems or others... that is fine... but it does not look like an argument to percent encode it per se. (Answer provided below: What if we have %20 as a file name ?)
  2. The '#' character, when it belongs to a path, needs to be percent encoded, but this might be best achieved by a path setter (as per the url.spec.whatwg.org standard) or manually if one prefers... In this sense, the Node documentation is clear enough in this case, but it seems like the right concept would be a path setter... no?

In effect, Guy, why don't we just use a standard-compliant path setter ?

(I am asking honestly, I am not being argumentative.)

@lemire
Copy link
Member Author

lemire commented Mar 1, 2023

@jasnell We are wondering about the pathToFileURL in Node. If you have any insights, please share.

@guybedford
Copy link

guybedford commented Mar 2, 2023

@lemire

  1. We support percent decoding in Node.js when importing paths in JavaScript. So import './Some%20File.js' will resolve into the path /base/Some File.js. This matches expectations with servers. But if you had a file on the file system explicitly called /base/Some%20File.js we still need a way to load that path so we support import './Some%25%20File.js' here in Node.js. When creating the inverse function pathToFileURL, in order to support a transitive canonical inversion we therefore encode percent encodings.
  2. The same argument / use case roughly applies for # in paths.

@lemire
Copy link
Member Author

lemire commented Mar 2, 2023

Thanks @guybedford. That's useful.

@TimothyGu can you help?

Here what the path setter in ada does right now...

    ada::result url = ada::parse("file://");
    url->set_pathname(....)
    url->get_href()....
input path href get_pathname
/foo#1 file:///foo%231 /foo%231
/base/love?.js file:///base/love%3F.js /love%3F.js
/base/love#.js  file:///base/love%23.js /love%23.js
/base/Some%20File.js file:///base/Some%20File.js /Some%20File.js

It seems to me (but I could be wrong) that all we need for ada->set_pathname to be a replacement for pathToFileURL is the handing of the case where the input has a percent followed by two hex characters (e.g., %20). If you allow the string "%20" in a path (and it certainly allowed in many systems) then you need to percent encode the '%' character (map it to '%25'). I don't see anything in the specification about that right now.

Another issue that you can see is that get_pathname does not even attempt to round-trip: it does not do the percent decoding. This seems to follow the spec... https://url.spec.whatwg.org/#url-path-serializer

But if we had something like get_pathname_decoded, we could round trip and allow systems such as Node to quickly retrieve the path. It would also make testing possible: we could have a routine that inputs all sorts of valid paths and we could make sure that ada encodes and decodes them.

It seems to be related to these issues...

  1. Should we unescape characters in path? whatwg/url#606
  2. Percent encoding does not round-trip whatwg/url#523

Motivation: Pushing these computations in ada (instead of leaving them to JavaScript) could improve the performance.

@TimothyGu
Copy link

Not entirely sure what the question here is.

I think "well-formed" file URLs should not have non-percent-encoding %s (like ab%xy.txt should be percent-encoded when it's part of a URL).

Also, you should be careful about UNC paths on Windows, which is a combination of hostname, port, and pathname.

@lemire
Copy link
Member Author

lemire commented Mar 2, 2023

I think "well-formed" file URLs should not have non-percent-encoding %s (like ab%xy.txt should be percent-encoded when it's part of a URL).

Not entirely sure what the question here is.

Let me make it precise. The spec. provides us with 'pathname setters' (search 'pathname setter' in https://url.spec.whatwg.org/#url-parsing). The 'pathname setters' will do percent encoding for you (see below) based on a percent-encoding set.

What do we think it should do in the following cases:

   ada::result url = ada::parse("file://");
   url->set_pathname("/base/Some%20File.js")
   url->get_href() == ?

E.g., is it file:///base/Some%20File.js or file:///base/Some%2520File.js?

   ada::result url = ada::parse("file://");
   url->set_pathname("/base/ab%xy.txt")
   url->get_href() == ?

Do we get file:///base/ab%xy.txt or something else? It seems clear in this instance that there is a validation error at play, but validation errors do not terminate the parsing and the specification does not say how they are handled.

That is my question in short.

Let me elaborate.

The path percent-encode set is U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>), U+003F (?), U+0060 (`), U+007B ({), and U+007D (}), C0 controls: range U+0000 NULL to U+001F INFORMATION SEPARATOR ONE, inclusive and all code points greater than U+007E (~).

The U+0025 (%) is not part of the path percent-encode set. I agree that the spec. says that ab%xy.txt is invalid, but in that context, for the parser, invalid means "you may issue a warning, or log an issue, but you must continue". That is, there is no indication that the parser should reject inputs with ab%xy.txt in it. Now, it could percent encode it, but where does it says so?

It is interesting to look into why it is a problem. We have that URL-path-segment string must be one of the following: zero or more URL units excluding U+002F (/) and U+003F (?). In turn, the URL units are URL code points (URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters) and percent-encoded bytes (A percent-encoded byte is U+0025 (%), followed by two ASCII hex digits.)

Here is what the spec says about that puzzle...

"Of the possible values for the percentEncodeSet argument only two end up encoding U+0025 (%) and thus give “roundtripable data”: component percent-encode set and application/x-www-form-urlencoded percent-encode set. The other values for the percentEncodeSet argument — which happen to be used by the URL parser — leave U+0025 (%) untouched and as such it needs to be percent-encoded first in order to be properly represented."

It specifically says that the parser leaves % untouched !!!

The component percent-encode set is the path percent-encode set plus U+0024 ($) to U+0026 (&), inclusive, U+002B (+), and U+002C (,), t) and U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+0040 (@), U+005B ([) to U+005E (^), inclusive, and U+007C (|).

See how it adds U+0025? So it is deliberately the case that U+0025 is to be left untouched.

The specification provides an example...

percent encoded decoded
%25%s%1G %%s%1G

So clearly, it suggests that in some cases, '%' is not itself percent encoded.

Reference:
https://url.spec.whatwg.org/#url-parsing

@TimothyGu
Copy link

That is my question in short.

Here's what Node.js v18 (pre-Ada) does, and also what the spec expects:

> u = new URL('file://'); u.pathname = '/base/Some%20File.js'; u.href
'file:///base/Some%20File.js'
> u = new URL('file://'); u.pathname = '/base/Some%xyFile.js'; u.href
'file:///base/Some%xyFile.js'

This behavior is consistent with the parser itself:

> u = new URL('file:///base/Some%20File.js'); u.href
'file:///base/Some%20File.js'
> u = new URL('file:///base/Some%xyFile.js'); u.href
'file:///base/Some%xyFile.js'

Note, the second URL is, as you note, "invalid". In practice, the "valid" equivalent would have Some%25xyFile.js – though I don't believe the WHATWG spec defines this semantic equivalence anywhere (there's a long line of issues about defining "normalization" in the WHATWG issue tracker).


Now, the spec is only binding for the JavaScript API that Node.js and browsers. Whether or not you think this is a good idea for Ada's C++ API is up to you – but I think it's quite reasonable to stick with the JS API.

However, it's important to note its limitations: it would be a mistake to call set_pathname() on a regular, unprocessed file system path, and expect the resulting URL to refer to that file path. It might be beneficial nevertheless to have a separate function encode_and_set_pathname for that use case.

@lemire
Copy link
Member Author

lemire commented Mar 8, 2023

it would be a mistake to call set_pathname() on a regular, unprocessed file system path, and expect the resulting URL to refer to that file path.

That's the key point.

@karwa
Copy link

karwa commented Mar 8, 2023

File path <-> URL conversion is not standardised. See whatwg/fetch#1338

POSIX style paths are fairly straightforward. Windows paths are more complex.

For my own URL library (a Swift implementation of the WHATWG standard), I created a fairly comprehensive test suite based on experimenting with the platform's native UrlCreateFromPath function, Microsoft's Edge browser, and other implementations.

The implementation is here. I'm not sure how easy it is to follow if you're not used to Swift, but I've tried to document everything thoroughly.

I'd love for everybody to come together to precisely document how this conversion should work.

@lemire
Copy link
Member Author

lemire commented Mar 8, 2023

@karwa I agree that it is a very nice idea to try to standardize the functions by joining forces. I code in Swift (proof: 1 and 2).

@lemire
Copy link
Member Author

lemire commented Mar 29, 2023

Closing this PR in favour of issue #309

@lemire lemire closed this Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants