fix: use `new URL` for URL normalization #365

B4nan · 2023-03-16T13:21:06Z

This deprecates the custom parseUrl function and uses new URL for normalization instead, which helps with some edge cases like having @ symbol inside the query string.

Closes apify/crawlee#1831

This deprecates the custom `parseUrl` function and uses `new URL` for normalization instead, which helps with some edge cases like having `@` symbol inside the query string. Closes apify/crawlee#1831

mtrunkat · 2023-03-16T13:38:05Z

I am adding @fnesveda and @drobnikj as this is then used in the Request Queue

metalwarrior665

This could break some stuff. If the normalizeUrl returns null, does it fail or just uses the URL as is?

B4nan · 2023-03-16T16:48:28Z

This could break some stuff. If the normalizeUrl returns null, does it fail or just uses the URL as is?

It returns null for invalid URLs, just like the previous implementation, the only change I ended up doing in tests was this:

https://github.com/apify/apify-shared-js/pull/365/files#diff-06a8e4c66772cfbd82d489c1b08ca03565556256143f99876e6edfcf4480a65dR530

mnmkng · 2023-03-17T09:30:24Z

packages/utilities/src/utilities.client.ts

+    let urlObj;
+
+    try {
+        urlObj = new URL(url.replace(/ +/g, ''));


Why replace url.trim() with this regex? They're not equal.

Yeah, removing whitespaces from the URL seems wrong. They can be legit non-encoded parts of the query params (?q=hotel prague)

Because the old parseUrl implementation accepts invalid URLs that contain spaces all over the place, this gets rid of them so it is possible to part those weird URLs.

https://github.com/apify/apify-shared-js/blob/master/test/utilities.client.test.ts#L544-L547

side note: new URL behaves differently in browser and in nodejs in this regard, browser can parse new URL('http: // test # fragment') but nodejs can't. even browser cannot parse new URL('http :// test # fragment') (note the colon position), which is what the test is doing

Yeah, removing whitespaces from the URL seems wrong. They can be legit non-encoded parts of the query params (?q=hotel prague)

Well, this is not a valid URL, spaces need to be encoded to %20.

I will be more than happy to remove that, but the behaviour will change - those malformed URLs will result in null while they were working before - that is why I used this.

Well, this is not a valid URL, spaces need to be encoded to %20.

Actually, in query it should be escaped to +, not %20.

But it's a good point, this is another difference from the old implementation.

An alternative to this is removing the spaces only from inside the protocol part via regexp, as that is what breaks new URL. We will still have different results, as new URL (correctly) encodes the spaces to %20 everywhere except query, where it uses (again, correctly) the + character. Will give that a shot.

I think the important question here is where and for what we use this method. Because crawlee seems to be using it only for computing req.uniqueKey and for that is seems to be fine to just drop spaces - we don't need it to be the same URL, as we basically use it as a one-way hashing function.

Or can I? Well sure I can wipe spaces before # symbol.

But it all feels wrong, bending the correct implementation of new URL to support our weird test case - which imho might have been written based on that weird implementation rather than actual problem :]

Not sure if I'm following well :) but naively for me it would make sense that we follow what new URL does. Spaces in origin are invalid, spaces in pathname and search are encoded to %20. And we just trim from before start and after end so it doesn't punish if people forget whitespace there

The whole discussion here is basically about this test:

it('should trim all parts of URL', () => { expect(normalizeUrl(' http :// test # fragment ')).toEqual('http://test'); expect(normalizeUrl(' http :// test # fragment ', true)).toEqual('http://test#fragment'); });

https://github.com/apify/apify-shared-js/blob/master/test/utilities.client.test.ts#L544-L547

It won't work with new URL and result in null.

If that is fine with us, I am all in for changing the assertions and call it a day.

(this is basically why I asked for a review of so many people/elders, as it changes the behavior a bit, making it more strict if you want)

Thanks, then I cannot really contribute since I have no idea why we stripped invalid spaces inside the protocol. I see blame on @mtrunkat :D

Yeah, that's why I added him to the reviewers, but you know, it's 5 years old code, and I can see it was probably copied from some stackoverflow post (I would do the same :D).

B4nan · 2023-03-17T11:29:42Z

Ok, so I reverted the regexp to trim() as it was and commented out those two assertions. Let's test in the wild, and if we see some problems, I can hack that regexp that would remove such spaces, or worst case we can always revert back to the custom URL parsing and try to fix the original issue with @.

github-actions bot assigned B4nan Mar 16, 2023

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 16, 2023

fix: use new URL for URL normalization

24b6a44

This deprecates the custom `parseUrl` function and uses `new URL` for normalization instead, which helps with some edge cases like having `@` symbol inside the query string. Closes apify/crawlee#1831

B4nan force-pushed the norm-url branch from c5413ae to 24b6a44 Compare March 16, 2023 13:25

B4nan mentioned this pull request Mar 16, 2023

Wrong uniqueKey creation for Google Maps URL with "@" apify/crawlee#1831

Closed

1 task

B4nan requested review from metalwarrior665, mnmkng and mtrunkat March 16, 2023 13:27

mtrunkat requested review from drobnikj and fnesveda and removed request for mtrunkat March 16, 2023 13:37

metalwarrior665 approved these changes Mar 16, 2023

View reviewed changes

fnesveda approved these changes Mar 17, 2023

View reviewed changes

mnmkng reviewed Mar 17, 2023

View reviewed changes

refactor: keep spaces in the URL before parsing

d88f2fb

B4nan force-pushed the norm-url branch from 4f9ed3c to d88f2fb Compare March 17, 2023 11:32

B4nan requested a review from mnmkng March 17, 2023 11:33

drobnikj approved these changes Mar 20, 2023

View reviewed changes

B4nan merged commit 55c69bc into master Mar 20, 2023

B4nan deleted the norm-url branch March 20, 2023 12:19

fix: use new URL for URL normalization #365

fix: use new URL for URL normalization #365

Uh oh!

Conversation

B4nan commented Mar 16, 2023

Uh oh!

mtrunkat commented Mar 16, 2023

Uh oh!

metalwarrior665 left a comment

Choose a reason for hiding this comment

Uh oh!

B4nan commented Mar 16, 2023

Uh oh!

mnmkng Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

B4nan Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

B4nan commented Mar 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fix: use `new URL` for URL normalization #365

fix: use `new URL` for URL normalization #365

B4nan Mar 17, 2023 •

edited

Loading

metalwarrior665 Mar 17, 2023 •

edited

Loading

B4nan Mar 17, 2023 •

edited

Loading