Skip to content

Conversation

@B4nan
Copy link
Member

@B4nan B4nan commented Mar 16, 2023

This deprecates the custom parseUrl function and uses new URL for normalization instead, which helps with some edge cases like having @ symbol inside the query string.

Closes apify/crawlee#1831

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 16, 2023
This deprecates the custom `parseUrl` function and uses `new URL` for normalization instead,
which helps with some edge cases like having `@` symbol inside the query string.

Closes apify/crawlee#1831
@mtrunkat
Copy link
Member

I am adding @fnesveda and @drobnikj as this is then used in the Request Queue

Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could break some stuff. If the normalizeUrl returns null, does it fail or just uses the URL as is?

@B4nan
Copy link
Member Author

B4nan commented Mar 16, 2023

This could break some stuff. If the normalizeUrl returns null, does it fail or just uses the URL as is?

It returns null for invalid URLs, just like the previous implementation, the only change I ended up doing in tests was this:

https://github.com/apify/apify-shared-js/pull/365/files#diff-06a8e4c66772cfbd82d489c1b08ca03565556256143f99876e6edfcf4480a65dR530

let urlObj;

try {
urlObj = new URL(url.replace(/ +/g, ''));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replace url.trim() with this regex? They're not equal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, removing whitespaces from the URL seems wrong. They can be legit non-encoded parts of the query params (?q=hotel prague)

Copy link
Member Author

@B4nan B4nan Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the old parseUrl implementation accepts invalid URLs that contain spaces all over the place, this gets rid of them so it is possible to part those weird URLs.

https://github.com/apify/apify-shared-js/blob/master/test/utilities.client.test.ts#L544-L547

side note: new URL behaves differently in browser and in nodejs in this regard, browser can parse new URL('http: // test # fragment') but nodejs can't. even browser cannot parse new URL('http :// test # fragment') (note the colon position), which is what the test is doing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, removing whitespaces from the URL seems wrong. They can be legit non-encoded parts of the query params (?q=hotel prague)

Well, this is not a valid URL, spaces need to be encoded to %20.

I will be more than happy to remove that, but the behaviour will change - those malformed URLs will result in null while they were working before - that is why I used this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is not a valid URL, spaces need to be encoded to %20.

Actually, in query it should be escaped to +, not %20.

But it's a good point, this is another difference from the old implementation.

An alternative to this is removing the spaces only from inside the protocol part via regexp, as that is what breaks new URL. We will still have different results, as new URL (correctly) encodes the spaces to %20 everywhere except query, where it uses (again, correctly) the + character. Will give that a shot.

I think the important question here is where and for what we use this method. Because crawlee seems to be using it only for computing req.uniqueKey and for that is seems to be fine to just drop spaces - we don't need it to be the same URL, as we basically use it as a one-way hashing function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can I? Well sure I can wipe spaces before # symbol.

But it all feels wrong, bending the correct implementation of new URL to support our weird test case - which imho might have been written based on that weird implementation rather than actual problem :]

Copy link
Member

@metalwarrior665 metalwarrior665 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm following well :) but naively for me it would make sense that we follow what new URL does. Spaces in origin are invalid, spaces in pathname and search are encoded to %20. And we just trim from before start and after end so it doesn't punish if people forget whitespace there

Copy link
Member Author

@B4nan B4nan Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole discussion here is basically about this test:

    it('should trim all parts of URL', () => {
        expect(normalizeUrl('    http    ://     test    # fragment   ')).toEqual('http://test');
        expect(normalizeUrl('    http   ://     test    # fragment   ', true)).toEqual('http://test#fragment');
    });

https://github.com/apify/apify-shared-js/blob/master/test/utilities.client.test.ts#L544-L547

It won't work with new URL and result in null.

If that is fine with us, I am all in for changing the assertions and call it a day.

(this is basically why I asked for a review of so many people/elders, as it changes the behavior a bit, making it more strict if you want)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, then I cannot really contribute since I have no idea why we stripped invalid spaces inside the protocol. I see blame on @mtrunkat :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's why I added him to the reviewers, but you know, it's 5 years old code, and I can see it was probably copied from some stackoverflow post (I would do the same :D).

@B4nan
Copy link
Member Author

B4nan commented Mar 17, 2023

Ok, so I reverted the regexp to trim() as it was and commented out those two assertions. Let's test in the wild, and if we see some problems, I can hack that regexp that would remove such spaces, or worst case we can always revert back to the custom URL parsing and try to fix the original issue with @.

@B4nan B4nan merged commit 55c69bc into master Mar 20, 2023
@B4nan B4nan deleted the norm-url branch March 20, 2023 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong uniqueKey creation for Google Maps URL with "@"

6 participants