-
Notifications
You must be signed in to change notification settings - Fork 11
fix: use new URL for URL normalization
#365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This deprecates the custom `parseUrl` function and uses `new URL` for normalization instead, which helps with some edge cases like having `@` symbol inside the query string. Closes apify/crawlee#1831
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could break some stuff. If the normalizeUrl returns null, does it fail or just uses the URL as is?
It returns |
| let urlObj; | ||
|
|
||
| try { | ||
| urlObj = new URL(url.replace(/ +/g, '')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why replace url.trim() with this regex? They're not equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, removing whitespaces from the URL seems wrong. They can be legit non-encoded parts of the query params (?q=hotel prague)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the old parseUrl implementation accepts invalid URLs that contain spaces all over the place, this gets rid of them so it is possible to part those weird URLs.
https://github.com/apify/apify-shared-js/blob/master/test/utilities.client.test.ts#L544-L547
side note: new URL behaves differently in browser and in nodejs in this regard, browser can parse new URL('http: // test # fragment') but nodejs can't. even browser cannot parse new URL('http :// test # fragment') (note the colon position), which is what the test is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, removing whitespaces from the URL seems wrong. They can be legit non-encoded parts of the query params (?q=hotel prague)
Well, this is not a valid URL, spaces need to be encoded to %20.
I will be more than happy to remove that, but the behaviour will change - those malformed URLs will result in null while they were working before - that is why I used this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is not a valid URL, spaces need to be encoded to %20.
Actually, in query it should be escaped to +, not %20.
But it's a good point, this is another difference from the old implementation.
An alternative to this is removing the spaces only from inside the protocol part via regexp, as that is what breaks new URL. We will still have different results, as new URL (correctly) encodes the spaces to %20 everywhere except query, where it uses (again, correctly) the + character. Will give that a shot.
I think the important question here is where and for what we use this method. Because crawlee seems to be using it only for computing req.uniqueKey and for that is seems to be fine to just drop spaces - we don't need it to be the same URL, as we basically use it as a one-way hashing function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can I? Well sure I can wipe spaces before # symbol.
But it all feels wrong, bending the correct implementation of new URL to support our weird test case - which imho might have been written based on that weird implementation rather than actual problem :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm following well :) but naively for me it would make sense that we follow what new URL does. Spaces in origin are invalid, spaces in pathname and search are encoded to %20. And we just trim from before start and after end so it doesn't punish if people forget whitespace there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole discussion here is basically about this test:
it('should trim all parts of URL', () => {
expect(normalizeUrl(' http :// test # fragment ')).toEqual('http://test');
expect(normalizeUrl(' http :// test # fragment ', true)).toEqual('http://test#fragment');
});
https://github.com/apify/apify-shared-js/blob/master/test/utilities.client.test.ts#L544-L547
It won't work with new URL and result in null.
If that is fine with us, I am all in for changing the assertions and call it a day.
(this is basically why I asked for a review of so many people/elders, as it changes the behavior a bit, making it more strict if you want)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, then I cannot really contribute since I have no idea why we stripped invalid spaces inside the protocol. I see blame on @mtrunkat :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's why I added him to the reviewers, but you know, it's 5 years old code, and I can see it was probably copied from some stackoverflow post (I would do the same :D).
|
Ok, so I reverted the regexp to |
This deprecates the custom
parseUrlfunction and usesnew URLfor normalization instead, which helps with some edge cases like having@symbol inside the query string.Closes apify/crawlee#1831