Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ignore invalid URLs in enqueueLinks in browser crawlers #1803

Merged
merged 3 commits into from
Mar 6, 2023
Merged

fix: ignore invalid URLs in enqueueLinks in browser crawlers #1803

merged 3 commits into from
Mar 6, 2023

Conversation

robertm-97
Copy link
Contributor

@robertm-97 robertm-97 commented Feb 27, 2023

Fixes: #1802

@robertm-97 robertm-97 changed the title Fixed enqueuing invalid URLs throwing Invalid URL error Fix enqueuing invalid URLs throwing Invalid URL error Feb 27, 2023
@robertm-97 robertm-97 changed the title Fix enqueuing invalid URLs throwing Invalid URL error Fix enqueuing invalid URLs throws Invalid URL error Feb 27, 2023
@robertm-97 robertm-97 changed the title Fix enqueuing invalid URLs throws Invalid URL error fix: enqueuing invalid URLs throws Invalid URL error Feb 27, 2023
@robertm-97
Copy link
Contributor Author

@mnmkng @vladfrangu @B4nan could anybody take a look at this? I'm happy with making code changes if needed.
Thanks in advance.

@mnmkng
Copy link
Member

mnmkng commented Mar 3, 2023

Thanks for the PR. I'm not sure if we want to silently ignore invalid URLs. Maybe this could be an option of enqueueLinks? throwOnInvalidURLs that users could optionally set to false ?

I would prefer not to ignore invalid URLs, because it usually points to some issues or inconsistencies in the crawlers. This way, we could be missing data and not even knowing about it.

@metalwarrior665
Copy link
Member

metalwarrior665 commented Mar 3, 2023

@B4nan At least instead of just crashing with TypeError [ERR_INVALID_URL]: Invalid URL, we should expand the error message with the actual URL and context where it crashed (since the stack trace is sometimes tricky). There are more places in Crawlee and Apify SDK where this happens so we should do a bigger look.

@robertm-97
Copy link
Contributor Author

I agree it should not be silently ignored. I can add a new option throwOnInvalidURLs that defaults to true, and also add better error handling when this happens.
Thanks

@B4nan
Copy link
Member

B4nan commented Mar 6, 2023

Sounds good to me!

Btw it also feels weird that your PR changes only the browser crawler, we should ensure the other crawlers work the same - this should be probably implemented on lower level.

@robertm-97
Copy link
Contributor Author

Other crawlers use the try catch approach which was also the inspiration for this PR.
But they also filter out invalid URLs from the array.
cheerio-crawler:

try {
return (new URL(href, baseUrl)).href;
} catch {
return undefined;
}

jsdom-crawler:
try {
return (new URL(href, baseUrl)).href;
} catch {
return undefined;
}

So if we would add the new throwOnInvalidURLs it needs to be added to the other crawlers as well.

@B4nan
Copy link
Member

B4nan commented Mar 6, 2023

Right, I thought we do this somewhere already. In that case, we should align the behavior probably.

@robertm-97
Copy link
Contributor Author

robertm-97 commented Mar 6, 2023

I aligned the enqueueLinks behavior across crawlers. Also added a helper function for this to avoid duplicated code.

@robertm-97 robertm-97 requested a review from B4nan March 6, 2023 09:46
@B4nan B4nan changed the title fix: enqueuing invalid URLs throws Invalid URL error fix: ignore invalid URLs in enqueueLinks in browser crawlers Mar 6, 2023
@B4nan B4nan merged commit 5ac336c into apify:master Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Websites containing invalid URL throws an error when enqueing links
4 participants