New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: retryOnBlocked detects blocked webpage #1956
Conversation
f571217
to
197cd18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add some test for this so we are notified when things break. maybe e2e test, as we want to check the protection is being properly detected on a real-world site
if (this.retryOnBlocked) { | ||
if (await this.isGettingBlocked(crawlingContext)) { | ||
session?.retire(); | ||
throw new Error('Antibot protection detected, the session has been retired.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should throw RetryRequestError
? but that could end up with infinite retries. maybe better to dynamically increase the request.maxRetries
instead and have some max, e.g. 10
not sure how easy it is to get around those blocking errors just by picking new session/proxy? it sounds safer to not count this into the retry limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last nit
(plus the e2e test)
3e59336
to
28784f0
Compare
The current design implements the
retryOnBlocked
feature in (HTTP | Browser)Crawler. Maybe there is a nicer OOP way to do this?Also, it currently utilizes a very simple (but reasonably robust) way of detecting blocking with CSS selectors for Cloudflare and Google Search antibot feature. We might want to extend this?