Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: retryOnBlocked detects blocked webpage #1956

Merged
merged 6 commits into from Jul 19, 2023
Merged

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Jun 26, 2023

The current design implements the retryOnBlocked feature in (HTTP | Browser)Crawler. Maybe there is a nicer OOP way to do this?

Also, it currently utilizes a very simple (but reasonably robust) way of detecting blocking with CSS selectors for Cloudflare and Google Search antibot feature. We might want to extend this?

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add some test for this so we are notified when things break. maybe e2e test, as we want to check the protection is being properly detected on a real-world site

packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved
packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved
packages/utils/src/internals/blocked.ts Outdated Show resolved Hide resolved
if (this.retryOnBlocked) {
if (await this.isGettingBlocked(crawlingContext)) {
session?.retire();
throw new Error('Antibot protection detected, the session has been retired.');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should throw RetryRequestError? but that could end up with infinite retries. maybe better to dynamically increase the request.maxRetries instead and have some max, e.g. 10

not sure how easy it is to get around those blocking errors just by picking new session/proxy? it sounds safer to not count this into the retry limit

packages/http-crawler/src/internals/http-crawler.ts Outdated Show resolved Hide resolved
packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last nit

(plus the e2e test)

packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved
@barjin barjin merged commit 766fa9b into master Jul 19, 2023
7 checks passed
@barjin barjin deleted the feat/retryOnBlocked branch July 19, 2023 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants