feat: retryOnBlocked detects blocked webpage #1956

barjin · 2023-06-26T14:57:22Z

The current design implements the retryOnBlocked feature in (HTTP | Browser)Crawler. Maybe there is a nicer OOP way to do this?

Also, it currently utilizes a very simple (but reasonably robust) way of detecting blocking with CSS selectors for Cloudflare and Google Search antibot feature. We might want to extend this?

B4nan

lets add some test for this so we are notified when things break. maybe e2e test, as we want to check the protection is being properly detected on a real-world site

packages/basic-crawler/src/internals/basic-crawler.ts

packages/utils/src/internals/blocked.ts

B4nan · 2023-07-03T12:53:58Z

packages/browser-crawler/src/internals/browser-crawler.ts

+        if (this.retryOnBlocked) {
+            if (await this.isGettingBlocked(crawlingContext)) {
+                session?.retire();
+                throw new Error('Antibot protection detected, the session has been retired.');


maybe we should throw RetryRequestError? but that could end up with infinite retries. maybe better to dynamically increase the request.maxRetries instead and have some max, e.g. 10

not sure how easy it is to get around those blocking errors just by picking new session/proxy? it sounds safer to not count this into the retry limit

packages/http-crawler/src/internals/http-crawler.ts

packages/basic-crawler/src/internals/basic-crawler.ts

B4nan

one last nit

(plus the e2e test)

packages/basic-crawler/src/internals/basic-crawler.ts

barjin force-pushed the feat/retryOnBlocked branch from f571217 to 197cd18 Compare June 27, 2023 11:29

barjin requested a review from B4nan June 30, 2023 11:34

B4nan requested changes Jul 3, 2023

View reviewed changes

B4nan approved these changes Jul 4, 2023

View reviewed changes

packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved

barjin added 6 commits July 19, 2023 11:51

feat: initial retryOnBlocked implementation

2198cdb

fix: BrowserCrawler implementation, structure

bf7df38

chore: add comments with explanation

0d100b2

fix: retryOnBlocked false by default

10c2de4

chore: PR fixes

6417bed

fix: PR fixes

28784f0

barjin force-pushed the feat/retryOnBlocked branch from 3e59336 to 28784f0 Compare July 19, 2023 09:59

B4nan approved these changes Jul 19, 2023

View reviewed changes

barjin merged commit 766fa9b into master Jul 19, 2023
7 checks passed

barjin deleted the feat/retryOnBlocked branch July 19, 2023 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: retryOnBlocked detects blocked webpage #1956

feat: retryOnBlocked detects blocked webpage #1956

barjin commented Jun 26, 2023

B4nan left a comment

B4nan Jul 3, 2023

B4nan left a comment •

edited

feat: retryOnBlocked detects blocked webpage #1956

feat: retryOnBlocked detects blocked webpage #1956

Conversation

barjin commented Jun 26, 2023

B4nan left a comment

Choose a reason for hiding this comment

B4nan Jul 3, 2023

Choose a reason for hiding this comment

B4nan left a comment • edited

Choose a reason for hiding this comment

B4nan left a comment •

edited