Skip to content

Commit

Permalink
fix: EnqueueStrategy.All erroring with links using unsupported prot…
Browse files Browse the repository at this point in the history
…ocols (#2389)

This changes `EnqueueStrategy.All` to filter out non-http and non-https
URLs (`mailto:` links were causing the crawler to error).

Let me know if there's a better fix or if you want me to change
something.

Thanks!


```
Request failed and reached maximum retries. Error: Received one or more errors
    at _ArrayValidator.handle (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
    at _ArrayValidator.parse (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
    at RequestQueueClient.batchAddRequests (/path/to/project/node_modules/@crawlee/src/resource-clients/request-queue.ts:340:36)
    at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:238:46)
    at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:304:22)
    at attemptToAddToQueueAndAddAnyUnprocessed (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:302:42)
    at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:319:37)
    at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:309:22)
    at enqueueLinks (/path/to/project/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:384:2)
    at browserCrawlerEnqueueLinks (/path/to/project/node_modules/@crawlee/src/internals/browser-crawler.ts:777:21)
```
  • Loading branch information
stefansundin committed May 15, 2024
1 parent 12210bd commit 8db3908
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
2 changes: 1 addition & 1 deletion packages/basic-crawler/src/internals/basic-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1620,7 +1620,7 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
}
case EnqueueStrategy.All:
default: {
return true;
return baseUrl.protocol === 'http:' || baseUrl.protocol === 'https:';
}
}
}
Expand Down
1 change: 1 addition & 0 deletions packages/core/src/enqueue_links/enqueue_links.ts
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,7 @@ export async function enqueueLinks(options: SetRequired<EnqueueLinksOptions, 're
}
case EnqueueStrategy.All:
default:
enqueueStrategyPatterns.push({ glob: `http{s,}://**` });
break;
}
}
Expand Down

0 comments on commit 8db3908

Please sign in to comment.