Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HttpCrawler error codes configurable #1711

Closed
corford opened this issue Dec 8, 2022 · 0 comments · Fixed by #2035
Closed

Make HttpCrawler error codes configurable #1711

corford opened this issue Dec 8, 2022 · 0 comments · Fixed by #2035
Labels
feature Issues that represent new features or improvements to existing features.

Comments

@corford
Copy link
Contributor

corford commented Dec 8, 2022

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Feature

It would be great if the status codes _parseResponse() throws on were configurable.

Motivation

Some sites we scrape return 500 (instead of 404) for no longer available product pages, giving us no ability to work with the response via the requestHandler.

Ideal solution or implementation, and any additional constraints

Either a way to provide an explicit list of codes that should throw an error; or a way to provide an ignore list of error codes that should not be treated as errors.

Alternative solutions or implementations

AFAIK, currently the only way to get responses with >500 codes treated as normal responses is to tamper with the status code in a postNavigation() hook.

@corford corford added the feature Issues that represent new features or improvements to existing features. label Dec 8, 2022
@corford corford changed the title Make error codes that _parseResponse() throws on configurable Make the error codes _parseResponse() throws on configurable Dec 8, 2022
@corford corford changed the title Make the error codes _parseResponse() throws on configurable Make HttpCrawler error codes configurable Mar 22, 2023
B4nan pushed a commit that referenced this issue Aug 18, 2023
This commit introduces two new optional properties to `CheerioCrawler`
and `HttpCrawler`, allowing for finer control over how HTTP error status
codes are handled:

1. `ignoreHttpErrorStatusCodes`: An array of HTTP response status codes
that should be excluded from being considered as errors. By default,
error consideration is triggered for status codes >= 500.

2. `additionalHttpErrorStatusCodes`: An array of extra HTTP response
status codes that should be treated as errors. By default, error
consideration is triggered for status codes >= 500.

These options provide flexibility in specifying which HTTP response
codes should be treated as errors and ignored during the crawling
process.

Closes #1711
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant