Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpCrawler fails request without retries on 403 response without any Content-Type #1994

Closed
1 task done
mvolfik opened this issue Jul 18, 2023 · 3 comments · Fixed by #2176
Closed
1 task done

HttpCrawler fails request without retries on 403 response without any Content-Type #1994

mvolfik opened this issue Jul 18, 2023 · 3 comments · Fixed by #2176
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@mvolfik
Copy link
Contributor

mvolfik commented Jul 18, 2023

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Run the example code. The endpoint returns 403 with just

HTTP/2 403 
content-length: 0

Crawlee fills default content-type application/octet-stream, and fails the request on

if (!this.supportedMimeTypes.has(type) && !this.supportedMimeTypes.has('*/*') && statusCode! < 500) {
request.noRetry = true;
throw new Error(`Resource ${request.url} served Content-Type ${type}, `
+ `but only ${Array.from(this.supportedMimeTypes).join(', ')} are allowed. Skipping resource.`);

ERROR HttpCrawler: Request failed and reached maximum retries. Error: Resource https://ftp.mvolfik.com/403-no-content-type served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
    at HttpCrawler._abortDownloadOfBody (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:541:19)
    at HttpCrawler.postNavigationHooks (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:242:45)
    at HttpCrawler._executeHooks (/tmp/amogus/node_modules/@crawlee/basic/internals/basic-crawler.js:900:23)
    at HttpCrawler._handleNavigation (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:337:20)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async HttpCrawler._runRequestHandler (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:287:13)
    at async wrap (/tmp/amogus/node_modules/@apify/timeout/index.js:52:21) {"id":"ey5j1V00zpDYPcb","url":"https://ftp.mvolfik.com/403-no-content-type","method":"GET","uniqueKey":"https://ftp.mvolfik.com/403-no-content-type"}

Expected behavior: do a standard request retry on 403 response.

Code sample

import { HttpCrawler } from "@crawlee/http";
const crawler = new HttpCrawler({requestHandler() {}});
await crawler.run(["https://ftp.mvolfik.com/403-no-content-type"]);

Package version

3.4.1

Node.js version

18.16.1

Operating system

Linux

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@mvolfik mvolfik added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Jul 18, 2023
@foxt451
Copy link
Collaborator

foxt451 commented Aug 24, 2023

Discussed this with @mvolfik , here is his message:
also, what currently happens when we get a 403 response with disallowed content-type? for example, if some server was returning all 403 blocked responses as image/jpeg which isn't allowed in the crawler, but if we retry the request with new proxy to get a 200, we would get html as usually? this bug report might actually apply to this scenario as well, not sure

So the purpose I guess is to still block unsupported response types, but try retrying them, because it might be temporary

@mvolfik
Copy link
Contributor Author

mvolfik commented Nov 8, 2023

just ran into this again with Yelp. @foxt451 did you do any work on this, or can I take over?

@foxt451
Copy link
Collaborator

foxt451 commented Nov 8, 2023

just ran into this again with Yelp. @foxt451 did you do any work on this, or can I take over?

Hi, nope, if i remember correctly. You can take it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants