You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thrownewError(`Resource ${request.url} served Content-Type ${type}, `
+`but only ${Array.from(this.supportedMimeTypes).join(', ')} are allowed. Skipping resource.`);
ERROR HttpCrawler: Request failed and reached maximum retries. Error: Resource https://ftp.mvolfik.com/403-no-content-type served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
at HttpCrawler._abortDownloadOfBody (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:541:19)
at HttpCrawler.postNavigationHooks (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:242:45)
at HttpCrawler._executeHooks (/tmp/amogus/node_modules/@crawlee/basic/internals/basic-crawler.js:900:23)
at HttpCrawler._handleNavigation (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:337:20)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async HttpCrawler._runRequestHandler (/tmp/amogus/node_modules/@crawlee/http/internals/http-crawler.js:287:13)
at async wrap (/tmp/amogus/node_modules/@apify/timeout/index.js:52:21) {"id":"ey5j1V00zpDYPcb","url":"https://ftp.mvolfik.com/403-no-content-type","method":"GET","uniqueKey":"https://ftp.mvolfik.com/403-no-content-type"}
Expected behavior: do a standard request retry on 403 response.
Discussed this with @mvolfik , here is his message: also, what currently happens when we get a 403 response with disallowed content-type? for example, if some server was returning all 403 blocked responses as image/jpeg which isn't allowed in the crawler, but if we retry the request with new proxy to get a 200, we would get html as usually? this bug report might actually apply to this scenario as well, not sure
So the purpose I guess is to still block unsupported response types, but try retrying them, because it might be temporary
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Issue description
Run the example code. The endpoint returns 403 with just
Crawlee fills default content-type application/octet-stream, and fails the request on
crawlee/packages/http-crawler/src/internals/http-crawler.ts
Lines 744 to 747 in d453f9c
Expected behavior: do a standard request retry on 403 response.
Code sample
Package version
3.4.1
Node.js version
18.16.1
Operating system
Linux
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: