You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The new (and amazing!) crawl_headers function seems quite aggressive.
Although it is very fast, I noticed that quite a few URLs fail and respond with an error code in column 'status' (although they load quite fine in a browser).
Is there a way to throttle or retry?
The text was updated successfully, but these errors were encountered:
Yes, since the function uses only the HEAD method, it is extremely fast, and therefore sends many requests to the server. You would typically get a 429 "too many requests" status code, asking you to slow down the requests. This does not mean that there is anything wrong with the page.
Retries: the crawler already runs retries for each URL by default, and this can be modified using the custom_settings parameter.
Concurrent requests: If you are getting a 429 error, you can slow down your crawling by also using custom_settings.
You can use any combination of the following custom settings to manage your crawling speed as needed:
importadvertoolsasadvadv.crawl_headers(URL_LIST, "output_file.jl",
custom_settings={
"RETRY_TIMES": 10, # defaults to 2"CONCURRENT_REQUESTS": 1, # defaults to 16"CONCURRENT_REQUESTS_PER_DOMAIN": 1, # defaults to 8"DOWNLOAD_DELAY": 0.75, # defaults to 0
}
There are also some SEO crawling scripts and recipes that you might be interested in. Also, check out Scrapy's settings for more details on what each of them means.
The new (and amazing!) crawl_headers function seems quite aggressive.
Although it is very fast, I noticed that quite a few URLs fail and respond with an error code in column 'status' (although they load quite fine in a browser).
Is there a way to throttle or retry?
The text was updated successfully, but these errors were encountered: