crawl_headers is too 'aggressive', leads to wrong status codes (e.g. retry/max_reached) #226

alexhopes · 2022-07-28T16:15:43Z

The new (and amazing!) crawl_headers function seems quite aggressive.

Although it is very fast, I noticed that quite a few URLs fail and respond with an error code in column 'status' (although they load quite fine in a browser).

Is there a way to throttle or retry?

eliasdabbas · 2022-07-28T21:21:40Z

Thanks @alexhopes
Glad you liked it!

Yes, since the function uses only the HEAD method, it is extremely fast, and therefore sends many requests to the server. You would typically get a 429 "too many requests" status code, asking you to slow down the requests. This does not mean that there is anything wrong with the page.

Retries: the crawler already runs retries for each URL by default, and this can be modified using the custom_settings parameter.
Concurrent requests: If you are getting a 429 error, you can slow down your crawling by also using custom_settings.

You can use any combination of the following custom settings to manage your crawling speed as needed:

import advertools as adv

adv.crawl_headers(URL_LIST, "output_file.jl",
    custom_settings={
        "RETRY_TIMES": 10,  # defaults to 2
        "CONCURRENT_REQUESTS": 1,  # defaults to 16
        "CONCURRENT_REQUESTS_PER_DOMAIN": 1,  # defaults to 8
        "DOWNLOAD_DELAY": 0.75,  # defaults to 0 
}

There are also some SEO crawling scripts and recipes that you might be interested in. Also, check out Scrapy's settings for more details on what each of them means.

Hope that helps.

alexhopes · 2022-07-31T17:51:48Z

Amazing, that really helped. Thank you for being so kind and properly explaining everything!

alexhopes closed this as completed Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl_headers is too 'aggressive', leads to wrong status codes (e.g. retry/max_reached) #226

crawl_headers is too 'aggressive', leads to wrong status codes (e.g. retry/max_reached) #226

alexhopes commented Jul 28, 2022

eliasdabbas commented Jul 28, 2022

alexhopes commented Jul 31, 2022

crawl_headers is too 'aggressive', leads to wrong status codes (e.g. retry/max_reached) #226

crawl_headers is too 'aggressive', leads to wrong status codes (e.g. retry/max_reached) #226

Comments

alexhopes commented Jul 28, 2022

eliasdabbas commented Jul 28, 2022

alexhopes commented Jul 31, 2022