Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crawl_headers is too 'aggressive', leads to wrong status codes (e.g. retry/max_reached) #226

Closed
alexhopes opened this issue Jul 28, 2022 · 2 comments

Comments

@alexhopes
Copy link

The new (and amazing!) crawl_headers function seems quite aggressive.

Although it is very fast, I noticed that quite a few URLs fail and respond with an error code in column 'status' (although they load quite fine in a browser).

Is there a way to throttle or retry?

@eliasdabbas
Copy link
Owner

Thanks @alexhopes
Glad you liked it!

Yes, since the function uses only the HEAD method, it is extremely fast, and therefore sends many requests to the server. You would typically get a 429 "too many requests" status code, asking you to slow down the requests. This does not mean that there is anything wrong with the page.

  • Retries: the crawler already runs retries for each URL by default, and this can be modified using the custom_settings parameter.

  • Concurrent requests: If you are getting a 429 error, you can slow down your crawling by also using custom_settings.

You can use any combination of the following custom settings to manage your crawling speed as needed:

import advertools as adv

adv.crawl_headers(URL_LIST, "output_file.jl",
    custom_settings={
        "RETRY_TIMES": 10,  # defaults to 2
        "CONCURRENT_REQUESTS": 1,  # defaults to 16
        "CONCURRENT_REQUESTS_PER_DOMAIN": 1,  # defaults to 8
        "DOWNLOAD_DELAY": 0.75,  # defaults to 0 
}

There are also some SEO crawling scripts and recipes that you might be interested in. Also, check out Scrapy's settings for more details on what each of them means.

Hope that helps.

@alexhopes
Copy link
Author

Amazing, that really helped. Thank you for being so kind and properly explaining everything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants