-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Store state for fault-tolerance crawler #75
Comments
You can use a custom pipeline to update your database tables with your scraped data. When your spiders are started, your starting urls can then be queried from the database. You can use an ecto-based job queue like Honeydew to poll your database for things to be scraped and startup the relevant spiders accordingly. Or you can use Mnesia to persist state. |
@Ziinc Thanks again. Soon will do. |
Pitching in to mention CubQ as a way to implement durable queues in an embedded database. Full disclaimer: I am the author of the library. I created it mostly for embedded software scenarios, but I think it would fit well for keeping a crawler queue too. |
For improved fault tolerance, usage of persistent queues for requests/ScrapedItems would definitely be good. However, this would involve an additional dependency, and it would be hard to argue for CubQ (backed by CubDb) over other more established queue libraries backed by Mnesia. |
I don't think that there is a strong case for improving fault tolerance and stability now when there are other features to be implemented. Perhaps in the future, we can reopen this when there is a v1. |
I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.
Is there any recommendations or suggestions how to implement it in my current Crawly project?
The text was updated successfully, but these errors were encountered: