Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Store state for fault-tolerance crawler #75

Closed
ghost opened this issue Apr 1, 2020 · 5 comments
Closed

Discussion: Store state for fault-tolerance crawler #75

ghost opened this issue Apr 1, 2020 · 5 comments

Comments

@ghost
Copy link

ghost commented Apr 1, 2020

I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.

Is there any recommendations or suggestions how to implement it in my current Crawly project?

@Ziinc Ziinc changed the title Store state for fault-tolerance crawler Discussion: Store state for fault-tolerance crawler Apr 1, 2020
@Ziinc
Copy link
Collaborator

Ziinc commented Apr 1, 2020

You can use a custom pipeline to update your database tables with your scraped data.

When your spiders are started, your starting urls can then be queried from the database.

You can use an ecto-based job queue like Honeydew to poll your database for things to be scraped and startup the relevant spiders accordingly. Or you can use Mnesia to persist state.

@ghost
Copy link
Author

ghost commented Apr 3, 2020

@Ziinc Thanks again. Soon will do.

@Ziinc Ziinc closed this as completed Apr 13, 2020
@lucaong
Copy link

lucaong commented Apr 16, 2020

Pitching in to mention CubQ as a way to implement durable queues in an embedded database. Full disclaimer: I am the author of the library. I created it mostly for embedded software scenarios, but I think it would fit well for keeping a crawler queue too.

@Ziinc
Copy link
Collaborator

Ziinc commented Apr 19, 2020

For improved fault tolerance, usage of persistent queues for requests/ScrapedItems would definitely be good.

However, this would involve an additional dependency, and it would be hard to argue for CubQ (backed by CubDb) over other more established queue libraries backed by Mnesia.

@Ziinc
Copy link
Collaborator

Ziinc commented Apr 19, 2020

I don't think that there is a strong case for improving fault tolerance and stability now when there are other features to be implemented.

Perhaps in the future, we can reopen this when there is a v1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants