Discussion: Store state for fault-tolerance crawler #75

ghost · 2020-04-01T17:47:37Z

I think Postgres would be a good way to store the spider state to in case the system crash it can continue the crawling from there it was stoped.

Is there any recommendations or suggestions how to implement it in my current Crawly project?

Ziinc · 2020-04-01T18:34:45Z

You can use a custom pipeline to update your database tables with your scraped data.

When your spiders are started, your starting urls can then be queried from the database.

You can use an ecto-based job queue like Honeydew to poll your database for things to be scraped and startup the relevant spiders accordingly. Or you can use Mnesia to persist state.

ghost · 2020-04-03T07:05:35Z

@Ziinc Thanks again. Soon will do.

lucaong · 2020-04-16T09:13:04Z

Pitching in to mention CubQ as a way to implement durable queues in an embedded database. Full disclaimer: I am the author of the library. I created it mostly for embedded software scenarios, but I think it would fit well for keeping a crawler queue too.

Ziinc · 2020-04-19T07:02:30Z

For improved fault tolerance, usage of persistent queues for requests/ScrapedItems would definitely be good.

However, this would involve an additional dependency, and it would be hard to argue for CubQ (backed by CubDb) over other more established queue libraries backed by Mnesia.

Ziinc · 2020-04-19T07:04:26Z

I don't think that there is a strong case for improving fault tolerance and stability now when there are other features to be implemented.

Perhaps in the future, we can reopen this when there is a v1.

Ziinc changed the title ~~Store state for fault-tolerance crawler~~ Discussion: Store state for fault-tolerance crawler Apr 1, 2020

Ziinc closed this as completed Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Store state for fault-tolerance crawler #75

Discussion: Store state for fault-tolerance crawler #75

ghost commented Apr 1, 2020

Ziinc commented Apr 1, 2020

ghost commented Apr 3, 2020

lucaong commented Apr 16, 2020

Ziinc commented Apr 19, 2020

Ziinc commented Apr 19, 2020

Discussion: Store state for fault-tolerance crawler #75

Discussion: Store state for fault-tolerance crawler #75

Comments

ghost commented Apr 1, 2020

Ziinc commented Apr 1, 2020

ghost commented Apr 3, 2020

lucaong commented Apr 16, 2020

Ziinc commented Apr 19, 2020

Ziinc commented Apr 19, 2020