We believe that web scraping is a process. It might seem easy to extract first data items, however we believe that the data delivery requires a bit more efforts or a process which supports it!
Our aim is to provide you with the following services:
- Schedule (start and stop) your spiders on a cloud
- View running jobs (performance based analysis)
- View and validate scraped items for quality assurance and data analysis purposes.
- View individual items and compare them with the actual website.
You can find setup examples here
On the highest level it's required to:
- Add SendToUI pipeline to the list of your item pipelines (before encoder pipelines)
{Crawly.Pipelines.Experimental.SendToUI, ui_node: :'ui@127.0.0.1'}
- Organize erlang cluster so Crawly nodes can find CrawlyUI node in the example above I was using erlang-node-discovery application for this task, however any other alternative would also work. For setting up erlang-node-discovery
- add the following code dependency to deps section of mix.exs
{:erlang_node_discovery, git: "https://github.com/oltarasenko/erlang-node-discovery"}
- add the following lines to the config.exs:
hosts: ["127.0.0.1", "crawlyui.com"],
node_ports: [
{:ui, 0}
]
CrawlyUI ships with a docker compose which brings up UI, worker and database nodes, so everything is ready for testing with just one command.
In order to try it:
- clone crawly_ui repo:
git clone git@github.com:oltarasenko/crawly_ui.git
- build ui and worker nodes:
docker-compose build
- apply migrations:
docker-compose run ui bash -c "/crawlyui/bin/ec eval \"CrawlyUI.ReleaseTasks.migrate\""
- run it all:
docker-compose up
Live demo is available as well. However it might be a bit unstable due to our continuous release process. Please give it a try and let us know what do you think
One of the cool features of the CrawlyUI is items browser which allows comparing extracted data with a target website loaded in the IFRAME. However, as sites may block iframes, a workaround browser extension may be used to ignore X-Frame headers. For example: Chrome extension