A project of Artificial Informer Labs.
AutoScrape is an automated scraper of structured data from interactive web pages. You point this scraper at a site and it will be crawled, searched for forms and structured data can then be extracted. No brittle, site-specific programming necessary.
This is an implementation of the web scraping framework described in the paper, Robust Web Scraping in the Public Interest with AutoScrape in Proceedings of Computation + Journalism Symposium 2019.
Currently there are two methods of running AutoScrape:
- as a local CLI python script
- as a containerized system via the API
Installation and running instructions are provided for both below.
Setup for Standalone Local CLI
You need to have geckodriver installed. You can do that here:
Version 0.23.0 is recommended as of November, 2018 along with Firefox version >= 0.63.
If you prefer to use Chrome, you will need the ChromeDriver (we've tested using v2.41). It can be found in your distribution's package manager or here:
Installing the remaining Python dependencies can be done using pip or pipenv:
Pip Install Method
Next you need to set up your python virtual environment (Python 3.6 required) and install the Python dependencies:
pip install -r requirements.txt
AutoScrape also supports pipenv. You can install required dependencies by running:
Running Standalone Scraper
Environment Test Crawler
You can run a test to ensure your webdriver is set up correctly by running the
./scrape.py --show-browser [SITE_URL]
test crawler will just do a depth-first click-only crawl of an entire website. It will not interact with forms or POST data. Data will be saved to
./autoscrape-data/ (the default output directory).
Manual Config-Based Scraper
Autoscrape has a manually controlled mode, similar to wget, except this uses interactive capabilities and can input data to search forms, follow "next page"-type buttons, etc. This functionality can be used either as a standalone crawler/scraper or as a method to build a training set for the automated scrapers.
Autoscrape manual-mode full options:
Setup Containerized API Version
AutoScrape can also be ran as a containerized cluster environment, where scrapes can be triggered and stopped via API calls and data can be streamed to this server.
docker-compose build --pull docker-compose up -t0 --abort-on-container-exit
This will build the containers and launch a API server
running on local port 5000. More information about the API calls
can be found in
If you have make installed, you can simply run
NOTE: This is a work in progress prototype that will likely be removed once AutoScrape is integrated into CJ Workbench.