Skip to content

Latest commit

 

History

History
36 lines (22 loc) · 921 Bytes

crawler.md

File metadata and controls

36 lines (22 loc) · 921 Bytes

Web Crawler

Configurable and multi-threaded crawler that crawls *.gov.si sites by default.

Setup

  1. Install requirements.
pip install -r requirements.txt
  1. Setup database with the crawldb.sql script
  2. Edit the settings.py to specify the driver location, database credentials, and defaults
  3. Start the crawler
python -m crawler.Crawler

Optionally you can specify the number of workers by using the flag. Default number of workers is configured in settings.

python -m crawler.Crawler -n 6

Database backup

The database is big, therefore we include an external link to the backup file [1 GB].

Report and analysis

The report is available here.

The code that was used to produce plots and additional analysis is available in this jupyter notebook.