Web Crawler

Configurable and multi-threaded crawler that crawls *.gov.si sites by default.

Setup

pip install -r requirements.txt

Setup database with the crawldb.sql script
Edit the settings.py to specify the driver location, database credentials, and defaults
Start the crawler

python -m crawler.Crawler

Optionally you can specify the number of workers by using the flag. Default number of workers is configured in settings.

python -m crawler.Crawler -n 6

The database is big, therefore we include an external link to the backup file [1 GB].

The report is available here.

The code that was used to produce plots and additional analysis is available in this jupyter notebook.