Configurable and multi-threaded crawler that crawls *.gov.si
sites by default.
- Install requirements.
pip install -r requirements.txt
- Setup database with the
crawldb.sql
script - Edit the
settings.py
to specify the driver location, database credentials, and defaults - Start the crawler
python -m crawler.Crawler
Optionally you can specify the number of workers by using the flag. Default number of workers is configured in settings.
python -m crawler.Crawler -n 6
The database is big, therefore we include an external link to the backup file [1 GB].
The report is available here.
The code that was used to produce plots and additional analysis is available in this jupyter notebook.