public bad code that crawls tor for terrible homemade spaghetti analytics | "Great repository names are short and memorable. Need inspiration? How about urban-fiesta."
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.vscode
alembic
containers
spidercommon
tests
tools
torspider
.dockerignore
.editorconfig
.envrc.example
.gitignore
.gitmodules
Dockerfile
LICENCE.md
README.md
alembic.ini
docker-compose.production.yml
init_db.py
requirements.txt
scrapy.cfg
setup.py

README.md

torspider

It does things that crawl Tor.

Initial ideas inspired by terrible jokes on Discord about Tor analytics. Lots of help with not reinventing the code for the crawling wheel comes from this crawler.

Licence is AGPL.

Notes

  • docker-compose run --rm spider python init_db.py - Init the DB
  • docker-compose up --scale spider=4 -d brings some nice multispider crawling
  • Rebloom is a required Redis module for duplicate URL filtering.
  • It is assumed that POSTGRES_URL is a bouncer that does its own pooling such as pgbouncer/pgpool.
  • Postgres MUST be the database due to Postgres specific features.