This project implements a dummy distributed web crawler that counts words. It is implemented using Map/Reduce with Celery/Redis
- DDD Domain driven design
- TDD Test Driven Design
- micro-service architecture
- Queue + Workers
- logs
- metrics (prometheus + grafana)
- Celery Flower
- CLI
- Makefile
- dependency management: poetry
- type annotations
- test with
pytest
(celery tasks, service layer)
- docker
- docker compose
$ docker compose up -d
The celery workers are then ready to pick up and process tasks. The celery scheduler is ready to be configured and then trigger some tasks as well.
On top of the celery related containers, some extra containers are provided for:
-
monitoring
- Celery Flower
- Grafana
- prometheus
-
infrastructure
- Redis: used as a broker for celery as well as it's result backend
- nginx: used as a reverse proxy to access:
service | URL |
---|---|
flower | http://localhost/flower |
grafana | http://localhost |
$ docker compose exec crawler python cli/crawl.py
docker compose exec crawler python cli/schedule_task.py schedule --when asap
In order to quickly run all commands above, a Makefile is available with the rebuild
target
$ make rebuild
The word count is then available in Redis.
$ make tests
poetry new crawler
access shell with:
poetry shell
Map/Reduce like implementation using Celery/Redis.
Available in Redis:
$ redis-cli
127.0.0.1:6379> KEYS count::*
1) "count::bicycle::https://demo.example.ai"
2) "count::bicycle"
3) "count::bicycle::https://www.def.org"
4) "count::bicycle::https://www.wikipedia.com"
5) "count::bicycle::https://blog.abc.com"
127.0.0.1:6379> GET count::bicycle
"221"
127.0.0.1:6379>