A multi-service backend designed to implement and evaluate distributed-system patterns end-to-end. Four Python services run behind a Traefik gateway, communicating over HTTP/gRPC and a RabbitMQ event bus with a DLQ, backed by a complete observability stack — logs, traces, and metrics — wired through OpenTelemetry and cross-correlated in Grafana.
Scope: Designed as an architectural sandbox project focused on core implementation. Known constraints, deliberate omissions, and planned enhancements are documented at the bottom.
flowchart TB
client[Client]
subgraph gateway[Gateway]
traefik[Traefik<br/>JWT ForwardAuth middleware<br/>gates protected routes]
end
subgraph apps[Application services]
user[user-service<br/>auth, users, purchases]
order[order-service<br/>order publisher]
inv[inventory-service<br/>order consumer + stock]
product[product-service<br/>gRPC product catalog]
end
subgraph data[Data + messaging]
pg[(Postgres)]
rmq[(RabbitMQ<br/>order_events + DLQ)]
end
subgraph obs[Observability]
jaeger[(Jaeger<br/>traces)]
loki[(Loki<br/>logs)]
prom[(Prometheus<br/>metrics)]
graf[Grafana<br/>unified UI]
end
client -->|HTTP| traefik
traefik -->|gated by jwt-auth| user
traefik -->|gated by jwt-auth| order
user -->|gRPC| product
user --> pg
order -->|publish OrderPlaced| rmq
rmq -->|consume| inv
user -.->|spans / metrics / logs| obs
order -.->|logs only| obs
inv -.->|logs only| obs
jaeger --> graf
loki --> graf
prom --> graf
style gateway fill:#cfe8c4
style apps fill:#fff3cc
style data fill:#ffe0b3
style obs fill:#eef6fb
Solid arrows are runtime data flow; dotted arrows are telemetry. user-service is the most fully instrumented (traces, metrics, structured logs with trace-id correlation); the other services emit logs only — full OTel wiring is a planned follow-up.
Some design choices, with the rationale.
JWT verification lives in one place — only user-service holds the signing secret. Traefik calls user-service/auth/verify on every protected request before forwarding, then injects X-User-Id and X-User-Email headers onto the request that reaches the backend. Downstream services don't import a JWT library and don't see tokens.
The alternative — per-service JWT verification — was rejected because it spreads the secret across every service and makes rotation a multi-deploy ordeal.
order-service publishes OrderPlaced events to a durable order_events queue. inventory-service consumes them. Any message a consumer rejects without requeueing (e.g. on a deserialization error) routes to order_events.dead via a dead-letter exchange, so failures are inspectable instead of silently dropped.
The full topology (DLX + DLQ + main queue arguments) is declared by a single helper in shared-events, called by both publisher and consumer from their lifespans.
Queue names, exchange names, event-type strings, and the topology helper live in libs/shared-events/, a uv workspace member depended on by both publisher and consumer. Renaming a queue is a single edit that breaks both sides at import time — never silently at runtime.
| Pillar | Backend | Where |
|---|---|---|
| Traces | Jaeger | OTel SDK → OTLP/gRPC → Jaeger container |
| Logs | Loki | stdout → Promtail (Docker socket discovery) → Loki |
| Metrics | Prometheus | OTel MeterProvider + PrometheusMetricReader exposing /metrics |
A TraceContextFilter stamps trace_id and span_id onto every Python log record. Loki picks them up via Promtail. Grafana's Loki datasource has a derivedFields rule that turns trace_id=<hex> into a clickable link that pivots to the matching trace in Jaeger (will migrate this to tempo in the future) .
Each service has a two-stage Dockerfile that copies only the relevant pyproject.toml files (root + service + shared library) before uv sync --frozen --no-dev, then the full source. The dependency-install layer is cached aggressively and only re-runs when actual dependencies change.
Two services have test suites using FastAPI's TestClient against in-memory SQLite (user-service) and an AsyncMock-chain replacement of aio_pika (order-service). Tests cover happy paths, security guards (IDOR via X-User-Id header verification), error paths, and message-contract assertions against the shared enums — so a renamed event type breaks the tests immediately.
| Layer | Tech | Why |
|---|---|---|
| Language | Python 3.13 | |
| Web framework | FastAPI | Native async, automatic OpenAPI docs |
| ORM | SQLModel | Pydantic + SQLAlchemy in one |
| Database | Postgres 16 | Real DB for the user service |
| Inter-service RPC | gRPC | user-service → product-service catalog lookup |
| Message broker | RabbitMQ | order-service → inventory-service via durable queue + DLQ |
| Gateway | Traefik v3 | Single ingress, ForwardAuth middleware for JWT gating |
| Auth | JWT (HS256) + bcrypt | pyjwt + passlib[bcrypt] |
| Tracing | OpenTelemetry → Jaeger | FastAPIInstrumentor for HTTP spans |
| Logs | Promtail → Loki | Auto-discovery via Docker socket |
| Metrics | OpenTelemetry → Prometheus | OTel-native, PrometheusMetricReader |
| Visualization | Grafana | Auto-provisioned datasources for Loki + Prometheus |
| Package management | uv with workspace members | Monorepo: 4 services + 1 shared library |
Requirements: Docker, uv.
cp .env.example .env
docker compose up -d --buildSmoke test the JWT auth + event flow:
# Sign up
curl -X POST http://localhost/auth/signup \
-H 'content-type: application/json' \
-d '{"name":"alice","email":"a@b.com","password":"hunter2"}'
# Log in and capture the token
TOKEN=$(curl -s -X POST http://localhost/auth/login \
-H 'content-type: application/json' \
-d '{"email":"a@b.com","password":"hunter2"}' \
| python -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
# Hit a protected route — Traefik's jwt-auth middleware gates this
curl http://localhost/users/ -H "Authorization: Bearer $TOKEN"
# Place an order — published to RabbitMQ, consumed by inventory-service
curl -X POST 'http://localhost/orders/?product_id=99' \
-H "Authorization: Bearer $TOKEN"
# Watch inventory drop
curl http://localhost/inventory/99 -H "Authorization: Bearer $TOKEN"Once docker compose up is running, these UIs are available:
| URL | Service | Login |
|---|---|---|
| http://localhost:16686 | Jaeger (traces) | — |
| http://localhost:3001 | Grafana (logs + metrics + trace pivot) | admin / admin |
| http://localhost:9090 | Prometheus (raw PromQL) | — |
| http://localhost:15672 | RabbitMQ management | guest / guest |
| http://localhost:8080 | Traefik dashboard | — |
agora/
├── compose.yaml # all services + observability stack
├── prometheus.yml # Prometheus scrape config
├── promtail-config.yaml # Promtail Docker socket discovery
├── grafana/
│ └── provisioning/
│ └── datasources/ # auto-provisioned Loki + Prometheus datasources
├── libs/
│ └── shared-events/ # workspace package: queue + event-type contract
│ └── shared_events/
│ └── __init__.py # StrEnums + topology helper
├── services/
│ ├── user-service/ # FastAPI: auth, users, purchases (gRPC client)
│ ├── product-service/ # gRPC server: product catalog
│ ├── order-service/ # FastAPI: publishes OrderPlaced
│ └── inventory-service/ # FastAPI: consumes OrderPlaced + stock endpoint
├── protos/ # shared .proto schemas
└── pyproject.toml # uv workspace root
# user-service: 25 tests (jwt_utils unit + auth/users integration)
uv run --directory services/user-service pytest
# order-service: 7 tests (publish behavior + IDOR guard + topology assertions)
uv run --directory services/order-service pytestThe user-service suite uses in-memory SQLite via FastAPI dependency-override. The order-service suite mocks aio_pika.connect_robust and the channel/exchange chain with AsyncMock — see services/order-service/conftest.py for the pattern.
Things I knew about and chose not to do, with the reasoning. Listing these is part of the point — knowing the gap matters more than filling every gap on a learning project.
| Not done | Why not (yet) |
|---|---|
| CI pipeline | Tests run locally with uv run pytest. A GitHub Actions workflow is incoming. |
| Persistent volumes for Jaeger / Loki / Prometheus | Telemetry data is intentionally ephemeral in dev. Production deployments would mount volumes. |
| Retry-with-backoff before dead-letter | Current DLQ is "fail once → dead-letter." A retry queue with TTL is the next pattern; not yet implemented. |
| OTel instrumentation in order-service / inventory-service | Only user-service is fully instrumented. |
| Cross-service trace propagation through RabbitMQ message headers | The OTel aio_pika instrumentation supports this; once the other services are OTel-aware, one trace spanning publisher → broker → consumer becomes possible. |
| Alertmanager / alert rules | Prometheus is collecting; no alerting layer yet because this is not a production codebase. |