Skip to content

berhanet/Agora

Repository files navigation

Agora

A multi-service backend designed to implement and evaluate distributed-system patterns end-to-end. Four Python services run behind a Traefik gateway, communicating over HTTP/gRPC and a RabbitMQ event bus with a DLQ, backed by a complete observability stack — logs, traces, and metrics — wired through OpenTelemetry and cross-correlated in Grafana.

Scope: Designed as an architectural sandbox project focused on core implementation. Known constraints, deliberate omissions, and planned enhancements are documented at the bottom.


Architecture

flowchart TB
  client[Client]

  subgraph gateway[Gateway]
    traefik[Traefik<br/>JWT ForwardAuth middleware<br/>gates protected routes]
  end

  subgraph apps[Application services]
    user[user-service<br/>auth, users, purchases]
    order[order-service<br/>order publisher]
    inv[inventory-service<br/>order consumer + stock]
    product[product-service<br/>gRPC product catalog]
  end

  subgraph data[Data + messaging]
    pg[(Postgres)]
    rmq[(RabbitMQ<br/>order_events + DLQ)]
  end

  subgraph obs[Observability]
    jaeger[(Jaeger<br/>traces)]
    loki[(Loki<br/>logs)]
    prom[(Prometheus<br/>metrics)]
    graf[Grafana<br/>unified UI]
  end

  client -->|HTTP| traefik
  traefik -->|gated by jwt-auth| user
  traefik -->|gated by jwt-auth| order
  user -->|gRPC| product
  user --> pg
  order -->|publish OrderPlaced| rmq
  rmq -->|consume| inv

  user -.->|spans / metrics / logs| obs
  order -.->|logs only| obs
  inv -.->|logs only| obs

  jaeger --> graf
  loki --> graf
  prom --> graf

  style gateway fill:#cfe8c4
  style apps fill:#fff3cc
  style data fill:#ffe0b3
  style obs fill:#eef6fb
Loading

Solid arrows are runtime data flow; dotted arrows are telemetry. user-service is the most fully instrumented (traces, metrics, structured logs with trace-id correlation); the other services emit logs only — full OTel wiring is a planned follow-up.


Highlights

Some design choices, with the rationale.

Gateway-side authentication via Traefik ForwardAuth

JWT verification lives in one place — only user-service holds the signing secret. Traefik calls user-service/auth/verify on every protected request before forwarding, then injects X-User-Id and X-User-Email headers onto the request that reaches the backend. Downstream services don't import a JWT library and don't see tokens.

The alternative — per-service JWT verification — was rejected because it spreads the secret across every service and makes rotation a multi-deploy ordeal.

Event-driven with a dead-letter queue

order-service publishes OrderPlaced events to a durable order_events queue. inventory-service consumes them. Any message a consumer rejects without requeueing (e.g. on a deserialization error) routes to order_events.dead via a dead-letter exchange, so failures are inspectable instead of silently dropped.

The full topology (DLX + DLQ + main queue arguments) is declared by a single helper in shared-events, called by both publisher and consumer from their lifespans.

Shared events workspace package

Queue names, exchange names, event-type strings, and the topology helper live in libs/shared-events/, a uv workspace member depended on by both publisher and consumer. Renaming a queue is a single edit that breaks both sides at import time — never silently at runtime.

Full observability with log-to-trace correlation

Pillar Backend Where
Traces Jaeger OTel SDK → OTLP/gRPC → Jaeger container
Logs Loki stdout → Promtail (Docker socket discovery) → Loki
Metrics Prometheus OTel MeterProvider + PrometheusMetricReader exposing /metrics

A TraceContextFilter stamps trace_id and span_id onto every Python log record. Loki picks them up via Promtail. Grafana's Loki datasource has a derivedFields rule that turns trace_id=<hex> into a clickable link that pivots to the matching trace in Jaeger (will migrate this to tempo in the future) .

Multi-stage Docker builds with uv workspace

Each service has a two-stage Dockerfile that copies only the relevant pyproject.toml files (root + service + shared library) before uv sync --frozen --no-dev, then the full source. The dependency-install layer is cached aggressively and only re-runs when actual dependencies change.

Tests with async mocking patterns

Two services have test suites using FastAPI's TestClient against in-memory SQLite (user-service) and an AsyncMock-chain replacement of aio_pika (order-service). Tests cover happy paths, security guards (IDOR via X-User-Id header verification), error paths, and message-contract assertions against the shared enums — so a renamed event type breaks the tests immediately.


Stack

Layer Tech Why
Language Python 3.13
Web framework FastAPI Native async, automatic OpenAPI docs
ORM SQLModel Pydantic + SQLAlchemy in one
Database Postgres 16 Real DB for the user service
Inter-service RPC gRPC user-service → product-service catalog lookup
Message broker RabbitMQ order-service → inventory-service via durable queue + DLQ
Gateway Traefik v3 Single ingress, ForwardAuth middleware for JWT gating
Auth JWT (HS256) + bcrypt pyjwt + passlib[bcrypt]
Tracing OpenTelemetry → Jaeger FastAPIInstrumentor for HTTP spans
Logs Promtail → Loki Auto-discovery via Docker socket
Metrics OpenTelemetry → Prometheus OTel-native, PrometheusMetricReader
Visualization Grafana Auto-provisioned datasources for Loki + Prometheus
Package management uv with workspace members Monorepo: 4 services + 1 shared library

Getting started

Requirements: Docker, uv.

cp .env.example .env

docker compose up -d --build

Smoke test the JWT auth + event flow:

# Sign up
curl -X POST http://localhost/auth/signup \
  -H 'content-type: application/json' \
  -d '{"name":"alice","email":"a@b.com","password":"hunter2"}'

# Log in and capture the token
TOKEN=$(curl -s -X POST http://localhost/auth/login \
  -H 'content-type: application/json' \
  -d '{"email":"a@b.com","password":"hunter2"}' \
  | python -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# Hit a protected route — Traefik's jwt-auth middleware gates this
curl http://localhost/users/ -H "Authorization: Bearer $TOKEN"

# Place an order — published to RabbitMQ, consumed by inventory-service
curl -X POST 'http://localhost/orders/?product_id=99' \
  -H "Authorization: Bearer $TOKEN"

# Watch inventory drop
curl http://localhost/inventory/99 -H "Authorization: Bearer $TOKEN"

Observability tour

Once docker compose up is running, these UIs are available:

URL Service Login
http://localhost:16686 Jaeger (traces)
http://localhost:3001 Grafana (logs + metrics + trace pivot) admin / admin
http://localhost:9090 Prometheus (raw PromQL)
http://localhost:15672 RabbitMQ management guest / guest
http://localhost:8080 Traefik dashboard

Project layout

agora/
├── compose.yaml              # all services + observability stack
├── prometheus.yml            # Prometheus scrape config
├── promtail-config.yaml      # Promtail Docker socket discovery
├── grafana/
│   └── provisioning/
│       └── datasources/      # auto-provisioned Loki + Prometheus datasources
├── libs/
│   └── shared-events/        # workspace package: queue + event-type contract
│       └── shared_events/
│           └── __init__.py   # StrEnums + topology helper
├── services/
│   ├── user-service/         # FastAPI: auth, users, purchases (gRPC client)
│   ├── product-service/      # gRPC server: product catalog
│   ├── order-service/        # FastAPI: publishes OrderPlaced
│   └── inventory-service/    # FastAPI: consumes OrderPlaced + stock endpoint
├── protos/                   # shared .proto schemas
└── pyproject.toml            # uv workspace root

Tests

# user-service: 25 tests (jwt_utils unit + auth/users integration)
uv run --directory services/user-service pytest

# order-service: 7 tests (publish behavior + IDOR guard + topology assertions)
uv run --directory services/order-service pytest

The user-service suite uses in-memory SQLite via FastAPI dependency-override. The order-service suite mocks aio_pika.connect_robust and the channel/exchange chain with AsyncMock — see services/order-service/conftest.py for the pattern.


Deliberate omissions

Things I knew about and chose not to do, with the reasoning. Listing these is part of the point — knowing the gap matters more than filling every gap on a learning project.

Not done Why not (yet)
CI pipeline Tests run locally with uv run pytest. A GitHub Actions workflow is incoming.
Persistent volumes for Jaeger / Loki / Prometheus Telemetry data is intentionally ephemeral in dev. Production deployments would mount volumes.
Retry-with-backoff before dead-letter Current DLQ is "fail once → dead-letter." A retry queue with TTL is the next pattern; not yet implemented.
OTel instrumentation in order-service / inventory-service Only user-service is fully instrumented.
Cross-service trace propagation through RabbitMQ message headers The OTel aio_pika instrumentation supports this; once the other services are OTel-aware, one trace spanning publisher → broker → consumer becomes possible.
Alertmanager / alert rules Prometheus is collecting; no alerting layer yet because this is not a production codebase.

About

A multi-service backend with four Python services run behind a Traefik gateway, communicating over HTTP/gRPC and a RabbitMQ event bus with a DLQ, backed by a complete observability stack — logs, traces, and metrics — wired through OpenTelemetry and cross-correlated in Grafana.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors