Tolerex is a fault-tolerant distributed message storage system
developed as part of the System Programming course.
Leader-based architecture | Configurable replication factor | Crash-tolerant reads | Disk persistence | Secure gRPC communication
Quick Start | Architecture | Test Scenarios | Project Structure
- Overview
- Technology Stack
- Quick Start
- Monitoring
- Service Ports
- Configuration
- Architecture
- Test Client
- Operational Tips
- Troubleshooting
- Performance Notes
- Project Structure
- Testing and Cleanup
- Test Scenarios
- Roadmap and References
Core capabilities:
- Replication and fault tolerance
- Load-balanced write placement
- Crash-aware read fallback
- Disk-backed persistence
- Secure inter-node communication with mTLS
Client interaction uses a text protocol called HaToKuSe.
Current scope:
- Single leader process (no leader election yet)
- Strong focus on local/dev cluster workflows
- Course-project oriented architecture with clear, inspectable code paths
| Layer | Technology | Purpose |
|---|---|---|
| Language | Go 1.24+ | Core implementation |
| RPC | gRPC + Protobuf | Leader-member communication |
| Transport Security | mTLS | Mutual node authentication |
| Metrics | Prometheus | Time-series collection |
| Visualization | Grafana | Operational dashboarding |
| Local Orchestration | Docker Compose | Repeatable local deploy |
- Go 1.24+
- Protocol Buffers compiler (
protoc) for regenerating.protofiles (optional) - OpenSSL for certificate generation (certificates are already included)
- Docker and Docker Compose for monitoring (optional)
git clone https://github.com/YasinEnginExpert/Tolerex.git
cd Tolerex
go mod tidygo build ./cmd/leader
go build ./cmd/client
go build ./cmd/member
go build ./cmd/launchergo run ./cmd/launcher/main.goThe launcher starts leader and members in separate terminals.
Stress clients are optional (set client count to 0 to skip).
Recommended values: 6 members and 4 stress clients.
Note: enter these values explicitly in the launcher prompts.
Leader:
$env:LEADER_GRPC_PORT='5555'
$env:LEADER_METRICS_PORT='9090'
go run ./cmd/leader/main.goMember:
$env:LEADER_ADDR='localhost:5555'
$env:MEMBER_ADDR='localhost:5556'
go run ./cmd/member/main.go -port=5556 -metrics=9092 -io=bufferedClient:
go run ./cmd/client/main.go
go run ./cmd/client/main.go -measure
go run ./cmd/client/main.go -measure -csv > measured/client.csv
go run ./cmd/client/main.go -measure -csv -id client-1 -offset 0
go run ./cmd/client/main.go -measure -csv -id client-2 -offset 100000docker compose -f deploy/docker-compose.cluster.yml up -d --build
# or: make deploy-cluster-upStop:
docker compose -f deploy/docker-compose.cluster.yml down
# or: make deploy-cluster-downAfter cluster startup, run clients from host (example):
make stress-client-30m30-minute continuous measured stress client:
make stress-client-30m30-minute continuous measured stress clients (parallel):
make stress-clients-30mQuick interactive commands:
SET 1 hello
GET 1
1000 SET load_test
QUIT
Quick health check after startup:
curl http://localhost:9090/metricsIf member metrics are enabled, also check:
curl http://localhost:9092/metricsMonitoring-only mode (Leader/Member on host):
docker compose -f deploy/docker-compose.yml up -d
# or: make deploy-monitoring-upFull Docker cluster mode (Leader + Members + Monitoring):
docker compose -f deploy/docker-compose.cluster.yml up -d --build
# or: make deploy-cluster-upNote: deploy Compose files intentionally do not run client containers.
Run clients separately (for example make stress-client-30m).
Access:
- Prometheus targets:
http://localhost:9091/targets - Grafana UI:
http://localhost:3000(defaultadmin/admin)
Import dashboard:
deploy/grafana_dashboard.json- Detailed deploy guide:
deploy/manual.md - Datasource/dashboard provisioning is automatic on startup (no manual datasource setup required)
The dashboard now includes:
- SRE summary KPIs (throughput, error rate, p95, p99)
- Throughput/error breakdown by method and status
- Instance-level throughput and scrape health
- Stored message totals per member instance
- Top methods by load for bottleneck analysis
| Service | Default Port | Notes |
|---|---|---|
| Leader gRPC | 5555 |
Member registration and replication path |
| Leader TCP | 6666 |
Client command protocol (HaToKuSe) |
| Leader Metrics | 9090 |
Prometheus scrape endpoint |
| Member-1 gRPC | 5556 |
Storage RPC |
| Member-2 gRPC | 5557 |
Storage RPC |
| Member-3 gRPC | 5558 |
Storage RPC |
| Member-4 gRPC | 5559 |
Storage RPC |
| Member-5 gRPC | 5560 |
Storage RPC |
| Member-6 gRPC | 5561 |
Storage RPC |
| Member Metrics | 9092-9097 |
Prometheus scrape endpoints |
| Prometheus UI | 9091 |
Exposed from container port 9090 |
| Grafana UI | 3000 |
Dashboard and alerting UI |
| Variable | Example | Description |
|---|---|---|
LEADER_GRPC_PORT |
5555 |
Leader gRPC port |
LEADER_TCP_PORT |
6666 |
Leader TCP command port |
LEADER_TCP_BIND |
127.0.0.1 or 0.0.0.0 |
Leader TCP bind host |
LEADER_METRICS_PORT |
9090 |
Leader metrics port |
LEADER_ADDR |
localhost:5555 |
Member -> Leader registration/heartbeat target |
MEMBER_ADDR |
localhost:5556 |
Member self address |
MEMBER_GRPC_PORT |
5556 |
Default member gRPC port |
MEMBER_METRICS_PORT |
9092 |
Default member metrics port |
MEMBER_IO_MODE |
buffered or unbuffered |
Member disk write mode |
MEMBER_EXPECTED_LEADER_CN |
leader |
Expected leader certificate CN for member-side authorization |
BALANCER_STRATEGY |
least_loaded or p2c |
Replica selection strategy |
TOLERANCE_CONF |
config/tolerance.conf |
Tolerance file path |
TOLEREX_TEST_MODE |
1 or 0 |
Test mode toggle |
TOLEREX_BASE_DIR |
D:\Tolerex |
Optional base directory override for launcher |
least_loaded: sort-based selection, practical complexityO(N log N).p2c: uses a sparse candidate pool and removes only winners; with small fixed tolerance, selection cost is near constant in practice.
Roles:
- Client: sends
SETandGETto leader over TCP. - Leader: membership, load balancing, replication coordination, metadata/state management.
- Member: stores and retrieves message data on disk.
Request flow:
- Client sends request to leader.
- Leader resolves tolerance and picks replicas.
- Leader replicates via gRPC.
- Members persist and ACK.
- Leader returns final result.
Read behavior:
- Leader first tries known replica list from metadata.
- If metadata is missing, leader can fall back to scanning alive members.
- If no replica returns data, client gets
NOT_FOUND.
System interaction map:
flowchart LR
C[Client] -->|TCP SET/GET| L[Leader]
L -->|gRPC Store/Retrieve| M1[Member 1]
L -->|gRPC Store/Retrieve| M2[Member 2]
L -->|gRPC Store/Retrieve| M3[Member 3]
L -->|/metrics| P[Prometheus]
M1 -->|/metrics| P
M2 -->|/metrics| P
M3 -->|/metrics| P
P --> G[Grafana]
Main flags:
-addr string Leader TCP address (default "127.0.0.1:6666")
-measure Enable RTT measurement
-csv Emit CSV to stdout (requires -measure)
-id string Client ID (default "client")
-offset int Starting ID offset
-flush int Flush interval (default 1000)
-stress int Stress duration in minutes
CSV format:
Timestamp,ClientID,Operation,Count,Bytes,RTT_us,Mode,Status,PayloadSizeCommon command cookbook:
| Goal | Command |
|---|---|
| Single write | SET 100 hello |
| Single read | GET 100 |
| Bulk write | 1000 SET payload |
| Graceful exit | QUIT |
- For realistic throughput testing, run client without
-measurefirst. - Use
-measure -csvfor latency analysis, not peak TPS measurement. - For a sustained 30-minute measured run, use
make stress-client-30m. - For higher pressure tests (parallel clients), use
make stress-clients-30m. - Keep member
-iomode explicit (bufferedorunbuffered) in performance comparisons. - Prefer
make deploy-cluster-upfor consistent local demo environments.
Leader starts but client cannot connect:
- Check
LEADER_TCP_BINDandLEADER_TCP_PORT. - Verify no other process is using port
6666. - Try
Test-NetConnection 127.0.0.1 -Port 6666on PowerShell.
Member fails to register:
- Confirm
LEADER_ADDRis reachable from member process. - Validate mTLS cert/key files under
config/tls. - If using custom certificates, set
MEMBER_EXPECTED_LEADER_CN.
Grafana has no data:
- Open Prometheus targets page:
http://localhost:9091/targets. - Ensure scrape targets are
UP. - Verify leader/member
/metricsendpoints return data.
Implemented optimizations:
- Persistent gRPC connection reuse in leader
- Parallel fan-out replication for store path
- Configurable buffered/unbuffered storage writes in member
Useful benchmark command:
go test -run '^$' -bench 'Benchmark(WriteMessage_|LeastLoaded_|P2C_)' ./internal/storage ./internal/server/balancer -benchmem.
|-- cmd/
| |-- client/ # test client
| |-- launcher/ # local cluster launcher
| |-- leader/ # leader bootstrap
| |-- member/ # member bootstrap
|-- config/
| |-- tls/
| `-- tolerance.conf
|-- deploy/
| |-- docker-compose.yml
| |-- docker-compose.cluster.yml
| |-- manual.md
| |-- prometheus.local.yml
| |-- prometheus.cluster.yml
| `-- grafana_dashboard.json
|-- docs/
| |-- testing.md
| |-- roadmap.md
| `-- references.md
|-- internal/
| |-- logger/
| |-- metrics/
| |-- middleware/
| |-- security/
| |-- server/
| `-- storage/
|-- proto/
|-- scripts/
| |-- run-tests.ps1
| |-- run-stress-client.ps1
| |-- run-stress-clients.ps1
| |-- clean-test-artifacts.ps1
| |-- clean-docker-artifacts.ps1
| `-- clean-runtime-artifacts.ps1
|-- test_data/
|-- Makefile
`-- README.md
Use either script or make targets.
Run tests:
make test
make test-race
make bench
make test-allDeploy helpers:
make deploy-monitoring-up
make deploy-monitoring-down
make deploy-cluster-up
make deploy-cluster-down
make deploy-cluster-status
make deploy-cluster-logs
make stress-client-30m
make stress-client-30m-unbuffered
make stress-clients-30m
make stress-clients-30m-unbufferedCleanup artifacts:
make clean-test-artifacts
make clean-runtime-artifacts
make clean-docker-artifacts
make clean-all-artifactsDetails are in docs/testing.md.
Demo videos:
- Test 1: Initial system validation
https://youtu.be/kz0HX8aq4wQ - Test 2: Disk-based single-node storage
https://youtu.be/mqYZ8ZRT5D4 - Test 3: gRPC message model
https://youtu.be/evnN6bgofg8 - Test 4: Distributed logging with tolerance 1-2
https://youtu.be/-CHNPo6JEkc - Test 6: General fault tolerance and load balancing
https://youtu.be/TSHtgNh90gI - Test 7: Crash and recovery behavior
https://youtu.be/3mGIgtAFrmg
Moved to dedicated docs:
- Roadmap:
docs/roadmap.md - References:
docs/references.md