🛡️ Sentinel-Ops AI

Autonomous Self-Healing AI Infrastructure Platform
Detects LLM/API outages and automatically reroutes inference traffic to backup providers in real time.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Sentinel-Ops AI                          │
│                                                                 │
│  ┌──────────────┐    ┌─────────────────┐    ┌───────────────┐  │
│  │  FastAPI      │    │  Failover        │    │  Health       │  │
│  │  Gateway      │───▶│  Engine          │───▶│  Monitor      │  │
│  │              │    │  (Circuit Break) │    │  (Background) │  │
│  └──────┬───────┘    └────────┬────────┘    └───────┬───────┘  │
│         │                     │                      │          │
│         │            ┌────────▼────────┐             │          │
│         │            │  Provider        │             │          │
│         │            │  Registry        │             │          │
│         │            │                 │             │          │
│         │            │  ┌───────────┐  │             │          │
│         │            │  │  OpenAI   │  │◀────────────┘          │
│         │            │  │ (Primary) │  │                        │
│         │            │  └───────────┘  │                        │
│         │            │  ┌───────────┐  │                        │
│         │            │  │  Ollama   │  │                        │
│         │            │  │ (Fallback)│  │                        │
│         │            │  └───────────┘  │                        │
│         │            └─────────────────┘                        │
│         │                                                       │
│  ┌──────▼───────┐    ┌─────────────────┐                       │
│  │  WebSocket   │    │  Redis           │                       │
│  │  Event Bus   │    │  (Incidents +    │                       │
│  │              │    │   Event Cache)   │                       │
│  └──────────────┘    └─────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘

Key Design Patterns

Pattern	Implementation
Circuit Breaker	3-state (CLOSED/OPEN/HALF_OPEN) per provider
Failover Chain	OpenAI → Ollama → (extensible)
Retry Strategy	Exponential back-off with jitter
Event Streaming	WebSocket fan-out via async broadcast
Incident Storage	Redis list (ring-buffer, 500 events max)
Observability	Structured JSON logs + per-provider metrics

Folder Structure

backend/
├── app/
│   ├── api/
│   │   ├── chat.py           # POST /api/chat
│   │   ├── providers.py      # GET /api/providers/status
│   │   ├── incidents.py      # GET /api/incidents
│   │   ├── metrics.py        # GET /api/metrics
│   │   ├── health.py         # GET /health
│   │   └── middleware.py     # Tracing, rate limiting, security headers
│   ├── core/
│   │   ├── config.py         # Pydantic settings (env vars)
│   │   ├── logging.py        # Structlog JSON logger
│   │   ├── redis.py          # Async Redis pool + helpers
│   │   └── circuit_breaker.py# 3-state circuit breaker
│   ├── providers/
│   │   ├── base.py           # Abstract BaseProvider interface
│   │   ├── openai_provider.py# OpenAI implementation
│   │   ├── ollama_provider.py# Ollama local implementation
│   │   └── registry.py       # Provider registry + chain
│   ├── services/
│   │   ├── failover_engine.py# Core routing + failover logic
│   │   ├── incident_service.py# Incident persistence + broadcast
│   │   └── metrics_service.py# Rolling metrics aggregation
│   ├── monitoring/
│   │   └── health_monitor.py # Async background health checker
│   ├── websocket/
│   │   ├── manager.py        # Connection pool + broadcast
│   │   └── router.py         # WS /ws/system-events endpoint
│   ├── models/
│   │   └── schemas.py        # All Pydantic v2 domain models
│   └── app_factory.py        # FastAPI app factory + lifespan
├── tests/
│   └── test_sentinel.py      # Unit + integration tests
├── main.py                   # Uvicorn entrypoint
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env.example

Quick Start

Option A — Docker Compose (recommended)

# 1. Clone and enter the project
cd sentinel-ops/backend

# 2. Configure environment
cp .env.example .env
# Edit .env — set OPENAI_API_KEY at minimum

# 3. Start Ollama locally (for fallback)
ollama pull llama3.2

# 4. Launch
docker compose up --build

API is live at http://localhost:8000
Docs at http://localhost:8000/docs

Option B — Local Development

# Prerequisites: Python 3.12+, Redis, Ollama

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Set OPENAI_API_KEY, APP_ENV=development

# 3. Start Redis
redis-server

# 4. Start Ollama
ollama serve
ollama pull llama3.2

# 5. Run
python main.py
# or
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

API Reference

POST /api/chat

Route a prompt through the AI failover engine.

curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Explain circuit breakers in distributed systems."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Response:

{
  "trace_id": "a1b2c3d4-...",
  "provider_used": "openai",
  "model": "gpt-4o-mini",
  "response_text": "A circuit breaker is ...",
  "latency_ms": 843.2,
  "status": "success",
  "failover_occurred": false,
  "failover_chain": [],
  "tokens_used": 187
}

Force failover (when OpenAI is down):

{
  "provider_used": "ollama",
  "failover_occurred": true,
  "failover_chain": ["openai"]
}

GET /api/providers/status

curl http://localhost:8000/api/providers/status

{
  "openai": {
    "status": "healthy",
    "latency_ms": 412.3,
    "success_rate_pct": 99.1,
    "circuit_breaker": { "state": "closed" }
  },
  "ollama": {
    "status": "healthy",
    "latency_ms": 1204.7,
    "circuit_breaker": { "state": "closed" }
  }
}

GET /api/incidents

curl "http://localhost:8000/api/incidents?limit=20&type=failover_triggered"

GET /api/metrics

curl http://localhost:8000/api/metrics

{
  "total_requests": 1042,
  "total_successes": 1038,
  "total_failures": 4,
  "total_failovers": 2,
  "avg_latency_ms": 523.1,
  "active_provider": "openai",
  "uptime_seconds": 3601.0
}

POST /api/providers/{name}/probe

Trigger an on-demand health check:

curl -X POST http://localhost:8000/api/providers/openai/probe

WS /ws/system-events

Connect from any WebSocket client:

const ws = new WebSocket("ws://localhost:8000/ws/system-events");

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.event_type, data.payload);
};

Event types:

provider_status — health update for one provider
incident — new incident recorded (outage, failover, recovery)
metrics — aggregated system metrics (every health-check cycle)
heartbeat — keep-alive ping every 30s
system — connection lifecycle messages

Testing

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=app --cov-report=term-missing

Adding a New Provider

Create app/providers/gemini_provider.py extending BaseProvider
Implement complete() and health_check()
Register it in app/providers/registry.py:

from app.providers.gemini_provider import GeminiProvider
registry.register(GeminiProvider(), position=2)

The failover engine will automatically include it in the chain.

Environment Variables

Variable	Default	Description
`OPENAI_API_KEY`	(required)	OpenAI API key
`OPENAI_MODEL`	`gpt-4o-mini`	Model to use
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL
`OLLAMA_MODEL`	`llama3.2`	Local model name
`REDIS_URL`	`redis://localhost:6379/0`	Redis connection string
`CIRCUIT_BREAKER_FAILURE_THRESHOLD`	`3`	Failures before circuit opens
`CIRCUIT_BREAKER_RECOVERY_TIMEOUT`	`30`	Seconds before half-open probe
`HEALTH_CHECK_INTERVAL_SECONDS`	`15`	Background monitoring frequency
`RATE_LIMIT_REQUESTS`	`100`	Max requests per window
`RATE_LIMIT_WINDOW_SECONDS`	`60`	Rate limit sliding window
`APP_ENV`	`production`	`development` / `staging` / `production`

Built With

FastAPI — async web framework
Pydantic v2 — data validation and settings
httpx — async HTTP client for provider calls
redis-py (async) — event storage and pub/sub
structlog — structured JSON logging
Docker + Compose — containerised deployment

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Sentinel-Ops AI

Architecture Overview

Key Design Patterns

Folder Structure

Quick Start

Option A — Docker Compose (recommended)

Option B — Local Development

API Reference

POST /api/chat

GET /api/providers/status

GET /api/incidents

GET /api/metrics

POST /api/providers/{name}/probe

WS /ws/system-events

Testing

Adding a New Provider

Environment Variables

Built With

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ Sentinel-Ops AI

Architecture Overview

Key Design Patterns

Folder Structure

Quick Start

Option A — Docker Compose (recommended)

Option B — Local Development

API Reference

POST /api/chat

GET /api/providers/status

GET /api/incidents

GET /api/metrics

POST /api/providers/{name}/probe

WS /ws/system-events

Testing

Adding a New Provider

Environment Variables

Built With

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages