Skip to content

YASHcode-IIITV/SENTINAL-OPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ Sentinel-Ops AI

Autonomous Self-Healing AI Infrastructure Platform
Detects LLM/API outages and automatically reroutes inference traffic to backup providers in real time.


Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Sentinel-Ops AI                          │
│                                                                 │
│  ┌──────────────┐    ┌─────────────────┐    ┌───────────────┐  │
│  │  FastAPI      │    │  Failover        │    │  Health       │  │
│  │  Gateway      │───▶│  Engine          │───▶│  Monitor      │  │
│  │              │    │  (Circuit Break) │    │  (Background) │  │
│  └──────┬───────┘    └────────┬────────┘    └───────┬───────┘  │
│         │                     │                      │          │
│         │            ┌────────▼────────┐             │          │
│         │            │  Provider        │             │          │
│         │            │  Registry        │             │          │
│         │            │                 │             │          │
│         │            │  ┌───────────┐  │             │          │
│         │            │  │  OpenAI   │  │◀────────────┘          │
│         │            │  │ (Primary) │  │                        │
│         │            │  └───────────┘  │                        │
│         │            │  ┌───────────┐  │                        │
│         │            │  │  Ollama   │  │                        │
│         │            │  │ (Fallback)│  │                        │
│         │            │  └───────────┘  │                        │
│         │            └─────────────────┘                        │
│         │                                                       │
│  ┌──────▼───────┐    ┌─────────────────┐                       │
│  │  WebSocket   │    │  Redis           │                       │
│  │  Event Bus   │    │  (Incidents +    │                       │
│  │              │    │   Event Cache)   │                       │
│  └──────────────┘    └─────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘

Key Design Patterns

Pattern Implementation
Circuit Breaker 3-state (CLOSED/OPEN/HALF_OPEN) per provider
Failover Chain OpenAI → Ollama → (extensible)
Retry Strategy Exponential back-off with jitter
Event Streaming WebSocket fan-out via async broadcast
Incident Storage Redis list (ring-buffer, 500 events max)
Observability Structured JSON logs + per-provider metrics

Folder Structure

backend/
├── app/
│   ├── api/
│   │   ├── chat.py           # POST /api/chat
│   │   ├── providers.py      # GET /api/providers/status
│   │   ├── incidents.py      # GET /api/incidents
│   │   ├── metrics.py        # GET /api/metrics
│   │   ├── health.py         # GET /health
│   │   └── middleware.py     # Tracing, rate limiting, security headers
│   ├── core/
│   │   ├── config.py         # Pydantic settings (env vars)
│   │   ├── logging.py        # Structlog JSON logger
│   │   ├── redis.py          # Async Redis pool + helpers
│   │   └── circuit_breaker.py# 3-state circuit breaker
│   ├── providers/
│   │   ├── base.py           # Abstract BaseProvider interface
│   │   ├── openai_provider.py# OpenAI implementation
│   │   ├── ollama_provider.py# Ollama local implementation
│   │   └── registry.py       # Provider registry + chain
│   ├── services/
│   │   ├── failover_engine.py# Core routing + failover logic
│   │   ├── incident_service.py# Incident persistence + broadcast
│   │   └── metrics_service.py# Rolling metrics aggregation
│   ├── monitoring/
│   │   └── health_monitor.py # Async background health checker
│   ├── websocket/
│   │   ├── manager.py        # Connection pool + broadcast
│   │   └── router.py         # WS /ws/system-events endpoint
│   ├── models/
│   │   └── schemas.py        # All Pydantic v2 domain models
│   └── app_factory.py        # FastAPI app factory + lifespan
├── tests/
│   └── test_sentinel.py      # Unit + integration tests
├── main.py                   # Uvicorn entrypoint
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── .env.example

Quick Start

Option A — Docker Compose (recommended)

# 1. Clone and enter the project
cd sentinel-ops/backend

# 2. Configure environment
cp .env.example .env
# Edit .env — set OPENAI_API_KEY at minimum

# 3. Start Ollama locally (for fallback)
ollama pull llama3.2

# 4. Launch
docker compose up --build

API is live at http://localhost:8000
Docs at http://localhost:8000/docs


Option B — Local Development

# Prerequisites: Python 3.12+, Redis, Ollama

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Set OPENAI_API_KEY, APP_ENV=development

# 3. Start Redis
redis-server

# 4. Start Ollama
ollama serve
ollama pull llama3.2

# 5. Run
python main.py
# or
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

API Reference

POST /api/chat

Route a prompt through the AI failover engine.

curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Explain circuit breakers in distributed systems."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Response:

{
  "trace_id": "a1b2c3d4-...",
  "provider_used": "openai",
  "model": "gpt-4o-mini",
  "response_text": "A circuit breaker is ...",
  "latency_ms": 843.2,
  "status": "success",
  "failover_occurred": false,
  "failover_chain": [],
  "tokens_used": 187
}

Force failover (when OpenAI is down):

{
  "provider_used": "ollama",
  "failover_occurred": true,
  "failover_chain": ["openai"]
}

GET /api/providers/status

curl http://localhost:8000/api/providers/status
{
  "openai": {
    "status": "healthy",
    "latency_ms": 412.3,
    "success_rate_pct": 99.1,
    "circuit_breaker": { "state": "closed" }
  },
  "ollama": {
    "status": "healthy",
    "latency_ms": 1204.7,
    "circuit_breaker": { "state": "closed" }
  }
}

GET /api/incidents

curl "http://localhost:8000/api/incidents?limit=20&type=failover_triggered"

GET /api/metrics

curl http://localhost:8000/api/metrics
{
  "total_requests": 1042,
  "total_successes": 1038,
  "total_failures": 4,
  "total_failovers": 2,
  "avg_latency_ms": 523.1,
  "active_provider": "openai",
  "uptime_seconds": 3601.0
}

POST /api/providers/{name}/probe

Trigger an on-demand health check:

curl -X POST http://localhost:8000/api/providers/openai/probe

WS /ws/system-events

Connect from any WebSocket client:

const ws = new WebSocket("ws://localhost:8000/ws/system-events");

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.event_type, data.payload);
};

Event types:

  • provider_status — health update for one provider
  • incident — new incident recorded (outage, failover, recovery)
  • metrics — aggregated system metrics (every health-check cycle)
  • heartbeat — keep-alive ping every 30s
  • system — connection lifecycle messages

Testing

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=app --cov-report=term-missing

Adding a New Provider

  1. Create app/providers/gemini_provider.py extending BaseProvider
  2. Implement complete() and health_check()
  3. Register it in app/providers/registry.py:
from app.providers.gemini_provider import GeminiProvider
registry.register(GeminiProvider(), position=2)

The failover engine will automatically include it in the chain.


Environment Variables

Variable Default Description
OPENAI_API_KEY (required) OpenAI API key
OPENAI_MODEL gpt-4o-mini Model to use
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
OLLAMA_MODEL llama3.2 Local model name
REDIS_URL redis://localhost:6379/0 Redis connection string
CIRCUIT_BREAKER_FAILURE_THRESHOLD 3 Failures before circuit opens
CIRCUIT_BREAKER_RECOVERY_TIMEOUT 30 Seconds before half-open probe
HEALTH_CHECK_INTERVAL_SECONDS 15 Background monitoring frequency
RATE_LIMIT_REQUESTS 100 Max requests per window
RATE_LIMIT_WINDOW_SECONDS 60 Rate limit sliding window
APP_ENV production development / staging / production

Built With

  • FastAPI — async web framework
  • Pydantic v2 — data validation and settings
  • httpx — async HTTP client for provider calls
  • redis-py (async) — event storage and pub/sub
  • structlog — structured JSON logging
  • Docker + Compose — containerised deployment

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors