OrionAI is a scalable, production-grade agent orchestration system that routes user queries through a structured four-stage AI pipeline with memory, retry logic, structured JSON outputs, latency tracking, and a live monitoring dashboard.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User / HTTP Client β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β POST /run { query, session_id }
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Layer (:8000) β
β β’ Request validation (Pydantic v2) β
β β’ Async request handling β
β β’ Global exception handler β
β β’ Request timing middleware β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator Engine β
β β’ Cache check (skip pipeline if duplicate query) β
β β’ Drives the 4-agent pipeline β
β β’ Manages ExecutorβCritic retry loop (max 3) β
β β’ Tracks per-agent and total latency β
β β’ Writes structured step records to memory β
β β’ Records metrics for the live dashboard β
ββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
ββββββββ βββββββββββ βββββββββββ βββββββββββ
βPlann-β βRetrieverβ βExecutor β β Critic β
β er ββββββΆβ Agent ββββΆβ Agent ββββΆβ Agent β
βAgent β β β β(concurr-β β(scores, β
β β β2-tier: β βent step β β retries)β
βSteps β βcorpus + β β exec + β βapproves β
βplan β βLLM synthβ βsynthesisβ βor rejectsβ
ββββββββ βββββββββββ βββββββββββ ββββββ¬βββββ
β rejected?
βΌ
ββββββββββββββββββββ
β Retry Loop β
β (feedback β β
β next attempt) β
ββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Infrastructure Layer β
β βββββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β Memory Store β β TTL Cache β β Metrics Store β β
β β (Redis / in-memory) β β (in-process)β β (rolling 200) β β
β βββββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β LLM Client β β Structured JSON Logger β β
β β (OpenAI / Mock) β β (stdout + rotating file) β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Live Dashboard GET /dashboard β
β Charts: Latency Β· Retries Β· Confidence β
β Agent latency bars Β· Request history table β
βββββββββββββββββββββββββββββββββββββββββββββββ
| Agent | Role | Output Schema |
|---|---|---|
| Planner | Decomposes the query into 2β6 ordered, typed steps | { steps: [{id, description, type}] } |
| Retriever | Fetches context (keyword corpus first, LLM synthesis fallback) | { context, sources: [{title, relevance}] } |
| Executor | Runs each step concurrently via LLM, then synthesises a final answer | { results: [{step_id, output}], combined } |
| Critic | Scores the answer on 4 dimensions (relevanceΒ·accuracyΒ·completenessΒ·clarity), approves or rejects | { approved, confidence, feedback, scores } |
The Critic Agent gates the Executor's output. If confidence < threshold (default 0.70):
- The Critic's
feedbackis passed back to the Executor as context. - The Executor retries (same plan, same context, improved prompt).
- This repeats up to
MAX_RETRIEStimes (default 3). - After exhausting retries, the last answer is always returned.
CortexOps/
βββ main.py β FastAPI app factory + entry point
βββ requirements.txt
βββ .env.example
β
βββ api/
β βββ models.py β Pydantic request/response models
β βββ routes.py β POST /run, GET /health, /session/* endpoints
β βββ metrics_routes.py β GET /metrics/summary, /history, /dashboard
β
βββ agents/
β βββ base_agent.py β Abstract base + AgentResult envelope
β βββ planner.py β PlannerAgent
β βββ retriever.py β RetrieverAgent (2-tier)
β βββ executor.py β ExecutorAgent (concurrent steps + synthesis)
β βββ critic.py β CriticAgent (4-dimension scoring)
β
βββ orchestrator/
β βββ engine.py β OrchestratorEngine (pipeline + retry loop)
β
βββ memory/
β βββ store.py β MemoryStore (Redis / in-memory fallback)
β
βββ utils/
β βββ inference.py β LLMClient (OpenAI async + mock mode)
β βββ logger.py β StructuredLogger (JSON, rotating file)
β βββ cache.py β SimpleCache (TTL, thread-safe)
β βββ metrics.py β MetricsStore (rolling 200-request window)
β
βββ config/
β βββ settings.py β Pydantic BaseSettings (config-driven)
β
βββ dashboard/
β βββ index.html β Live Latency & Retry Dashboard (SPA)
β
βββ logs/
βββ orion.log β JSON-structured rotating log file
cd CortexOps
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env and set OPENAI_API_KEYNo API key? Leave
OPENAI_API_KEY=sk-placeholderβ the system runs in Mock Mode using deterministic responses so the full pipeline executes without any real LLM calls.
uvicorn main:app --reload --host 0.0.0.0 --port 8000http://localhost:8000/dashboard
Request
{
"query": "Explain how transformer models work",
"session_id": "user-session-001"
}Response
{
"request_id": "a1b2c3d4-...",
"session_id": "user-session-001",
"query": "Explain how transformer models work",
"steps": [
{"id": 1, "description": "Understand the architecture of transformers", "type": "analysis"},
{"id": 2, "description": "Retrieve context on self-attention mechanisms", "type": "retrieval"},
{"id": 3, "description": "Synthesise a comprehensive explanation", "type": "synthesis"}
],
"final_answer": "Transformer models use self-attention to weigh the importance of each token...",
"retries": 1,
"latency": 2.847,
"critic_confidence": 0.88,
"critic_feedback": "Clear and well-structured. Covers key concepts.",
"agent_latencies_ms": {
"planner": 312.4,
"retriever": 48.1,
"executor_attempt_0": 1402.7,
"critic_attempt_0": 289.8,
"executor_attempt_1": 689.4,
"critic_attempt_1": 104.6
},
"from_cache": false,
"error": null
}{ "status": "ok", "version": "1.0.0", "memory_backend": "redis", "cache_enabled": true }Opens the live Latency & Retry Dashboard in the browser.
Returns aggregate KPIs: avg/P95 latency, retry rate, critic confidence, error rate, per-agent averages.
Returns the rolling history of the last N requests for timeline charts.
Retrieve all memory stored for a session (steps, outputs, status).
| Widget | Description |
|---|---|
| KPI Cards | Total requests, avg/P95 latency, avg retries, critic confidence, error rate |
| Latency Timeline | Line chart of end-to-end seconds per request (last 50) |
| Retry Distribution | Bar chart: how many requests needed 0/1/2/3 retries |
| Confidence Timeline | Critic score per request with approval threshold line |
| Agent Latency Bars | Horizontal bars showing avg ms per agent |
| Request History | Table with query, latency, retries, confidence, cache status, request ID |
Auto-refreshes every 5 seconds. Refresh button available for manual sync.
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
sk-placeholder |
OpenAI key (mock mode if placeholder) |
MODEL_NAME |
gpt-4o-mini |
Any OpenAI chat model |
MAX_RETRIES |
3 |
Critic-triggered retry budget |
CRITIC_CONFIDENCE_THRESHOLD |
0.70 |
Min score to approve without retry |
REDIS_URL |
redis://localhost:6379/0 |
Falls back to in-memory if unavailable |
CACHE_TTL_SECONDS |
300 |
Deduplication window (seconds) |
LOG_LEVEL |
INFO |
DEBUG / INFO / WARNING / ERROR |
To replace OpenAI with a local model (vLLM, Ollama, etc.), edit only utils/inference.py:
# Replace the OpenAI call in LLMClient.chat_complete() with:
response = await your_local_client.generate(prompt=user_prompt)Everything else β agents, orchestrator, memory, dashboard β remains unchanged.
Modern AI applications need more than a single LLM call. OrionAI demonstrates:
- Reliability β The Critic/retry loop catches low-quality outputs before they reach the user.
- Observability β Every step is logged as structured JSON and visualized on the live dashboard.
- Modularity β Each agent is independent; swap any one without touching the others.
- Production patterns β Async throughout, config-driven, graceful Redis fallback, request deduplication.
- Extensibility β Drop in a real vector DB for the Retriever, or swap the LLM backend with one file change.
We are actively improving OrionAI. Contributions are welcome!
- Add Redis-based caching layer for inference optimization
- Implement critic agent retry mechanism
- Add latency tracking and monitoring
- Multi-agent consensus mechanism
- Streaming responses (token-level)
- Distributed worker system
- Improve logging and debugging tools
- Add unit tests for agents