Skip to content

coderleeon/OrionAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OrionAI β€” Production Multi-Agent Inference Engine

OrionAI is a scalable, production-grade agent orchestration system that routes user queries through a structured four-stage AI pipeline with memory, retry logic, structured JSON outputs, latency tracking, and a live monitoring dashboard.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      User / HTTP Client                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚  POST /run  { query, session_id }
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FastAPI Layer  (:8000)                         β”‚
β”‚  β€’ Request validation (Pydantic v2)                              β”‚
β”‚  β€’ Async request handling                                         β”‚
β”‚  β€’ Global exception handler                                       β”‚
β”‚  β€’ Request timing middleware                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Orchestrator Engine                            β”‚
β”‚  β€’ Cache check (skip pipeline if duplicate query)                β”‚
β”‚  β€’ Drives the 4-agent pipeline                                   β”‚
│  ‒ Manages Executor→Critic retry loop (max 3)                    │
β”‚  β€’ Tracks per-agent and total latency                            β”‚
β”‚  β€’ Writes structured step records to memory                      β”‚
β”‚  β€’ Records metrics for the live dashboard                        β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚              β”‚              β”‚              β”‚
   β–Ό              β–Ό              β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Plann-β”‚     β”‚Retrieverβ”‚   β”‚Executor β”‚   β”‚ Critic  β”‚
β”‚  er  │────▢│  Agent  │──▢│  Agent  │──▢│  Agent  β”‚
β”‚Agent β”‚     β”‚         β”‚   β”‚(concurr-β”‚   β”‚(scores, β”‚
β”‚      β”‚     β”‚2-tier:  β”‚   β”‚ent step β”‚   β”‚ retries)β”‚
β”‚Steps β”‚     β”‚corpus + β”‚   β”‚ exec +  β”‚   β”‚approves β”‚
β”‚plan  β”‚     β”‚LLM synthβ”‚   β”‚synthesisβ”‚   β”‚or rejectsβ”‚
β””β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                                              β”‚  rejected?
                                              β–Ό
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚  Retry Loop      β”‚
                                    β”‚  (feedback β†’     β”‚
                                    β”‚   next attempt)  β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Infrastructure Layer                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Memory Store        β”‚  β”‚   TTL Cache  β”‚  β”‚  Metrics Store β”‚  β”‚
β”‚  β”‚  (Redis / in-memory) β”‚  β”‚  (in-process)β”‚  β”‚  (rolling 200) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  LLM Client          β”‚  β”‚  Structured JSON Logger           β”‚   β”‚
β”‚  β”‚  (OpenAI / Mock)     β”‚  β”‚  (stdout + rotating file)        β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Live Dashboard  GET /dashboard             β”‚
β”‚  Charts: Latency Β· Retries Β· Confidence     β”‚
β”‚  Agent latency bars Β· Request history table β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agents

Agent Role Output Schema
Planner Decomposes the query into 2–6 ordered, typed steps { steps: [{id, description, type}] }
Retriever Fetches context (keyword corpus first, LLM synthesis fallback) { context, sources: [{title, relevance}] }
Executor Runs each step concurrently via LLM, then synthesises a final answer { results: [{step_id, output}], combined }
Critic Scores the answer on 4 dimensions (relevanceΒ·accuracyΒ·completenessΒ·clarity), approves or rejects { approved, confidence, feedback, scores }

Retry Logic

The Critic Agent gates the Executor's output. If confidence < threshold (default 0.70):

  1. The Critic's feedback is passed back to the Executor as context.
  2. The Executor retries (same plan, same context, improved prompt).
  3. This repeats up to MAX_RETRIES times (default 3).
  4. After exhausting retries, the last answer is always returned.

Project Structure

CortexOps/
β”œβ”€β”€ main.py                    ← FastAPI app factory + entry point
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
β”‚
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ models.py              ← Pydantic request/response models
β”‚   β”œβ”€β”€ routes.py              ← POST /run, GET /health, /session/* endpoints
β”‚   └── metrics_routes.py      ← GET /metrics/summary, /history, /dashboard
β”‚
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ base_agent.py          ← Abstract base + AgentResult envelope
β”‚   β”œβ”€β”€ planner.py             ← PlannerAgent
β”‚   β”œβ”€β”€ retriever.py           ← RetrieverAgent (2-tier)
β”‚   β”œβ”€β”€ executor.py            ← ExecutorAgent (concurrent steps + synthesis)
β”‚   └── critic.py              ← CriticAgent (4-dimension scoring)
β”‚
β”œβ”€β”€ orchestrator/
β”‚   └── engine.py              ← OrchestratorEngine (pipeline + retry loop)
β”‚
β”œβ”€β”€ memory/
β”‚   └── store.py               ← MemoryStore (Redis / in-memory fallback)
β”‚
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ inference.py           ← LLMClient (OpenAI async + mock mode)
β”‚   β”œβ”€β”€ logger.py              ← StructuredLogger (JSON, rotating file)
β”‚   β”œβ”€β”€ cache.py               ← SimpleCache (TTL, thread-safe)
β”‚   └── metrics.py             ← MetricsStore (rolling 200-request window)
β”‚
β”œβ”€β”€ config/
β”‚   └── settings.py            ← Pydantic BaseSettings (config-driven)
β”‚
β”œβ”€β”€ dashboard/
β”‚   └── index.html             ← Live Latency & Retry Dashboard (SPA)
β”‚
└── logs/
    └── orion.log              ← JSON-structured rotating log file

Setup

1. Clone & install

cd CortexOps
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

pip install -r requirements.txt

2. Configure environment

cp .env.example .env
# Edit .env and set OPENAI_API_KEY

No API key? Leave OPENAI_API_KEY=sk-placeholder β€” the system runs in Mock Mode using deterministic responses so the full pipeline executes without any real LLM calls.

3. Start the server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

4. Open the dashboard

http://localhost:8000/dashboard

API Reference

POST /run β€” Run the inference pipeline

Request

{
  "query": "Explain how transformer models work",
  "session_id": "user-session-001"
}

Response

{
  "request_id": "a1b2c3d4-...",
  "session_id": "user-session-001",
  "query": "Explain how transformer models work",
  "steps": [
    {"id": 1, "description": "Understand the architecture of transformers", "type": "analysis"},
    {"id": 2, "description": "Retrieve context on self-attention mechanisms", "type": "retrieval"},
    {"id": 3, "description": "Synthesise a comprehensive explanation", "type": "synthesis"}
  ],
  "final_answer": "Transformer models use self-attention to weigh the importance of each token...",
  "retries": 1,
  "latency": 2.847,
  "critic_confidence": 0.88,
  "critic_feedback": "Clear and well-structured. Covers key concepts.",
  "agent_latencies_ms": {
    "planner": 312.4,
    "retriever": 48.1,
    "executor_attempt_0": 1402.7,
    "critic_attempt_0": 289.8,
    "executor_attempt_1": 689.4,
    "critic_attempt_1": 104.6
  },
  "from_cache": false,
  "error": null
}

GET /health

{ "status": "ok", "version": "1.0.0", "memory_backend": "redis", "cache_enabled": true }

GET /dashboard

Opens the live Latency & Retry Dashboard in the browser.

GET /metrics/summary

Returns aggregate KPIs: avg/P95 latency, retry rate, critic confidence, error rate, per-agent averages.

GET /metrics/history?limit=50

Returns the rolling history of the last N requests for timeline charts.

GET /session/{session_id}

Retrieve all memory stored for a session (steps, outputs, status).


Dashboard Features

Widget Description
KPI Cards Total requests, avg/P95 latency, avg retries, critic confidence, error rate
Latency Timeline Line chart of end-to-end seconds per request (last 50)
Retry Distribution Bar chart: how many requests needed 0/1/2/3 retries
Confidence Timeline Critic score per request with approval threshold line
Agent Latency Bars Horizontal bars showing avg ms per agent
Request History Table with query, latency, retries, confidence, cache status, request ID

Auto-refreshes every 5 seconds. Refresh button available for manual sync.


Advanced Configuration

Variable Default Description
OPENAI_API_KEY sk-placeholder OpenAI key (mock mode if placeholder)
MODEL_NAME gpt-4o-mini Any OpenAI chat model
MAX_RETRIES 3 Critic-triggered retry budget
CRITIC_CONFIDENCE_THRESHOLD 0.70 Min score to approve without retry
REDIS_URL redis://localhost:6379/0 Falls back to in-memory if unavailable
CACHE_TTL_SECONDS 300 Deduplication window (seconds)
LOG_LEVEL INFO DEBUG / INFO / WARNING / ERROR

Swapping the LLM

To replace OpenAI with a local model (vLLM, Ollama, etc.), edit only utils/inference.py:

# Replace the OpenAI call in LLMClient.chat_complete() with:
response = await your_local_client.generate(prompt=user_prompt)

Everything else β€” agents, orchestrator, memory, dashboard β€” remains unchanged.


Why This System Matters

Modern AI applications need more than a single LLM call. OrionAI demonstrates:

  • Reliability β€” The Critic/retry loop catches low-quality outputs before they reach the user.
  • Observability β€” Every step is logged as structured JSON and visualized on the live dashboard.
  • Modularity β€” Each agent is independent; swap any one without touching the others.
  • Production patterns β€” Async throughout, config-driven, graceful Redis fallback, request deduplication.
  • Extensibility β€” Drop in a real vector DB for the Retriever, or swap the LLM backend with one file change.

Roadmap

We are actively improving OrionAI. Contributions are welcome!

Core Improvements

  • Add Redis-based caching layer for inference optimization
  • Implement critic agent retry mechanism
  • Add latency tracking and monitoring

Advanced Features

  • Multi-agent consensus mechanism
  • Streaming responses (token-level)
  • Distributed worker system

Developer Experience

  • Improve logging and debugging tools
  • Add unit tests for agents

About

OrionAI is a production-grade multi-agent inference engine that orchestrates planning, retrieval, execution, and self-correction workflows using LLMs, enabling reliable and scalable AI system behavior beyond simple prompt-response pipelines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors