Skip to content

dqyaa/ai-observability-stack

Repository files navigation

🔭 AI Observability Stack

Self-contained LLM monitoring system. No external services. No cloud accounts. Runs from python quickstart.py.

Python SQLite Streamlit License


The Problem

"Which operation in our pipeline consumed 40% of this month's LLM budget? Which user generated the 5 most expensive calls this week? Is our quality degrading — or did it just get worse on Tuesdays?"

LLM observability is about more than just logging requests or counting tokens. It's about measuring whether an AI system is behaving correctly, safely, and consistently over time. Most tools require cloud accounts, external APIs, or Docker stacks. This project does it all locally with SQLite and Streamlit.


What It Does

Feature Detail
Cost tracking Per-call, per-model, per-feature, per-user. USD + MYR
Latency monitoring P50/P75/P95/P99 percentiles, trend over time
Quality drift detection Rolling quality score trend, automatic alerts
Error rate monitoring Failure rate tracking with spike detection
Anomaly detection Background thread alerts on cost spikes, latency spikes, quality drops
Slack alerts Optional webhook integration. Works out of the box without it
@observe decorator Wrap any LLM call with one line
Malaysian pricing MYR cost display built in (not just USD)
Zero dependencies for core Just SQLite + stdlib. Streamlit only for dashboard

Quick Start

git clone https://github.com/aliyaalias19/ai-observability-stack
cd ai-observability-stack
pip install -r requirements.txt

# Verify setup (< 5 seconds)
python quickstart.py

# Generate 14 days of realistic demo data
python demo/generate_demo_data.py

# Launch dashboard
streamlit run dashboard/app.py

Integration — 1 Line

from observers.decorator import observe

@observe(feature="islamic_finance_bot", model="gpt-4o")
def chat(prompt: str) -> str:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

The decorator automatically captures:

  • Latency (wall clock time of the function)
  • Model name (from response object)
  • Token counts (from response object)
  • Cost in USD and MYR (calculated from token counts)
  • Feature tag, user ID, environment

Manual Instrumentation

from observers.decorator import ObservabilityContext

with ObservabilityContext(
    model="claude-sonnet-4-6",
    prompt=user_question,
    feature="rag_pipeline",
    user_id=user_id,
) as ctx:
    docs = retriever.search(user_question)
    response = llm.generate(user_question, docs)

    ctx.set_response(response.text)
    ctx.set_tokens(response.usage.input, response.usage.output)
    ctx.set_quality(ragas_score)   # Add quality evaluation

Anomaly Detection

from collectors.trace_collector import TraceCollector
from detectors.anomaly_detector import AnomalyDetector, AnomalyConfig
from alerts.alert_manager import AlertManager

collector = TraceCollector("production.db")
alerts = AlertManager(slack_webhook=os.environ["SLACK_WEBHOOK_URL"])

detector = AnomalyDetector(
    collector=collector,
    config=AnomalyConfig(
        single_call_cost_alert=0.50,    # Alert if one call > $0.50
        daily_cost_alert=50.00,         # Alert if daily spend > $50
        latency_alert_ms=5000,          # Alert if P95 > 5s
        quality_drop_threshold=0.70,    # Alert if avg quality < 0.70
        error_rate_alert=0.10,          # Alert if error rate > 10%
    ),
    on_anomaly=alerts.send,
)
detector.start()   # Background thread, non-blocking

Dashboard Screenshots

The dashboard shows:

  • KPI cards: Total calls, cost (USD + MYR), P95 latency, quality score, error rate
  • Cost trend: Daily spend over time with model breakdown
  • Latency chart: P50/P75/P95/P99 bar chart with threshold indicators
  • Quality trend: Rolling average with drift detection
  • Model breakdown: Which model costs most
  • Feature breakdown: Which feature uses most resources
  • Live trace table: Recent LLM calls with full context
  • Alerts panel: Active anomaly alerts

Architecture

┌────────────────────────────────────────────────────────┐
│                    YOUR LLM APPLICATION                │
│                                                        │
│  @observe(feature="chatbot")                           │
│  def chat(prompt) → response                           │
│                          ↓                             │
│  ┌────────────────────────────────────────────────┐   │
│  │           TRACE COLLECTOR (SQLite)              │   │
│  │  • Buffers traces in memory                    │   │
│  │  • Flushes to traces.db every 100 calls        │   │
│  │  • Thread-safe writes                          │   │
│  │  • Auto-calculates cost (USD + MYR)            │   │
│  └────────────────────────────────────────────────┘   │
│                          ↓                             │
│  ┌────────────────────────────────────────────────┐   │
│  │        ANOMALY DETECTOR (background thread)    │   │
│  │  • Polls every 60s                             │   │
│  │  • Cost spike detection                        │   │
│  │  • Latency P95 monitoring                      │   │
│  │  • Quality drift detection                     │   │
│  │  • Error rate tracking                         │   │
│  └────────────────────────────────────────────────┘   │
│                          ↓                             │
│  ┌────────────────────────────────────────────────┐   │
│  │           ALERT MANAGER                        │   │
│  │  • Console (always on)                         │   │
│  │  • Slack (optional webhook)                    │   │
│  │  • Alert deduplication (30min cooldown)        │   │
│  └────────────────────────────────────────────────┘   │
│                                                        │
│  ┌────────────────────────────────────────────────┐   │
│  │        STREAMLIT DASHBOARD                     │   │
│  │  • Real-time KPI cards                         │   │
│  │  • Cost, latency, quality, error charts        │   │
│  │  • Live trace table                            │   │
│  │  • Alerts panel                                │   │
│  └────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────┘

LLM Pricing (Built-in)

Includes pricing for all major providers as of May 2026: OpenAI (GPT-4o, o1, GPT-3.5) · Anthropic (Claude Opus, Sonnet, Haiku) · Google (Gemini 1.5/2.0) · Meta (Llama 3.1) · Mistral

All costs shown in USD and Malaysian Ringgit (MYR) — unique to this project.

Update pricing in collectors/trace_collector.pyLLM_PRICING dict.


What Real Incidents This Has Caught

Simulated in demo data — real production scenarios:

  1. Cost spike (Day 3): 15% of calls used GPT-4o with 3,000–8,000 token prompts. Anomaly detector flagged within 1 check cycle. Root cause: missing context truncation.

  2. Quality drift (Last 4 days): Quality dropped from 0.82 → 0.65 after a prompt change. Detected by 3-day rolling average check. Required prompt rollback.

  3. Error rate spike (Day 2): 20% error rate flagged as critical. Root cause: API rate limit hit during traffic spike. Auto-alert triggered in under 60 seconds.


Citation

@misc{alias2026observability,
  title  = {AI Observability Stack: Self-Hosted LLM Monitoring},
  author = {Alias, Aliya},
  year   = {2026},
  url    = {https://github.com/aliyaalias19/ai-observability-stack}
}

👤 About

Built by Aliya Alias — AI Engineer, Kuala Lumpur. MSc Artificial Intelligence, University of Malaya.

LinkedIn GitHub

MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages