🔭 AI Observability Stack

Self-contained LLM monitoring system. No external services. No cloud accounts. Runs from python quickstart.py.

The Problem

"Which operation in our pipeline consumed 40% of this month's LLM budget? Which user generated the 5 most expensive calls this week? Is our quality degrading — or did it just get worse on Tuesdays?"

LLM observability is about more than just logging requests or counting tokens. It's about measuring whether an AI system is behaving correctly, safely, and consistently over time. Most tools require cloud accounts, external APIs, or Docker stacks. This project does it all locally with SQLite and Streamlit.

What It Does

Feature	Detail
Cost tracking	Per-call, per-model, per-feature, per-user. USD + MYR
Latency monitoring	P50/P75/P95/P99 percentiles, trend over time
Quality drift detection	Rolling quality score trend, automatic alerts
Error rate monitoring	Failure rate tracking with spike detection
Anomaly detection	Background thread alerts on cost spikes, latency spikes, quality drops
Slack alerts	Optional webhook integration. Works out of the box without it
@observe decorator	Wrap any LLM call with one line
Malaysian pricing	MYR cost display built in (not just USD)
Zero dependencies for core	Just SQLite + stdlib. Streamlit only for dashboard

Quick Start

git clone https://github.com/aliyaalias19/ai-observability-stack
cd ai-observability-stack
pip install -r requirements.txt

# Verify setup (< 5 seconds)
python quickstart.py

# Generate 14 days of realistic demo data
python demo/generate_demo_data.py

# Launch dashboard
streamlit run dashboard/app.py

Integration — 1 Line

from observers.decorator import observe

@observe(feature="islamic_finance_bot", model="gpt-4o")
def chat(prompt: str) -> str:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

The decorator automatically captures:

Latency (wall clock time of the function)
Model name (from response object)
Token counts (from response object)
Cost in USD and MYR (calculated from token counts)
Feature tag, user ID, environment

Manual Instrumentation

from observers.decorator import ObservabilityContext

with ObservabilityContext(
    model="claude-sonnet-4-6",
    prompt=user_question,
    feature="rag_pipeline",
    user_id=user_id,
) as ctx:
    docs = retriever.search(user_question)
    response = llm.generate(user_question, docs)

    ctx.set_response(response.text)
    ctx.set_tokens(response.usage.input, response.usage.output)
    ctx.set_quality(ragas_score)   # Add quality evaluation

Anomaly Detection

from collectors.trace_collector import TraceCollector
from detectors.anomaly_detector import AnomalyDetector, AnomalyConfig
from alerts.alert_manager import AlertManager

collector = TraceCollector("production.db")
alerts = AlertManager(slack_webhook=os.environ["SLACK_WEBHOOK_URL"])

detector = AnomalyDetector(
    collector=collector,
    config=AnomalyConfig(
        single_call_cost_alert=0.50,    # Alert if one call > $0.50
        daily_cost_alert=50.00,         # Alert if daily spend > $50
        latency_alert_ms=5000,          # Alert if P95 > 5s
        quality_drop_threshold=0.70,    # Alert if avg quality < 0.70
        error_rate_alert=0.10,          # Alert if error rate > 10%
    ),
    on_anomaly=alerts.send,
)
detector.start()   # Background thread, non-blocking

Dashboard Screenshots

The dashboard shows:

KPI cards: Total calls, cost (USD + MYR), P95 latency, quality score, error rate
Cost trend: Daily spend over time with model breakdown
Latency chart: P50/P75/P95/P99 bar chart with threshold indicators
Quality trend: Rolling average with drift detection
Model breakdown: Which model costs most
Feature breakdown: Which feature uses most resources
Live trace table: Recent LLM calls with full context
Alerts panel: Active anomaly alerts

Architecture

┌────────────────────────────────────────────────────────┐
│                    YOUR LLM APPLICATION                │
│                                                        │
│  @observe(feature="chatbot")                           │
│  def chat(prompt) → response                           │
│                          ↓                             │
│  ┌────────────────────────────────────────────────┐   │
│  │           TRACE COLLECTOR (SQLite)              │   │
│  │  • Buffers traces in memory                    │   │
│  │  • Flushes to traces.db every 100 calls        │   │
│  │  • Thread-safe writes                          │   │
│  │  • Auto-calculates cost (USD + MYR)            │   │
│  └────────────────────────────────────────────────┘   │
│                          ↓                             │
│  ┌────────────────────────────────────────────────┐   │
│  │        ANOMALY DETECTOR (background thread)    │   │
│  │  • Polls every 60s                             │   │
│  │  • Cost spike detection                        │   │
│  │  • Latency P95 monitoring                      │   │
│  │  • Quality drift detection                     │   │
│  │  • Error rate tracking                         │   │
│  └────────────────────────────────────────────────┘   │
│                          ↓                             │
│  ┌────────────────────────────────────────────────┐   │
│  │           ALERT MANAGER                        │   │
│  │  • Console (always on)                         │   │
│  │  • Slack (optional webhook)                    │   │
│  │  • Alert deduplication (30min cooldown)        │   │
│  └────────────────────────────────────────────────┘   │
│                                                        │
│  ┌────────────────────────────────────────────────┐   │
│  │        STREAMLIT DASHBOARD                     │   │
│  │  • Real-time KPI cards                         │   │
│  │  • Cost, latency, quality, error charts        │   │
│  │  • Live trace table                            │   │
│  │  • Alerts panel                                │   │
│  └────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────┘

LLM Pricing (Built-in)

Includes pricing for all major providers as of May 2026: OpenAI (GPT-4o, o1, GPT-3.5) · Anthropic (Claude Opus, Sonnet, Haiku) · Google (Gemini 1.5/2.0) · Meta (Llama 3.1) · Mistral

All costs shown in USD and Malaysian Ringgit (MYR) — unique to this project.

Update pricing in collectors/trace_collector.py → LLM_PRICING dict.

What Real Incidents This Has Caught

Simulated in demo data — real production scenarios:

Cost spike (Day 3): 15% of calls used GPT-4o with 3,000–8,000 token prompts. Anomaly detector flagged within 1 check cycle. Root cause: missing context truncation.
Quality drift (Last 4 days): Quality dropped from 0.82 → 0.65 after a prompt change. Detected by 3-day rolling average check. Required prompt rollback.
Error rate spike (Day 2): 20% error rate flagged as critical. Root cause: API rate limit hit during traffic spike. Auto-alert triggered in under 60 seconds.

Citation

@misc{alias2026observability,
  title  = {AI Observability Stack: Self-Hosted LLM Monitoring},
  author = {Alias, Aliya},
  year   = {2026},
  url    = {https://github.com/aliyaalias19/ai-observability-stack}
}

👤 About

Built by Aliya Alias — AI Engineer, Kuala Lumpur. MSc Artificial Intelligence, University of Malaya.

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
README.md		README.md
alert_manager.py		alert_manager.py
anomaly_detector.py		anomaly_detector.py
app.py		app.py
decorator.py		decorator.py
generate_demo_data.py		generate_demo_data.py
quickstart.py		quickstart.py
requirements.txt		requirements.txt
trace_collector.py		trace_collector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔭 AI Observability Stack

The Problem

What It Does

Quick Start

Integration — 1 Line

Manual Instrumentation

Anomaly Detection

Dashboard Screenshots

Architecture

LLM Pricing (Built-in)

What Real Incidents This Has Caught

Citation

👤 About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔭 AI Observability Stack

The Problem

What It Does

Quick Start

Integration — 1 Line

Manual Instrumentation

Anomaly Detection

Dashboard Screenshots

Architecture

LLM Pricing (Built-in)

What Real Incidents This Has Caught

Citation

👤 About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages