Self-contained LLM monitoring system. No external services. No cloud accounts. Runs from
python quickstart.py.
"Which operation in our pipeline consumed 40% of this month's LLM budget? Which user generated the 5 most expensive calls this week? Is our quality degrading — or did it just get worse on Tuesdays?"
LLM observability is about more than just logging requests or counting tokens. It's about measuring whether an AI system is behaving correctly, safely, and consistently over time. Most tools require cloud accounts, external APIs, or Docker stacks. This project does it all locally with SQLite and Streamlit.
| Feature | Detail |
|---|---|
| Cost tracking | Per-call, per-model, per-feature, per-user. USD + MYR |
| Latency monitoring | P50/P75/P95/P99 percentiles, trend over time |
| Quality drift detection | Rolling quality score trend, automatic alerts |
| Error rate monitoring | Failure rate tracking with spike detection |
| Anomaly detection | Background thread alerts on cost spikes, latency spikes, quality drops |
| Slack alerts | Optional webhook integration. Works out of the box without it |
| @observe decorator | Wrap any LLM call with one line |
| Malaysian pricing | MYR cost display built in (not just USD) |
| Zero dependencies for core | Just SQLite + stdlib. Streamlit only for dashboard |
git clone https://github.com/aliyaalias19/ai-observability-stack
cd ai-observability-stack
pip install -r requirements.txt
# Verify setup (< 5 seconds)
python quickstart.py
# Generate 14 days of realistic demo data
python demo/generate_demo_data.py
# Launch dashboard
streamlit run dashboard/app.pyfrom observers.decorator import observe
@observe(feature="islamic_finance_bot", model="gpt-4o")
def chat(prompt: str) -> str:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.contentThe decorator automatically captures:
- Latency (wall clock time of the function)
- Model name (from response object)
- Token counts (from response object)
- Cost in USD and MYR (calculated from token counts)
- Feature tag, user ID, environment
from observers.decorator import ObservabilityContext
with ObservabilityContext(
model="claude-sonnet-4-6",
prompt=user_question,
feature="rag_pipeline",
user_id=user_id,
) as ctx:
docs = retriever.search(user_question)
response = llm.generate(user_question, docs)
ctx.set_response(response.text)
ctx.set_tokens(response.usage.input, response.usage.output)
ctx.set_quality(ragas_score) # Add quality evaluationfrom collectors.trace_collector import TraceCollector
from detectors.anomaly_detector import AnomalyDetector, AnomalyConfig
from alerts.alert_manager import AlertManager
collector = TraceCollector("production.db")
alerts = AlertManager(slack_webhook=os.environ["SLACK_WEBHOOK_URL"])
detector = AnomalyDetector(
collector=collector,
config=AnomalyConfig(
single_call_cost_alert=0.50, # Alert if one call > $0.50
daily_cost_alert=50.00, # Alert if daily spend > $50
latency_alert_ms=5000, # Alert if P95 > 5s
quality_drop_threshold=0.70, # Alert if avg quality < 0.70
error_rate_alert=0.10, # Alert if error rate > 10%
),
on_anomaly=alerts.send,
)
detector.start() # Background thread, non-blockingThe dashboard shows:
- KPI cards: Total calls, cost (USD + MYR), P95 latency, quality score, error rate
- Cost trend: Daily spend over time with model breakdown
- Latency chart: P50/P75/P95/P99 bar chart with threshold indicators
- Quality trend: Rolling average with drift detection
- Model breakdown: Which model costs most
- Feature breakdown: Which feature uses most resources
- Live trace table: Recent LLM calls with full context
- Alerts panel: Active anomaly alerts
┌────────────────────────────────────────────────────────┐
│ YOUR LLM APPLICATION │
│ │
│ @observe(feature="chatbot") │
│ def chat(prompt) → response │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ TRACE COLLECTOR (SQLite) │ │
│ │ • Buffers traces in memory │ │
│ │ • Flushes to traces.db every 100 calls │ │
│ │ • Thread-safe writes │ │
│ │ • Auto-calculates cost (USD + MYR) │ │
│ └────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ ANOMALY DETECTOR (background thread) │ │
│ │ • Polls every 60s │ │
│ │ • Cost spike detection │ │
│ │ • Latency P95 monitoring │ │
│ │ • Quality drift detection │ │
│ │ • Error rate tracking │ │
│ └────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────┐ │
│ │ ALERT MANAGER │ │
│ │ • Console (always on) │ │
│ │ • Slack (optional webhook) │ │
│ │ • Alert deduplication (30min cooldown) │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ STREAMLIT DASHBOARD │ │
│ │ • Real-time KPI cards │ │
│ │ • Cost, latency, quality, error charts │ │
│ │ • Live trace table │ │
│ │ • Alerts panel │ │
│ └────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Includes pricing for all major providers as of May 2026: OpenAI (GPT-4o, o1, GPT-3.5) · Anthropic (Claude Opus, Sonnet, Haiku) · Google (Gemini 1.5/2.0) · Meta (Llama 3.1) · Mistral
All costs shown in USD and Malaysian Ringgit (MYR) — unique to this project.
Update pricing in collectors/trace_collector.py → LLM_PRICING dict.
Simulated in demo data — real production scenarios:
-
Cost spike (Day 3): 15% of calls used GPT-4o with 3,000–8,000 token prompts. Anomaly detector flagged within 1 check cycle. Root cause: missing context truncation.
-
Quality drift (Last 4 days): Quality dropped from 0.82 → 0.65 after a prompt change. Detected by 3-day rolling average check. Required prompt rollback.
-
Error rate spike (Day 2): 20% error rate flagged as critical. Root cause: API rate limit hit during traffic spike. Auto-alert triggered in under 60 seconds.
@misc{alias2026observability,
title = {AI Observability Stack: Self-Hosted LLM Monitoring},
author = {Alias, Aliya},
year = {2026},
url = {https://github.com/aliyaalias19/ai-observability-stack}
}Built by Aliya Alias — AI Engineer, Kuala Lumpur. MSc Artificial Intelligence, University of Malaya.
MIT License.