A flexible hallucination detection system for LLM responses that works both with and without RAG (Retrieval Augmented Generation). Monitor truth quality, latency, and throughput for production LLM systems.
- Dual-mode operation: Works with or without retrieved documents
- Multi-signal detection: Combines semantic drift, uncertainty analysis, and factual checking
- Explainable scores: Returns detailed breakdown of all metrics
- Latency tracking: Measures end-to-end evaluation time
- Throughput monitoring: Calculates requests per second (single-run or batch)
- Full observability: Datadog-style metrics for LLM systems
- OpenAI-powered: Uses embeddings and GPT-4o-mini for evaluation
- Thread-safe: Concurrent throughput tracking with locks
- Simple API: Single function call with optional parameters
User Prompt + Response
β
[ β±οΈ Latency Timer Start ]
β
[ Embedding Drift Check ] β always active
[ Uncertainty Analysis ] β always active
[ Evidence Entailment ] β only if retrieved_docs
[ Factual Self-Check LLM ] β fallback when no evidence
β
Weighted Fusion (0.4 factual + 0.4 drift + 0.2 uncertainty)
β
[ β±οΈ Latency Timer End ]
[ π Throughput Calculation ]
β
β truth metrics + reliability metrics
pip install agentops-clientThat's it! You're ready to start monitoring your LLM agents.
Create a .env file or set environment variable:
export OPENAI_API_KEY=your_openai_api_key_hereOr in Python:
import os
os.environ['OPENAI_API_KEY'] = 'your_openai_api_key_here'If you want to contribute or modify the code:
git clone https://github.com/ezazahamad2003/agentops.git
cd agentops
pip install -e .from agentops import AgentOps
# Initialize the SDK
ops = AgentOps()
# Evaluate a single response
result = ops.evaluate(
prompt="Who discovered penicillin?",
response="Penicillin was discovered by Alexander Fleming in 1928."
)
print(f"Hallucinated: {result['hallucinated']}")
print(f"Latency: {result['latency_sec']}s")Connect to our production backend for automatic metrics storage and analytics:
from agentops import AgentOps
# Initialize with production backend
ops = AgentOps(
api_key="your_api_key", # Get from /register endpoint
api_url="https://agentops-api-1081133763032.us-central1.run.app"
)
# Evaluations are automatically uploaded to backend
result = ops.evaluate(
prompt="What is AI?",
response="AI is artificial intelligence...",
model_name="gpt-4o-mini"
)
# Data is stored in Supabase for analytics! β
Get Your API Key:
curl -X POST https://agentops-api-1081133763032.us-central1.run.app/register \
-H "Content-Type: application/json" \
-d '{"name":"my_agent"}'from agentops import AgentOps
ops = AgentOps()
# RAG Mode evaluation
result = ops.evaluate(
prompt="What are the side effects of aspirin?",
response="Aspirin causes stomach upset, nausea, and heartburn.",
retrieved_docs=[
"Common side effects include stomach upset and nausea.",
"Some people may experience allergic reactions."
]
)
print(result)from agentops import AgentOps
ops = AgentOps()
# Start a monitoring session
ops.start_session()
# Run multiple evaluations
for prompt, response in your_test_cases:
result = ops.evaluate(prompt, response)
print(f"Latency: {result['latency_sec']}s")
# Get session statistics
stats = ops.end_session()
print(f"Total evaluations: {stats['total_evaluations']}")
print(f"Average throughput: {stats['throughput_qps']} req/sec")from agentops import AgentOps
with AgentOps() as ops:
result = ops.evaluate(prompt, response)
# Session automatically closed after blockfrom agentops import detect_hallucination
# Direct function call (lower-level API)
result = detect_hallucination(prompt, response, retrieved_docs)
print(result){
# Truth Metrics
"semantic_drift": 0.22, # 0-1: semantic distance from prompt
"uncertainty": 0.0, # 0-1: uncertainty language score
"factual_support": 0.52, # 0-1: factual grounding score
"mode": "retrieved-doc entailment", # or "self-check"
"hallucination_probability": 0.57, # 0-1: overall score
"hallucinated": True, # True if probability > 0.45
# Reliability Metrics π
"latency_sec": 2.34, # End-to-end evaluation time in seconds
"throughput_qps": 0.427 # Requests per second (queries per second)
}| Mode | retrieved_docs | Truth Checks | Reliability Metrics |
|---|---|---|---|
| RAG mode | List of chunks | Semantic drift + entailment (evidence-based factuality) | Latency + Throughput (tracked) |
| No-RAG mode | None | Semantic drift + uncertainty + factual self-check (LLM) | Latency + Throughput (tracked) |
- What: End-to-end time from request to response
- Why: Shows model responsiveness and performance degradation
- Unit: Seconds (rounded to 3 decimal places)
- What: Number of evaluations processed per second
- Why: Measures system capacity and parallel efficiency
- Unit: Queries per second (QPS)
- Modes:
- Single-run (
track_throughput=False):throughput = 1 / latency - Batch mode (
track_throughput=True):throughput = total_evaluations / total_time
- Single-run (
Run the test suite:
pytest test_detector.py -vRun example scenarios:
python examples.py-
Semantic Drift (weight: 0.4)
- Measures cosine distance between prompt and response embeddings
- High drift = response is semantically distant from question
-
Uncertainty (weight: 0.2)
- Detects uncertainty language: "maybe", "probably", "might", etc.
- Higher score = more uncertain language
-
Factual Support (weight: 0.4)
- RAG mode: Entailment check against retrieved docs
- No-RAG mode: LLM self-check for factual accuracy
- Hallucination threshold: 0.45
- Scores above 0.45 are flagged as potential hallucinations
Edit the fusion weights in detector_flexible.py:
halluc_prob = round(0.4 * (1 - factual) + 0.4 * drift + 0.2 * uncert, 3)
# ^^^ ^^^ ^^^
# factual weight drift uncertaintyChange the threshold in the return statement:
"hallucinated": halluc_prob > 0.45 # Change 0.45 to desired threshold# Single-run mode (throughput = 1/latency)
result = detect_hallucination(prompt, response, track_throughput=False)
# Batch mode (cumulative tracking)
result = detect_hallucination(prompt, response, track_throughput=True)
# Reset cumulative tracker
from detector_flexible import reset_throughput_tracker
reset_throughput_tracker()
# Get current stats
from detector_flexible import get_throughput_stats
stats = get_throughput_stats()
# Returns: {'total_evaluations': int, 'total_time_sec': float, 'throughput_qps': float}prompt = "What are Ozempic side effects?"
docs = ["Common: nausea, vomiting", "Rare: pancreatitis"]
response = "Causes nausea and heart palpitations" # β οΈ heart palpitations not in docs
result = detect_hallucination(prompt, response, docs)
# High hallucination probability due to unsupported claimprompt = "Who invented the telephone?"
response = "Alexander Graham Bell invented the telephone."
result = detect_hallucination(prompt, response)
# Low hallucination probability - factually correctprompt = "What's the weather like?"
response = "Maybe it's probably sunny, I'm not sure."
result = detect_hallucination(prompt, response)
# High uncertainty score detectedInitialize the AgentOps SDK client.
Parameters:
api_key(str, optional): API key for future cloud featurestrack_throughput(bool, default=True): Enable cumulative throughput tracking
Methods:
Evaluate an agent's response for hallucinations and reliability.
Returns: dict with truth and reliability metrics
Get current cumulative statistics.
Returns: {'total_evaluations': int, 'total_time_sec': float, 'throughput_qps': float}
Reset throughput tracker for new session.
Start a new monitoring session with fresh metrics.
End current session and return final statistics.
Context Manager Support:
with AgentOps() as ops:
result = ops.evaluate(prompt, response)Low-level detection function with reliability metrics.
Parameters:
prompt(str): Original user question/promptresponse(str): LLM's generated responseretrieved_docs(list[str], optional): Retrieved evidence chunks for RAG modetrack_throughput(bool, default=True): Enable cumulative throughput tracking
Returns:
dict: Detection results with truth metrics and reliability metrics
reset_throughput_tracker(): Reset cumulative throughput countersget_throughput_stats(): Get current throughput statisticsuncertainty_score(text): Calculate uncertainty language score
- Dual-mode hallucination detection
- Semantic drift, uncertainty, factual support
- Comprehensive test suite
- Latency tracking
- Throughput calculation (single-run and batch)
- Thread-safe cumulative tracking
- FastAPI endpoint for HTTP access
- Supabase/database logging
- AgentOps SDK for automatic instrumentation
- Visual dashboard (metrics over time)
- Sentence-level breakdown (flag specific hallucinated sentences)
- Custom model support (non-OpenAI)
- Async/concurrent evaluation
- Performance optimization for large-scale deployment
- Alerting on anomalies (latency spikes, hallucination rate)
MIT License - feel free to use in your projects!
For comprehensive guides and additional documentation, see the docs/ directory:
- SDK Guide - Complete SDK integration guide
- PyPI Publishing Guide - Publishing guide
- Project Summary - Full project overview
- Changelog - Version history and updates
Contributions welcome! Please test your changes with the test suite before submitting.
Built with β€οΈ using OpenAI APIs