Open source LLM evaluation framework with cost + latency + hallucination metrics #2730

vignesh2027 · 2026-06-08T05:39:06Z

vignesh2027
Jun 8, 2026

Hey DeepEval community!

Love what DeepEval is doing with LLM-as-judge evaluation. I built an open source framework that takes a complementary approach focused on production metrics.

Key difference from DeepEval: No LLM-as-judge needed. Everything runs locally or from real API responses.

5 metrics tracked simultaneously:

Accuracy: 4-strategy cascade (exact, normalized, MC, fuzzy Levenshtein)
Latency: p50/p75/p90/p95/p99 from real async API calls
Cost per 1K tokens: from actual token counts, not estimates
Hallucination Rate: linguistic signal analysis (hedging/uncertainty/grounding signals)
Reasoning Quality: chain-of-thought depth score 1-10

One command benchmark:

pip install llm-evaluation-framework
llm-eval compare --models gpt-4o-mini --models gemini/gemini-1.5-flash --benchmark mmlu --samples 100

Works via LiteLLM so any model is supported.

Live demo (no API key needed): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

71 tests, 82% coverage. The two approaches complement well: DeepEval for semantic/behavioral evaluation, this for production metrics. Feedback welcome!

rehan243 · 2026-06-12T09:13:57Z

rehan243
Jun 12, 2026

Not sure I buy the "no LLM-as-judge needed" claim — turns out, your framework and DeepEval are more complementary than mutually exclusive. I've worked on production LLM evals where we needed both semantic evaluation and production metrics. Your 4-strategy cascade for accuracy is really interesting, though. We implemented something similar for our RAG pipeline evals, but we stuck with just exact and fuzzy Levenshtein for simplicity.

The llm-eval compare command is slick — LiteLLM is a great abstraction layer. I've used it for model comparisons too. One thing that might be useful is adding support for custom eval datasets beyond just MMLU. We had to roll our own dataset loader for our specific use case.

Here's a snippet from our code that might be relevant:

def evaluate_latency(responses):
    latencies = [r['latency'] for r in responses]
    return np.percentile(latencies, [50, 75, 90, 95, 99])

We used this to track p99 latency across our inference cluster. Curious to hear if you've considered adding more advanced latency analysis features?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open source LLM evaluation framework with cost + latency + hallucination metrics #2730

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open source LLM evaluation framework with cost + latency + hallucination metrics #2730

Uh oh!

vignesh2027 Jun 8, 2026

Replies: 1 comment

Uh oh!

rehan243 Jun 12, 2026

vignesh2027
Jun 8, 2026

rehan243
Jun 12, 2026