Open source LLM evaluation framework with cost + latency + hallucination metrics #2730
Replies: 1 comment
-
|
Not sure I buy the "no LLM-as-judge needed" claim — turns out, your framework and DeepEval are more complementary than mutually exclusive. I've worked on production LLM evals where we needed both semantic evaluation and production metrics. Your 4-strategy cascade for accuracy is really interesting, though. We implemented something similar for our RAG pipeline evals, but we stuck with just exact and fuzzy Levenshtein for simplicity. The Here's a snippet from our code that might be relevant: def evaluate_latency(responses):
latencies = [r['latency'] for r in responses]
return np.percentile(latencies, [50, 75, 90, 95, 99])We used this to track p99 latency across our inference cluster. Curious to hear if you've considered adding more advanced latency analysis features? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey DeepEval community!
Love what DeepEval is doing with LLM-as-judge evaluation. I built an open source framework that takes a complementary approach focused on production metrics.
Key difference from DeepEval: No LLM-as-judge needed. Everything runs locally or from real API responses.
5 metrics tracked simultaneously:
One command benchmark:
Works via LiteLLM so any model is supported.
Live demo (no API key needed): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
71 tests, 82% coverage. The two approaches complement well: DeepEval for semantic/behavioral evaluation, this for production metrics. Feedback welcome!
Beta Was this translation helpful? Give feedback.
All reactions