---
title: "Stop Vibe-Checking: Real-World Lessons on LLM Evals"
author: "Safouane Chergui"
date: "2025-11-19"
format: html
toc: true
toc-location: body
toc-depth: 4
categories: [LLM, Evaluation]
---

## Introduction

Deploying LLM-powered systems in production is the easy part. The hard part? Making sure they're actually working.

I've been deploying LLM-powered systems in production in many companies and across different industries for almost 3 years now. Each time, I encountered the same critical challenge: **how do you truly evaluate whether your LLM is performing well and not only rely on vibe checks ?**

If you're struggling to move beyond "it looks good to me" when evaluating your LLM applications, this blog post is for you.

<br><br>
<div align="center">
<img src="./assets/front_image.jpg" width="80%" style="display: block; margin: 0 auto;">
</div>
<br>

This blog post contains lessons learnt through hands-on experience. These lessons come mostly from my experience testing what I have learnt in deploying real-life LLM applications, talking with peers, doing courses, and reading blog posts.

Apart from my personal experience, [Hamel Husain](https://www.linkedin.com/in/hamelhusain/) & [Shreya Shankar](https://www.linkedin.com/in/shrshnk/) both [course](https://maven.com/parlance-labs/evals) & blogs on LLM evaluation have been of a great help to me and many of the techniques I discuss here are inspired by their work.

Each of the lessons below will tackle a specific part of building and evaluating real LLM applications.

# Pre-lesson

I can't emphasize this enough and even though you've probably heard it a million times before, I'll have to say it: **KNOW YOUR PRODUCT AND YOUR USERS**.

If you don't know your users and your product well, you won't be able to understand the different ways they'll interact with your system, the different types of queries they'll make, and the different ways they can express the same intent. This is going to be a major obstacle whether you want to create synthetic data, to create evaluation metrics, or to interpret the results of your evaluations.

Now that this is said, let's dive into the lessons!

# Lesson 1: I'll vibe-check my app

Here's a common scenario I've encountered many times: a company builds a RAG system over their product documentation, and naturally wants to evaluate and improve it. But there's a pattern I keep seeing so often, companies expecting AI to "just work" out of the box.

This expectation shows up most clearly in evaluation practices. Companies often use "vibe checks" by manually asking a dozen of questions and eyeballing whether the system answers seem reasonable.

This is a terrible way to evaluate your LLM applications for multiple reasons:

- The queries you're going to ask are biased towards what you think the system should be able to do or towards a specific range of queries that you expect the final user to enter. You will likely miss most of the cases and failure modes that you didn't think about.
From first-hand experience, the users usually will surprise you with the way they write their queries and the types of queries they will enter. You should expect them to write them as if they are in a rush ðŸ˜…

- Not having evaluated your app in a systematic and/or scalable way will lead to have to deal with "whack-a-mole" situation (as Hamel so beautifully puts it) where you fix one by changing some prompt only to have another one pop up somewhere else.
This will lead to frustration from your stakeholders, frustration from the team working on the app, and maybe even to a a lack of trust in LLMs whithin your organization.

<div align="center">
<img src="https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExdm55dDYxNHd5eHdmYzlnZmJnMTg4OHRvb2x4ZWQ4bHVnZzQ2am9waiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/ebITvSXYKNvRm/giphy.gif" width="20%" style="display: block; margin: 0 auto;">
</div>
<br>

# Lesson 2: I don't have any data to test my application, where do I start ?

Once you develop your LLM-powered app, you are faced with **"Which comes first, the chicken or the egg?"** dilemma:

- You don't have real user queries data because you haven't deployed you app yet
- You can't deploy your app yet because you haven't tested it with real user queries yet

## Sub-lesson 1: Real data is better than synthetic data

You can almost always get some pseudo-real data. If you can't have access to some beta users, ask your teammates to test the system. They will have very probably some biases of their own, but at least you will get some data that is not completely synthetic and that has different characteristics because it's coming from different people.

## Sub-lesson 2: The bad way to create synthetic data
Real data is always better than synthetic data. But hey, if you really can't have some real data, then synthetic data is the way to go.

The mistake most people do when creating synthetic data is to ask an LLM to generate queries that are similar to what they expect the users to enter. This is a also a bad idea as the generated queries will be biased towards what you think the users will enter and will likely miss many failure modes.
Most importantly, the generated queries will likely be "too good" and not representative of real user queries (messy, mispellings, incomplete...).

I've trained a retriever in the past on synthetic data generated this way. While the performance on synthetic-queries-like was really good, the performance on real user queries was really bad. The gap between synthetic data and real user data was just too big.

## Sub-lesson 3: The good way to create synthetic data

- *Think of dimensions of variability of user queries:*
An approach that I have learnt from Hamel Husain & Shreya Shankar is to first think about the different dimensions of variability in user queries for your specific application.
For example, if you're building a RAG over a technical product documentation, you can think about several dimensions of variability, such as:

- User type (user, admin, developer, etc.)
- Intent (seeking information, troubleshooting, feature requests, etc.)
- User expertise level (beginner, intermediate, expert, etc.)
- Query length (short queries, long queries, etc.)
- Query complexity (simple queries, complex queries with multiple sub-questions, etc.)
- Query style (formal, informal, typos, etc.)
- etc.

As a query always depends on the context of the application and the persona of the users (a wink ðŸ˜‰ to the pre-lesson above), a dimension should be really specific to your application and not some general dimensions that someone else has used in another context.

Then, for each dimension, you can brainstorm different values that the dimension can take (or delegate the task of brainstorming values of some dimensions to an LLM). For example, for the "user expertise level" dimension, you can have the values: "beginner", "intermediate", "expert" as shown above.

Once the list of dimensions and their possible values is ready, you can start combining them to create tuples that will represent different synthetic queries.

Here are some examples of tuples representing combinations of the dimensions mentioned above.

```
("end_user", "troubleshoot", "beginner", "short", "simple", "typos")
("developer", "integration_info", "expert", "long", "complex", "formal")
("admin", "permissions_help", "intermediate", "short", "simple", "informal")
("end_user", "feature_discovery", "beginner", "short", "simple", "incomplete")
("support_engineer", "root_cause_analysis", "expert", "long", "complex", "dense")
("end_user", "account_status", "beginner", "short", "simple", "mixed_case")
("developer", "performance_optimization", "expert", "medium", "complex", "typos")
("admin", "audit_logging", "intermediate", "medium", "moderate", "formal")
("end_user", "error_meaning", "beginner", "short", "simple", "abbreviations")
```

Each tuple becomes a prompt seed you can use to generate multiple queries from ðŸš€

And here you have it, asystematic way to create synthetic data that covers a wide range of possible user queries for your specific application

# Lesson 3: Have a systematic way to identify failure modes

When you start evaluating your LLM applications, you need to have a systematic way to identify failure modes.

# Lesson 4: Off-the-shelf evals don't work

# Lesson 5: The evaluation interface matters