# RAG Evaluation Without Ground Truth ‚Äî Experiments

**M·ª•c ti√™u:** Th·ª≠ nghi·ªám c√°c framework evaluation kh√¥ng c·∫ßn `expected_answer` v√† `ground_truth_context`

## V·∫•n ƒë·ªÅ

H·ªá th·ªëng hi·ªán t·∫°i ph·ª• thu·ªôc v√†o:
- ‚ùå `expected_answer` ‚Äî kh√¥ng c√≥ s·∫µn trong th·ª±c t·∫ø
- ‚ùå `ground_truth_context` ‚Äî kh√¥ng c√≥ v√¨ RAG t·ª± ƒë·ªông retrieve

## Gi·∫£i ph√°p

Test 3 frameworks:
1. **DeepEval** ‚Äî LLM-as-Judge (no ground truth mode)
2. **RAGAS** ‚Äî RAG-specific metrics
3. **OpenRAG-Eval** (Optional) ‚Äî Research approach

## Workflow

1. Load test cases
2. Generate answers from RAG API
3. Evaluate with each framework
4. Meta-evaluate: So s√°nh consistency
5. Human-in-the-loop validation

---

## üì¶ Setup & Installation

C√†i ƒë·∫∑t c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt

In [1]:
!pip install ragas datasets langchain langchain-community langchain-openai
!pip install deepeval pandas matplotlib seaborn scipy



In [2]:
# Install packages n·∫øu c·∫ßn (uncomment)
# !pip install ragas datasets langchain langchain-community langchain-openai
# !pip install deepeval pandas matplotlib seaborn scipy

import os, sys, warnings, json, pandas as pd, numpy as np
from typing import Dict, List, Any
from dotenv import load_dotenv

# Suppress warnings
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.append(os.path.abspath('.'))

# Load .env v√† ki·ªÉm tra API key
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("B·∫°n ch∆∞a ƒë·∫∑t OPENAI_API_KEY trong file .env")

print("‚úÖ Setup complete!")

‚úÖ Setup complete!


## 1Ô∏è‚É£ Load Test Cases

Load v√† explore test cases t·ª´ `data/testcases.json`

In [3]:
# Load test cases
with open('../data/testcase/factual_testcase/results.json', 'r', encoding='utf-8-sig') as f:
    results = json.load(f)

print(f"üìä Loaded {len(results)} test cases\n")

# Display first test case
print("Example test case:")
for  testcase in results[:2]:
    print(f"ID: {testcase['question_id']}")
    print(f"Question: {testcase['question']}")



üìä Loaded 18 test cases

Example test case:
ID: F001
Question: Ng√†nh Tr√≠ tu·ªá nh√¢n t·∫°o y√™u c·∫ßu t·ªëi thi·ªÉu bao nhi√™u t√≠n ch·ªâ ƒë·ªÉ t·ªët nghi·ªáp?
ID: F002
Question: Th·ªùi gian ƒë√†o t·∫°o ng√†nh CNTT k√©o d√†i bao nhi√™u nƒÉm?


## 3Ô∏è‚É£ RAGAS Evaluation (No Ground Truth)

RAGAS c√≥ c√°c metrics kh√¥ng c·∫ßn `expected_answer`:
- **Faithfulness** ‚Äî Answer c√≥ faithful v·ªõi retrieved context kh√¥ng?
- **Answer Relevancy** ‚Äî Answer c√≥ relevant v·ªõi question kh√¥ng?

C·∫£ 2 metrics n√†y ƒë·ªÅu reference-free!

In [4]:
! pip install langchain_openai




In [5]:
# Setup RAGAS with Ollama (local LLM)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Initialize Ollama LLM for RAGAS
ollama_llm = ChatOpenAI(model="gpt-4o-mini")
ollama_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

ragas_llm = LangchainLLMWrapper(ollama_llm)
ragas_embeddings = LangchainEmbeddingsWrapper(ollama_embeddings)

In [6]:
import json
from ragas import evaluate
from datasets import Dataset

file_path = "../data/testcase/factual_testcase/results.json"

with open(file_path, "r", encoding="utf-8-sig") as f:
    results = json.load(f)

# L·ªçc nh·ªØng tr∆∞·ªùng h·ª£p c√≥ answer h·ª£p l·ªá
valid_results = [r for r in results if r.get("answer")]

# Chu·∫©n ho√° contexts: RAGAS y√™u c·∫ßu contexts l√† list[str], nh∆∞ng gi·ªØ c·∫£ title
def normalize_contexts(ctx_list):
    if not ctx_list:
        return []
    # Tr·∫£ v·ªÅ list[str] theo format [title:{title}|content:{content}]
    return [f"[title:{c.get('title','')}|content:{c.get('content','')}]" for c in ctx_list]


ragas_data = {
    "question": [r["question"] for r in valid_results],
    "answer": [r["answer"] for r in valid_results],
    "contexts": [normalize_contexts(r["contexts"]) for r in valid_results],
    "ground_truth": [r.get("ground_truth", "") for r in valid_results],  # üî• th√™m ground truth
}

ragas_dataset = Dataset.from_dict(ragas_data)


# In theo t·ª´ng c√¢u h·ªèi k√®m content, answer v√† ground_truth
for i, r in enumerate(valid_results):
    print(f"C√¢u h·ªèi: {r['question']}\n")
    
    print("N·ªôi dung:")
    for c in normalize_contexts(r["contexts"]):
        print(f"- {c}\n")  # xu·ªëng d√≤ng sau m·ªói content
    
    print(f"Answer: {r['answer']}\n")
    print(f"Ground truth: {r.get('ground_truth', '')}\n")
    
    print("="*50 + "\n")  # ph√¢n c√°ch gi·ªØa c√°c c√¢u h·ªèi
    

C√¢u h·ªèi: Ng√†nh Tr√≠ tu·ªá nh√¢n t·∫°o y√™u c·∫ßu t·ªëi thi·ªÉu bao nhi√™u t√≠n ch·ªâ ƒë·ªÉ t·ªët nghi·ªáp?

N·ªôi dung:
- [title:H·ªá Th√¥ng Th√¥ng Tin - Ch∆∞∆°ng Tr√¨nh Ti√™n Ti·∫øn - 4\. Quy ƒë·ªãnh ƒë√†o t·∫°o, ƒëi·ªÅu ki·ªán t·ªët nghi·ªáp - ƒêi·ªÅu ki·ªán t·ªët nghi·ªáp|content:\- C√¥ng nh·∫≠n t·ªët nghi·ªáp: Sinh vi√™n ƒë√£ t√≠ch l≈©y t·ªëi thi·ªÉu l√† 130 t√≠n ch·ªâ, ƒë√£ ho√†n th√†nh c√°c m√¥n h·ªçc b·∫Øt bu·ªôc ƒë·ªëi v·ªõi ng√†nh H·ªá th·ªëng th√¥ng tin Ch∆∞∆°ng tr√¨nh ƒë√†o t·∫°o ti√™n ti·∫øn, tr√¨nh ƒë·ªô Anh vƒÉn ƒë·∫°t y√™u c·∫ßu theo quy ƒë·ªãnh c·ªßa Tr∆∞·ªùng d√†nh ri√™ng cho ch∆∞∆°ng tr√¨nh ti√™n ti·∫øn.
\- Sinh vi√™n ph·∫£i ƒë√°p ·ª©ng ƒë·ªß c√°c ti√™u chu·∫©n kh√°c theo Quy ch·∫ø ƒë√†o t·∫°o hi·ªán h√†nh.
\- Sinh vi√™n t·ªët nghi·ªáp ƒë∆∞·ª£c c·∫•p b·∫±ng: C·ª≠ nh√¢n H·ªá th·ªëng th√¥ng tin ‚Äì ch∆∞∆°ng tr√¨nh ƒë√†o t·∫°o ti√™n ti·∫øn.]

- [title:Tr√≠ Tu·ªá Nh√¢n T·∫°o - 5\. CH∆Ø∆†NG TR√åNH ƒê√ÄO T·∫†O - 5.4 Kh·ªëi ki·∫øn th·ª©c t·ªët nghi·ªáp|content:‚óè Sinh v

In [7]:
# ---------------------------------------------------
# 2Ô∏è‚É£ Import metrics async
# ---------------------------------------------------

from ragas.metrics import AnswerCorrectness, AnswerRelevancy, Faithfulness, ContextRelevance, ContextRecall
import pandas as pd

metrics_dict = {
    "AnswerCorrectness": AnswerCorrectness(llm=ragas_llm),
    "AnswerRelevancy": AnswerRelevancy(embeddings=ragas_embeddings),
    "Faithfulness": Faithfulness(llm=ragas_llm),
    "ContextRelevance": ContextRelevance(llm=ragas_llm),
    "ContextRecall": ContextRecall()
}



In [12]:
# ---------------------------------------------------
# 2Ô∏è‚É£ Import c√°c metric async-compatible
# ---------------------------------------------------

from ragas.metrics import AnswerCorrectness, AnswerRelevancy, Faithfulness, ContextRelevance, ContextRecall
import pandas as pd

metrics_dict = {
    "AnswerCorrectness": AnswerCorrectness(llm=ragas_llm),
    "AnswerRelevancy": AnswerRelevancy(embeddings=ragas_embeddings),
    "Faithfulness": Faithfulness(llm=ragas_llm),
    "ContextRelevance": ContextRelevance(llm=ragas_llm),
    "ContextRecall": ContextRecall()
}

summary_scores = {}

for name, metric in metrics_dict.items():
    print(f"\nüöÄ Running metric: {name} ...")
    
    result = evaluate(
        dataset=ragas_dataset,
        metrics=[metric],
        llm=ragas_llm,
        embeddings=ragas_embeddings,
        batch_size=5,
        show_progress=True
    )
    
    score = list(result._scores_dict.values())[0]
    summary_scores[name] = score

df_scores = pd.DataFrame([summary_scores])
print("\n‚úÖ Summary scores preview:")
print(df_scores)



üöÄ Running metric: AnswerCorrectness ...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [03:53<00:00, 13.00s/it]



üöÄ Running metric: AnswerRelevancy ...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [01:40<00:00,  5.56s/it]



üöÄ Running metric: Faithfulness ...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [01:34<00:00,  5.25s/it]



üöÄ Running metric: ContextRelevance ...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [00:41<00:00,  2.29s/it]



üöÄ Running metric: ContextRecall ...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [00:50<00:00,  2.83s/it]



‚úÖ Summary scores preview:
                                   AnswerCorrectness  \
0  [0.5296319994975731, 0.08293431489069471, 0.19...   

                                     AnswerRelevancy  \
0  [0.31958526800349046, 0.0, 0.3757968320058209,...   

                                        Faithfulness  \
0  [0.8571428571428571, 0.0, 1.0, 0.0, 1.0, 0.666...   

                                    ContextRelevance  \
0  [1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.25, 1.0, 0.75...   

                                       ContextRecall  
0  [1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, ...  


In [13]:
import numpy as np

# T·∫°o dict ƒë·ªÉ l∆∞u trung b√¨nh
avg_scores = {}

for col in df_scores.columns:
    avg_scores[col] = np.mean(df_scores[col][0])  # df_scores[col][0] l√† list score

# Xu·∫•t k·∫øt qu·∫£
print("\n‚úÖ Average score for each metric:")
for metric, avg in avg_scores.items():
    print(f"{metric}: {avg:.4f}")



‚úÖ Average score for each metric:
AnswerCorrectness: 0.2388
AnswerRelevancy: 0.1244
Faithfulness: 0.5142
ContextRelevance: 0.6528
ContextRecall: 0.5370
