# RAG Evaluation Without Ground Truth ‚Äî Experiments

**M·ª•c ti√™u:** Th·ª≠ nghi·ªám c√°c framework evaluation kh√¥ng c·∫ßn `expected_answer` v√† `ground_truth_context`

## V·∫•n ƒë·ªÅ

H·ªá th·ªëng hi·ªán t·∫°i ph·ª• thu·ªôc v√†o:
- ‚ùå `expected_answer` ‚Äî kh√¥ng c√≥ s·∫µn trong th·ª±c t·∫ø
- ‚ùå `ground_truth_context` ‚Äî kh√¥ng c√≥ v√¨ RAG t·ª± ƒë·ªông retrieve

## Gi·∫£i ph√°p

Test 3 frameworks:
1. **DeepEval** ‚Äî LLM-as-Judge (no ground truth mode)
2. **RAGAS** ‚Äî RAG-specific metrics
3. **OpenRAG-Eval** (Optional) ‚Äî Research approach

## Workflow

1. Load test cases
2. Generate answers from RAG API
3. Evaluate with each framework
4. Meta-evaluate: So s√°nh consistency
5. Human-in-the-loop validation

---

## üì¶ Setup & Installation

C√†i ƒë·∫∑t c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt

In [11]:
!pip install ragas datasets langchain langchain-community langchain-openai
!pip install deepeval pandas matplotlib seaborn scipy



In [12]:
# Install packages n·∫øu c·∫ßn (uncomment)
# !pip install ragas datasets langchain langchain-community langchain-openai
# !pip install deepeval pandas matplotlib seaborn scipy

import os, sys, warnings, json, pandas as pd, numpy as np
from typing import Dict, List, Any
from dotenv import load_dotenv

# Suppress warnings
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.append(os.path.abspath('.'))

# Load .env v√† ki·ªÉm tra API key
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("B·∫°n ch∆∞a ƒë·∫∑t OPENAI_API_KEY trong file .env")

print("‚úÖ Setup complete!")

‚úÖ Setup complete!


## 1Ô∏è‚É£ Load Test Cases

Load v√† explore test cases t·ª´ `data/testcases.json`

In [13]:
import json
from datasets import Dataset

# 1. Load Ground Truth Map
with open('../data/pipeline/query.json', 'r', encoding='utf-8') as f:
    standard_data = json.load(f)

# Chu·∫©n h√≥a key: vi·∫øt th∆∞·ªùng v√† b·ªè kho·∫£ng tr·∫Øng th·ª´a
gt_map = {item['original'].strip(): item['answer'] for item in standard_data}

# 2. Load Pipeline Results t·ª´ file JSON c·ªßa b·∫°n
# Thay ƒë·ªïi ƒë∆∞·ªùng d·∫´n file cho ƒë√∫ng v·ªõi file b·∫°n ƒëang d√πng
file_path = '../data/pipeline/ragas_evaluation_hybrid_20260111_192121.json'
with open(file_path, 'r', encoding='utf-8-sig') as f:
    results_data = json.load(f)

ragas_input = {"question": [], "answer": [], "contexts": [], "ground_truth": []}

# X·ª≠ l√Ω tr∆∞·ªùng h·ª£p results_data l√† m·ªôt list ho·∫∑c m·ªôt dict
entries = results_data if isinstance(results_data, list) else [results_data]

for item in entries:
    # Truy c·∫≠p v√†o danh s√°ch k·∫øt qu·∫£ (t√πy c·∫•u tr√∫c file c·ªßa b·∫°n)
    res_list = item.get('results', []) if isinstance(item, dict) else []
    
    for r in res_list:
        q_raw = r.get('question', '').strip()
        
        if q_raw in gt_map:
            ragas_input["question"].append(q_raw)
            ragas_input["answer"].append(r.get('answer', ''))
            ragas_input["contexts"].append(r.get('contexts', []))
            ragas_input["ground_truth"].append(gt_map[q_raw])

ragas_dataset = Dataset.from_dict(ragas_input)
print(f"‚úÖ ƒê√£ chu·∫©n b·ªã xong {len(ragas_dataset)} test cases.")

if len(ragas_dataset) == 0:
    print("‚ö†Ô∏è C·∫¢NH B√ÅO: V·∫´n ch∆∞a t√¨m th·∫•y test case n√†o. H√£y ki·ªÉm tra l·∫°i key 'question' trong file JSON.")

# Ki·ªÉm tra n·∫øu c√≥ d·ªØ li·ªáu
if len(ragas_dataset) > 0:
    print(f"--- KI·ªÇM TRA D·ªÆ LI·ªÜU C√ÇU ƒê·∫¶U TI√äN ---")
    # In c√¢u h·ªèi
    print(f"‚ùì Question: {ragas_input['question'][20]}")
    
    # In c√¢u tr·∫£ l·ªùi t·ª´ RAG
    print(f"ü§ñ Answer: {ragas_input['answer'][20]}")
    
    # In Ground Truth ƒë·ªÉ ƒë·ªëi chi·∫øu
    print(f"üéØ Ground Truth: {ragas_input['ground_truth'][20]}")
    
    # In s·ªë l∆∞·ª£ng context ƒë√£ t√¨m th·∫•y
    print(f"üìÑ Contexts count: {len(ragas_input['contexts'][20])}")
    print("-" * 40)
else:
    print("‚ùå Kh√¥ng c√≥ d·ªØ li·ªáu ƒë·ªÉ hi·ªÉn th·ªã. Vui l√≤ng ki·ªÉm tra l·∫°i logic so kh·ªõp c√¢u h·ªèi.")

‚úÖ ƒê√£ chu·∫©n b·ªã xong 35 test cases.
--- KI·ªÇM TRA D·ªÆ LI·ªÜU C√ÇU ƒê·∫¶U TI√äN ---
‚ùì Question: B·ªë c·ª•c l√†m ƒë·ªì √°n t·ªët nghi·ªáp nh∆∞ n√†o nh·ªâ?
ü§ñ Answer: D·ª±a tr√™n th√¥ng tin ƒë∆∞·ª£c cung c·∫•p, b·ªë c·ª•c l√†m ƒë·ªì √°n t·ªët nghi·ªáp nh∆∞ sau:

*   **M·ªû ƒê·∫¶U:** Tr√¨nh b√†y l√≠ do ch·ªçn ƒë·ªÅ t√†i, m·ª•c ƒë√≠ch, ƒë·ªëi t∆∞·ª£ng v√† ph·∫°m vi nghi√™n c·ª©u.
*   **T·ªîNG QUAN:** Ph√¢n t√≠ch ƒë√°nh gi√° c√°c h∆∞·ªõng nghi√™n c·ª©u ƒë√£ c√≥ c·ªßa c√°c t√°c gi·∫£ trong v√† ngo√†i n∆∞·ªõc li√™n quan ƒë·∫øn ƒë·ªÅ t√†i; n√™u nh·ªØng v·∫•n ƒë·ªÅ c√≤n t·ªìn t·∫°i; ch·ªâ ra nh·ªØng v·∫•n ƒë·ªÅ m√† ƒë·ªÅ t√†i c·∫ßn t·∫≠p trung, nghi√™n c·ª©u gi·∫£i quy·∫øt.
*   **NGHI√äN C·ª®U TH·ª∞C NGHI·ªÜM HO·∫∂C L√ç THUY·∫æT:** Tr√¨nh b√†y c∆° s·ªü l√≠ thuy·∫øt, l√≠ lu·∫≠n, gi·∫£ thi·∫øt khoa h·ªçc v√† ph∆∞∆°ng ph√°p nghi√™n c·ª©u ƒë√£ ƒë∆∞·ª£c s·ª≠ d·ª•ng trong ƒë·ªì √°n.
*   **TR√åNH B√ÄY, ƒê√ÅNH GI√Å B√ÄN LU·∫¨N V·ªÄ K·∫æT QU·∫¢:** M√¥ t·∫£ ng·∫Øn g·ªçn c√¥ng vi·ªác nghi√™n c·

## 3Ô∏è‚É£ RAGAS Evaluation (No Ground Truth)

RAGAS c√≥ c√°c metrics kh√¥ng c·∫ßn `expected_answer`:
- **Faithfulness** ‚Äî Answer c√≥ faithful v·ªõi retrieved context kh√¥ng?
- **Answer Relevancy** ‚Äî Answer c√≥ relevant v·ªõi question kh√¥ng?

C·∫£ 2 metrics n√†y ƒë·ªÅu reference-free!

In [14]:
! pip install langchain_openai




In [15]:
# Setup RAGAS with Ollama (local LLM)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Initialize Ollama LLM for RAGAS
ollama_llm = ChatOpenAI(model="gpt-4o-mini")
ollama_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

ragas_llm = LangchainLLMWrapper(ollama_llm)
ragas_embeddings = LangchainEmbeddingsWrapper(ollama_embeddings)

In [None]:
# ---------------------------------------------------
# 2Ô∏è‚É£ Import c√°c metric async-compatible
# ---------------------------------------------------

from ragas.metrics import AnswerCorrectness, AnswerRelevancy, Faithfulness, ContextRelevance, ContextRecall
import pandas as pd

metrics_dict = {
    "AnswerCorrectness": AnswerCorrectness(llm=ragas_llm),
    "AnswerRelevancy": AnswerRelevancy(embeddings=ragas_embeddings),
    "Faithfulness": Faithfulness(llm=ragas_llm),
    "ContextRelevance": ContextRelevance(llm=ragas_llm),
    "ContextRecall": ContextRecall()
}

summary_scores = {}

for name, metric in metrics_dict.items():
    print(f"\nüöÄ Running metric: {name} ...")
    
    result = evaluate(
        dataset=ragas_dataset,
        metrics=[metric],
        llm=ragas_llm,
        embeddings=ragas_embeddings,
        batch_size=5,
        show_progress=True
    )
    
    score = list(result._scores_dict.values())[0]
    summary_scores[name] = score

df_scores = pd.DataFrame([summary_scores])
print("\n‚úÖ Summary scores preview:")
print(df_scores)



üöÄ Running metric: AnswerCorrectness ...


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]Exception raised in Job[4]: TimeoutError()
Exception raised in Job[2]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Evaluating:  14%|‚ñà‚ñç        | 5/35 [03:03<18:21, 36.71s/it]Exception raised in Job[7]: TimeoutError()
Exception raised in Job[5]: TimeoutError()
Exception raised in Job[8]: TimeoutError()
Exception raised in Job[6]: TimeoutError()
Exception raised in Job[9]: TimeoutError()
Evaluating:  29%|‚ñà‚ñà‚ñä       | 10/35 [06:54<17:35, 42.24s/it]

In [None]:
import numpy as np

# T·∫°o dict ƒë·ªÉ l∆∞u trung b√¨nh
avg_scores = {}

for col in df_scores.columns:
    avg_scores[col] = np.mean(df_scores[col][0])  # df_scores[col][0] l√† list score

# Xu·∫•t k·∫øt qu·∫£
print("\n‚úÖ Average score for each metric:")
for metric, avg in avg_scores.items():
    print(f"{metric}: {avg:.4f}")



‚úÖ Average score for each metric:
AnswerCorrectness: 0.3381
AnswerRelevancy: 0.3603
Faithfulness: 0.3103
ContextRelevance: 0.6357
ContextRecall: 0.4000


In [None]:
import pandas as pd
import csv

# Xu·∫•t df_scores v·ªõi UTF-8 v√† quoting
df_scores.to_csv(
    "hybrid_scores.csv",      # t√™n file xu·∫•t ra
    index=False,               # kh√¥ng xu·∫•t c·ªôt index
    encoding="utf-8-sig",      # utf-8-sig gi√∫p Excel ƒë·ªçc ƒë√∫ng ti·∫øng Vi·ªát
    quoting=csv.QUOTE_ALL      # ƒë·∫∑t t·∫•t c·∫£ gi√° tr·ªã trong d·∫•u ngo·∫∑c k√©p
)

print("‚úÖ df_scores ƒë√£ ƒë∆∞·ª£c l∆∞u v√†o summary_scores.csv v·ªõi UTF-8 v√† quoting")


‚úÖ df_scores ƒë√£ ƒë∆∞·ª£c l∆∞u v√†o summary_scores.csv v·ªõi UTF-8 v√† quoting


In [None]:
import pandas as pd
import ast
import csv

# 1. ƒê·ªçc file
df = pd.read_csv("hybrid_scores.csv")

# 2. H√†m d·ªçn d·∫πp d·ªØ li·ªáu (gi·ªØ nguy√™n logic c≈© nh∆∞ng th√™m x·ª≠ l√Ω l·ªói)
def clean_and_parse(s):
    if isinstance(s, list): return s
    if not isinstance(s, str) or s.lower() == 'nan': return []
    s = s.replace("np.float64(", "").replace(")", "")
    try:
        return ast.literal_eval(s)
    except:
        return []

# 3. C√°ch m·ªõi: Bung t·ª´ng c·ªôt (Explode) m·ªôt c√°ch ƒë·ªôc l·∫≠p ƒë·ªÉ tr√°nh l·ªói index
exploded_cols = {}
for col in df.columns:
    # L·∫•y d·ªØ li·ªáu d√≤ng ƒë·∫ßu ti√™n, parse th√†nh list, r·ªìi chuy·ªÉn th√†nh Series
    data_list = clean_and_parse(df[col].iloc[0])
    exploded_cols[col] = pd.Series(data_list)

# T·∫°o DataFrame m·ªõi t·ª´ c√°c c·ªôt ƒë√£ bung
df_flat = pd.DataFrame(exploded_cols)

# 4. Xu·∫•t file CSV
output_file = "hybrid_results_flat.csv"
df_flat.to_csv(
    output_file, 
    index=False, 
    encoding="utf-8-sig",
    quoting=csv.QUOTE_NONNUMERIC
)

print(f"‚úÖ ƒê√£ x·ª≠ l√Ω xong! File '{output_file}' ƒë√£ s·∫µn s√†ng.")
print(df_flat)

‚úÖ ƒê√£ x·ª≠ l√Ω xong! File 'naive_results_flat.csv' ƒë√£ s·∫µn s√†ng.
    AnswerCorrectness  AnswerRelevancy  Faithfulness  ContextRelevance  \
0            0.161569         0.338033      0.166667              0.75   
1            0.165271         0.306052      0.000000              1.00   
2            0.715650         0.337067      0.000000              1.00   
3            0.414763         0.761386      0.000000              0.00   
4            0.773079         0.334505      1.000000              1.00   
5            0.158986         0.295671      1.000000              1.00   
6            0.175921         0.325610      1.000000              1.00   
7            0.711161         0.182434      0.000000              0.25   
8            0.166561         0.474354      0.000000              0.25   
9            0.138070         0.373547      0.000000              0.75   
10           0.172681         0.269540      0.357143              0.75   
11           0.775811         0.257310  