<a href="https://colab.research.google.com/github/duper203/RAG_Techniques_with_upstage/blob/main/upstage/23_evaluation_deep_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Evaluation of RAG Systems using deepeval

** Upstage product not used

In [None]:
! pip3 install -qU deepeval langchain-upstage langchain langchain-community faiss-cpu sentence_transformers

In [3]:
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

In [7]:
correctness_metric = GEval(
    name="Correctness",
    model="gpt-4o",
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT],
        evaluation_steps=[
        "Determine whether the actual output is factually correct based on the expected output."
    ],

)

gt_answer = "Madrid is the capital of Spain."
pred_answer = "MadriD."

test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output=gt_answer,
    actual_output=pred_answer,
)

correctness_metric.measure(test_case_correctness)
print(correctness_metric.score)

Output()

0.16304948566525715


In [8]:
question = "what is 3+3?"
context = ["6"]
generated_answer = "6"

faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=False
)

test_case = LLMTestCase(
    input = question,
    actual_output=generated_answer,
    retrieval_context=context

)

faithfulness_metric.measure(test_case)
print(faithfulness_metric.score)
print(faithfulness_metric.reason)

Output()

1
None


In [9]:
actual_output = "then go somewhere else."
retrieval_context = ["this is a test context","mike is a cat","if the shoes don't fit, then go somewhere else."]
gt_answer = "if the shoes don't fit, then go somewhere else."

relevance_metric = ContextualRelevancyMetric(
    threshold=1,
    model="gpt-4",
    include_reason=True
)
relevance_test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context,
    expected_output=gt_answer,

)

relevance_metric.measure(relevance_test_case)
print(relevance_metric.score)
print(relevance_metric.reason)

Output()

0.3333333333333333
The score is 0.33 because the majority of the retrieval context, including statements like 'this is a test context' and 'mike is a cat', was found to be irrelevant to the input about shoes fitting. However, the statement 'if the shoes don't fit, then go somewhere else' was identified as relevant, contributing to the score.


In [10]:
new_test_case = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output="Madrid is the capital of Spain.",
    actual_output="MadriD.",
    retrieval_context=["Madrid is the capital of Spain."]
)

In [11]:
evaluate(
    test_cases=[relevance_test_case, new_test_case],
    metrics=[correctness_metric, faithfulness_metric, relevance_metric]
)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 2 test case(s) in parallel: |          |  0% (0/2) [Time Taken: 00:00, ?test case/s]

None
None


Evaluating 2 test case(s) in parallel: |██████████|100% (2/2) [Time Taken: 00:09,  4.55s/test case]



Metrics Summary

  - ❌ Correctness (GEval) (score: 0.162045571714766, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output 'MadriD.' is a misspelled and incomplete version of the expected output 'Madrid is the capital of Spain.', error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4, reason: None, error: None)
  - ✅ Contextual Relevancy (score: 1.0, threshold: 1.0, strict: False, evaluation model: gpt-4, reason: The score is 1.00 because the retrieval context accurately provides the information asked in the input, confirming that 'Madrid is the capital of Spain.', error: None)

For test case:

  - input: What is the capital of Spain?
  - actual output: MadriD.
  - expected output: Madrid is the capital of Spain.
  - context: None
  - retrieval context: ['Madrid is the capital of Spain.']


Metrics Summary

  - ✅ Correctness (GEval) (score: 0.5763010506497754, threshold: 0.5, strict: False, evaluation model: g




EvaluationResult(test_results=[TestResult(success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.162045571714766, reason="The actual output 'MadriD.' is a misspelled and incomplete version of the expected output 'Madrid is the capital of Spain.'", strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.000985, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]'), MetricData(name='Faithfulness', threshold=0.7, success=True, score=1.0, reason=None, strict_mode=False, evaluation_model='gpt-4', error=None, evaluation_cost=0.01245, verbose_logs='Truths (limit=None):\n[\n    "Madrid is the capital of Spain."\n] \n \nClaims:\n[] \n \nVerdicts:\n[]'), MetricData(name='Contextual Relevancy', threshold=1.0, success=True, score=1.0, reason="The score is 1.00 because the retrieval context accurately provides the information ask

In [12]:
def create_deep_eval_test_cases(questions, gt_answers, generated_answers, retrieved_documents):
    return [
        LLMTestCase(
            input=question,
            expected_output=gt_answer,
            actual_output=generated_answer,
            retrieval_context=retrieved_document
        )
        for question, gt_answer, generated_answer, retrieved_document in zip(
            questions, gt_answers, generated_answers, retrieved_documents
        )
    ]