## Evaluate RAG Quality
##### Evaluates the app by running an experiment in Langsmith with the following metrics:
-  Accuracy- Is the answer correct according to the ground truth answer
-  Recall- How many of the relevant documents were retrieved
-  Truthfulness - Did the response stray from the documents or hallucinate?

Do not add code to this to run a regular rag inferences or it may put the wrong tracing project name. Use inference_tester.ipynb instead


In [1]:
from dotenv import load_dotenv
import os, sys
import streamlit as st

load_dotenv('/Users/drew_wilkins/Drews_Files/Drew/Python/VSCode/.env')

# Add the parent directory to sys.path so you can import your modules from a subdirectory
sys.path.append(os.path.abspath('..'))

import rag
from rag import CONFIG

In [2]:
# Config LangSmith if you also want the traces
os.environ["LANGCHAIN_API_KEY"] = st.secrets["LANGCHAIN_API_KEY"]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "langchain_evaluator.ipynb on ASK main/local"

In [3]:
from langsmith.evaluation import evaluate
from langsmith import Client
from langchain import hub
from langchain_openai import ChatOpenAI
from langsmith import traceable

client = Client()

eval_model = "gpt-4o-mini"

### Set up the Accuracy Evaluator

In [4]:
grade_prompt_accuracy = prompt = hub.pull(
    "cot_qa_drew")


def accuracy_evaluator(run, example) -> dict:
    """
    A simple evaluator for detecting generation accuracy
    """

    # Inputs to Evaluator from Eval set
    input = example.inputs["question"]
    reference = example.outputs["ground_truth_answer"]

    # Inputs to Evaluator from RAG output
    prediction = run.outputs["answer"]

    llm = ChatOpenAI(model=eval_model, temperature=0, stream_usage=True)

    # Define the grader
    answer_grader = grade_prompt_accuracy | llm

    # Get score by passing populated prompt to the evaluator
    # The prompt template takes in "query", "ground_truth_answer", "answer" as inputs
    grader_response = answer_grader.invoke({"query": input,
                                           "ground_truth_answer": reference,
                                            "answer": prediction}
                                           )

    correctness = grader_response["correctness"]
    explanation = grader_response["explanation"]

    return {
        "key": "Accuracy",
        "score": correctness,  # Numerical score expected by the evaluator
        "value": "Correct" if correctness == 1 else "Incorrect",  # Optional categorical value
        "comment": explanation,  # Additional metadata
    }

### Set up the Recall Evaluator

In [5]:
from langsmith.evaluation import LangChainStringEvaluator


recall_evaluator = LangChainStringEvaluator(
    "score_string",
    config={
        "criteria": {
            "Recall": """The Assistant's Answer is a set of documents retrieved from a vectorstore. The input is a question used for retrieval. You will score whether the Assistant's Answer (retrieved documents) are relevant to the input question. The score should be between 0 and 10. A score of [[0]] means that the Assistant answer contains documents that are not at all relevant to the input question. A score of [[5]] menas that the Assistant answer contains some documents that are relevant to the input question. A score of [[10]] means that all of the Assistant answer documents are all relevant to the input question. If the user\'s question is unclear or seems to be based on a misunderstanding or typo, you should give a score of [[0]].""",
        },
        "normalize_by": 10,
        "llm": ChatOpenAI(model=eval_model, temperature=0, stream_usage=True),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["context"],
        "input": example.inputs["question"],
    }
)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


### Set up the Truthfulness Evaluator

In [6]:
grade_prompt_truthfulness = prompt = hub.pull(
    "langchain-ai/rag-answer-hallucination")


def hallucination_evaluator(run, example) -> dict:
    """
    A simple evaluator for detecting generation hallucinations
    """

    # Inputs to Evaluator from Eval set
    input_question = example.inputs["question"]

    # Inputs to Evaluator from RAG output
    contexts = run.outputs["context"]
    prediction = run.outputs["answer"]

    # LLM grader
    # other models gpt-4-turbo gpt-4o-mini
    llm = ChatOpenAI(model=eval_model, temperature=0, stream_usage=True)

    # Structured prompt
    answer_grader = grade_prompt_truthfulness | llm

    # Get score by passing populated prompt to the evaluator
    # The evaluator template expects "documents" and "student_answer" as inputs
    # The evaluator returns "Score" (int) and "Explanation" (str) as output
    grader_response = answer_grader.invoke({"documents": contexts,
                                            "student_answer": prediction})
    score = grader_response["Score"]
    explanation = grader_response["Explanation"]

    return {"key": "Truthfulness", "score": score, "comment": explanation}

### Config your Evaluation

In [7]:
dataset_name = "ASK-groundtruth-v2"
# ASK-groundtruth_v1   initial_EDA

data = dataset_name

# I don't think I need this one anymore
data = client.list_examples(dataset_name=dataset_name, splits=["1_question"])


experiment_prefix = "ASK_Eval-cost-Eval-llm-gpt-4o-mini"

experiment_description = "Testing cost one last time before conducting the baseline test using gpt-4o-mini for Eval and gpt-3.5-turbo-16k for RAG. This will run an eval over a signle question 1x. \n\nNAMING CONVENTION\nAppName_TestMetrics_TestVariables \nExample: ASK_ART_llm-gpt-4o-mini\nTest metrics are ART = Accuracy, Recall, Truthfulness. Test Variable is gpt-4o-mini which we will compare against some other llm. Other example of TestMetrics could be Eval_cost, App_cost, App_time, etc."

### Run the Evaluation
 [OpenAI API pricing is here](https://openai.com/api/pricing/)

In [8]:
evaluate(
    # maps the shape input from our example, which is a single-field dictionary, to the rag function we are testing, which accepts a string
    lambda input: rag.rag(input["question"]),
    data=data,
    # , recall_evaluator, hallucination_evaluator
    evaluators=[accuracy_evaluator, recall_evaluator, hallucination_evaluator],
    experiment_prefix=experiment_prefix,
    description=experiment_description,
    num_repetitions=1,
    metadata=CONFIG,
)  # type: ignore    # This supresses an error

View the evaluation results for experiment: 'ASK_Eval-cost-Eval-llm-gpt-4o-mini-dd697f2c' at:
https://smith.langchain.com/o/3941ecea-6957-508c-9f4f-08ed62dc7d61/datasets/0b24ff94-f4f0-4197-89f3-765f835936c9/compare?selectedSessions=6bc892da-bf53-403d-aca5-22a5912a4182




0it [00:00, ?it/s]

2025-01-14 10:39:20.814 
  command:

    streamlit run /Users/drew_wilkins/Drews_Files/Drew/Python/Repositories/ASK/.venv-main/lib/python3.11/site-packages/ipykernel_launcher.py [ARGUMENTS]


Unnamed: 0,inputs.question,outputs.user_question,outputs.enriched_question,outputs.context,outputs.answer,outputs.llm_sources,error,reference.ground_truth_answer,reference.ground_truth_sources,feedback.Accuracy,feedback.score_string:Recall,feedback.Truthfulness,execution_time,example_id,id
0,"How is harassment generally defined, and who i...","How is harassment generally defined, and who i...","How is harassment generally defined, and who i...",[page_content='COMDTINST M16790.1G \n \n \n \n...,Harassment is generally defined as unwelcome a...,"[COMDTINST M16790.1G, COMDTINST M5350.4E]",,Harassment is generally defined as unwelcome a...,[],1,1.0,1,4.155902,a3326a11-5c24-4ceb-a04e-5bb708dd9b38,551bda83-8ba3-45c6-bb42-3af637d525fb
