## Evaluate RAG Quality
##### Evaluates the app by running an experiment in Langsmith  
Do not add code to this to run a regular rag inferences or it may put the wrong tracing project name. Use inference_tester.ipynb instead

Tests 
-  Accuracy (COT Answer Accuracy)
-  Recall- How many of the relevant documents were retrieved
-  Precision- How well did the response answer the question given the retrieved documents
-  Truthfulness - Did the response stray from the documents or hallucinate?


In [1]:
# %pip install pip --upgrade

In [2]:
from dotenv import load_dotenv
import os, sys

load_dotenv('/Users/drew_wilkins/Drews_Files/Drew/Python/VSCode/.env')

# Add the parent directory to sys.path so you can import your modules from a subdirectory
sys.path.append(os.path.abspath('..'))

import rag
from rag import CONFIG

In [3]:
# Config LangSmith if you also want the traces
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "langchain_evaluator.ipynb on ASK main/local"

In [4]:
from langsmith.evaluation import evaluate
from langsmith import Client

client = Client()

### Set up the Accuracy Evaluator

In [5]:
from langsmith.evaluation import LangChainStringEvaluator
from langchain_openai import ChatOpenAI


def prepare_cot_qa_data(run, example):
    '''
    Create a dictionary for the evaluator to use.

    run is the rag function 
    example is the example from the dataset
    '''
    return {
        "input": example.inputs["question"],
        "reference": example.outputs["ground_truth_answer"],
        "prediction": run.outputs["answer"],
    }


# Initialize the LLM with the desired model and temperature
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# cot_qa uses the CotQAEvalChain class which uses the prompt template here: https://smith.langchain.com/hub/wfh/cot_qa
accuracy_evaluator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": llm},
    prepare_data=prepare_cot_qa_data,
)

### Set up the Recall Evaluator

In [6]:
from langsmith.evaluation import LangChainStringEvaluator


recall_evaluator = LangChainStringEvaluator(
    "score_string",
    config={
        "criteria": {
            "Recall": """The Assistant's Answer is a set of documents retrieved from a vectorstore. The input is a question used for retrieval. You will score whether the Assistant's Answer (retrieved documents) are relevant to the input question. The score should be between 0 and 10. A score of [[0]] means that the Assistant answer contains documents that are not at all relevant to the input question. A score of [[5]] menas that the Assistant answer contains some documents that are relevant to the input question. A score of [[10]] means that all of the Assistant answer documents are all relevant to the input question. If the user\'s question is unclear or seems to be based on a misunderstanding or typo, you should give a score of [[0]].""",
        },
        "normalize_by": 10,
        "llm": ChatOpenAI(model="gpt-4o-mini", temperature=0),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["context"],
        "input": example.inputs["question"],
    }
)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


### Set up the Truthfulness Evaluator

In [7]:
from langchain import hub
from langchain_openai import ChatOpenAI


# Prompt to grade Truthfulness. It's a Langchain object with two inputs: "documents", "student_answer"
grade_prompt_truthfulness = prompt = hub.pull(
    "langchain-ai/rag-answer-hallucination")


def hallucination_evaluator(run, example) -> dict:
    """
    A simple evaluator for detecting generation hallucinations
    """

    # RAG inputs
    input_question = example.inputs["question"]
    contexts = run.outputs["context"]
    prediction = run.outputs["answer"]

    # LLM grader
    # other models gpt-4-turbo gpt-4o-mini
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    # Structured prompt
    answer_grader = grade_prompt_truthfulness | llm

    # Get score by passing populated prompt to the evaluator
    score = answer_grader.invoke({"documents": contexts,
                                  "student_answer": prediction})
    score = score["Score"]

    return {"key": "Truthfulness", "score": score}

### Config your Evaluation

In [8]:

dataset_name = "ASK-groundtruth-v2"
# ASK-groundtruth_v1   initial_EDA

split_name = "1_question"

data = dataset_name

# I don't think I need this one anymore
data = client.list_examples(dataset_name=dataset_name, splits=["1_question"])


experiment_prefix = "ASK_ART_eval-llm-gpt-4o-mini"

experiment_description = "Testing cost using gpt-4o-mini for Eval and gpt-4-turbo for RAG. This will run an eval over a signle question 1x. AppName-TestType-TestVariables. ART stands for Accuraacy, Recall, Truthfulness. oai= OpenAI model. accuracy, recall, truthfulness are the test variables."

### Run the Evaluation
 [OpenAI API pricing is here](https://openai.com/api/pricing/)

In [9]:
def target_function(input: dict):
    '''maps the shape input from our example, which is a single-field dictionary, to the rag function we are testing, which accepts a string'''
    return rag.rag(input["question"])


evaluate(
    target_function,
    data=data,
    evaluators=[accuracy_evaluator, recall_evaluator,
                hallucination_evaluator],
    experiment_prefix=experiment_prefix,
    num_repetitions=1,
    metadata=CONFIG,
)  # type: ignore    # This supresses an error

View the evaluation results for experiment: 'ASK_ART_eval-llm-gpt-4o-mini-4bf178af' at:
https://smith.langchain.com/o/3941ecea-6957-508c-9f4f-08ed62dc7d61/datasets/0b24ff94-f4f0-4197-89f3-765f835936c9/compare?selectedSessions=cb135756-eff8-4189-b5bd-e66472b0e76a




0it [00:00, ?it/s]

2025-01-12 19:12:05.188 
  command:

    streamlit run /Users/drew_wilkins/Drews_Files/Drew/Python/Repositories/ASK/.venv-main/lib/python3.11/site-packages/ipykernel_launcher.py [ARGUMENTS]


Unnamed: 0,inputs.question,outputs.user_question,outputs.enriched_question,outputs.context,outputs.answer,outputs.llm_sources,error,reference.ground_truth_answer,reference.ground_truth_sources,feedback.COT Contextual Accuracy,feedback.score_string:Recall,feedback.Truthfulness,execution_time,example_id,id
0,"How is harassment generally defined, and who i...","How is harassment generally defined, and who i...","How is harassment generally defined, and who i...",[page_content='COMDTINST M16790.1G \n \n \n \n...,Harassment is generally defined as unwelcome a...,[COMDTINST M16790.1G Section B Anti-Discrimina...,,Harassment is generally defined as unwelcome a...,[],1,1.0,1,25.385029,a3326a11-5c24-4ceb-a04e-5bb708dd9b38,4548cb6d-68a0-41fd-be24-6fb030ca8e75
