# LLM Evaluation Example on a Q&A with RAG scenario
This is an end-to-end example how to scale evaluation for a Q&A with RAG scenario.
This notebook has the following parts:
1. Set up the LLM app
2. Create evaluation dataset
3. Define metrics to use
4. Evaluate the app
5. Optional: Use LangSmith & Langfuse to track evaluation and trace

In [1]:
%reload_ext autoreload
%autoreload 2

import dotenv
dotenv.load_dotenv("../.env", override=True)


True

## Step 1: Set up the LLM app
We're using a [Q&A example from LangChain](https://python.langchain.com/docs/use_cases/question_answering/quickstart) where we'll be doing a Q&A over a [blog](https://lilianweng.github.io/posts/2023-06-23-agent/).

In [2]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from operator import itemgetter

In [3]:
# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"retrieved_docs": retriever, "context": retriever | format_docs, "question": RunnablePassthrough()}
    | RunnablePassthrough()
    | {"answer": prompt | llm | StrOutputParser(), "context": itemgetter("context")}
)

In [4]:
print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]


In [5]:
rag_chain.invoke("What is Task Decomposition?")

{'answer': 'Task Decomposition is a technique used to break down complex tasks into smaller and simpler steps. This approach helps agents to plan and execute tasks more efficiently by dividing them into manageable subtasks. Task decomposition can be achieved through various methods such as prompting with specific instructions or utilizing human inputs.',
 'context': 'Fig. 1. Overview of a LLM-powered autonomous agent system.\nComponent One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\n\nTree of Thoughts

## Step 2: Create evaluation dataset

### Take the documents split created earlier

In [6]:
splits[:5]

[Document(page_content='LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n

In [7]:
print(len(splits))

66


### Setup a chain to ask LLM to create question and answer. We'll use GPT-4 here to ensure a good Q&A generation

In [8]:
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI


# Define your desired data structure.
class QAExample(BaseModel):
    question: str = Field(description="question relevant to the given input")
    answer: str = Field(description="answer to the question")

    # You can add custom validation logic easily with Pydantic.
    @validator("question")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=QAExample)

prompt = PromptTemplate(
    template="Given the following text, generate a set of question and answer about an information contained in the text.\n{format_instructions}\nText:\n```\n{text}\n```\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0)

gen_qa_chain = prompt | gpt4_llm | parser



In [9]:
gen_qa_chain.invoke({"text": splits[0].page_content})

QAExample(question='What is the core controller of the autonomous agents discussed in the text?', answer='LLM (large language model)')

### This looks good, let's run them over each chunk

In [10]:
gen_qa = []

for split in splits:
    gen_qa.append(gen_qa_chain.invoke({"text": split.page_content}))

In [None]:
gen_qa[:5]

[QAExample(question='What is the core controller of the autonomous agents discussed in the text?', answer='LLM (large language model)'),
 QAExample(question='What is considered as utilizing the short-term memory of the model?', answer='In-context learning, as seen in Prompt Engineering, utilizes the short-term memory of the model.'),
 QAExample(question='What is the purpose of the Chain of Thought (CoT) prompting technique according to Wei et al. 2022?', answer="The purpose of the Chain of Thought (CoT) prompting technique is to enhance model performance on complex tasks by instructing the model to 'think step by step', which allows it to utilize more test-time computation to decompose hard tasks into smaller and simpler steps, thereby transforming big tasks into multiple manageable tasks and shedding light into an interpretation of the model’s thinking process."),
 QAExample(question='What does the Tree of Thoughts (Yao et al. 2023) extend and what is its main contribution?', answer='

### Let's also make some questions where the answer is not in any parts of the text

In [None]:
prompt = PromptTemplate(
    template="Given the following text, generate a question about information not contained in the text, with the answer confirming that the information is not included.\n{format_instructions}\nText:\n```\n{text}\n```\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Increase the temperature to get more diverse questions and answers.
gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0.7)

gen_qa_chain = prompt | gpt4_llm | parser

In [None]:
gen_qa_chain.invoke({"text": docs[0].page_content})

QAExample(question='Does the text provide any information on the use of LLM-powered autonomous agents in space exploration?', answer='No, the text does not mention the use of LLM-powered autonomous agents in space exploration.')

In [None]:
gen_qa_chain.invoke({"text": docs[0].page_content})

QAExample(question='What specific instances demonstrate the use of long-term memory by LLM-powered autonomous agents in real-world applications?', answer='The text does not provide specific real-world application examples demonstrating the use of long-term memory by LLM-powered autonomous agents.')

In [None]:
gen_qa_no_answer = []

for i in range(10):
    gen_qa_no_answer.append(gen_qa_chain.invoke({"text": docs[0].page_content}))

In [None]:
gen_qa_no_answer

[QAExample(question='Does the text provide information on the specific algorithms used for short-term memory optimization?', answer='No, the text does not provide information on the specific algorithms used for short-term memory optimization.'),
 QAExample(question='Does the text provide any specific examples of the types of APIs used in the HuggingGPT framework?', answer='No, the text does not provide specific examples of the types of APIs used in the HuggingGPT framework.'),
 QAExample(question="What are the specific names of the 'External APIs' mentioned in the section about Tool Use?", answer="The specific names of the 'External APIs' are not mentioned in the text."),
 QAExample(question='Does the text provide information on the specific programming languages used in the development of the mentioned autonomous agents?', answer='No, the text does not specify which programming languages were used in the development of the autonomous agents.'),
 QAExample(question='Does the text provi

### Let's put this in a dataframe

In [None]:
import pandas as pd

gen_qa_lst = []

for i in range(len(gen_qa)):
    qa_dict = gen_qa[i].dict()
    qa_dict["ground_truth_context"] = splits[i].page_content
    gen_qa_lst.append(qa_dict)
    
for qa in gen_qa_no_answer:
    qa_dict = qa.dict()
    qa_dict["ground_truth_context"] = None
    gen_qa_lst.append(qa_dict)

gen_dataset = pd.DataFrame(gen_qa_lst)
gen_dataset.rename(columns={"answer": "ground_truth"}, inplace=True)
gen_dataset

Unnamed: 0,question,ground_truth,ground_truth_context
0,What is the core controller of the autonomous ...,LLM (large language model),LLM Powered Autonomous Agents\n \nDate: Jun...
1,What is considered as utilizing the short-term...,"In-context learning, as seen in Prompt Enginee...",Memory\n\nShort-term memory: I would consider ...
2,What is the purpose of the Chain of Thought (C...,The purpose of the Chain of Thought (CoT) prom...,Fig. 1. Overview of a LLM-powered autonomous a...
3,What does the Tree of Thoughts (Yao et al. 202...,The Tree of Thoughts extends CoT (Chain of Tho...,Tree of Thoughts (Yao et al. 2023) extends CoT...
4,What is the distinct approach called that invo...,LLM+P (Liu et al. 2023),"Another quite distinct approach, LLM+P (Liu et..."
...,...,...,...
71,Does the text mention any specific improvement...,"No, the text does not mention any specific imp...",
72,What specific details are provided about the t...,The text does not provide specific details abo...,
73,Does the text provide any information on the s...,"No, the text does not provide information on t...",
74,Does the text provide specific examples of how...,"No, the text does not provide specific example...",


## Step 3: Define metrics to use
Evaluate the retrieved context:
- same as the dataset's context (exact match and we can also add ROUGE-L)
- context_precision: assess relevancy of context to the question

Evaluate the answer:
- faithfulness: does the answer use information from the context?
- answer_correctness: combining answer relevancy and whether the answer matches with the ground truth

For function definition, we'll use HuggingFace's Evaluate and ragas libraries.


## Step 4: Evaluate the application

### First, let's run the application on the question examples

In [61]:
rag_chain.invoke(gen_dataset.iloc[0].question)

{'answer': "The core controller of the autonomous agents discussed in the text is the LLM (large language model). It functions as the agent's brain and is complemented by components such as planning and memory. The LLM helps in task decomposition and planning ahead for complex tasks.",
 'retrieved_docs': [Document(page_content='LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent bre

In [177]:
gen_dataset["answer"] = None
gen_dataset["contexts"] = None

for idx, item in gen_dataset.iterrows():
    result = rag_chain.invoke(item.question)
    gen_dataset.at[idx, "answer"] = result["answer"]
    gen_dataset.at[idx, "contexts"] = result["context"]


KeyboardInterrupt: 

In [65]:
gen_dataset

Unnamed: 0,question,ground_truth,ground_truth_context,answer,contexts
0,What is the core controller of the autonomous ...,LLM (large language model),LLM Powered Autonomous Agents\n \nDate: Jun...,The core controller of the autonomous agents d...,[LLM Powered Autonomous Agents\n \nDate: Ju...
1,What is considered as utilizing the short-term...,"In-context learning, as seen in Prompt Enginee...",Memory\n\nShort-term memory: I would consider ...,Utilizing the short-term memory of the model i...,[Memory\n\nShort-term memory: I would consider...
2,What is the purpose of the Chain of Thought (C...,The purpose of the Chain of Thought (CoT) prom...,Fig. 1. Overview of a LLM-powered autonomous a...,The purpose of the Chain of Thought (CoT) prom...,[Tree of Thoughts (Yao et al. 2023) extends Co...
3,What does the Tree of Thoughts (Yao et al. 202...,The Tree of Thoughts extends CoT (Chain of Tho...,Tree of Thoughts (Yao et al. 2023) extends CoT...,The Tree of Thoughts (Yao et al. 2023) extends...,[Tree of Thoughts (Yao et al. 2023) extends Co...
4,What is the distinct approach called that invo...,LLM+P (Liu et al. 2023),"Another quite distinct approach, LLM+P (Liu et...",The distinct approach involving relying on an ...,"[Another quite distinct approach, LLM+P (Liu e..."
...,...,...,...,...,...
71,What is the name of the author who developed t...,The text does not include the name of the auth...,,The author who developed the Reflexion framewo...,[Fig. 2. Examples of reasoning trajectories f...
72,What are the specific examples of tasks that L...,The text does not provide specific examples of...,,LLM-powered autonomous agents have been succes...,[LLM Powered Autonomous Agents\n \nDate: Ju...
73,What are the specific details about the extern...,The specific details about the external APIs u...,,The LLM-powered autonomous agents use the API-...,[Reliability of natural language interface: Cu...
74,What specific challenges did the AutoGPT proje...,The text does not provide specific details on ...,,The AutoGPT project faced challenges with the ...,[Reliability of natural language interface: Cu...


In [81]:
gen_dataset.iloc[-10:][["question", "answer"]].values

array([['Does the text provide information on the specific algorithms used for short-term memory optimization?',
        'The text does not provide specific information on the algorithms used for short-term memory optimization.'],
       ['Does the text provide any specific examples of the types of APIs used in the HuggingGPT framework?',
        'The text does not provide specific examples of the types of APIs used in the HuggingGPT framework.'],
       ["What are the specific names of the 'External APIs' mentioned in the section about Tool Use?",
        "The specific names of the 'External APIs' mentioned in the section about Tool Use are API-Bank, TALM, and Toolformer. These APIs are used to augment language models with external tools and improve their capabilities in various tasks. The selection of APIs includes search engines, calculator, calendar queries, smart home control, schedule management, health data management, and more."],
       ['Does the text provide information on t

### Running evaluation with ragas, this might take a few minutes

In [135]:
from datasets import Dataset
from ragas import evaluate as ragas_evaluate
from ragas.metrics import context_precision, faithfulness, answer_correctness
import evaluate

# Setting up
gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0)
rouge = evaluate.load('rouge')

# We'll be tracking the following metrics:
# - context_correctness: whether the ground_truth_context is in the retrieved context
# - ground_truth_context_rank: rank of the ground_truth_context in the retrieved context
# - context_rougel_score: ROUGE-L score between the ground_truth_context and the top retrieved context
# - context_precision: precision of the retrieved context (calculated w/ragas)
# - faithfulness: faithfulness of the answer (calculated w/ragas)
# - answer_correctness: whether the answer is correct (calculated w/ragas)

results_lst = []
 
for idx, row in gen_dataset.iloc[:2].iterrows(): # Subsetting to make it go faster
    context_correctness = row["ground_truth_context"] in row["contexts"]

    if context_correctness:
        ground_truth_context_rank = row["contexts"].index(row["ground_truth_context"])
    else:
        ground_truth_context_rank = None

    context_rougel_score = rouge.compute(predictions=[row["contexts"][0]], references=[row["ground_truth_context"]])["rougeL"]

    custom_eval_results = {
        "context_correctness": context_correctness,
        "ground_truth_context_rank": ground_truth_context_rank,
        "context_rougel_score": context_rougel_score,
    }

    # Format row into datasets, which ragas takes as inputs
    row_dataset = Dataset.from_pandas(row.to_frame().T)
    # Ragas by default aggregates metrics together, so we need to pass one by one to get individual results.
    # However, it does run faster when you pass all metrics at once.
    ragas_eval_results = ragas_evaluate(row_dataset, metrics=[context_precision, faithfulness, answer_correctness], llm=gpt4_llm)
    
    results_lst.append(custom_eval_results | ragas_eval_results)


results_df = pd.DataFrame(results_lst)



Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

In [132]:
results_df

Unnamed: 0,question,ground_truth,ground_truth_context,answer,contexts
0,What is the core controller of the autonomous ...,LLM (large language model),LLM Powered Autonomous Agents\n \nDate: Jun...,The core controller of the autonomous agents d...,[LLM Powered Autonomous Agents\n \nDate: Ju...


## Step 5 (Optional): Use LangSmith & Langfuse to track evaluation and trace

### 5a: LangSmith


#### Tracing
Because we are already using langchain, in order to start tracing, we just need to set the following environment variables along with the API key

In [None]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "LLM Eval Workshop"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# If you haven't set it in your ".env" file, you can set your API key here.
# os.environ["LANGCHAIN_API_KEY"] = "ls__..."

##### Other ways to log traces to LangSmith
If you're not using LangChain, don't worry! There are other ways of using LangSmith, you can find them [here](https://docs.smith.langchain.com/tracing/faq/logging_and_viewing#logging-traces).

For non-langchain apps, we find adding `traceable` decorator to be the easiest way to log.

In [None]:
from langsmith import traceable
from langsmith.wrappers import wrap_openai
import openai

@traceable(run_type="retriever", name="Retrieve Context")
def retrieve_docs(question: str) -> str:
    docs = retriever.get_relevant_documents(question)
    return format_docs(docs)

# Langsmith also provides a specific wrapper for OpenAI's API, or we can also use the traceable like above
client = wrap_openai(openai.Client())

@traceable(name="RAG Pipeline Trace")
def rag_pipeline(question: str):
    context = retrieve_docs(question)
    messages = [
        { "role": "system", "content": "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise." },
        { "role": "user", "content": f"Question: {question} \nContext: {context} \nAnswer:"}
    ]
    chat_completion = client.chat.completions.create(
        model="gpt-3.5-turbo", messages=messages
    )
    return chat_completion.choices[0].message.content

rag_pipeline("What is Task Decomposition?")



'Task decomposition involves breaking down complex tasks into smaller and simpler steps to make them more manageable and easier to tackle. It can be done through techniques like Chain of Thought and Tree of Thoughts which help models think step by step or explore multiple reasoning possibilities. Task decomposition can be facilitated by using language models with specific prompts, task-specific instructions, or human inputs for guidance.'

#### Uploading evaluation to LangSmith 

##### First, we need to register our eval dataset to LangSmith

In [None]:
gen_dataset["question"].to_list()

['What is the core controller of the autonomous agents discussed in the text?',
 'What is considered as utilizing the short-term memory of the model?',
 'What is the purpose of the Chain of Thought (CoT) prompting technique according to Wei et al. 2022?',
 'What does the Tree of Thoughts (Yao et al. 2023) extend and what is its main contribution?',
 'What is the distinct approach called that involves relying on an external classical planner for long-horizon planning?',
 'What does ReAct integrate within LLM according to Yao et al. 2023?',
 'What are the examples of reasoning trajectories mentioned for knowledge-intensive and decision-making tasks?',
 'What does the heuristic function in the Reflexion framework do?',
 'What is a more common failure in AlfWorld according to the experiments?',
 'What does Chain of Hindsight (CoH) use to encourage model improvement?',
 "What is the purpose of adding a regularization term in CoH's approach?",
 'What is the main idea behind CoH as described 

In [None]:
from langsmith import Client

client = Client()
dataset_name = "RAG QA Dataset"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Dataset to test out QA with RAG.",
)

client.create_examples(
    inputs=[{"question": q} for q in gen_dataset["question"]],
    outputs=[{"answer": row["ground_truth"], "context": row["ground_truth_context"]} for _, row in gen_dataset.iterrows()],
    dataset_id=dataset.id,
)

##### Then, we evaluate our app
First, let's setup our custom evaluators

There's an issue with ragas.langchain import, and it's being looked at right now (here)[https://github.com/explodinggradients/ragas/issues/571]. I'll check back closer to the workshop date, but if not, we can just remove it for the example

In [None]:
from langchain.smith import RunEvalConfig, run_on_dataset
from langsmith.evaluation import EvaluationResult, run_evaluator
# from ragas.langchain.evalchain import RagasEvaluatorChain

@run_evaluator
def context_correctness(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs.get("context")
    retrieved_contexts = run.outputs.get("output")["context"] or []
    if ground_truth_context is None and retrieved_contexts != []:
        return EvaluationResult(key="context_correctness", score=False)
    else:
        return EvaluationResult(key="context_correctness", score=ground_truth_context in retrieved_contexts)
    
@run_evaluator
def context_rank(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs.get("context")
    if ground_truth_context is None:
        return EvaluationResult(key="context_rank", score=None)
    else:
        retrieved_contexts = run.outputs.get("output")["context"] or []
        if ground_truth_context not in retrieved_contexts:
            return EvaluationResult(key="context_rank", score=None)
        else:
            return EvaluationResult(key="context_rank", score=retrieved_contexts.index(ground_truth_context))

@run_evaluator
def context_rougel_score(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs.get("context") or ""
    top_retrieved_context = run.outputs.get("output")["context"][0] or ""
    rouge = evaluate.load('rouge')
    return EvaluationResult(key="context_rougel_score", score=rouge.compute(predictions=[top_retrieved_context], references=[ground_truth_context])["rougeL"])


# # create evaluation chains
# context_prec_chain = RagasEvaluatorChain(metric=context_precision)
# faithfulness_chain = RagasEvaluatorChain(metric=faithfulness)
# answer_correctness_chain = RagasEvaluatorChain(metric=answer_correctness)

Next, let's setup the eval config

In [None]:
eval_config = RunEvalConfig(
    custom_evaluators=[context_correctness, context_rank, context_rougel_score],
                    # context_prec_chain,
                    # faithfulness_chain,
                    # answer_correctness_chain],
    # You can also use a prebuilt evaluator
    # by providing a name or RunEvalConfig.<configured evaluator>
    evaluators=[
        # You can specify an evaluator by name/enum.
        RunEvalConfig.Criteria("harmfulness"),
        # And also define your own custom LLM evaluator.
        RunEvalConfig.Criteria(
            {
                "helpfulness": "Are the answers helpful and provide new information to the user?"
            }
        ),
    ],
)


Run the method async, although can be done synchronously as well

In [201]:

client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=itemgetter("question") | rag_chain,
    evaluation=eval_config,
    verbose=True,
    project_name="LLM Eval Workshop",
    # Any experiment metadata can be specified here
    project_metadata={"version": "0.0.1"},
    
)

HTTPError: [Errno 409 Client Error: Conflict for url: https://api.smith.langchain.com/sessions] {"detail":"Session already exists."}

### 5b: Langfuse

#### Set your keys if you didn't put it in your ".env" file

In [157]:
import os
 
# # get keys for your project from https://cloud.langfuse.com
# os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
# os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-708a9a94-0c9b-4304-9fdc-d1be544b0913"
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-7657a0eb-9263-413f-b709-05c0b66bddd7"
LANGFUSE_PUBLIC_KEY="pk-lf-708a9a94-0c9b-4304-9fdc-d1be544b0913"
LANGFUSE_SECRET_KEY="sk-lf-7657a0eb-9263-413f-b709-05c0b66bddd7"

In [158]:
# import
from langfuse import Langfuse
import openai
 
# init
langfuse = Langfuse()

#### Create dataset

In [160]:
langfuse.create_dataset(name="RAG QA Dataset")

# Upload to Langfuse
for _, row in gen_dataset.iterrows():
  langfuse.create_dataset_item(
      dataset_name="RAG QA Dataset",
      # any python object or value
      input=row["question"],
      # any python object or value, optional
      expected_output={
        "answer": row["ground_truth"],
        "context": row["ground_truth_context"]
      }
)

#### Setup custom evaluators

In [179]:
def context_correctness(output, expected_output):
    ground_truth_context = expected_output["context"]
    retrieved_contexts = output["context"] or []
    if ground_truth_context is None and retrieved_contexts != []:
        return False
    else:
        return ground_truth_context in retrieved_contexts

def context_rank(output, expected_output):
    ground_truth_context = expected_output["context"]
    retrieved_contexts = output["context"] or []
    if not ground_truth_context or ground_truth_context not in retrieved_contexts:
        return -1
    else:
        return retrieved_contexts.index(ground_truth_context)

def context_rougel_score(output, expected_output):
    top_retrieved_context = output["context"][0] or ""
    ground_truth_context = expected_output["context"] or ""
    rouge = evaluate.load('rouge')
    print(top_retrieved_context, ground_truth_context)
    return rouge.compute(predictions=[top_retrieved_context], references=[ground_truth_context])["rougeL"]


#### Run evaluation

In [174]:
from datetime import datetime
 
def run_my_custom_llm_app(input):
    generationStartTime = datetime.now()

    out = rag_chain.invoke(input)
    
    langfuse_generation = langfuse.generation(
        name="rag-chain-qa",
        input=input,
        output=out,
        model="gpt-3.5-turbo",
        start_time=generationStartTime,
        end_time=datetime.now()
        )

    return out, langfuse_generation

In [180]:
dataset = langfuse.get_dataset("RAG QA Dataset")

for item in dataset.items:
    completion, langfuse_generation = run_my_custom_llm_app(item.input)

    item.link(langfuse_generation, "Exp 1")

    langfuse_generation.score(
        name="context_correctness",
        value=context_correctness(completion, item.expected_output)
        )
    langfuse_generation.score(
        name="context_rank",
        value=context_rank(completion, item.expected_output)
        )
    langfuse_generation.score(
        name="context_rougel_score",
        value=context_rougel_score(completion, item.expected_output)
        )
    

L 
L 
L 
F 
L 
L 
R 
A 
C 
M 
O [13] Shen et al. “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace” arXiv preprint arXiv:2303.17580 (2023).
[14] Bran et al. “ChemCrow: Augmenting large-language models with chemistry tools.” arXiv preprint arXiv:2304.05376 (2023).
[15] Boiko et al. “Emergent autonomous scientific research capabilities of large language models.” arXiv preprint arXiv:2304.05332 (2023).
[16] Joon Sung Park, et al. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv preprint arXiv:2304.03442 (2023).
[17] AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT
[18] GPT-Engineer. https://github.com/AntonOsika/gpt-engineer
[ [6] Google Blog. “Announcing ScaNN: Efficient Vector Similarity Search” July 28, 2020.
[7] https://chat.openai.com/share/46ff149e-a4c7-4dd7-a800-fc4a642ea389
[8] Shinn & Labash. “Reflexion: an autonomous agent with dynamic memory and self-reflection” arXiv preprint arXiv:2303.11366 (2023).
[9] Laskin et al. “In-con

## Last step: Clean up the vector store

In [None]:
# cleanup
vectorstore.delete_collection()