# Step 2: Create evaluation dataset

Next, we're going to create an evaluation dataset for our app.

## Run the earlier notebook

First, we run the code from the previous notebook. The code below does this for us, we don't need to go back to that notebook!

In [1]:
%run 01-llm-app-setup.ipynb

## 1. Take the document chunks created earlier

These are the chunks of documents in our RAG system.

In [2]:
splits[:5]

[Document(page_content='LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n

In [3]:
print(len(splits))

66


## 2. Setup a chain to ask the LLM to create question and answer pairs. 

We'll use GPT-4 here to ensure good Q&A generation. These are generated based on a given chunk of text.

In [4]:
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI


# Define your desired data structure.
class QAExample(BaseModel):
    question: str = Field(description="question relevant to the given input")
    answer: str = Field(description="answer to the question")

    # You can add custom validation logic easily with Pydantic.
    @validator("question")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=QAExample)

prompt = PromptTemplate(
    template="Given the following text, generate a set of question and answer about an information contained in the text.\n{format_instructions}\nText:\n```\n{text}\n```\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Like before, you can replace this with a different LLM
gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0)

gen_qa_chain = prompt | gpt4_llm | parser



In [9]:
print(splits[0].page_content)

LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory


In [5]:
gen_qa_chain.invoke({"text": splits[0].page_content})

QAExample(question='What is the core controller of the autonomous agents discussed in the text?', answer='LLM (large language model)')

## 3. Generate the dataset

The output above looks good, so let's run the chain over our entire dataset.

In [10]:
gen_qa = []

for split in splits:
    gen_qa.append(gen_qa_chain.invoke({"text": split.page_content}))

In [11]:
gen_qa[:5]

[QAExample(question='What is the core controller of the autonomous agents discussed in the text?', answer='LLM (large language model)'),
 QAExample(question='What is considered as utilizing the short-term memory of the model?', answer='In-context learning, as seen in Prompt Engineering, utilizes the short-term memory of the model.'),
 QAExample(question='What is the purpose of the Chain of Thought (CoT) prompting technique according to Wei et al. 2022?', answer="The purpose of the Chain of Thought (CoT) prompting technique is to enhance model performance on complex tasks by instructing the model to 'think step by step', which allows it to utilize more test-time computation to decompose hard tasks into smaller and simpler steps, thereby transforming big tasks into multiple manageable tasks and shedding light into an interpretation of the model’s thinking process."),
 QAExample(question='What does the Tree of Thoughts (Yao et al. 2023) extend and what new approach does it introduce?', an

## 4. Generate negative samples

Let's also make some questions where the answer is not in any parts of the text.

In [12]:
prompt = PromptTemplate(
    template="Given the following text, generate a question about information not contained in the text, with the answer confirming that the information is not included.\n{format_instructions}\nText:\n```\n{text}\n```\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Increase the temperature to get more diverse questions and answers.
gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0.7)

gen_qa_chain = prompt | gpt4_llm | parser

Checking to make sure the questions generated are varied

In [13]:
for _ in range(2):
    print(gen_qa_chain.invoke({"text": docs[0].page_content}))

question='Does the text mention any specific challenges related to the energy consumption of LLM-powered autonomous agents?' answer='No, the text does not mention specific challenges related to the energy consumption of LLM-powered autonomous agents.'
question='Does the text provide any specific examples of real-world applications where LLM-powered autonomous agents have been successfully deployed?' answer='No, the text does not provide specific examples of real-world applications where LLM-powered autonomous agents have been successfully deployed.'


This looks good, let's run it a few more times

In [14]:
gen_qa_no_answer = []

for i in range(10):
    gen_qa_no_answer.append(gen_qa_chain.invoke({"text": docs[0].page_content}))

In [15]:
gen_qa_no_answer

[QAExample(question='What specific challenges do LLMs face when interpreting and generating natural language instructions for complex, multi-step tasks?', answer='This information is not included in the text.'),
 QAExample(question='Does the text provide any specific examples of how LLM-powered autonomous agents have been integrated into commercial products or services?', answer='No, the text does not provide specific examples of integration into commercial products or services.'),
 QAExample(question='Does the text provide any specific examples of real-world applications outside of experimental or conceptual frameworks?', answer='No, the text does not provide specific examples of real-world applications outside of experimental or conceptual frameworks.'),
 QAExample(question='Does the text provide any specific examples of LLM-powered autonomous agents being used in educational settings, such as tutoring or personalized learning?', answer='No, the text does not provide examples of LLM-

## 5. Save the datasets

Let's put this in a dataframe so we don't need to rerun all the code again.

In [16]:
import pandas as pd

gen_qa_lst = []

for i in range(len(gen_qa)):
    qa_dict = gen_qa[i].dict()
    qa_dict["ground_truth_context"] = splits[i].page_content
    gen_qa_lst.append(qa_dict)
    
for qa in gen_qa_no_answer:
    qa_dict = qa.dict()
    qa_dict["ground_truth_context"] = ""
    gen_qa_lst.append(qa_dict)

gen_dataset = pd.DataFrame(gen_qa_lst)
gen_dataset.rename(columns={"answer": "ground_truth"}, inplace=True)
gen_dataset

Unnamed: 0,question,ground_truth,ground_truth_context
0,What is the core controller of the autonomous ...,LLM (large language model),LLM Powered Autonomous Agents\n \nDate: Jun...
1,What is considered as utilizing the short-term...,"In-context learning, as seen in Prompt Enginee...",Memory\n\nShort-term memory: I would consider ...
2,What is the purpose of the Chain of Thought (C...,The purpose of the Chain of Thought (CoT) prom...,Fig. 1. Overview of a LLM-powered autonomous a...
3,What does the Tree of Thoughts (Yao et al. 202...,Tree of Thoughts extends CoT by exploring mult...,Tree of Thoughts (Yao et al. 2023) extends CoT...
4,What is the distinct approach called that invo...,LLM+P (Liu et al. 2023),"Another quite distinct approach, LLM+P (Liu et..."
...,...,...,...
71,Does the text provide any specific examples of...,"No, the text does not provide specific example...",
72,What are the specific programming languages us...,The text does not specify which programming la...,
73,Does the text provide any specific examples of...,"No, the text does not provide specific example...",
74,Does the text provide any specific examples of...,"No, the text does not provide specific example...",


In [17]:
gen_dataset.to_csv("generated_qa.csv", index=False)