# Step 2: Create evaluation dataset

## Run the earlier notebook

In [1]:
%run 01-llm-app-setup.ipynb

## Take the documents split created earlier

In [2]:
splits[:5]

[Document(page_content='LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n

In [3]:
print(len(splits))

66


## Setup a chain to ask LLM to create question and answer. 
We'll use GPT-4 here to ensure a good Q&A generation

In [4]:
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI


# Define your desired data structure.
class QAExample(BaseModel):
    question: str = Field(description="question relevant to the given input")
    answer: str = Field(description="answer to the question")

    # You can add custom validation logic easily with Pydantic.
    @validator("question")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=QAExample)

prompt = PromptTemplate(
    template="Given the following text, generate a set of question and answer about an information contained in the text.\n{format_instructions}\nText:\n```\n{text}\n```\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0)

gen_qa_chain = prompt | gpt4_llm | parser



In [5]:
gen_qa_chain.invoke({"text": splits[0].page_content})

QAExample(question='What is the core controller of the autonomous agents discussed in the text?', answer='LLM (large language model)')

## This looks good, let's run them over each chunk

In [6]:
gen_qa = []

for split in splits:
    gen_qa.append(gen_qa_chain.invoke({"text": split.page_content}))

In [7]:
gen_qa[:5]

[QAExample(question='What is the core controller of the autonomous agents discussed in the text?', answer='LLM (large language model)'),
 QAExample(question='What is considered as utilizing the short-term memory of the model?', answer='In-context learning, as seen in Prompt Engineering, utilizes the short-term memory of the model.'),
 QAExample(question='What is the standard prompting technique mentioned for enhancing model performance on complex tasks?', answer='Chain of thought (CoT; Wei et al. 2022)'),
 QAExample(question='What does the Tree of Thoughts (Yao et al. 2023) extend and what is its primary methodology?', answer='Tree of Thoughts extends CoT by exploring multiple reasoning possibilities at each step, decomposing the problem into multiple thought steps, and generating multiple thoughts per step to create a tree structure.'),
 QAExample(question='What is the LLM+P approach described by Liu et al. 2023?', answer='The LLM+P approach involves relying on an external classical p

## Let's also make some questions where the answer is not in any parts of the text

In [8]:
prompt = PromptTemplate(
    template="Given the following text, generate a question about information not contained in the text, with the answer confirming that the information is not included.\n{format_instructions}\nText:\n```\n{text}\n```\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Increase the temperature to get more diverse questions and answers.
gpt4_llm = ChatOpenAI(model_name="gpt-4-turbo-preview", temperature=0.7)

gen_qa_chain = prompt | gpt4_llm | parser

Checking to make sure the questions generated are varied

In [9]:
for _ in range(2):
    print(gen_qa_chain.invoke({"text": docs[0].page_content}))

question='What are the specific environmental impacts mentioned in relation to the use of LLM-powered autonomous agents?' answer='The text does not mention any specific environmental impacts related to the use of LLM-powered autonomous agents.'
question='Does the text provide any specific examples of real-world applications where LLM-powered autonomous agents have been deployed?' answer='No, the text does not provide specific examples of real-world applications where LLM-powered autonomous agents have been deployed.'


This looks good, let's run for a few more times

In [10]:
gen_qa_no_answer = []

for i in range(10):
    gen_qa_no_answer.append(gen_qa_chain.invoke({"text": docs[0].page_content}))

In [11]:
gen_qa_no_answer

[QAExample(question='Does the text provide any comparison of LLM-powered autonomous agents with human intelligence in terms of adaptability and creativity?', answer='No, the text does not provide a comparison of LLM-powered autonomous agents with human intelligence in terms of adaptability and creativity.'),
 QAExample(question='Does the text provide specific examples of external APIs used by MRKL systems?', answer='No, the text does not provide specific examples of external APIs used by MRKL systems.'),
 QAExample(question='What is the name of the author who developed the Reflexion framework mentioned in the text?', answer='The text does not provide the name of the author who developed the Reflexion framework.'),
 QAExample(question='What is the name of the author who developed the Reflexion framework?', answer='The name of the author who developed the Reflexion framework is not included in the text.'),
 QAExample(question='What specific methods were used to evaluate the performance o

## Let's put this in a dataframe

In [12]:
import pandas as pd

gen_qa_lst = []

for i in range(len(gen_qa)):
    qa_dict = gen_qa[i].dict()
    qa_dict["ground_truth_context"] = splits[i].page_content
    gen_qa_lst.append(qa_dict)
    
for qa in gen_qa_no_answer:
    qa_dict = qa.dict()
    qa_dict["ground_truth_context"] = ""
    gen_qa_lst.append(qa_dict)

gen_dataset = pd.DataFrame(gen_qa_lst)
gen_dataset.rename(columns={"answer": "ground_truth"}, inplace=True)
gen_dataset

Unnamed: 0,question,ground_truth,ground_truth_context
0,What is the core controller of the autonomous ...,LLM (large language model),LLM Powered Autonomous Agents\n \nDate: Jun...
1,What is considered as utilizing the short-term...,"In-context learning, as seen in Prompt Enginee...",Memory\n\nShort-term memory: I would consider ...
2,What is the standard prompting technique menti...,Chain of thought (CoT; Wei et al. 2022),Fig. 1. Overview of a LLM-powered autonomous a...
3,What does the Tree of Thoughts (Yao et al. 202...,Tree of Thoughts extends CoT by exploring mult...,Tree of Thoughts (Yao et al. 2023) extends CoT...
4,What is the LLM+P approach described by Liu et...,The LLM+P approach involves relying on an exte...,"Another quite distinct approach, LLM+P (Liu et..."
...,...,...,...
71,What are the specific challenges faced by LLM-...,The text does not provide details on the speci...,
72,Does the text provide specific examples of API...,"No, the text does not provide specific example...",
73,Does the text provide any specific examples of...,"No, the text does not provide any examples of ...",
74,Does the text provide any information on the s...,"No, the text does not specify which programmin...",


In [13]:
gen_dataset.to_csv("generated_qa.csv", index=False)