# Lesson 4: Building a Multi-Document Agent

## Setup

In [1]:
from helper import get_openai_api_key
OPENAI_API_KEY = get_openai_api_key()

In [2]:
import nest_asyncio
nest_asyncio.apply()

## 1. Setup an agent over 3 papers

**Note**: The pdf files are included with this lesson. To access these papers, go to the `File` menu and select`Open...`.

In [6]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=hSyW5go0v8",
]

papers = [
    "metagpt.pdf",
    "longlora.pdf",
    "selfrag.pdf",
]

In [7]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: selfrag.pdf


In [8]:
initial_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [9]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

In [10]:
len(initial_tools)

6

In [11]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    initial_tools, 
    llm=llm, 
    verbose=True
)
agent = AgentRunner(agent_worker)

In [12]:
response = agent.query(
    "Tell me about the evaluation dataset used in LongLoRA, "
    "and then tell me about the evaluation results"
)

Added user message to memory: Tell me about the evaluation dataset used in LongLoRA, and then tell me about the evaluation results
=== Calling Function ===
Calling function: vector_tool_longlora with args: {"query": "evaluation dataset"}
=== Function Output ===
PG19 test split
=== Calling Function ===
Calling function: vector_tool_longlora with args: {"query": "evaluation results"}
=== Function Output ===
The evaluation results show that the models achieve better perplexity with longer context sizes. Increasing the context window size leads to improved perplexity values. Additionally, the models are fine-tuned on different context lengths, such as 100k, 65536, and 32768, and achieve promising results on these extremely large settings. However, there is some perplexity degradation observed on small context sizes for the extended models, which is a known limitation of Position Interpolation.
=== LLM Response ===
The evaluation dataset used in LongLoRA is the PG19 test split. 

Regarding 

In [13]:
response = agent.query("Give me a summary of both Self-RAG and LongLoRA")
print(str(response))

Added user message to memory: Give me a summary of both Self-RAG and LongLoRA
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Self-RAG"}
=== Function Output ===
Self-RAG is a framework designed to improve the quality and accuracy of large language models by incorporating retrieval on demand and self-reflection mechanisms. It involves training a single arbitrary LM to retrieve, generate, and evaluate text passages and its own outputs using special tokens known as reflection tokens. This approach has shown superior performance compared to LLMs with more parameters or traditional retrieval-augmented generation methods across a range of tasks. The system evaluates text outputs based on these reflection tokens to assess the need for retrieval, relevance of evidence, supportiveness of the response, and overall utility of the generated text. By training a Critic LM to predict reflection tokens and a Generator LM to produce text based on these predictions,

## 2. Setup an agent over 11 papers

### Download 11 ICLR papers

In [14]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=LzPWWPAdY4",
    "https://openreview.net/pdf?id=VTF8yNQM66",
    "https://openreview.net/pdf?id=hSyW5go0v8",
    "https://openreview.net/pdf?id=9WD9KwssyT",
    "https://openreview.net/pdf?id=yV6fD7LYkF",
    "https://openreview.net/pdf?id=hnrB5YHoYu",
    "https://openreview.net/pdf?id=WbWtOYIzIK",
    "https://openreview.net/pdf?id=c5pwL0Soay",
    "https://openreview.net/pdf?id=TpD2aG1h0D"
]

papers = [
    "metagpt.pdf",
    "longlora.pdf",
    "loftq.pdf",
    "swebench.pdf",
    "selfrag.pdf",
    "zipformer.pdf",
    "values.pdf",
    "finetune_fair_diffusion.pdf",
    "knowledge_card.pdf",
    "metra.pdf",
    "vr_mcl.pdf"
]

To download these papers, below is the needed code:


    #for url, paper in zip(urls, papers):
         #!wget "{url}" -O "{paper}"
    
    
**Note**: The pdf files are included with this lesson. To access these papers, go to the `File` menu and select`Open...`.

In [16]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: loftq.pdf
Getting tools for paper: swebench.pdf
Getting tools for paper: selfrag.pdf
Getting tools for paper: zipformer.pdf
Getting tools for paper: values.pdf
Getting tools for paper: finetune_fair_diffusion.pdf
Getting tools for paper: knowledge_card.pdf
Getting tools for paper: metra.pdf
Getting tools for paper: vr_mcl.pdf


### Extend the Agent with Tool Retrieval

In [17]:
all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [18]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)

In [19]:
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

In [20]:
tools = obj_retriever.retrieve(
    "Tell me about the eval dataset used in MetaGPT and SWE-Bench"
)

In [24]:
tools[1].metadata

ToolMetadata(description='Useful for summarization questions related to metra', name='summary_tool_metra', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

In [25]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_retriever,
    llm=llm, 
    system_prompt=""" \
You are an agent designed to answer queries over a set of given papers.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    verbose=True
)
agent = AgentRunner(agent_worker)

In [26]:
response = agent.query(
    "Tell me about the evaluation dataset used "
    "in MetaGPT and compare it against SWE-Bench"
)
print(str(response))

Added user message to memory: Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "evaluation dataset used in MetaGPT"}
=== Function Output ===
The evaluation dataset used in MetaGPT is the SoftwareDev dataset, which consists of 70 diverse software development tasks. The dataset includes tasks such as creating games like Snake, Brick Breaker, and Flappy Bird, developing programs for Excel data processing and CRUD management, as well as tasks like creating a music transcriber, custom press releases, Gomoku game, and a weather dashboard. The dataset covers a wide range of software development tasks to assess the performance of MetaGPT in generating code for various applications.
=== Calling Function ===
Calling function: summary_tool_swebench with args: {"input": "evaluation dataset used in SWE-Bench"}
=== Function Output ===
The evaluation dataset used in SWE-Bench is c

In [29]:
response = agent.query(
    "Compare and contrast the LoRA papers (LongLoRA, LoftQ). "
    "Analyze the approach in each paper first. "
)

Added user message to memory: Summarize each of the papers. 
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Summarize the paper titled 'Selfrag: A Self-supervised Framework for Rare Disease Gene Prioritization'"}
=== Function Output ===
The paper introduces a self-supervised framework called SELF-RAG for prioritizing genes related to rare diseases. It utilizes self-supervised learning to effectively prioritize genes associated with rare diseases, aiming to enhance accuracy and efficiency in identifying these genes. The framework enhances the quality and factuality of large language models through retrieval on demand and self-reflection, significantly outperforming existing models on various tasks by improving factuality, citation accuracy, and overall performance.
=== Calling Function ===
Calling function: summary_tool_knowledge_card with args: {"input": "Summarize the paper titled 'Knowledge Card: A Novel Representation Learning Framework for Bio

Retrying llama_index.llms.openai.base.OpenAI._achat in 0.8165571065358295 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-rt3P1puAd3ylJozP7R0e7ggI on tokens per min (TPM): Limit 60000, Used 44532, Requested 15738. Please try again in 270ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.


=== Function Output ===
The paper titled 'SWeBench: A Benchmarking Toolkit for Semantic Web Reasoners' presents a benchmarking toolkit specifically created to evaluate the performance of semantic web reasoners. It details the process of constructing the benchmark, which involves scraping pull requests from top PyPI libraries and converting them into task instances for validation through execution. The toolkit is designed to be easily expandable to new programming languages and code domains, allowing for continuous updates to reflect advancements in language models trained on recent source code. Additionally, the paper discusses the methodology used to assess semantic web reasoners, including retrieval details, inference settings, and prompt template examples. It also touches on the societal impact of machine-automated software engineering, emphasizing the importance of ensuring AI-generated code aligns with human intents and discussing safety measures. The toolkit offers a standardized

KeyboardInterrupt: 

In [30]:
response = agent.query(
    "Summarize the Metra paper. "
)

Added user message to memory: Summarize the Metra paper. 
=== Calling Function ===
Calling function: summary_tool_metra with args: {"input": "Metra paper"}
=== Function Output ===
The METRA paper introduces an unsupervised reinforcement learning method that focuses on discovering diverse and useful behaviors in locomotion and manipulation environments. It emphasizes learning a compact latent skill space connected to the state space through a temporal distance metric. The paper details the methodology, experiments, and comparisons with other unsupervised RL methods, showcasing success in discovering locomotion behaviors in various environments like Quadruped and Humanoid. Additionally, it discusses theoretical connections to principal component analysis (PCA) and other unsupervised skill learning methods like DIAYN, DADS, and CIC, along with providing implementation specifics and experimental results in different tasks and environments.
=== LLM Response ===
The METRA paper introduces an

In [None]:
response = agent.query(
    "Summarize the Metra paper. "
)