In [34]:
from settings import (
    COMPLETIONS_MODEL,
    API_EXCHANGE_VERSION,
    API_BASE_URL,
    
    EMBEDDINGS_MODEL,
    EMBEDDINGS_BASE_URL,
    TOKEN_ID
)
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

In [35]:
import nest_asyncio
nest_asyncio.apply()

In [36]:
### Setting up llm and embed model


llm = AzureOpenAI(
    engine=COMPLETIONS_MODEL,
    api_key=TOKEN_ID,
    api_version=API_EXCHANGE_VERSION,
    azure_endpoint=f"{API_BASE_URL}/api"
)

embed_model = AzureOpenAIEmbedding(
    engine=EMBEDDINGS_MODEL,
    api_key=TOKEN_ID,
    api_version=API_EXCHANGE_VERSION,
    azure_endpoint=f"{EMBEDDINGS_BASE_URL}/api"
)

### 1. Setup an agent over 3 papers

In [37]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=hSyW5go0v8",
]

papers = [
    "data/metagpt.pdf",
    "data/longlora.pdf",
    "data/selfrag.pdf",
]

In [38]:
from utils import get_doc_tools
from pathlib import Path

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

paper_to_tools_dict = {}

for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: data/metagpt.pdf
Getting tools for paper: data/longlora.pdf
Getting tools for paper: data/selfrag.pdf


In [39]:
initial_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [40]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

In [41]:
agent_worker = FunctionCallingAgentWorker.from_tools(
    initial_tools,
    llm=llm,
    verbose=True
)
agent = AgentRunner(agent_worker)

In [42]:
response = agent.query(
    "Tell me about the evaluation dataset used in Self-RAG,"
    "and then tell me about the evaluation results"
)

Added user message to memory: Tell me about the evaluation dataset used in Self-RAG,and then tell me about the evaluation results
=== Calling Function ===
Calling function: vector_tool_selfrag with args: {"query": "evaluation dataset used in Self-RAG"}
=== Function Output ===
The evaluation dataset used in Self-RAG is not mentioned in the given context information.
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Self-RAG is a novel approach to generate question answering models that can reason over multiple passages. The model is evaluated on the Natural Questions (NQ) dataset and achieves state-of-the-art results on the leaderboard. Self-RAG is shown to outperform previous models on the NQ dataset, especially on long and complex questions."}
=== Function Output ===
True. According to multiple sources, Self-RAG is a novel approach to generating question answering models that can reason over multiple passages. The model has been evaluated on the Nat

In [43]:
response = agent.query("Give me a summary of both Self-RAG and LongLoRA methods")
print(str(response))

Added user message to memory: Give me a summary of both Self-RAG and LongLoRA methods
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Self-RAG is a method for generating responses in conversational AI systems. It combines the strengths of retrieval-based and generative models by using a retriever to select relevant passages and a generator to generate a response based on the selected passages. Self-RAG also incorporates self-training, where the model is fine-tuned using its own generated responses as additional training data. This helps improve the model's performance over time. Self-RAG has been shown to achieve state-of-the-art results on various conversational AI benchmarks."}
=== Function Output ===
Self-RAG is a method that combines retrieval-based and generative models to generate responses in conversational AI systems. It uses a retriever to select relevant passages and a generator to produce a response based on the selected passages. Additi

### Setting up an agent over 11 papers 

In [44]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=LzPWWPAdY4",
    "https://openreview.net/pdf?id=VTF8yNQM66",
    "https://openreview.net/pdf?id=hSyW5go0v8",
    "https://openreview.net/pdf?id=9WD9KwssyT",
    "https://openreview.net/pdf?id=yV6fD7LYkF",
    "https://openreview.net/pdf?id=hnrB5YHoYu",
    "https://openreview.net/pdf?id=WbWtOYIzIK",
    "https://openreview.net/pdf?id=c5pwL0Soay",
    "https://openreview.net/pdf?id=TpD2aG1h0D"
]

papers = [
    "data/metagpt.pdf",
    "data/longlora.pdf",
    "data/loftq.pdf",
    "data/swebench.pdf",
    "data/selfrag.pdf",
    "data/zipformer.pdf",
    "data/values.pdf",
    "data/finetune_fair_diffusion.pdf",
    "data/knowledge_card.pdf",
    "data/metra.pdf",
    "data/vr_mcl.pdf"
]

In [45]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}

for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: data/metagpt.pdf
Getting tools for paper: data/longlora.pdf
Getting tools for paper: data/loftq.pdf
Getting tools for paper: data/swebench.pdf
Getting tools for paper: data/selfrag.pdf
Getting tools for paper: data/zipformer.pdf
Getting tools for paper: data/values.pdf
Getting tools for paper: data/finetune_fair_diffusion.pdf
Getting tools for paper: data/knowledge_card.pdf
Getting tools for paper: data/metra.pdf
Getting tools for paper: data/vr_mcl.pdf


In [46]:
all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [47]:
len(all_tools)

22

### Extend the Agent with Tool Retrieval

- Using retrieval to get the relevant tool - if the number of tools is too large, we'd like to retrieve a relevant tool, if there are multiple tools, then it's not possible to select the best one

In [48]:
# Defining an object index and retriever over the tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex
)

In [49]:
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

In [50]:
tools = obj_retriever.retrieve("Tell me about the eval dataset used in MetaGPT and SWE-Bench")

In [51]:
tools[0].metadata

ToolMetadata(description='Use ONLY IF you want to get a holistic summary of metagpt. Do NOT use if you have specific questions over metagpt.', name='summary_tool_metagpt', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

In [52]:
tools[1].metadata

ToolMetadata(description='Use ONLY IF you want to get a holistic summary of swebench. Do NOT use if you have specific questions over swebench.', name='summary_tool_swebench', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

In [53]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_retriever,
    llm=llm,
    system_prompt = """
        You are and agent designed to answer queries over a set of given papers.
        Please alweys use the tools provided to answer a question. Do not rely on prior knowledge.
    """,
    verbose=True
)

In [54]:
agent = AgentRunner(agent_worker)

In [55]:
response = agent.query(
    "Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench"
)

Added user message to memory: Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "Evaluation dataset used in MetaGPT"}
=== Function Output ===
The evaluation dataset used in MetaGPT includes the HumanEval benchmark, the MBPP benchmark, and a self-generated SoftwareDev dataset. The HumanEval benchmark consists of 164 handwritten programming tasks, while the MBPP benchmark consists of 427 Python tasks. The SoftwareDev dataset comprises 70 representative examples of software development tasks with diverse scopes, such as mini-games, image processing algorithms, and data visualization. These datasets were used to evaluate the functional accuracy, executability, cost, code statistics, productivity, and human revision cost of the generated code.
=== Calling Function ===
Calling function: summary_tool_swebench with args: {"input": "Evaluation dataset used in SWE-Bench"}
=== 

In [56]:
response = agent.query("Compare and contrast the LoRA papers (LongLoRA, LoftQ). Analyze the approach in each paper first.")

Added user message to memory: Compare and contrast the LoRA papers (LongLoRA, LoftQ). Analyze the approach in each paper first.
=== LLM Response ===
The LongLoRA and LoftQ papers both propose approaches for generating long-form text using language models. Let's analyze the approach in each paper separately:

1. LongLoRA:
The LongLoRA paper introduces a method for generating long documents by extending the capabilities of the GPT-3 language model. It addresses the limitation of GPT-3, which has a maximum token limit of 4096, by splitting the input into chunks and generating the output in a hierarchical manner. The approach involves two steps: chunking and stitching. In the chunking step, the input document is divided into smaller chunks, each within the token limit. Then, in the stitching step, the chunks are processed sequentially, and the outputs are concatenated to form the final long document. LongLoRA also introduces a novel technique called "contextual chunking" to ensure coherenc

### TODO: What is the size of the memory buffer? - How many messages?