<a href="https://colab.research.google.com/github/girijesh-ai/llamaIndex-projects/blob/main/Recursive_Retriever_Document_Agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/query_engine/recursive_retriever_agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recursive Retriever + Document Agents

This guide shows how to combine recursive retrieval and "document agents" for advanced decision making over heterogeneous documents.

There are two motivating factors that lead to solutions for better retrieval:
- Decoupling retrieval embeddings from chunk-based synthesis. Oftentimes fetching documents by their summaries will return more relevant context to queries rather than raw chunks. This is something that recursive retrieval directly allows.
- Within a document, users may need to dynamically perform tasks beyond fact-based question-answering. We introduce the concept of "document agents" - agents that have access to both vector search and summary tools for a given document.

### Setup and Download Data

In this section, we'll define imports and then download Wikipedia articles about different cities. Each article is stored separately.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.8.53.post3-py3-none-any.whl (794 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/794.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/794.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m788.5/794.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.6/794.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting aiostream<0.6.0,>=0.5.2 (from llama-index)
  Downloading aiostream-0.5.2-py3-none-any.whl (39 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting langchain>=0.0.303 (from llama-index)
  Downlo

In [None]:
from llama_index import (
    VectorStoreIndex,
    SummaryIndex,
    SimpleKeywordTableIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.schema import IndexNode
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.llms import OpenAI

import openai
openai.api_key = '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'

from IPython.display import display, HTML

In [None]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [None]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [None]:
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

Define LLM + Service Context + Callback Manager

In [None]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Build Document Agent for each Document

In this section we define "document agents" for each document.

First we define both a vector index (for semantic search) and summary index (for summarization) for each document. The two query engines are then converted into tools that are passed to an OpenAI function calling agent.

This document agent can dynamically choose to perform semantic search or summarization within a given document.

We create a separate document agent for each city.

In [None]:
from llama_index.agent import OpenAIAgent

# Build agents dictionary
agents = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
        city_docs[wiki_title], service_context=service_context
    )
    # build summary index
    summary_index = SummaryIndex.from_documents(
        city_docs[wiki_title], service_context=service_context
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    "Useful for retrieving specific context from"
                    f" {wiki_title}"
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    f"Useful for summarization questions related to {wiki_title}"
                ),
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
    )

    agents[wiki_title] = agent

## Build Recursive Retriever over these Agents

Now we define a set of summary nodes, where each node links to the corresponding Wikipedia city article. We then define a `RecursiveRetriever` on top of these Nodes to route queries down to a given node, which will in turn route it to the relevant document agent.

We finally define a full query engine combining `RecursiveRetriever` into a `RetrieverQueryEngine`.

In [None]:
# define top-level nodes
nodes = []
for wiki_title in wiki_titles:
    # define index node that links to these agents
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        " this index if you need to lookup specific facts about"
        f" {wiki_title}.\nDo not use this index if you want to analyze"
        " multiple cities."
    )
    node = IndexNode(text=wiki_summary, index_id=wiki_title)
    nodes.append(node)

In [None]:
# define top-level retriever
vector_index = VectorStoreIndex(nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

In [None]:
# define recursive retriever
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

In [None]:
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=agents,
    verbose=True,
)

#### Define Full Query Engine

This query engine uses the recursive retriever + response synthesis module to synthesize a response.

In [None]:
query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever,
    service_context=service_context,
)

## Running Example Queries

In [None]:
# should use Boston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Boston")

[1;3;34mRetrieving with query id None: Tell me about the sports teams in Boston
[0m[1;3;38;5;200mRetrieved node with id, entering: Boston
[0m[1;3;34mRetrieving with query id Boston: Tell me about the sports teams in Boston
[0m[1;3;32mGot response: Boston is home to several professional sports teams across different leagues. Here are some of the notable sports teams in Boston:

1. Boston Red Sox (MLB): The Boston Red Sox are one of the oldest and most successful baseball teams in Major League Baseball (MLB). They have won multiple World Series championships, including their most recent victory in 2018.

2. New England Patriots (NFL): The New England Patriots are a highly successful American football team in the National Football League (NFL). They have won multiple Super Bowl championships, with their most recent victory in 2018.

3. Boston Celtics (NBA): The Boston Celtics are one of the most successful basketball teams in the National Basketball Association (NBA). They have won

In [None]:
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

In [None]:
# should use Houston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Houston and Boston")

[1;3;34mRetrieving with query id None: Tell me about the sports teams in Houston and Boston
[0m[1;3;38;5;200mRetrieved node with id, entering: Houston
[0m[1;3;34mRetrieving with query id Houston: Tell me about the sports teams in Houston and Boston
[0m=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "sports teams in Houston and Boston"
}
Got output: Houston has sports teams for every major professional league except the National Hockey League. The Houston Astros are a Major League Baseball expansion team formed in 1962, the Houston Rockets are a National Basketball Association franchise, the Houston Texans are a National Football League expansion team formed in 2002, the Houston Dynamo is a Major League Soccer franchise, the Houston Dash team plays in the National Women's Soccer League, the Houston SaberCats are a rugby team that plays in Major League Rugby, and Houston is one of eight cities to have an XFL team, the Houston Roughnecks.
=== Calling F

In [None]:
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

In [None]:
response = query_engine.query(
    "Give me a summary on all the positive aspects of Chicago"
)

[1;3;34mRetrieving with query id None: Give me a summary on all the positive aspects of Chicago
[0m[1;3;38;5;200mRetrieved node with id, entering: Chicago
[0m[1;3;34mRetrieving with query id Chicago: Give me a summary on all the positive aspects of Chicago
[0m=== Calling Function ===
Calling function: summary_tool with args: {
  "input": "positive aspects of Chicago"
}
Got output: Chicago is a city that offers a vibrant arts scene and has made significant contributions to various forms of art, including visual arts, literature, film, theater, comedy, food, dance, and music. It is home to prestigious institutions such as the Art Institute of Chicago, the Chicago Symphony Orchestra, and the Lyric Opera of Chicago. Additionally, Chicago is a prominent center for finance, culture, commerce, industry, education, technology, telecommunications, and transportation. With a diverse economy, no single industry dominates more than 14% of the workforce. The city also attracts millions of vis

In [None]:
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))