<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/agent/multi_document_agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Document Agents

In this guide, you learn towards setting up an agent that can effectively answer different types of questions over a larger set of documents.

These questions include the following

- QA over a specific doc
- QA comparing different docs
- Summaries over a specific doc
- Comparing summaries between different docs

We do this with the following architecture:

- setup a "document agent" over each Document: each doc agent can do QA/summarization within its doc
- setup a top-level agent over this set of document agents. Do tool retrieval and then do CoT over the set of tools to answer a question.

## Setup and Download Data

In this section, we'll define imports and then download Wikipedia articles about different cities. Each article is stored separately.

We load in 18 cities - this is not quite at the level of "hundreds" of documents but its still large enough to warrant some top-level document retrieval!

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.9.39-py3-none-any.whl (15.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama-index)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting httpx (from llama-index)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.1.0 (from llama-index)
  Downloading openai-1.10.0-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.1/225.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken>=0.3.3 (from llama-index)
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K

In [None]:
pip install openai



In [None]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "Your API Key goes here"

In [None]:
from llama_index import (
    VectorStoreIndex,
    SummaryIndex,
    SimpleKeywordTableIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.schema import IndexNode
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.llms import OpenAI

In [None]:
wiki_titles = [
    "Toronto",
    "Seattle",
    "Chicago",
    "Boston",
    "Houston",
    "Tokyo",
    "Berlin",
    "Lisbon",
    "Paris",
    "London",
    "Atlanta",
    "Munich",
    "Shanghai",
    "Beijing",
    "Copenhagen",
    "Moscow",
    "Cairo",
    "Karachi",
]

In [None]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [None]:
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

Define LLM + Service Context + Callback Manager

In [None]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

## Building Multi-Document Agents

In this section we show you how to construct the multi-document agent. We first build a document agent for each document, and then define the top-level parent agent with an object index.

### Build Document Agent for each Document

In this section we define "document agents" for each document.

We define both a vector index (for semantic search) and summary index (for summarization) for each document. The two query engines are then converted into tools that are passed to an OpenAI function calling agent.

This document agent can dynamically choose to perform semantic search or summarization within a given document.

We create a separate document agent for each city.

In [None]:
from llama_index.agent import OpenAIAgent
from llama_index import load_index_from_storage, StorageContext
from llama_index.node_parser import SentenceSplitter
import os

node_parser = SentenceSplitter()

# Build agents dictionary
agents = {}
query_engines = {}

# this is for the baseline
all_nodes = []

for idx, wiki_title in enumerate(wiki_titles):
    nodes = node_parser.get_nodes_from_documents(city_docs[wiki_title])
    all_nodes.extend(nodes)

    if not os.path.exists(f"./data/{wiki_title}"):
        # build vector index
        vector_index = VectorStoreIndex(nodes, service_context=service_context)
        vector_index.storage_context.persist(
            persist_dir=f"./data/{wiki_title}"
        )
    else:
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=f"./data/{wiki_title}"),
            service_context=service_context,
        )

    # build summary index
    summary_index = SummaryIndex(nodes, service_context=service_context)
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    summary_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    "Useful for questions related to specific aspects of"
                    f" {wiki_title} (e.g. the history, arts and culture,"
                    " sports, demographics, or more)."
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=summary_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    "Useful for any requests that require a holistic summary"
                    f" of EVERYTHING about {wiki_title}. For questions about"
                    " more specific sections, please use the vector_tool."
                ),
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-4")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
        system_prompt=f"""\
You are a specialized agent designed to answer queries about {wiki_title}.
You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
""",
    )

    agents[wiki_title] = agent
    query_engines[wiki_title] = vector_index.as_query_engine(
        similarity_top_k=2
    )

### Build Retriever-Enabled OpenAI Agent

We build a top-level agent that can orchestrate across the different document agents to answer any user query.

This agent takes in all document agents as tools. This specific agent `RetrieverOpenAIAgent` performs tool retrieval before tool use (unlike a default agent that tries to put all tools in the prompt).

Here we use a top-k retriever, but we encourage you to customize the tool retriever method!


In [None]:
# define tool for each document agent
all_tools = []
for wiki_title in wiki_titles:
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        f" this tool if you want to answer any questions about {wiki_title}.\n"
    )
    doc_tool = QueryEngineTool(
        query_engine=agents[wiki_title],
        metadata=ToolMetadata(
            name=f"tool_{wiki_title}",
            description=wiki_summary,
        ),
    )
    all_tools.append(doc_tool)

In [None]:
# define an "object" index and retriever over these tools
from llama_index import VectorStoreIndex
from llama_index.objects import ObjectIndex, SimpleToolNodeMapping

tool_mapping = SimpleToolNodeMapping.from_objects(all_tools)
obj_index = ObjectIndex.from_objects(
    all_tools,
    tool_mapping,
    VectorStoreIndex,
)

In [None]:
from llama_index.agent import FnRetrieverOpenAIAgent

top_agent = FnRetrieverOpenAIAgent.from_retriever(
    obj_index.as_retriever(similarity_top_k=3),
    system_prompt=""" \
You are an agent designed to answer queries about a set of given cities.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    verbose=True,
)

### Define Baseline Vector Store Index

As a point of comparison, we define a "naive" RAG pipeline which dumps all docs into a single vector index collection.

We set the top_k = 4

In [None]:
base_index = VectorStoreIndex(all_nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=4)

## Running Example Queries

Let's run some example queries, ranging from QA / summaries over a single document to QA / summarization over multiple documents.

In [None]:
# should use Boston agent -> vector tool
response = top_agent.query("Tell me about the arts and culture in Boston")

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: tool_Boston with args: {
  "input": "arts and culture"
}
Added user message to memory: arts and culture
=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "arts and culture"
}
Got output: Boston has a rich arts and culture scene. The city is home to several art museums and galleries, including the Museum of Fine Arts and the Isabella Stewart Gardner Museum. The Institute of Contemporary Art is also located in Boston's Seaport District. The city has a vibrant theater district, with several theaters such as the Cutler Majestic Theatre, Citi Performing Arts Center, the Colonial Theater, and the Orpheum Theatre. Boston is known for its annual events, including the Boston Early Music Festival, the Boston Arts Festival, and the Italian summer feasts in the North End. The city also has a strong music culture, with the Boston Symphony Orchestra being one of the "Big Five" American orchestras

In [None]:
print(response)

Boston has a vibrant arts and culture scene with several renowned art museums and galleries. The city is home to the Museum of Fine Arts and the Isabella Stewart Gardner Museum, both of which showcase a wide range of artistic works. The Institute of Contemporary Art, located in the Seaport District, is another popular destination for contemporary art enthusiasts.

Boston's theater district is also a hub for performing arts. The district is home to several theaters, including the Cutler Majestic Theatre, Citi Performing Arts Center, Colonial Theater, and Orpheum Theatre. These venues host a variety of performances, including Broadway shows, musicals, and plays.

The city is known for its annual events that celebrate arts and culture. The Boston Early Music Festival showcases early music performances and attracts musicians and enthusiasts from around the world. The Boston Arts Festival is a multi-day event that features local artists, musicians, and performers. Additionally, the North En

In [None]:
# baseline
response = base_query_engine.query(
    "Tell me about the arts and culture in Boston"
)
print(str(response))

Boston has a rich arts and culture scene. The city is known for its literary culture, with famous writers like Ralph Waldo Emerson, Henry David Thoreau, and Nathaniel Hawthorne having roots in Boston. The Old Corner Bookstore is considered the "cradle of American literature" and the place where these writers met. The Boston Public Library, founded in 1852, was the first free library in the United States. Today, Boston's literary culture thrives with the presence of many universities and the annual Boston Book Festival.

Music is also highly valued in Boston, with the Boston Symphony Orchestra being one of the "Big Five" American orchestras. Symphony Hall, home to the Boston Symphony Orchestra, is considered one of the top venues for classical music in the world. The city is also home to the Boston Pops Orchestra and the Boston Youth Symphony Orchestra, the largest youth orchestra in the nation. Other performing arts organizations in Boston include the Boston Ballet, Boston Lyric Opera 

In [None]:
# should use Houston agent -> vector tool
response = top_agent.query(
    "Give me a summary of all the positive aspects of Houston"
)

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: tool_Houston with args: {
  "input": "positive aspects"
}
Added user message to memory: positive aspects
=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "positive aspects"
}
Got output: Houston has several positive aspects. It is recognized worldwide for its energy industry, particularly for oil and natural gas. The city is also known for its biomedical research and aeronautics. In addition, Houston is a growing hub for technology startup firms and has a fast-growing technology sector. The city's economy is diverse and includes major technology and software companies. Houston is also a global city and a top U.S. market for exports. The Houston area has a strong petrochemical industry and is home to the Port of Houston, which ranks first in international commerce in the United States. The city's gross domestic product (GDP) is one of the largest in the United States and larger than

In [None]:
print(response)

Houston has numerous positive aspects that make it a vibrant and dynamic city. 

1. Economy: Houston is globally recognized for its energy industry, particularly oil and natural gas. It's also known for its biomedical research and aeronautics. The city is a growing hub for technology startups and has a fast-growing technology sector. 

2. Diversity: The city's economy is diverse, housing major technology and software companies. Houston is a global city and a top U.S. market for exports. 

3. Petrochemical Industry: The Houston area has a strong petrochemical industry and is home to the Port of Houston, which ranks first in international commerce in the United States. 

4. GDP: The city's gross domestic product (GDP) is one of the largest in the United States and larger than several countries. 

5. Education: The University of Houston System has a significant impact on the local economy, attracting new funds and generating jobs. 

6. Quality of Life: Houston has been recognized for its 

In [None]:
# baseline
response = base_query_engine.query(
    "Give me a summary of all the positive aspects of Houston"
)
print(str(response))

Houston has several positive aspects. It has experienced significant growth and economic success, with a thriving energy industry, biomedical research, and aeronautics. The city is recognized as a global city and is a top market for exports in the United States. Houston is also known for its diverse culture and large international community. It has a vibrant arts and theater scene, with numerous performing arts organizations and museums. The city offers a variety of recreational opportunities, including parks, green spaces, and tourist attractions like Space Center Houston. Additionally, Houston has been recognized for its food and restaurant culture, and it has been ranked highly in various lists for its technological innovation, job creation, and quality of life.


In [None]:
# baseline: the response doesn't quite match the sources...
response.source_nodes[1].get_content()

'== Economy ==\n\nHouston is recognized worldwide for its energy industry—particularly for oil and natural gas—as well as for biomedical research and aeronautics. Renewable energy sources—wind and solar—are also growing economic bases in the city, and the City Government purchases 90% of its annual 1 TWh power mostly from wind, and some from solar. The city has also been a growing hub for technology startup firms and is the fastest growing sector of the city\'s economy. Major technology and software companies within Greater Houston include Crown Castle, KBR, FlightAware, Cybersoft, Houston Wire & Cable, and HostGator. Aylo, Go Daddy, and ByteDance have offices in the Houston area. On April 4, 2022, Hewlett Packard Enterprise relocated its global headquarters from California to the Greater Houston area. The Houston Ship Channel is also a large part of Houston\'s economic base.\nBecause of these strengths, Houston is designated as a global city by the Globalization and World Cities Study

In [None]:
response = top_agent.query(
    "Tell the demographics of Houston, and then compare that with the"
    " demographics of Chicago"
)

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: tool_Houston with args: {
  "input": "demographics"
}
Added user message to memory: demographics
=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "demographics"
}
Got output: Houston has a population of 2,304,580 according to the 2020 U.S. census. In 2017, the estimated population was 2,312,717, and in 2018 it was 2,325,502. The city has a diverse demographic makeup, with a significant number of undocumented immigrants residing in the Houston area, comprising nearly 9% of the city's metropolitan population in 2017. The age distribution in Houston shows a significant number of residents under 15 and between the ages of 20 to 34. The median age of the city is 33.4. The city has a mix of homeowners and renters, with an estimated 42.3% of Houstonians owning housing units. The median household income in 2019 was $52,338, and 20.1% of Houstonians lived at or below the poverty line.

Got 

In [None]:
print(response)

Houston has a population of 2,304,580, while Chicago has a population of under 2.7 million. Houston is known for its diverse demographic makeup, with a significant number of undocumented immigrants. In terms of age distribution, Houston has a large number of residents under 15 and between the ages of 20 to 34, with a median age of 33.4. In Chicago, the largest racial or ethnic groups are non-Hispanic White, Blacks, and Hispanics. Additionally, Chicago has a significant LGBTQ population, with an estimated 7.5% of the adult population identifying as LGBTQ.


In [None]:
# baseline
response = base_query_engine.query(
    "Tell the demographics of Houston, and then compare that with the"
    " demographics of Chicago"
)
print(str(response))

Houston is a majority-minority city with a diverse population. According to the 2019 U.S. Census Bureau data, the demographics of Houston are as follows: non-Hispanic whites make up 23.3% of the population, Hispanics and Latino Americans make up 45.8%, Blacks or African Americans make up 22.4%, and Asian Americans make up 6.5%. The largest Hispanic or Latino American ethnic group in Houston is Mexican Americans, comprising 31.6% of the population.

In comparison, the demographics of Chicago, based on the 2019 U.S. Census Bureau data, are as follows: non-Hispanic whites make up 32.7% of the population, Hispanics and Latino Americans make up 29.9%, Blacks or African Americans make up 29.0%, and Asian Americans make up 7.6%. The largest Hispanic or Latino American ethnic group in Chicago is Mexican Americans, comprising 21.0% of the population.

Overall, both Houston and Chicago have diverse populations with significant Hispanic and Latino American, Black or African American, and Asian Am

In [None]:
# baseline: the response tells you nothing about Chicago...
response.source_nodes[3].get_content()

"=== Early 21st century ===\nHouston has continued to grow into the 21st century, with the population increasing 15.7% from 2000 to 2022.Oil & gas have continued to fuel Houston's economic growth, with major oil companies including Phillips 66, ConocoPhillips, Occidental Petroleum, Halliburton, and ExxonMobil having their headquarters in the Houston area. In 2001, Enron Corporation, a Houston company with $100 billion in revenue, became engulfed in an accounting scandal which bankrupted the company in 2001. Health care has emerged as a major industry in Houston. The Texas Medical Center is now the largest medical complex in the world and employs over 120,000 people.Three new sports stadiums opened downtown in the first decade of the 21st century. In 2000, the Houston Astros opened their new baseball stadium, Minute Maid Park, in downtown adjacent to the old Union Station. The Houston Texans were formed in 2002 as an NFL expansion team, replacing the Houston Oilers, which had left the c

In [None]:
response = top_agent.query(
    "Tell me the differences between Shanghai and Beijing in terms of history"
    " and current economy"
)

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: tool_Shanghai with args: {
  "input": "history"
}
Added user message to memory: history
=== Calling Function ===
Calling function: vector_tool with args: {
  "input": "history"
}
Got output: Shanghai has a rich history that dates back to the 19th century. During this time, the city gained international attention due to its economic and trade potential. It became one of the five treaty ports for international trade after the First Opium War. Shanghai experienced various conflicts and attacks, including the Taiping Rebellion and the First Sino-Japanese War. In the early 20th century, Shanghai became a major commercial and financial hub, attracting people from all over the world. However, the city also faced challenges during the Cultural Revolution. Since then, Shanghai has undergone significant economic reforms and development, becoming a global city and a major contributor to China's tax revenue.

Got output: S

In [None]:
print(str(response))

In terms of history, both Shanghai and Beijing have rich and significant pasts. Shanghai gained international attention in the 19th century as a major trade port and has experienced conflicts and developments throughout the years. Beijing, on the other hand, has been the capital of China for several dynasties and has witnessed significant events such as the Cultural Revolution. 

In terms of current economy, Shanghai is known for its thriving economy, particularly in finance and innovation sectors. It serves as a national hub for commerce, trade, and transportation and is one of the fastest-growing cities globally. Key industries in Shanghai include retail, finance, IT, real estate, and manufacturing. 

Beijing's economy is primarily driven by the tertiary sector, with a focus on services such as professional services, information technology, and commercial real estate. It also has high-end economic output zones that contribute to its growth. Beijing is considered one of the top major 

In [None]:
# baseline
response = base_query_engine.query(
    "Tell me the differences between Shanghai and Beijing in terms of history"
    " and current economy"
)
print(str(response))

NameError: name 'base_query_engine' is not defined

In [None]:
# baseline
response = base_query_engine.query(
    "List ten differences between Toronto and Boston"
)
print(str(response))

NameError: name 'base_query_engine' is not defined