# Comparing Methods for Structured Retrieval (Auto-Retrieval vs. Recursive Retrieval)

In a naive RAG system, the set of input documents are then chunked, embedded, and dumped to a vector database collection. Retrieval would just fetch the top-k documents by embedding similarity.

This can fail if the set of documents is large - it can be hard to disambiguate raw chunks, and you're not guaranteed to filter for the set of documents that contain relevant context.

In this guide we explore **structured retrieval** - more advanced query algorithms that take advantage of structure within your documents for higher-precision retrieval. We compare the following two methods:

- **Metadata Filters + Auto-Retrieval**: Tag each document with the right set of metadata. During query-time, use auto-retrieval to infer metadata filters along with passing through the query string for semantic search.
- **Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval**: Embed document summaries and map that to the set of raw chunks for each document. During query-time, do recursive retrieval to first fetch summaries before fetching documents.

In [77]:
import nest_asyncio

nest_asyncio.apply()

In [1]:
import logging
import sys
from llama_index import SimpleDirectoryReader, ListIndex, ServiceContext

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [58]:
wiki_titles = ["Michael Jordan", "Elon Musk", "Richard Branson", "Rihanna"]
wiki_metadatas = {
    "Michael Jordan": {
        "category": "Sports",
        "country": "United States",
    },
    "Elon Musk": {
        "category": "Business",
        "country": "United States",
    },
    "Richard Branson": {
        "category": "Business",
        "country": "UK",
    },
    "Rihanna": {
        "category": "Music",
        "country": "Barbados",
    },
}

In [59]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [60]:
# Load all wiki documents
docs_dict = {}
for wiki_title in wiki_titles:
    doc = SimpleDirectoryReader(input_files=[f"data/{wiki_title}.txt"]).load_data()[0]

    doc.metadata.update(wiki_metadatas[wiki_title])
    docs_dict[wiki_title] = doc

In [61]:
from llama_index.llms import OpenAI
from llama_index.callbacks import LlamaDebugHandler, CallbackManager


llm = OpenAI("gpt-4")
callback_manager = CallbackManager([LlamaDebugHandler()])
service_context = ServiceContext.from_defaults(
    llm=llm, callback_manager=callback_manager, chunk_size=256
)

## Metadata Filters + Auto-Retrieval

In this approach, we tag each Document with metadata (category, country), and store in a Weaviate vector db.

During retrieval-time, we then perform "auto-retrieval" to infer the relevant set of metadata filters.

In [105]:
## Setup Weaviate
import weaviate

# cloud
resource_owner_config = weaviate.AuthClientPassword(
    username="username",
    password="password",
)
client = weaviate.Client(
    "https://llamaindex-test-ul4sgpxc.weaviate.network",
    auth_client_secret=resource_owner_config,
)

  self.adapters[prefix] = adapter


In [106]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores import WeaviateVectorStore
from IPython.display import Markdown, display

In [107]:
# drop items from collection first
client.schema.delete_class("LlamaIndex")

In [108]:
from llama_index.storage.storage_context import StorageContext

# If you want to load the index later, be sure to give it a name!
vector_store = WeaviateVectorStore(weaviate_client=client, index_name="LlamaIndex")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# NOTE: you may also choose to define a index_name manually.
# index_name = "test_prefix"
# vector_store = WeaviateVectorStore(weaviate_client=client, index_name=index_name)

In [109]:
# validate that the schema was created
class_schema = client.schema.get("LlamaIndex")
display(class_schema)

{'class': 'LlamaIndex',
 'description': 'Class for LlamaIndex',
 'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
  'cleanupIntervalSeconds': 60,
  'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
 'multiTenancyConfig': {'enabled': False},
 'properties': [{'dataType': ['text'],
   'description': 'Text property',
   'indexFilterable': True,
   'indexSearchable': True,
   'name': 'text',
   'tokenization': 'whitespace'},
  {'dataType': ['text'],
   'description': 'The ref_doc_id of the Node',
   'indexFilterable': True,
   'indexSearchable': True,
   'name': 'ref_doc_id',
   'tokenization': 'whitespace'},
  {'dataType': ['text'],
   'description': 'node_info (in JSON)',
   'indexFilterable': True,
   'indexSearchable': True,
   'name': 'node_info',
   'tokenization': 'whitespace'},
  {'dataType': ['text'],
   'description': 'The relationships of the node (in JSON)',
   'indexFilterable': True,
   'indexSearchable': True,
   'name': 'relationships',
   'tokeniza

In [110]:
index = VectorStoreIndex(
    [], storage_context=storage_context, service_context=service_context
)

# add documents to index
for wiki_title in wiki_titles:
    index.insert(docs_dict[wiki_title])

Exception in thread TokenRefresh:
Traceback (most recent call last):
  File "/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1374, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/htt

In [68]:
from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description="Category of the celebrity, one of [Sports, Entertainment, Business, Music]",
        ),
        MetadataInfo(
            name="country",
            type="str",
            description="Country of the celebrity, one of [United States, Barbados, Portugal]",
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    service_context=service_context,
    max_top_k=10000,
)

In [69]:
# NOTE: the "set top-k to 10000" is a hack to return all data.
# Right now auto-retrieval will always return a fixed top-k, there's a TODO to allow it to be None
# to fetch all data.
# So it's theoretically possible to have the LLM infer a None top-k value.
nodes = retriever.retrieve(
    "Tell me about a celebrity from the United States, set top k to 10000"
)

INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: celebrity
Using query str: celebrity
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {'country': 'United States'}
Using filters: {'country': 'United States'}
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 10000
Using top_k: 10000


In [70]:
print(f"Number of nodes: {len(nodes)}")
for node in nodes:
    print(node.node.get_content())

Number of nodes: 124
The Super Bowl commercial inspired the 1996 live action/animated film Space Jam, which starred Jordan and Bugs in a fictional story set during the former's first retirement from basketball.They have subsequently appeared together in several commercials for MCI.Jordan also made an appearance in the music video for Michael Jackson's "Jam" (1992).Since 2008, Jordan's yearly income from the endorsements is estimated to be over $40 million.In addition, when Jordan's power at the ticket gates was at its highest point, the Bulls regularly sold out both their home and road games.Due to this, Jordan set records in player salary by signing annual contracts worth in excess of US$30 million per season.An academic study found that Jordan's first NBA comeback resulted in an increase in the market capitalization of his client firms of more than $1 billion.Most of Jordan's endorsement deals, including his first deal with Nike, were engineered by his agent, David Falk.Jordan has de

In [71]:
nodes = retriever.retrieve(
    "Tell me about the childhood of a popular sports celebrity in the United States"
)
for node in nodes:
    print(node.node.get_content())

INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: childhood of a popular sports celebrity
Using query str: childhood of a popular sports celebrity
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {'category': 'Sports', 'country': 'United States'}
Using filters: {'category': 'Sports', 'country': 'United States'}
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 2
Using top_k: 2
Knafel claimed Jordan promised her $5 million for remaining silent and agreeing not to file a paternity suit after Knafel learned she was pregnant in 1991; a DNA test showed Jordan was not the father of the child.Jordan proposed to his longtime girlfriend, Cuban-American model Yvette Prieto, on Christmas 2011, and they were married on April 27, 2013, at Bethesda-by-the-Sea Episcopal Church.It was announced on November 30, 2013, that the two were expecting their first child together.

In [72]:
nodes = retriever.retrieve(
    "Tell me about the college life of a billionaire who started at company at the age of 16"
)
for node in nodes:
    print(node.node.get_content())

INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: college life of a billionaire who started at company at the age of 16
Using query str: college life of a billionaire who started at company at the age of 16
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {}
Using filters: {}
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 2
Using top_k: 2
He reportedly hosted large, ticketed house parties to help pay for tuition, and wrote a business plan for an electronic book-scanning service similar to Google Books.In 1994, Musk held two internships in Silicon Valley: one at energy storage startup Pinnacle Research Institute, which investigated electrolytic ultracapacitors for energy storage, and another at Palo Alto–based startup Rocket Science Games.In 1995, he was accepted to a PhD program in materials science at Stanford University.However, Musk decided to join 

In [73]:
nodes = retriever.retrieve("Tell me about the childhood of a UK billionaire")
for node in nodes:
    print(node.node.get_content())

INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: childhood of a billionaire
Using query str: childhood of a billionaire
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {'country': 'UK'}
Using filters: {'country': 'UK'}
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 2
Using top_k: 2
Branson has also talked openly about having ADHD.Branson's parents were supportive of his endeavours from an early age.His mother was an entrepreneur; one of her most successful ventures was building and selling wooden tissue boxes and wastepaper bins.In London, he started off squatting from 1967 to 1968.Branson is an atheist.He said in a 2011 interview with CNN's Piers Morgan that he believes in evolution and the importance of humanitarian efforts but not in the existence of God."I would love to believe," he said."It's very comforting to believe".


== Early business care

## Build Recursive Retriever over Document Summaries

In [87]:
from llama_index.schema import IndexNode

In [98]:
# define top-level nodes and vector retrievers
nodes = []
vector_query_engines = {}
vector_retrievers = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
        [docs_dict[wiki_title]], service_context=service_context
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    vector_query_engines[wiki_title] = vector_query_engine
    vector_retrievers[wiki_title] = vector_index.as_retriever()

    # save summaries
    out_path = Path("summaries") / f"{wiki_title}.txt"
    if not out_path.exists():
        # use LLM-generated summary
        list_index = ListIndex.from_documents(
            [docs_dict[wiki_title]], service_context=service_context
        )

        summarizer = list_index.as_query_engine(response_mode="tree_summarize")
        response = await summarizer.aquery(f"Give me a summary of {wiki_title}")

        wiki_summary = response.response
        Path("summaries").mkdir(exist_ok=True)
        with open(out_path, "w") as fp:
            fp.write(wiki_summary)
    else:
        with open(out_path, "r") as fp:
            wiki_summary = fp.read()

    print(f"**Summary for {wiki_title}: {wiki_summary}")
    node = IndexNode(text=wiki_summary, index_id=wiki_title)
    nodes.append(node)

**Summary for Michael Jordan: Michael Jordan, often referred to as MJ, is a retired professional basketball player from the United States who is widely considered one of the greatest players in the history of the sport. He played 15 seasons in the NBA, primarily with the Chicago Bulls, and won six NBA championships. His individual accolades include six NBA Finals MVP awards, ten NBA scoring titles, five NBA MVP awards, and fourteen NBA All-Star Game selections. He also holds the NBA records for career regular season scoring average and career playoff scoring average. Jordan briefly retired to play Minor League Baseball, but returned to lead the Bulls to three more championships. He was twice inducted into the Naismith Memorial Basketball Hall of Fame. 

After retiring, Jordan became a successful businessman, part-owner and head of basketball operations for the Charlotte Hornets, and owner of 23XI Racing in the NASCAR Cup Series. He has also made significant contributions to charitable 

In [99]:
# define top-level retriever
top_vector_index = VectorStoreIndex(nodes)
top_vector_retriever = top_vector_index.as_retriever(similarity_top_k=1)

In [100]:
# define recursive retriever
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

In [102]:
# note: can pass `agents` dict as `query_engine_dict` since every agent can be used as a query engine
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": top_vector_retriever, **vector_retrievers},
    # query_engine_dict=vector_query_engines,
    verbose=True,
)

In [103]:
# ?
nodes = recursive_retriever.retrieve("Tell me about a celebrity from the United States")
for node in nodes:
    print(node.node.get_content())

[36;1m[1;3mRetrieving with query id None: Tell me about a celebrity from the United States
[0m[38;5;200m[1;3mRetrieved node with id, entering: Michael Jordan
[0m[36;1m[1;3mRetrieving with query id Michael Jordan: Tell me about a celebrity from the United States
[0m[38;5;200m[1;3mRetrieving text node: He was interviewed at three homes associated with the production and did not want cameras in his home or on his plane, as according to director Jason Hehir "there are certain aspects of his life that he wants to keep private".Jordan granted rapper Travis Scott permission to film a music video for his single "Franchise" at his home in Highland Park, Illinois.Jordan appeared in the 2022 miniseries The Captain, which follows the life and career of Derek Jeter.


=== Books ===
Jordan has authored several books focusing on his life, basketball career, and world view.

Rare Air: Michael on Michael, with Mark Vancil and Walter Iooss (Harper San Francisco, 1993).
I Can't Accept Not Tryi

In [104]:
nodes = recursive_retriever.retrieve(
    "Tell me about the childhood of a billionaire who started at company at the age of 16"
)
for node in nodes:
    print(node.node.get_content())

[36;1m[1;3mRetrieving with query id None: Tell me about the childhood of a billionaire who started at company at the age of 16
[0m[38;5;200m[1;3mRetrieved node with id, entering: Richard Branson
[0m[36;1m[1;3mRetrieving with query id Richard Branson: Tell me about the childhood of a billionaire who started at company at the age of 16
[0m[38;5;200m[1;3mRetrieving text node: Branson has also talked openly about having ADHD.Branson's parents were supportive of his endeavours from an early age.His mother was an entrepreneur; one of her most successful ventures was building and selling wooden tissue boxes and wastepaper bins.In London, he started off squatting from 1967 to 1968.Branson is an atheist.He said in a 2011 interview with CNN's Piers Morgan that he believes in evolution and the importance of humanitarian efforts but not in the existence of God."I would love to believe," he said."It's very comforting to believe".


== Early business career ==
After failed attempts to gro