# Retrievers and Node Post-Processors

In this notebook, we cover some customization to our existing retrieval process, using the `HierarchicalNodeParser`, `AutoMergingRetriever`, 
and a custom node-postprocessor that ensures a certain amount of tokens are always sent to the LLM.

## Setup

In [3]:
import openai
import os
import sys
sys.path.append(os.path.join(os.getcwd(), '..'))

os.environ['OPENAI_API_KEY'] = "YOUR_API_KEY"
openai.api_key = os.environ['OPENAI_API_KEY']

In [4]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI

# Use local embeddings + gpt-3.5-turbo-16k
service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-16k", max_tokens=512, temperature=0.1),
    embed_model="local:BAAI/bge-base-en"
)

set_global_service_context(service_context)

  from .autonotebook import tqdm as notebook_tqdm


## Node Parsing + Retrieval

Previously, we used a custom markdown loader to load chunks from our markdown documentation. However, since then, advancements have been made in llama-index that may provide more relevant retrieval. Specifically, we will use the `HierarchicalNodeParser`, which parses nodes into several chunk sizes.

The idea here is that during retrieval, if a majority of chunks are retrieved that have the same parent chunk, we return the larger parent chunk instead.

To support this, we can modify our loading code as shown below:

### Loading Helper Function

In [5]:
from llama_index import SimpleDirectoryReader, Document
from llama_index.node_parser import HierarchicalNodeParser, SimpleNodeParser, get_leaf_nodes
from llama_index.schema import MetadataMode
from llama_docs_bot.markdown_docs_reader import MarkdownDocsReader


def load_markdown_docs(filepath, hierarchical=True):
    """Load markdown docs from a directory, excluding all other file types."""
    loader = SimpleDirectoryReader(
        input_dir=filepath, 
        required_exts=[".md"],
        file_extractor={".md": MarkdownDocsReader()},
        recursive=True
    )

    documents = loader.load_data()

    if hierarchical:
        # combine all documents into one
        documents = [
            Document(text="\n\n".join(
                    document.get_content(metadata_mode=MetadataMode.ALL) 
                    for document in documents
                )
            )
        ]

        # chunk into 3 levels
        # majority means 2/3 are retrieved before using the parent
        large_chunk_size = 1536
        node_parser = HierarchicalNodeParser.from_defaults(
            chunk_sizes=[
                large_chunk_size, 
                large_chunk_size // 3,
            ]
        )

        nodes = node_parser.get_nodes_from_documents(documents)
        return nodes, get_leaf_nodes(nodes)
    else:
        node_parser = SimpleNodeParser.from_defaults()
        nodes = node_parser.get_nodes_from_documents(documents)
        return nodes

Here, we parse each directory into a single giant document, and then chunk into a heirarchy of 2048, 2048 // 3, and 2048 // 9. 

This means if 2 of 3 child chunks are retrieved, the `AutoMergingRetriever` will replace the nodes with the larger parent chunk.

Now, in order for the auto merging to work properly, we will need to set the top-k higher. However, we still want to avoid sending too much text to the LLM for the sake of latency. So here, we also introduce a local re-ranker to limit the amount of returned nodes after merging.

### Load/Create Query Engines

Let's write a function to build our query engine tools next.


In [6]:
from llama_index import VectorStoreIndex,StorageContext, load_index_from_storage
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import AutoMergingRetriever
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.storage.docstore import SimpleDocumentStore


def get_query_engine_tool(directory, description, hierarchical=True, postprocessors=None):
    try:
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./data_{os.path.basename(directory)}"
        )
        index = load_index_from_storage(storage_context)

        if hierarchical:
            retriever = AutoMergingRetriever(
                index.as_retriever(similarity_top_k=6), 
                storage_context=storage_context
            )
        else:
            retriever = index.as_retriever(similarity_top_k=12)
    except:
        if hierarchical:
            nodes, leaf_nodes = load_markdown_docs(directory, hierarchical=hierarchical)

            docstore = SimpleDocumentStore()
            docstore.add_documents(nodes)
            storage_context = StorageContext.from_defaults(docstore=docstore)

            index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)
            index.storage_context.persist(persist_dir=f"./data_{os.path.basename(directory)}")

            retriever = AutoMergingRetriever(
                index.as_retriever(similarity_top_k=12), 
                storage_context=storage_context
            )

        else:
            nodes = load_markdown_docs(directory, hierarchical=hierarchical)
            index = VectorStoreIndex(nodes)
            index.storage_context.persist(persist_dir=f"./data_{os.path.basename(directory)}")

            retriever = index.as_retriever(similarity_top_k=12)

    query_engine = RetrieverQueryEngine.from_args(
        retriever,
        node_postprocessors=postprocessors or [],
    )

    return QueryEngineTool(query_engine=query_engine, metadata=ToolMetadata(name=directory, description=description))

### Compare retrievers

You'll notice we included some code to enable/disable the hierarchical node parsing. Let's compare results a bit quickly

In [7]:
hierarchical_engine = get_query_engine_tool(
    "../docs/core_modules/query_modules",
    "Useful for information on various query engines and retrievers, and anything related to querying data.",
    hierarchical=True, 
).query_engine

In [8]:
!rm -rf data_*

In [9]:
base_engine = get_query_engine_tool(
    "../docs/core_modules/query_modules",
    "Useful for information on various query engines and retrievers, and anything related to querying data.",
    hierarchical=False, 
).query_engine

In [10]:
from llama_index import QueryBundle
hierarchical_nodes = hierarchical_engine.retrieve(QueryBundle("How do I setup a query engine?"))
base_nodes = base_engine.retrieve(QueryBundle("How do I setup a query engine?"))

In [11]:
from llama_index.utils import globals_helper
from llama_index.schema import MetadataMode

print("\n--- Hierarchical ---\n")
print('\n---\n'.join([node.node.text for node in hierarchical_nodes]))

total_length = 0
for node in hierarchical_nodes:
    total_length += len(globals_helper.tokenizer(node.node.get_content(metadata_mode=MetadataMode.LLM)))
print(f"Total length: {total_length}")


--- Hierarchical ---

File Name: ./docs/core_modules/query_modules/query_engine/usage_pattern.md
Content Type: text
Header Path: Usage Pattern/Configuring a Query Engine/High-Level API
Links:

You can directly build and configure a query engine from an index in 1 line of code:

File Name: ./docs/core_modules/query_modules/query_engine/usage_pattern.md
Content Type: text
Header Path: Usage Pattern/Configuring a Query Engine/High-Level API
Links:

query_engine = index.as_query_engine(
    response_mode='tree_summarize',
    verbose=True,
)

File Name: ./docs/core_modules/query_modules/query_engine/usage_pattern.md
Content Type: text
Header Path: Usage Pattern/Configuring a Query Engine/High-Level API
Links:

> Note: While the high-level API optimizes for ease-of-use, it does *NOT* expose full range of configurability.See **Response Modes** for a full list of response modes and what they do.File Name: ./docs/core_modules/query_modules/query_engine/usage_pattern.md
Content Type: text
Head

In [12]:
print("\n--- Base ---\n")
print('\n---\n'.join([node.node.text for node in base_nodes]))

total_length = 0
for node in base_nodes:
    total_length += len(globals_helper.tokenizer(node.node.get_content(metadata_mode=MetadataMode.LLM)))
print(f"Total length: {total_length}")


--- Base ---

Query engine is a generic interface that allows you to ask question over your data.

A query engine takes in a natural language query, and returns a rich response.
It is most often (but not always) built on one or many Indices via Retrievers.
You can compose multiple query engines to achieve more advanced capability.
---
You can use the low-level composition API if you need more granular control.
Concretely speaking, you would explicitly construct a `QueryEngine` object instead of calling `index.as_query_engine(...)`.
> Note: You may need to look at API references or example notebooks.
---
To enable streaming, you need to use an LLM that supports streaming.
Right now, streaming is supported by `OpenAI`, `HuggingFaceLLM`, and most LangChain LLMs (via `LangChainLLM`).

Configure query engine to use streaming:

If you are using the high-level API, set `streaming=True` when building a query engine.
---
vector_query_engine = vector_index.as_query_engine()
vector_query_engine 

As you can see, the hierarchical query engine seems to return better text, but there is also a LOT of text.

If not enough nodes are merged in the retriever, we can end up with a lot of text, due to setting the top-k so high.

So, let's write a custom node-postprocessor to make sure this doesn't happen!

## Custom Node Post-Processor

Here, we use a very basic approach to approximate token counts. We return the most nodes that fit within our token count.
The nodes are already pre-sorted, so we don't have to worry about similarity scores here.

In [13]:
from typing import Callable, Optional

from llama_index.utils import globals_helper
from llama_index.schema import MetadataMode

class LimitRetrievedNodesLength:

    def __init__(self, limit: int = 3000, tokenizer: Optional[Callable] = None):
        self._tokenizer = tokenizer or globals_helper.tokenizer
        self.limit = limit

    def postprocess_nodes(self, nodes, query_bundle):
        included_nodes = []
        current_length = 0

        for node in nodes:
            current_length += len(self._tokenizer(node.node.get_content(metadata_mode=MetadataMode.LLM)))
            if current_length > self.limit:
                break
            included_nodes.append(node)

        return included_nodes

In [14]:
!rm -rf data_*

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [15]:
query_engine = get_query_engine_tool(
    "../docs/core_modules/query_modules",
    "Useful for information on various query engines and retrievers, and anything related to querying data.",
    hierarchical=True,
    postprocessors=[LimitRetrievedNodesLength(limit=3000)]
).query_engine

In [16]:
hierarchical_nodes = query_engine.retrieve(QueryBundle("How do I setup a query engine?"))
total_length = 0
for node in hierarchical_nodes:
    total_length += len(globals_helper.tokenizer(node.node.get_content(metadata_mode=MetadataMode.LLM)))
print(f"Total length: {total_length}")

Total length: 2971


## Final Query Engine

With our functions setup, we can load/create our indexes and create our final query engine across our documentation.

In [60]:
import nest_asyncio
nest_asyncio.apply()

from llama_index.query_engine import SubQuestionQueryEngine, RouterQueryEngine

# Here we define the directories we want to index, as well as a description for each
# NOTE: these descriptions are hand-written based on my understanding. We could have also
# used an LLM to write these, maybe a future experiment.
docs_directories = {
    "../docs/community": "Useful for information on community integrations with other libraries, vector dbs, and frameworks.", 
    "../docs/core_modules/agent_modules": "Useful for information on data agents and tools for data agents.", 
    "../docs/core_modules/data_modules": "Useful for information on data, storage, indexing, and data processing modules.",
    "../docs/core_modules/model_modules": "Useful for information on LLMs, embedding models, and prompts.",
    "../docs/core_modules/query_modules": "Useful for information on various query engines and retrievers, and anything related to querying data.",
    "../docs/core_modules/supporting_modules": "Useful for information on supporting modules, like callbacks, evaluators, and other supporting modules.",
    "../docs/getting_started": "Useful for information on getting started with LlamaIndex.", 
    "../docs/development": "Useful for information on contributing to LlamaIndex development.",
}

# Build query engine tools
query_engine_tools = [
    get_query_engine_tool(
        directory, 
        description, 
        hierarchical=True, 
        postprocessors=[LimitRetrievedNodesLength(limit=3000)]
    ) for directory, description in docs_directories.items()
]

# build top-level router -- this will route to multiple sub-indexes and aggregate results
# query_engine = SubQuestionQueryEngine.from_defaults(
#     query_engine_tools=query_engine_tools,
#     service_context=service_context,
#     verbose=False
# )

query_engine = RouterQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    select_multi=True,
)

In [61]:
from llama_index.response.notebook_utils import display_response
response = query_engine.query("How do I setup a ChromaDB Vector Store? Give me a code sample please.")
display_response(response)

**`Final Response:`** To setup a ChromaDB Vector Store, you can use the following code sample:

```python
import chromadb
from llama_index.vector_stores import ChromaVectorStore

chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("quickstart")

vector_store = ChromaVectorStore(
    chroma_collection=chroma_collection,
)
```

This code imports the necessary libraries, creates a ChromaDB client, and then creates a collection called "quickstart". Finally, it initializes a ChromaVectorStore using the created collection.

In [62]:
response = query_engine.query("How can I customize Document objects?")
display_response(response)

**`Final Response:`** To customize Document objects, you can include useful metadata using the `metadata` dictionary on each document. Additionally, you can customize the embedding metadata text by setting the `excluded_embed_metadata_keys` attribute to exclude specific metadata keys from being included in the embedding model. You can also customize the format of the metadata using attributes such as `metadata_separator` and `metadata_template`. Furthermore, you can pass in a service context to specific parts of the pipeline to override the default configuration. This allows you to set different components such as the LLM, embedding model, node parser, and prompt helper according to your requirements, thereby tailoring the behavior of the Document objects to suit your needs.

In [63]:
response = query_engine.query("How can I customize Document metadata?")
display_response(response)

**`Final Response:`** You can customize Document metadata in a few ways. 

First, you can exclude specific metadata keys from being visible to the LLM (Language Model) by using the `excluded_llm_metadata_keys` attribute. This allows you to exclude certain metadata from being read by the LLM during response synthesis.

Second, you can exclude metadata keys from being visible to the embedding model by using the `excluded_embed_metadata_keys` attribute. This is useful if you don't want certain text to bias the embeddings.

Additionally, you can customize the format of the metadata using the following attributes:
- `metadata_seperator`: controls the separator between each key/value pair of the metadata.
- `metadata_template`: controls how each key/value pair is formatted.
- `text_template`: controls how the metadata is joined with the text content of the document.

You can set the metadata dictionary in the document constructor or after the document is created. You can also set the filename automatically using the `SimpleDirectoryReader` and `file_metadata` hook.

Overall, customizing Document metadata allows you to control what metadata is visible to the LLM and embedding model, as well as the format of the metadata.

## Conclusion

Here' we covered a ton of concepts
- Node Parsing and Retrievers, specifically the `AutoMergingRetriever` and `HierarchicalNodeParser`
- Node post-processors and custom node-postprocessing
- Reviewing setting up a `RouterQueryEngine`

The full code is available in the `llama_docs_bot` folder in the repo!