
<a href="https://colab.research.google.com/github/edumunozsala/llamaindex-RAG-techniques/blob/main/auto-merging-retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Auto Merging Retrieval

The idea here is pretty much similar to Sentence Window Retriever — to search for more granular pieces of information and then to extend the context window before feeding said context to an LLM for reasoning. Documents are split into smaller child chunks referring to larger parent chunks.
Fetch smaller chunks during retrieval first, then if more than n chunks in top k retrieved chunks are linked to the same parent node (larger chunk), we replace the context fed to the LLM by this parent node — works like auto merging a few retrieved chunks into a larger parent chunk, hence the method name

In this notebook, we showcase our `AutoMergingRetriever`, which looks at a set of leaf nodes and recursively “merges” subsets of leaf nodes that reference a parent node beyond a given threshold. This allows us to consolidate potentially disparate, smaller contexts into a larger context that might help synthesis.

You can define this hierarchy yourself over a set of documents, or you can make use of our brand-new text parser: a `HierarchicalNodeParser` that takes in a candidate set of documents and outputs an entire hierarchy of nodes, from “coarse-to-fine”.

Original code from Llama-index: https://docs.llamaindex.ai/en/latest/examples/retrievers/auto_merging_retriever.html#

In [14]:
import os
import openai

from llama_index import ServiceContext, VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.llms import OpenAI

from llama_index.node_parser import HierarchicalNodeParser, get_leaf_nodes, get_root_nodes
from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

Load the API Keys:

In [1]:
from dotenv import load_dotenv

# Load the enviroment variables
load_dotenv()

True

## Setup


If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

### Load the data

In [2]:
from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('./data/Attention is all you need.pdf'))

In [3]:
documents

[Document(id_='9f20d3c9-0b6f-4f5a-93da-e011de57707d', embedding=None, metadata={'page_label': '1', 'file_name': 'Attention is all you need.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='215a4e4cfd187b7f1951b4796bf528de8e3c7a090794c107967fc03077f7dc5c', text='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propos

## Auto Merging Retrieval Setup

### Parse Chunk Hierarchy from Text, Load into Storage

In this section we make use of the HierarchicalNodeParser. This will output a hierarchy of nodes, from top-level nodes with bigger chunk sizes to child nodes with smaller chunk sizes, where each child node has a parent node with a bigger chunk size.

By default, the hierarchy is:

1st level: chunk size 2048

2nd level: chunk size 512

3rd level: chunk size 128

We then load these nodes into storage. The leaf nodes are indexed and retrieved via a vector store - these are the nodes that will first be directly retrieved via similarity search. The other nodes will be retrieved from a docstore.


In [5]:
# create the hierarchical node parser w/ default settings
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

In [6]:
nodes = node_parser.get_nodes_from_documents(documents)
print(len(nodes))

168


Here we import a simple helper function for fetching “leaf” nodes within a node list. These are nodes that don’t have children of their own.

In [7]:
leaf_nodes = get_leaf_nodes(nodes)
print(leaf_nodes[10].text)
print(len(leaf_nodes))

Recent work has achieved
signiﬁcant improvements in computational efﬁciency through factorization tricks [ 21] and conditional
computation [ 32], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [ 2,19].
126


We can inspect the root nodes

In [9]:
root_nodes = get_root_nodes(nodes)
print(root_nodes[5].text)
print(len(root_nodes))

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. nis the sequence length, dis the representation dimension, kis the kernel
size of convolutions and rthe size of the neighborhood in restricted self-attention.
Layer Type Complexity per Layer Sequential Maximum Path Length
Operations
Self-Attention O(n2·d) O(1) O(1)
Recurrent O(n·d2) O(n) O(n)
Convolutional O(k·n·d2)O(1) O(logk(n))
Self-Attention (restricted) O(r·n·d)O(1) O(n/r)
tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,
learned and ﬁxed [9].
In this work, we use sine and cosine functions of different frequencies:
PE(pos,2i)=sin(pos/100002i/d model)
PE(pos,2i+1)=cos(pos/100002i/d model)
whereposis the position and iis 

## Building the index

In [12]:
# DEfine the LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Create the context definition
auto_merging_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    node_parser=node_parser,
)
# Create the storage context
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# Build the Vector index
automerging_index = VectorStoreIndex(
    leaf_nodes, storage_context=storage_context, service_context=auto_merging_context
)

automerging_index.storage_context.persist(persist_dir="./merging_index")

The next section creates the vector index and saves it to disk or, if it's already been created, the vector index is read from disk.

In [17]:
if not os.path.exists("./merging_index"):
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)

    automerging_index = VectorStoreIndex(
            leaf_nodes,
            storage_context=storage_context,
            service_context=auto_merging_context
        )

    automerging_index.storage_context.persist(persist_dir="./merging_index")
else:
    automerging_index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir="./merging_index"),
        service_context=auto_merging_context
    )


## Defining the retriever



Here, we define the retriever and a post processor to rerank the retrieved documents

In [22]:
automerging_retriever = automerging_index.as_retriever(
    similarity_top_k=6
)

retriever = AutoMergingRetriever(
    automerging_retriever, 
    automerging_index.storage_context, 
    verbose=True
)

rerank = SentenceTransformerRerank(top_n=6, model="BAAI/bge-reranker-base")

auto_merging_engine = RetrieverQueryEngine.from_args(
    retriever, node_postprocessors=[rerank]
)

query_engine = RetrieverQueryEngine.from_args(retriever)
base_query_engine = RetrieverQueryEngine.from_args(automerging_retriever)

Let's check how it works

In [23]:
# query_str = "What were some lessons learned from red-teaming?"
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = (
    "How are transformers related to convolutional neural networks?"
)

nodes = retriever.retrieve(query_str)
base_nodes = automerging_retriever.retrieve(query_str)

In [24]:
print(len(nodes))
print(len(base_nodes))

6
6


Inspect the nodes

In [18]:
from llama_index.response.notebook_utils import display_source_node

for node in nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 8490da0d-b7ef-4bb4-b881-57438ed24893<br>**Similarity:** 0.7504335093880229<br>**Text:** In all but a few cases [ 27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for signiﬁcantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.<br>

**Node ID:** 12f31f51-48d5-4658-b74f-02f574af14dc<br>**Similarity:** 0.7464125792384666<br>**Text:** 7 Conclusion
In this work, we presented the Transformer, the ﬁrst sequence transduction model based entirely on
attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained signiﬁcantly faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks, we achieve a new state of the art.<br>

**Node ID:** 8f9e0572-88c0-495d-b215-97d5db06633e<br>**Similarity:** 0.742476441706261<br>**Text:** To the best of our knowledge, however, the Transformer is the ﬁrst transduction model relying
entirely on self-attention to compute representations of its input and output without using sequence-
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [17, 18] and [9].
3 Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [ 5,2,35].<br>

**Node ID:** 0f62036f-d767-4b88-9122-e101478b015f<br>**Similarity:** 0.7345469413593934<br>**Text:** See Figure 2.
3.3 Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.
FFN(x) = max(0,xW 1+b1)W2+b2 (2)
While the linear transformations are the same across different positions, they use different parameters
from layer to layer.<br>

**Node ID:** 97b5295a-18c7-4256-8b76-b13da7f333f8<br>**Similarity:** 0.7339010441162862<br>**Text:** We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU.<br>

**Node ID:** e7eb96c6-f092-4455-9616-11adcadfbd2f<br>**Similarity:** 0.7298574313030675<br>**Text:** The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
2<br>

Inspect the base nodes

In [19]:
for node in base_nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 8490da0d-b7ef-4bb4-b881-57438ed24893<br>**Similarity:** 0.7504335093880229<br>**Text:** In all but a few cases [ 27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for signiﬁcantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.<br>

**Node ID:** 12f31f51-48d5-4658-b74f-02f574af14dc<br>**Similarity:** 0.7464125792384666<br>**Text:** 7 Conclusion
In this work, we presented the Transformer, the ﬁrst sequence transduction model based entirely on
attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained signiﬁcantly faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks, we achieve a new state of the art.<br>

**Node ID:** 8f9e0572-88c0-495d-b215-97d5db06633e<br>**Similarity:** 0.742476441706261<br>**Text:** To the best of our knowledge, however, the Transformer is the ﬁrst transduction model relying
entirely on self-attention to compute representations of its input and output without using sequence-
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [17, 18] and [9].
3 Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [ 5,2,35].<br>

**Node ID:** 0f62036f-d767-4b88-9122-e101478b015f<br>**Similarity:** 0.7345469413593934<br>**Text:** See Figure 2.
3.3 Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.
FFN(x) = max(0,xW 1+b1)W2+b2 (2)
While the linear transformations are the same across different positions, they use different parameters
from layer to layer.<br>

**Node ID:** 97b5295a-18c7-4256-8b76-b13da7f333f8<br>**Similarity:** 0.7339010441162862<br>**Text:** We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU.<br>

**Node ID:** e7eb96c6-f092-4455-9616-11adcadfbd2f<br>**Similarity:** 0.7298574313030675<br>**Text:** The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
2<br>

## Run the Query engine


Now, we can create the query engine that will use the Replacement post-processor

In [25]:
response = auto_merging_engine.query(query_str)

In [26]:
from llama_index.response.notebook_utils import display_response

display_response(response)

**`Final Response:`** Transformers are related to convolutional neural networks (CNNs) in that they both are used for neural sequence transduction tasks. However, transformers differ from CNNs in their architecture. Transformers rely entirely on self-attention to compute representations of their input and output, without using sequence-aligned recurrent neural networks (RNNs) or convolution. This makes transformers more parallelizable and allows for faster training compared to architectures based on recurrent or convolutional layers.

We can also check the base retriever

In [27]:
response = base_query_engine.query(query_str)
display_response(response)

**`Final Response:`** Transformers are not directly related to convolutional neural networks (CNNs). The Transformer model architecture, as described in the context, is based solely on attention mechanisms and does not use recurrence or convolutions. While CNNs are commonly used in various tasks, such as image recognition, the Transformer model is specifically designed for sequence transduction tasks, such as machine translation. Therefore, Transformers and CNNs are different types of neural network architectures with different design principles and applications.

## Refactor the code

In the deeplearning.ai course "building-eval-advanced-rag-llamaindex" a reranker is added to the post processor stage to get a better retrieved context. In the next section we will refactor the code and include the reranker. 

In [28]:

def build_automerging_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="merging_index",
    chunk_sizes=None,
):
    chunk_sizes = chunk_sizes or [2048, 512, 128]
    node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
    nodes = node_parser.get_nodes_from_documents(documents)
    leaf_nodes = get_leaf_nodes(nodes)
    merging_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
    )
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)

    if not os.path.exists(save_dir):
        automerging_index = VectorStoreIndex(
            leaf_nodes, storage_context=storage_context, service_context=merging_context
        )
        automerging_index.storage_context.persist(persist_dir=save_dir)
    else:
        automerging_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=merging_context,
        )
    return automerging_index


def get_automerging_query_engine(
    automerging_index,
    similarity_top_k=12,
    rerank_top_n=6,
):
    base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
    retriever = AutoMergingRetriever(
        base_retriever, automerging_index.storage_context, verbose=True
    )
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )
    auto_merging_engine = RetrieverQueryEngine.from_args(
        retriever, node_postprocessors=[rerank]
    )
    return auto_merging_engine

Now, we can call the functions and create our merging retriever and engine. Then you can invoke it

In [29]:
# Create the index and the retriever
index = build_automerging_index(
    documents,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),
    save_dir="./merging_index",
)
# Get the query engine
query_engine = get_automerging_query_engine(index, similarity_top_k=6)

In [30]:
respose = query_engine.query(
   "How are transformers related to convolutional neural networks?"
)

In [31]:

display_response(respose)

**`Final Response:`** Transformers are related to convolutional neural networks (CNNs) in that they both are used for sequence transduction tasks. However, transformers differ from CNNs in their architecture. Transformers rely entirely on self-attention to compute representations of their input and output, without using sequence-aligned recurrent neural networks (RNNs) or convolution. This makes transformers more parallelizable and allows for faster training compared to architectures based on recurrent or convolutional layers.