# Advanced RAG - Auto Merging Retrieval
* Notebook by Adam Lang
* Date: 3/24/2025

# Overview
* In this notebook we will implement the Advanced RAG technique "Auto-Merging".

# Architecture
* We will use these technical dependencies to build this:

1. Data
  * To get data, we will crawl a web page using the llama-index-readers-web.

2. Embeddings
  * We will use hugging face open source embeddings via the Llama-index library.

3. LLM
  * We will leverage the llama-index APIs and use GROQ open source models.

4. APIs
We will use Llama-index.

# Install Dependencies

In [None]:
%%capture
!pip install llama-index llama-index-readers-web

In [None]:
%%capture
!pip install llama-index-embeddings-huggingface

In [None]:
%%capture
#!pip install llama-index-llms-openai
!pip install llama-index-llms-groq ## use groq instead of open ai

# Load Data
* We will crawl a webpage and extract the information using the llama-index `SimpleWebPageReader`
* This will allow us to load the webpage as a Document.

In [None]:
## imports to load webpage
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display # to display in notebook
import os

In [None]:
## load webpage as Document
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)

In [None]:
## view docs
documents

[Document(id_='http://paulgraham.com/worked.html', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='![](https://s.turbifycdn.com/aah/paulgraham/essays-5.gif)|\n![](https://sep.turbifycdn.com/ca/Img/trans_1x1.gif)|\n[![](https://s.turbifycdn.com/aah/paulgraham/essays-6.gif)](index.html)  \n  \n| ![What I Worked On](https://s.turbifycdn.com/aah/paulgraham/what-i-worked-\non-4.gif)  \n  \nFebruary 2021  \n  \nBefore college the two main things I worked on, outside of school, were\nwriting and programming. I didn\'t write essays. I wrote what beginning writers\nwere supposed to write then, and probably still are: short stories. My stories\nwere awful. They had hardly any plot, just characters with strong feelings,\nwhich I imagined made them deep.  \n  \nThe first programs I tried writing were on the IBM 140

# Setup LLM & Embedding Models

## LLM Setup
* We will use an open source LLM from GROQ.
* Here are the llama-index docs for GROQ: https://docs.llamaindex.ai/en/stable/examples/llm/groq/

In [None]:
## imports for llm and embeddings
#from llama_index.llms.openai import OpenAI
from llama_index.llms.groq import Groq
from getpass import getpass
import os

#OPENAI_API_KEY = getpass("Enter your Open AI key: ")
GROQ_API_KEY = getpass("Enter your GROQ API key: ")

Enter your GROQ API key: ··········


In [None]:
## set Open AI env variable
#os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

## setup GROQ API env variable
os.environ['GROQ_API_KEY'] = GROQ_API_KEY

In [None]:
## initialize the LLM model from llama-index
#llm = OpenAI(model='gpt-4o-mini-2024-07-18',temperature=0.0)

llm = Groq(model="llama3-70b-8192", api_key=GROQ_API_KEY)

## Embeddings setup
* This is the model I will use: BAAI/bge-base-en-v1.5
* Model card: https://huggingface.co/BAAI/bge-base-en-v1.5

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

## load embedding model of choice
model_embed = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5",
                                   max_length=512)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']


# Implementing Auto-Merging
* Here we will use llama-index to implement Auto-Merging

## Build the Hierarchical Node Parser
* We will use llama-index to do this.
* Full docs here: https://llamaindexxx.readthedocs.io/en/latest/api/llama_index.core.node_parser.HierarchicalNodeParser.html

In [None]:
## imports
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.node_parser import get_leaf_nodes

In [None]:
512/2

256.0

In [None]:
256/2

128.0

In [None]:
## 1. initialize chunk sizes: Parent --> Child --> Child
## Each chunk size is 1/2 size of parent
## chunk sizes are based on number of tokens
chunk_sizes = [512, 256, 128]

## 2. Create Node parser
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)

## 3. create nodes
nodes = node_parser.get_nodes_from_documents(documents)

## 4. leaf nodes
leaf_nodes = get_leaf_nodes(nodes)

In [None]:
## length of nodes
len(leaf_nodes)

229

# Examine Leaf and Parent nodes
* Note: leaf nodes do not have any children.

In [None]:
## leaf node
print(leaf_nodes[50].text)

In everyday
life it would be distracting to notice every leaf on every bush. But when you
have to paint something, you have to look more closely, and when you do
there's a lot to see. You can still be noticing new things after days of
trying to paint something people usually take for granted, just as you can
after days of trying to write an essay about something people usually take for
granted.  
  
This is not the only way to paint. I'm not 100% sure it's even a good way to
paint.


In [None]:
## get node ids
nodes_by_id = {node.node_id: node for node in nodes}

## get parent node
parent_node = nodes_by_id[leaf_nodes[50].parent_node.node_id]
print(parent_node.text)

[4]  
  
I liked painting still lives because I was curious about what I was seeing. In
everyday life, we aren't consciously aware of much we're seeing. Most visual
perception is handled by low-level processes that merely tell your brain
"that's a water droplet" without telling you details like where the lightest
and darkest points are, or "that's a bush" without telling you the shape and
position of every leaf. This is a feature of brains, not a bug. In everyday
life it would be distracting to notice every leaf on every bush. But when you
have to paint something, you have to look more closely, and when you do
there's a lot to see. You can still be noticing new things after days of
trying to paint something people usually take for granted, just as you can
after days of trying to write an essay about something people usually take for
granted.  
  
This is not the only way to paint. I'm not 100% sure it's even a good way to
paint. But it seemed a good enough bet to be worth trying.  
  


Summary
* We can see the parent node is much bigger than the leaf node.

# Important Steps
1. Service Context Creation
2. Storage Context Creation
3. Vector Store Index Creation

In [None]:
## llama index imports
from llama_index.core import Settings ## Same as `ServiceContext` in legacy llama-index
from llama_index.core import VectorStoreIndex, StorageContext

## 1. Set Service context for auto merging using Settings
Settings.llm = llm
Settings.embed_model = model_embed
Settings.node_parser = node_parser


* Parent nodes are then stored in DocStore (StorageContext)
* Index is created using leaf nodes and we only need to embed the leaf nodes (VectorStoreIndex)
* Hierarchical relationships are created and maintained between PARENT and LEAF nodes during the retrieval auto-merging process.

In [None]:
## 2. DocStore creation using all nodes
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

In [None]:
## 3. VectorStore Index Creation
auto_merging_index = VectorStoreIndex(leaf_nodes,
                                      storage_context=storage_context,
                                      )

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
## setup parameters
re_rank = True
similarity_top_k = 12
rerank_top_n = 4

In [None]:
## imports for auto merging
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.indices.postprocessor import SentenceTransformerRerank

## Setup Retrievers

In [None]:
## base retriever
base_retriever = auto_merging_index.as_retriever(similarity_top_k=similarity_top_k)

## automerging retriever
retriever = AutoMergingRetriever(base_retriever,
                                 auto_merging_index.storage_context,
                                 verbose=True)


## setup logic for retreive -- merge -- rerank
if re_rank:
  rerank = SentenceTransformerRerank(top_n=rerank_top_n,
                                     model="BAAI/bge-reranker-base")
  auto_merging_engine = RetrieverQueryEngine.from_args(retriever,
                                                       llm=llm,
                                                       node_postprocessor=[rerank])

else:
  auto_merging_engine = RetrieverQueryEngine.from_args(retriever,
                                                       llm=llm)


config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [32]:
## Create auto merging retriever function
def auto_merge_retriever(auto_merging_engine, query):
  """
  Function for auto merging retriever.
  """
  merging_response = auto_merging_engine.query(query)

  ## store context in list
  context = []
  for node in merging_response.source_nodes:
    context.append(node.dict()['node']['text'])

  ## retrieve and merge response
  response = merging_response.response

  return response, context

# Test Queries

In [31]:
## query 1
query = "Where is he from?"

In [33]:
## get response, context
response, context = auto_merge_retriever(auto_merging_engine, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:llama_index.core.retrievers.auto_merging_retriever:> Merging 2 nodes into parent node.
> Parent node id: d09eacc7-723d-4b69-bcae-d81c1b3cc628.
> Parent node text: It's called Yorkville, and that
was my new home. Now I was a New York artist  in the strictly te...

INFO:llama_index.core.retrievers.auto_merging_retriever:> Merging 2 nodes into parent node.
> Parent node id: bb5111cc-12d2-49c1-86d0-6fd7f271a361.
> Parent node text: If he even knew about the strange classes I was taking, he never said
anything.  
  
So now I was...

INFO:llama_index.core.retrievers.auto_merging_retriever:> Merging 2 nodes into parent node.
> Parent node id: 1c3e6571-e1e1-4065-b69d-ce3b99a7a5bd.
> Parent node text: It was exciting for a while. Painting started to go better. I
experimented with a new kind of sti...



> Merging 2 nodes into parent node.
> Parent node id: d09eacc7-723d-4b69-bcae-d81c1b3cc628.
> Parent node text: It's called Yorkville, and that
was my new home. Now I was a New York artist  in the strictly te...

> Merging 2 nodes into parent node.
> Parent node id: bb5111cc-12d2-49c1-86d0-6fd7f271a361.
> Parent node text: If he even knew about the strange classes I was taking, he never said
anything.  
  
So now I was...

> Merging 2 nodes into parent node.
> Parent node id: 1c3e6571-e1e1-4065-b69d-ce3b99a7a5bd.
> Parent node text: It was exciting for a while. Painting started to go better. I
experimented with a new kind of sti...



INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


In [34]:
## show response
response

'He is from England, as he is a British citizen by birth.'

In [35]:
## context for response
context

["In the summer of 2016 we moved to England. We wanted our kids to see what it\nwas like living in another country, and since I was a British citizen by\nbirth, that seemed the obvious choice. We only meant to stay for a year, but\nwe liked it so much that we still live there. So most of Bel was written in\nEngland.  \n  \nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp,\nit's a spec rather than an implementation, although like McCarthy's Lisp it's\na spec expressed as code.",
 "So in 1993 I\ndropped out. I hung around Providence for a bit, and then my college friend\nNancy Parmet did me a big favor. A rent-controlled apartment in a building her\nmother owned in New York was becoming vacant. Did I want it? It wasn't much\nmore than my current place, and New York was supposed to be where the artists\nwere. So yes, I wanted it! [7]  \n  \nAsterix comics begin by zooming in on a tiny corner of Roman Gaul that turns\nout not to be controlled by the Romans.",
 "

Summary
* We can see with auto merging child chunks to parent we can get a succint response from the llm as output.