# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

In [1]:
# NOTE: An OpenAI API key must be set here for application initialization, even if not in use.
# If you're not utilizing OpenAI models, assign a placeholder string (e.g., "not_used").
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

FileNotFoundError: [Errno 2] No such file or directory: 'null/Users/teja.nagubandi'

In [None]:
# Cinderella story defined in sample.txt
with open('demo/sample.txt', 'r') as file:
    text = file.read()

print(text[:100])

1) **Building**: RAPTOR recursively embeds, clusters, and summarizes chunks of text to construct a tree with varying levels of summarization from the bottom up. You can create a tree from the text in 'sample.txt' using `RA.add_documents(text)`.

2) **Querying**: At inference time, the RAPTOR model retrieves information from this tree, integrating data across lengthy documents at different abstraction levels. You can perform queries on the tree with `RA.answer_question`.

### Building the tree

In [2]:
from raptor import RetrievalAugmentation 

  from .autonotebook import tqdm as notebook_tqdm
2025-04-19 21:13:08,404 - Loading faiss with AVX2 support.
2025-04-19 21:13:08,555 - Successfully loaded faiss with AVX2 support.


In [None]:
RA = RetrievalAugmentation()

# construct the tree
RA.add_documents(text)

### Querying from the tree

```python
question = # any question
RA.answer_question(question)
```

In [None]:
question = "How did Cinderella reach her happy ending ?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)

In [None]:
# Save the tree by calling RA.save("path/to/save")
SAVE_PATH = "demo/cinderella"
RA.save(SAVE_PATH)

In [None]:
# load back the tree by passing it into RetrievalAugmentation

RA = RetrievalAugmentation(tree=SAVE_PATH)

answer = RA.answer_question(question=question)
print("Answer: ", answer)

## Using other Open Source Models for Summarization/QA/Embeddings

If you want to use other models such as Llama or Mistral, you can very easily define your own models and use them with RAPTOR. 

In [2]:
import torch
from raptor import BaseSummarizationModel, BaseQAModel, BaseEmbeddingModel, RetrievalAugmentationConfig
from transformers import AutoTokenizer, pipeline
from raptor import RetrievalAugmentation 

  from .autonotebook import tqdm as notebook_tqdm
2025-04-18 21:49:34,402 - Loading faiss with AVX2 support.
2025-04-18 21:49:34,416 - Successfully loaded faiss with AVX2 support.


In [None]:
from huggingface_hub import login

# Replace 'your_hf_token_here' with your actual token string
login(token="{your_hf_token_here}")

Model_Name = "meta-llama/Llama-2-7b-chat-hf"


In [4]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    BitsAndBytesConfig,
)

_MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

# One‑time 4‑bit quantisation setup.
_bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,                 # activate 4‑bit quant
    bnb_4bit_quant_type="nf4",         # better accuracy
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,    # second‑level quant for even less VRAM
)

print("⏳ Loading Llama-2-7B-chat in 4-bit …")
_tokenizer = AutoTokenizer.from_pretrained(_MODEL_NAME, use_fast=False)
_model     = AutoModelForCausalLM.from_pretrained(
    _MODEL_NAME,
    quantization_config=_bnb_cfg,
    device_map="auto",                 # lets Accelerate place layers on GPU/CPU as needed
    trust_remote_code=True,
)
print("✅ Model loaded in 4-bit NF4")

# ——————————————————————————————————————————————
# Summarisation wrapper
# ——————————————————————————————————————————————
class SummarizationModel(BaseSummarizationModel):
    def __init__(self):
        # reuse global objects so we don’t duplicate VRAM
        self.tokenizer = _tokenizer
        self._pipeline = pipeline(
            "text-generation",
            model=_model,
            tokenizer=self.tokenizer,
            device_map="auto",          # pipeline picks correct device
        )

    def summarize(self, context: str, max_tokens: int = 150) -> str:
        messages = [
            {
                "role": "user",
                "content": f"Write a concise, information-dense summary of the following:\n{context}",
            }
        ]
        prompt = self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        outputs = self._pipeline(
            prompt,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_k=50,
            top_p=0.95,
        )
        return outputs[0]["generated_text"][len(prompt) :].strip()

# ——————————————————————————————————————————————
# Question‑answering wrapper
# ——————————————————————————————————————————————
class QAModel(BaseQAModel):
    def __init__(self):
        self.tokenizer = _tokenizer
        self._pipeline = pipeline(
            "text-generation",
            model=_model,
            tokenizer=self.tokenizer,
            device_map="auto",
        )

    def answer_question(self, context: str, question: str) -> str:
        messages = [
            {
                "role": "user",
                "content": (
                    f"Context:\n{context}\n\n"
                    f"Question: {question}\n\n"
                    "Answer as thoroughly as possible:"
                ),
            }
        ]
        prompt = self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        outputs = self._pipeline(
            prompt,
            max_new_tokens=256,
            temperature=0.7,
            top_k=50,
            top_p=0.95,
        )
        return outputs[0]["generated_text"][len(prompt) :].strip()


⏳ Loading Llama-2-7B-chat in 4-bit …
✅ Model loaded in 4-bit NF4


2025-04-18 21:49:38,240 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.88s/it]

In [5]:
# from transformers import AutoTokenizer, pipeline
# import torch

# # You can define your own Summarization model by extending the base Summarization Class. 
# class SummarizationModel(BaseSummarizationModel):
#     def __init__(self, model_name="meta-llama/Llama-2-7b-chat-hf"):
#         # Set use_fast=False to avoid AttributeError with apply_chat_template
#         self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
#         self.summarization_pipeline = pipeline(
#             "text-generation",
#             model=model_name,
#             tokenizer=self.tokenizer,
#             model_kwargs={"torch_dtype": torch.bfloat16},
#             device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
#             trust_remote_code=True,
#         )

#     def summarize(self, context, max_tokens=150):
#         messages = [
#             {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
#         ]
#         prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

#         outputs = self.summarization_pipeline(
#             prompt,
#             max_new_tokens=max_tokens,
#             do_sample=True,
#             temperature=0.7,
#             top_k=50,
#             top_p=0.95
#         )
#         summary = outputs[0]["generated_text"][len(prompt):].strip()
#         return summary


In [6]:
# class QAModel(BaseQAModel):
#     def __init__(self, model_name="meta-llama/Llama-2-7b-chat-hf"):
#         self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
#         self.qa_pipeline = pipeline(
#             "text-generation",
#             model=model_name,
#             tokenizer=self.tokenizer,
#             model_kwargs={"torch_dtype": torch.bfloat16},
#             device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
#             trust_remote_code=True,
#         )

#     def answer_question(self, context, question):
#         messages = [
#             {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
#         ]
#         prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

#         outputs = self.qa_pipeline(
#             prompt,
#             max_new_tokens=256,
#             do_sample=True,
#             temperature=0.7,
#             top_k=50,
#             top_p=0.95
#         )
#         answer = outputs[0]["generated_text"][len(prompt):].strip()
#         return answer

In [7]:
from sentence_transformers import SentenceTransformer
class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)


In [8]:
RAC = RetrievalAugmentationConfig(summarization_model=SummarizationModel(), qa_model=QAModel(), embedding_model=SBertEmbeddingModel())

2025-04-18 21:49:42,367 - Load pretrained SentenceTransformer: sentence-transformers/multi-qa-mpnet-base-cos-v1
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2025-04-18 21:49:43,060 - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2025-04-18 21:49:45,455 - Use pytorch device_name: cuda


In [9]:
RA = RetrievalAugmentation(config=RAC)

2025-04-18 21:49:45,557 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.SummarizationModel object at 0x7fac3869d820>
            Embedding Models: {'EMB': <__main__.SBertEmbeddingModel object at 0x7fad39807ee0>}
            Cluster Embedding Model: EMB
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2025-04-18 21:49:45,558 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization

In [10]:
with open('demo/sample.txt', 'r') as file:
    text = file.read()
    
RA.add_documents(text)

2025-04-18 21:49:45,615 - Creating Leaf Nodes

Batches:   0%|          | 0/1 [00:00<?, ?it/s][A

Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A



Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A


Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A





Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A




Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A






Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A[A







Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A[A[A









Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A[A








Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A










Batches:   0%|          | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A[A[A



Batches: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it][A[A[A[A





Batches: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it][A[A[A[A[A[A
Batches: 100%|██████████| 1/1 [00:0

In [11]:
question = "How did Cinderella reach her happy ending?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)

2025-04-18 21:50:50,916 - Using collapsed_tree
Batches:   0%|          | 0/1 [00:00<?, ?it/s]Batches: 100%|██████████| 1/1 [00:00<00:00, 18.16it/s]


Answer:  Cinderella reached her happy ending through a series of events that involved the help of her Fairy Godmother, a magical transformation, and a chance encounter with the king's son at the ball. Here are the key events that led to her happy ending:

1. Cinderella's Fairy Godmother appears: Cinderella's Fairy Godmother appears to her in the garden and offers to help her attend the king's son's ball.
2. Transformation: The Fairy Godmother transforms a pumpkin into a beautiful golden carriage, mice into horses, and a rat into a coachman. She also gives Cinderella a beautiful dress and slippers.
3. Ball: Cinderella attends the ball with the help of her Fairy Godmother and dances with the king's son.
4. Midnight: As the clock strikes midnight, Cinderella must leave the ball before her stepfamily discovers she is not who she seems to be. She leaves behind one of her glass slippers.
5. Search for the missing bride: The king's son searches for the girl who left behind one of her


In [2]:
from raptor.qa_pipeline import RAPTORLLM

with open('demo/sample.txt', 'r') as file:
    text = file.read()
    

raptor = RAPTORLLM()
raptor.index_corpus([text])   # one‑time

print(raptor.answer("How did Cinderella reach her happy ending?"))

  from .autonotebook import tqdm as notebook_tqdm
2025-04-20 05:00:56,585 - Loading faiss with AVX2 support.
2025-04-20 05:00:56,752 - Successfully loaded faiss with AVX2 support.
2025-04-20 05:01:02,762 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 2/2 [01:56<00:00, 58.17s/it]
2025-04-20 05:02:59,419 - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-04-20 05:03:00,601 - Use pytorch device_name: cuda
2025-04-20 05:03:01,393 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2025-04-20 05:03:01,629 - Use pytorch device_name: cuda
2025-04-20 05:03:01,793 - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-04-20 05:03:02,000 - Use pytorch device_name: cuda
2025-04-20 05:03:02,039 - Successfully initialized TreeBuilder with Config {'token

⏳ Loading Llama-2-7B-chat in 4-bit …


Using the provided contexts, Cinderella reached her happy ending by:

Context 1: attending the festival with the help of a fairy godmother, who appeared to her in a dream and gave her a magical makeover.

Context 2: persevering through difficult circumstances and staying true to her values.

Context 3: attending the festival with her parents and step-sisters, despite her step-mother's refusal to let her go.
