<a href="https://colab.research.google.com/github/akashmathur-2212/LLMs-playground/blob/main/LlamaIndex-applications%20/Advanced-RAG/parent_child_document_retriever_metadata_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parent-Child Document Retriever with Metadata Extraction

This notebook shows how you can use recursive retrieval to traverse node relationships and fetch nodes based on "references".

When you first perform retrieval, you may want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

In this guide we explore some different usages of node references:
- **Chunk references**: Different chunk sizes referring to a bigger chunk
- **Metadata references**: Summaries + Generated Questions referring to a bigger chunk

In [4]:
!pip install -qqq llama-index llama-hub langchain openai accelerate==0.21.0 bitsandbytes==0.40.2 transformers sentence_transformers InstructorEmbedding chromadb

## Setup

1. In this section we will work with the QLoRA paper and create an initial set of nodes (chunk size 1024).
2. We will use Open Source LLM [`zephyr-7b-alpha`](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) and embedding [`hkunlp/instructor-large`](https://huggingface.co/hkunlp/instructor-large)

In [6]:
import json
import torch
from pathlib import Path

# transformers
from transformers import BitsAndBytesConfig

# llama_index
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM
from llama_index import download_loader, Document, VectorStoreIndex, ServiceContext
from llama_index.node_parser import SentenceSplitter
from llama_index.schema import IndexNode
from langchain.embeddings import HuggingFaceInstructEmbeddings
from llama_index.response.notebook_utils import display_source_node
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

# Metadata Extraction
from llama_index.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)

# db
import chromadb

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load Data

In [7]:
PDFReader = download_loader("PDFReader")
loader = PDFReader()
docs = loader.load_data(file=Path("./QLoRa.pdf"))

In [8]:
docs[0]

Document(id_='60349a1c-cb4e-42ef-9db1-e8cbdfc4c0dc', embedding=None, metadata={'page_label': '1', 'file_name': 'QLoRa.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='d5842650b3ec5f81bc436b0f768188fd29d9b209f1e08a472ec049643bba89d9', text='QL ORA: Efficient Finetuning of Quantized LLMs\nTim Dettmers∗Artidoro Pagnoni∗Ari Holtzman\nLuke Zettlemoyer\nUniversity of Washington\n{dettmers,artidoro,ahai,lsz}@cs.washington.edu\nAbstract\nWe present QLORA, an efficient finetuning approach that reduces memory us-\nage enough to finetune a 65B parameter model on a single 48GB GPU while\npreserving full 16-bit finetuning task performance. QLORAbackpropagates gradi-\nents through a frozen, 4-bit quantized pretrained language model into Low Rank\nAdapters (LoRA). Our best model family, which we name Guanaco , outperforms\nall previous openly released models on the Vicuna benchmark, reaching 99.3%\nof the performance level of ChatGPT while only requiring 

In [9]:
docs[0].get_content()

'QL ORA: Efficient Finetuning of Quantized LLMs\nTim Dettmers∗Artidoro Pagnoni∗Ari Holtzman\nLuke Zettlemoyer\nUniversity of Washington\n{dettmers,artidoro,ahai,lsz}@cs.washington.edu\nAbstract\nWe present QLORA, an efficient finetuning approach that reduces memory us-\nage enough to finetune a 65B parameter model on a single 48GB GPU while\npreserving full 16-bit finetuning task performance. QLORAbackpropagates gradi-\nents through a frozen, 4-bit quantized pretrained language model into Low Rank\nAdapters (LoRA). Our best model family, which we name Guanaco , outperforms\nall previous openly released models on the Vicuna benchmark, reaching 99.3%\nof the performance level of ChatGPT while only requiring 24 hours of finetuning\non a single GPU. QLORAintroduces a number of innovations to save memory\nwithout sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that\nis information theoretically optimal for normally distributed weights (b) Double\nQuantization to reduce

In [10]:
# combine all the text
doc_text = "\n\n".join([d.get_content() for d in docs])
documents = [Document(text=doc_text)]

# Chunking

In [12]:
node_parser = SentenceSplitter(chunk_size=1024)

In [13]:
base_nodes = node_parser.get_nodes_from_documents(documents)
# set node ids to be a constant
for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{idx}"

In [66]:
# print all the node ids corrosponding to all the chunks
for node in base_nodes:
  print(node.id_)

node-0
node-1
node-2
node-3
node-4
node-5
node-6
node-7
node-8
node-9
node-10
node-11
node-12
node-13
node-14
node-15
node-16
node-17
node-18
node-19
node-20
node-21
node-22
node-23
node-24
node-25
node-26
node-27
node-28
node-29
node-30


So, the entire document is divided into 30 nodes.

# LLM (`zephyr-7b-alpha`)

In [16]:
from google.colab import userdata

# huggingface api token for downloading llama2
hf_token = userdata.get('hf_token')

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

# Embedding (`hkunlp/instructor-large`)

In [17]:
embed_model = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
)

load INSTRUCTOR_Transformer
max_seq_length  512


In [18]:
# set your ServiceContext for all the next steps
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

## Baseline Retriever

Define a baseline retriever that simply fetches the top-k raw text nodes by embedding similarity.

In [19]:
base_index = VectorStoreIndex(base_nodes, service_context=service_context)
base_retriever = base_index.as_retriever(similarity_top_k=2)

In [20]:
retrievals = base_retriever.retrieve(
    "Can you tell me about the Paged Optimizers?"
)

In [21]:
for n in retrievals:
    display_source_node(n, source_length=1500)

**Node ID:** node-5<br>**Similarity:** 0.8639193985783434<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from st...<br>

**Node ID:** node-6<br>**Similarity:** 0.8246144101148492<br>**Text:** We provide more details in the results section for each particular setup to make the results more
readable. Full details in Appendix A.
QLoRA-AllQLoRA-FFN
QLoRA-AttentionAlpaca (ours)
Stanford-Alpaca
Model6061626364RougeL
bits
4
16
Figure 2: RougeL for LLaMA 7B models on the
Alpaca dataset. Each point represents a run with a
different random seed. We improve on the Stanford
Alpaca fully finetuned default hyperparameters to
construct a strong 16-bit baseline for comparisons.
Using LoRA on all transformer layers is critical to
match 16-bit performance.While paged optimizers are critical to do 33B/65B
QLORAtuning on a single 24/48GB GPU, we do
not provide hard measurements for Paged Optimiz-
ers since the paging only occurs when processing
mini-batches with long sequence lengths, which is
rare. We do, however, perform an analysis of the
runtime of paged optimizers for 65B models on
48GB GPUs and find that with a batch size of 16,
paged optimizers provide the same training speed
as regular optimizers. Future work should measure
and characterize under what circumstances slow-
downs occur from the paging process.
Default LoRA hyperparameters do not match 16-
bit performance When using the standard prac-
tice of applying LoRA to query and value attention
projection matrices [ 28], we are not able to replicate
full finetuning performance for large base models.
As shown in Figure 2 for LLaMA 7B finetuning on
Alpaca, we find that the most critical LoRA hyper-
parameter is how many L...<br>

In [22]:
query_engine_base = RetrieverQueryEngine.from_args(
    base_retriever, service_context=service_context
)

In [23]:
response = query_engine_base.query(
    "Can you tell me about the Paged Optimizers?"
)
print(str(response))



Paged Optimizers are a feature used in QLoRA to allocate paged memory for optimizer states. This allows for automatic page-to-page transfers between CPU and GPU memory for error-free GPU processing when the GPU runs out of memory. This feature works like regular memory paging between CPU RAM and disk. By using this feature, QLoRA can significantly reduce the required memory for finetuning models. However, the paging process only occurs when processing mini-batches with long sequence lengths, which is rare. The default LoRA hyperparameters do not match 16-bit performance, and the most critical LoRA hyperparameter is the number of LoRA adapters used in total. LoRA on all linear transformer block layers is required to match full finetuning performance. NormalFloat data type significantly improves bit-for-bit accuracy gains compared to regular 4-bit Floats, and double quantization allows for a more fine-grained control over the memory footprint to fit models of certain size into certain GP

## Chunk References: Smaller Child Chunks Referring to Bigger Parent Chunk

Now, we will build smaller chunks that will point to their bigger parent chunks.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

In [24]:
sub_chunk_sizes = [256, 512]
sub_node_parsers = [SentenceSplitter(chunk_size=c) for c in sub_chunk_sizes]

all_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(original_node)

In [25]:
all_nodes_dict = {n.node_id: n for n in all_nodes}

In [68]:
all_nodes_dict.keys()

dict_keys(['node-0', 'node-1', 'node-2', 'node-3', 'node-4', 'node-5', 'node-6', 'node-7', 'node-8', 'node-9', 'node-10', 'node-11', 'node-12', 'node-13', 'node-14', 'node-15', 'node-16', 'node-17', 'node-18', 'node-19', 'node-20', 'node-21', 'node-22', 'node-23', 'node-24', 'node-25', 'node-26', 'node-27', 'node-28', 'node-29', 'node-30', 'aff4cd7a-114d-4ae2-a50e-ca5d30afc9ce', '756bcf23-c37a-4ab6-b1ce-1e0a87992845', '98453e39-b13b-499e-8cc9-81b9be409191', '6cb1624d-b628-4f2e-901c-d0fb5d495707', '641b49f2-0494-49c3-a013-c220a29a19f0', 'e5ecbc64-8b29-4e34-83ef-aa5162180da0', 'a9293f16-2787-4199-939c-22bfca2b9958', 'db2f9d01-62f9-40d3-88a6-fb62dd8a832e', 'e2f9f48f-b4a0-4d19-8317-bcd60d88aefd', '970443fa-3aea-4d06-adb7-dcd6651bb6de', '907b2e47-825c-44af-a2cf-8c41c89e3960', '95d2bf7e-938c-44e8-9e0b-99462268880a', '3689fecb-8967-4928-b194-c82dc1a86736', 'c9c94976-987c-4100-8a6f-573099e4c4bc', 'aaece5c8-2df4-4b68-8fd7-b8db966eb6cf', 'f093581b-8178-499c-81b5-2750d0ac2b0d', '77e14b4b-5b72-4ae

In [72]:
all_nodes_dict['aff4cd7a-114d-4ae2-a50e-ca5d30afc9ce']

IndexNode(id_='aff4cd7a-114d-4ae2-a50e-ca5d30afc9ce', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='25b08fc5a0708dcadb67fb32123722c3e726452fc10252d1ea1179f3fc6e20c6', text='Question: How does QLORA, an efficient finetuning approach, reduce memory usage while preserving full 16-bit finetuning task performance for LLMs? What innovations does it introduce to achieve this?', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='node-0')

In [87]:
# all_nodes_list = list(all_nodes_dict.keys())
# index_id = [x for x in all_nodes_list if "node-" not in x]
# for id in index_id:
#   print(f"{id} ---> {all_nodes_dict[id].index_id}")
#   print("-"*40, end="\n")

See that these many smaller chunks (`IndexNode`) are associated with each of the original text chunks(`TextNode`) for example `node-0`. In fact, all of the smaller chunks reference to the large chunk in the metadata with `index_id` pointing to the index ID of the larger chunk.

## Create Index from these smaller chunks (IndexNode)

In [28]:
vector_index_chunk = VectorStoreIndex(
    all_nodes, service_context=service_context
)

In [29]:
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)

When we perform retrieval, we want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

In [30]:
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)

In [31]:
nodes = retriever_chunk.retrieve(
    "Can you tell me about the Paged Optimizers?"
)
for node in nodes:
    display_source_node(node, source_length=2000)

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-4
[0m[1;3;34mRetrieving with query id node-4: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-5
[0m[1;3;34mRetrieving with query id node-5: Can you tell me about the Paged Optimizers?
[0m

**Node ID:** node-4<br>**Similarity:** 0.8935850984244503<br>**Text:** For our data type, we
set the arbitrary range [−1,1]. As such, both the quantiles for the data type and the neural network
weights need to be normalized into this range.
The information theoretically optimal data type for zero-mean normal distributions with arbitrary
standard deviations σin the range [−1,1]is computed as follows: (1) estimate the 2k+ 1quantiles
of a theoretical N(0,1)distribution to obtain a k-bit quantile quantization data type for normal distri-
butions, (2) take this data type and normalize its values into the [−1,1]range, (3) quantize an input
weight tensor by normalizing it into the [−1,1]range through absolute maximum rescaling.
Once the weight range and data type range match, we can quantize as usual. Step (3) is equivalent to
rescaling the standard deviation of the weight tensor to match the standard deviation of the k-bit data
type. More formally, we estimate the 2kvalues qiof the data type as follows:
qi=1
2
QXi
2k+ 1
+QXi+ 1
2k+ 1
, (4)
where QX(·)is the quantile function of the standard normal distribution N(0,1). A problem for
a symmetric k-bit quantization is that this approach does not have an exact representation of zero,
which is an important property to quantize padding and other zero-valued elements with no error. To
4

ensure a discrete zeropoint of 0and to use all 2kbits for a k-bit datatype, we create an asymmetric
data type by estimating the quantiles qiof two ranges qi:2k−1for the negative part and 2k−1+ 1for
the positive part and then we unify these sets of qiand remove one of the two zeros that occurs in both
sets. We term the resulting data type that has equal expected number of values in each quantization bin
k-bit NormalFloat (NFk), since the data type is information-theoretically optimal for zero-centered
normally distributed data. The exact values of this data type can be found in Appendix E.
Double Quantization We introduce Double Quantization (DQ), the process of quantizing the
quantization constants for add...<br>

**Node ID:** node-5<br>**Similarity:** 0.8929982170962448<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from storage WNF4to computation data type
WBF16to calculate the derivative∂X
∂Win BFloat16 precision.
To summarize, QLORAhas one storage data type (usually 4-bit NormalFloat) and a computation
data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type
to perform the forward and backward pass, but we only compute weight gradients for the LoRA
parameters which use 16-bit BrainFloat.
4 QLoRA vs. Standard Finetuning
We have discussed how QLoRA works and how it can signi...<br>

In [32]:
query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk, service_context=service_context
)

In [33]:
response = query_engine_chunk.query(
    "Can you tell me about the Paged Optimizers?"
)
print(str(response))

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-4
[0m[1;3;34mRetrieving with query id node-4: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieved node with id, entering: node-5
[0m[1;3;34mRetrieving with query id node-5: Can you tell me about the Paged Optimizers?
[0mYes, according to the given context information, Paged Optimizers are a feature used in the QLoRA model to allocate paged memory for optimizer states. This feature automatically transfers memory between the CPU and GPU when the GPU runs out of memory, allowing for error-free GPU processing. This is done by using the NVIDIA unified memory feature, which works like regular memory paging between CPU RAM and the disk. This allows for more efficient use of memory during the optimization process.


## Metadata References: Summaries + Generated Questions referring to a bigger chunk

Now, we will add some additional context that references the source node.

This additional context includes summaries as well as generated questions. `Due to the limited compute I am only extracting questions, but you can uncomment the summarizer to extract summaries.`

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

In [35]:
import nest_asyncio

nest_asyncio.apply()

In [36]:
extractors = [
    # SummaryExtractor(summaries=["self"], llm=llm, show_progress=True),
    QuestionsAnsweredExtractor(questions=1, llm=llm, show_progress=True),
]

In [38]:
# run metadata extractor across base nodes, get back dictionaries
metadata_dicts = []
for extractor in extractors:
    metadata_dicts.extend(extractor.extract(base_nodes))

100%|██████████| 31/31 [04:24<00:00,  8.52s/it]  


In [40]:
# all nodes consists of source nodes, along with metadata
import copy

all_nodes = copy.deepcopy(base_nodes)
for idx, d in enumerate(metadata_dicts):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"],
        index_id=base_nodes[idx].node_id,
    )
    # inode_s = IndexNode(
    #     text=d["section_summary"], index_id=base_nodes[idx].node_id)
    all_nodes.extend([inode_q]) #, inode_s

In [41]:
all_nodes_dict = {n.node_id: n for n in all_nodes}

In [42]:
vector_index_metadata = VectorStoreIndex(all_nodes, service_context=service_context)
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k=2)

In [43]:
retriever_metadata = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_metadata},
    node_dict=all_nodes_dict,
    verbose=True,
)

In [44]:
nodes = retriever_metadata.retrieve(
    "Can you tell me about the Paged Optimizers?"
)
for node in nodes:
    display_source_node(node, source_length=2000)

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieving text node: On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows

**Node ID:** node-5<br>**Similarity:** 0.8639193985783434<br>**Text:** On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows:
YBF16=XBF16doubleDequant (cFP32
1, ck-bit
2,WNF4) +XBF16LBF16
1LBF16
2, (5)
where doubleDequant (·)is defined as:
doubleDequant (cFP32
1, ck-bit
2,Wk-bit) =dequant (dequant (cFP32
1, ck-bit
2),W4bit) =WBF16,(6)
We use NF4 for Wand FP8 for c2. We use a blocksize of 64 for Wfor higher quantization precision
and a blocksize of 256 for c2to conserve memory.
For parameter updates only the gradient with respect to the error for the adapters weights∂E
∂Liare
needed, and not for 4-bit weights∂E
∂W. However, the calculation of∂E
∂Lientails the calculation of∂X
∂W
which proceeds via equation (5) with dequantization from storage WNF4to computation data type
WBF16to calculate the derivative∂X
∂Win BFloat16 precision.
To summarize, QLORAhas one storage data type (usually 4-bit NormalFloat) and a computation
data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type
to perform the forward and backward pass, but we only compute weight gradients for the LoRA
parameters which use 16-bit BrainFloat.
4 QLoRA vs. Standard Finetuning
We have discussed how QLoRA works and how it can signi...<br>

**Node ID:** node-0<br>**Similarity:** 0.840191491716247<br>**Text:** QL ORA: Efficient Finetuning of Quantized LLMs
Tim Dettmers∗Artidoro Pagnoni∗Ari Holtzman
Luke Zettlemoyer
University of Washington
{dettmers,artidoro,ahai,lsz}@cs.washington.edu
Abstract
We present QLORA, an efficient finetuning approach that reduces memory us-
age enough to finetune a 65B parameter model on a single 48GB GPU while
preserving full 16-bit finetuning task performance. QLORAbackpropagates gradi-
ents through a frozen, 4-bit quantized pretrained language model into Low Rank
Adapters (LoRA). Our best model family, which we name Guanaco , outperforms
all previous openly released models on the Vicuna benchmark, reaching 99.3%
of the performance level of ChatGPT while only requiring 24 hours of finetuning
on a single GPU. QLORAintroduces a number of innovations to save memory
without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that
is information theoretically optimal for normally distributed weights (b) Double
Quantization to reduce the average memory footprint by quantizing the quantization
constants, and (c) Paged Optimizers to manage memory spikes. We use QLORA
to finetune more than 1,000 models, providing a detailed analysis of instruction
following and chatbot performance across 8 instruction datasets, multiple model
types (LLaMA, T5), and model scales that would be infeasible to run with regular
finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA
finetuning on a small high-quality dataset leads to state-of-the-art results, even
when using smaller models than the previous SoTA. We provide a detailed analysis
of chatbot performance based on both human and GPT-4 evaluations showing that
GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Fur-
thermore, we find that current chatbot benchmarks are not trustworthy to accurately
evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates
where Guanaco fails compared to ChatGPT. We release all of our models ...<br>

In [45]:
query_engine_metadata = RetrieverQueryEngine.from_args(
    retriever_metadata, service_context=service_context
)

In [46]:
response = query_engine_metadata.query(
    "Can you tell me about the Paged Optimizers?"
)
print(str(response))

[1;3;34mRetrieving with query id None: Can you tell me about the Paged Optimizers?
[0m[1;3;38;5;200mRetrieving text node: On average, for a blocksize of 64, this quantization reduces the memory footprint per
parameter from 32/64 = 0 .5bits, to 8/64 + 32 /(64·256) = 0 .127bits, a reduction of 0.373 bits
per parameter.
Paged Optimizers use the NVIDIA unified memory3feature wich does automatic page-to-page
transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU
occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM
and the disk. We use this feature to allocate paged memory for the optimizer states which are then
automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU
memory when the memory is needed in the optimizer update step.
QL ORA.Using the components described above, we define QLORAfor a single linear layer in
the quantized base model with a single LoRA adapter as follows



Paged Optimizers are a feature used in QLORA to manage memory spikes during finetuning. This feature automatically pages memory between the CPU and GPU for error-free GPU processing in scenarios where the GPU occasionally runs out-of-memory. In QLORA, the optimizer states are allocated paged memory, which are then automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step. This feature helps to reduce the memory footprint of finetuning large language models, making it possible to finetune a quantized 4-bit model without any performance degradation.


# Storage

Now, let's store your embeddings to Chroma DB to retrieve later.

In [59]:
# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("QLoRa_knowledge_database")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
vector_index_metadata_db = VectorStoreIndex(
    all_nodes,
    storage_context=storage_context,
    service_context=service_context
)

# Compare the Outputs

In [88]:
baseline_retriver = "Paged Optimizers are a feature used in QLoRA to allocate paged memory for optimizer states. This allows for automatic page-to-page transfers between CPU and GPU memory for error-free GPU processing when the GPU runs out of memory. This feature works like regular memory paging between CPU RAM and disk. By using this feature, QLoRA can significantly reduce the required memory for finetuning models. However, the paging process only occurs when processing mini-batches with long sequence lengths, which is rare. The default LoRA hyperparameters do not match 16-bit performance, and the most critical LoRA hyperparameter is the number of LoRA adapters used in total. LoRA on all linear transformer block layers is required to match full finetuning performance. NormalFloat data type significantly improves bit-for-bit accuracy gains compared to regular 4-bit Floats, and double quantization allows for a more fine-grained control over the memory footprint to fit models of certain size into certain GPUs."

chunk_reference_retriver = "Yes, according to the given context information, Paged Optimizers are a feature used in the QLoRA model to allocate paged memory for optimizer states. This feature automatically transfers memory between the CPU and GPU when the GPU runs out of memory, allowing for error-free GPU processing. This is done by using the NVIDIA unified memory feature, which works like regular memory paging between CPU RAM and the disk. This allows for more efficient use of memory during the optimization process."

metadata_retriver = "Paged Optimizers are a feature used in QLORA to manage memory spikes during finetuning. This feature automatically pages memory between the CPU and GPU for error-free GPU processing in scenarios where the GPU occasionally runs out-of-memory. In QLORA, the optimizer states are allocated paged memory, which are then automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step. This feature helps to reduce the memory footprint of finetuning large language models, making it possible to finetune a quantized 4-bit model without any performance degradation."

In [89]:
import pandas as pd
df = pd.DataFrame()

df.loc[0, 'baseline_retriver'] = baseline_retriver
df.loc[0, 'chunk_reference_retriver'] = chunk_reference_retriver
df.loc[0, 'metadata_retriver'] = metadata_retriver
df

Unnamed: 0,baseline_retriver,chunk_reference_retriver,metadata_retriver
0,Paged Optimizers are a feature used in QLoRA t...,"Yes, according to the given context informatio...",Paged Optimizers are a feature used in QLORA t...


# END