# First version: Load pdf, get embeddings, query them

This is based on the workflow shown here: https://medium.com/geekculture/automating-pdf-interaction-with-langchain-and-chatgpt-e723337f26a6. However, this version has been heavily altered from the version shown there. Most notably, this version uses 🤗 models instead of OpenAI models, for both text generation as well as for embeddings. This version also adds a map-reduce version, as well as a version for working with multiple papers simultaneously, with citations linking to specific pages used as sources.

### Load paper

In [1]:
!curl -o gptq.pdf https://arxiv.org/pdf/2210.17323.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  509k  100  509k    0     0   703k      0 --:--:-- --:--:-- --:--:--  702k


In [2]:
from langchain.document_loaders import PyPDFLoader # for loading the pdf
from langchain.embeddings import OpenAIEmbeddings # for creating embeddings
from langchain.vectorstores import Chroma # for the vectorization part
from langchain.chains import ChatVectorDBChain # for chatting with the pdf

# Define a pprint function
import textwrap
def pprint(s, width=70):
    print(textwrap.fill(s, width))

In [3]:
pdf_path = "./gptq.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()
print(pages[9].page_content)

standard PTQ benchmarks, in the same setup as (Frantar et al., 2022). As can be seen in Table 1,
GPTQ performs on par at 4-bit, and slightly worse than the most accurate methods at 3-bit. At the
same time, it signiﬁcantly outperforms AdaQuant, the fastest amongst prior PTQ methods. Further,
we compare against the full greedy OBQ method on two smaller language models: BERT-base (De-
vlin et al., 2019) and OPT-125M. The results are shown in Appendix Table 8. At 4 bits, both methods
perform similarly, and for 3 bits, GPTQ surprisingly performs slightly better. We suspect that this
is because some of the additional heuristics used by OBQ, such as early outlier rounding, might
require careful adjustments for optimal performance on non-vision models. Overall, GPTQ appears
to be competitive with state-of-the-art post-training methods for smaller models, while taking only
<1minute rather than ≈1hour. This enables scaling to much larger models.
Runtime. Next we measure the full model quantizati

### Get embeddings of chunks of the paper

We'll create a vector store, which will contain an embedding for each of the documents in our corpus.
The reason we do this is so we can query the resulting vector store. 
Whatever topic we're interested in, we can write a (natural language) query about that topic.
Our query can then be converted to an embedding. 
We can then find which of our documents are closest in the embedding space to our query.
In theory, these documents should be the ones that are most relevant/similar to our query.

In [4]:
# Make a vector store of embeddings for each doc, and cluster resulting embeddings
from utils.vector_store_tools import generate_vector_store, load_saved_vector_store
import os

username = os.environ.get('USER')
cache_loc = os.path.join('/','scratch',username,'hf_cache')

saved_already = True

if saved_already:
    db = load_saved_vector_store(cache_loc)
else:
    db = generate_vector_store(pages, cache_loc)
    db.save_local(os.path.join("data","faiss_index"))

In [9]:
# Test out the db
query_docs = db.similarity_search(query = 'What is quantization of model parameters?')

print(f'{len(query_docs)} DOCUMENTS RETURNED.\n')
print(f'FIRST DOCUMENT CONTENT: \n{query_docs[0].page_content}')
print(f"\nSOURCE: \n{query_docs[0].metadata['source']}, page {query_docs[0].metadata['page']}")

4 DOCUMENTS RETURNED.

FIRST DOCUMENT CONTENT: 
Figure 1: Different finetuning methods and their memory requirements. QLORAimproves over LoRA by
quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.
2 Background
Block-wise k-bit Quantization Quantization is the process of discretizing an input from a rep-
resentation that holds more information to a representation with less information. It often means
taking a data type with more bits and converting it to fewer bits, for example from 32-bit floats to
8-bit Integers. To ensure that the entire range of the low-bit data type is used, the input data type is
commonly rescaled into the target data type range through normalization by the absolute maximum
of the input elements, which are usually structured as a tensor. For example, quantizing a 32-bit
Floating Point (FP32) tensor into a Int8 tensor with range [−127,127]:
XInt8=round127
absmax (XFP32)XFP32
=round (cFP32·XFP32), (1)
where cis t

### Load an LLM that will answer questions about the paper

We need a smart AI chatbot to answer our questions about the paper using the documents in our vector store.
Let's use LLaMA-2 7B -- it's pretty fast on this hardware.
If you have money and aren't too concerned about privacy, you could use GPT-3 or GPT-4.

In [10]:
# Load model
from utils import load_model
llm, _, cache_loc = load_model.load_model(model_id="meta-llama/Llama-2-7b-chat-hf", max_length=4096)

Okay, using /scratch/cehrett/hf_cache for huggingface cache. Models will be stored there.
Huggingface API key loaded.
Loading model


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading tokenizer
Instantiating pipeline
Instantiating HuggingFacePipeline


In [11]:
# Let's test out the LLM
test_output = llm(prompt="""\
You are a SilenceBot. Whatever the user says to you, you respond only with: "SILENCE!!!!" 

User: Hi SilenceBot, can I talk to you for a minute?
""")
print(test_output)

SilenceBot: SILENCE!!!


### Define an object that will query the embeddings to get context for a question to the LLM
The `ChatVectorDBChain` object is a wrapper around our LLM, that provides a way of interacting with that LLM.
When we use this chain to interact with our LLM, the LLM will see a prompt containing our question. 
But more than just our question, the prompt will also contain passages from the paper we loaded.
Which passages will the LLM get to see? The ones that are most relevant to our question, as measured by cosine similarity of the embeddings.

So, when we pass a question to the chain:
* The question is converted to an embedding
* The embedding of our question is used to find the most similar documents in our vector store
* The question is then shown to our LLM, along with the most similar documents, which the LLM uses as context to answer our question.

In [35]:
pdf_qa = ChatVectorDBChain.from_llm(llm,
                                    db, 
                                    return_source_documents=True)


query = 'What is "quantization of model parameters"? What does that mean?'
result = pdf_qa({"question": query, "chat_history": ""})


print("Answer:")
pprint(result["answer"])
print("Source pages:\n"+",".join([str(page.metadata['page']) for page in result['source_documents']]))

Answer:
 When the authors write "quantization", they are referring
specifically to the process of representing model parameters using
fewer bits than the standard floating point representation. They use
the phrase "parameter quantization" to distinguish this meaning from
other possible interpretations... Of course, there are many other ways
to "quantify" a neural network (e.g., quantifying gradients used
during training, etc.), but if the text is talking about reducing the
precision of model weights themselves, then this is what they mean.
Source pages:
2,2,1,3


## Now make a map-reduce version

The above version is simple and fast, but it has limitations. 
* It doesn't use the specific prompt format LLaMA-2 expects.
* For context, the model just gets big chunks of the paper, pasted together. Lots of that context is probably irrelevant or distracting.

So, instead of a one-step Q&A like used above, let's define a map-reduce approach that is more nuanced. In the map step, the model looks at just one document in the corpus at a time, and outputs a response about just that document.
The map step gets applied to lots of documents (one at a time).
In the reduce step, the model looks at all the outputs it produced during the map stage, and uses all those outputs combined as context to produce a final output.
The prompt used for the map step is different from the one used for the reduce step.

So, when we pass a question to this chain:
* The question is converted to an embedding
* The embedding of our question is used to find the most similar documents in our vector store
* For each of those most similar documents:
    * The document is shown to the LLM, along with our question. The model is instructed to summarize all info in the document relevant to our question.
* There is now a summary of each of the documents similar to our query.
* Those summaries are pasted together and shown to the LLM along with our question.
* The LLM produces a final answer to our question using the summarized documents as context.

In [36]:
# Define map and reduce functions
from tqdm import tqdm
from langchain.vectorstores import FAISS

def map_fn(doc, 
           query,
           verbose=False
          ):
    
    prompt = \
f"""\
[INST] <<SYS>>
From the following part of an academic paper, summarize all information that is relevant to the question "{query}\".
<</SYS>>

DOCUMENT:
{doc.page_content} [/INST]\
"""
    
    output = llm(prompt)
    if verbose:
        print(f'DOCUMENT:\n{doc.page_content}\n\nSUMMARY:\n{output}')
    return output

def reduce_fn(mapped_outputs, query):
    context = "\n#########\n".join(mapped_outputs)
    
    prompt = \
f"""\
[INST] <<SYS>>
Based on the below context, which is summaries of parts of academic papers, answer the question: \"{query}\".
<</SYS>>

CONTEXT:
{context} [/INST]\
"""
    
    return llm(prompt)

In [59]:
# Try out the map-reduce version
query = "What is quantization of model parameters?"

from utils.vector_store_tools import get_relevant_docs
docs = get_relevant_docs(query, db)

# Use tqdm for list comprehension progress bar
verbose = False
mapped_outputs = [map_fn(doc, query, verbose=verbose) for doc in tqdm(docs, desc="Mapping documents")]

final_answer = reduce_fn(mapped_outputs, query)

print("FINAL ANSWER:")
print(final_answer)
print("\nSource pages:\n" + ",".join([str(page.metadata['page']) for page in docs]))

Mapping documents: 100%|██████████| 5/5 [03:22<00:00, 40.40s/it]


FINAL ANSWER:
  Quantization of model parameters refers to the process of representing a set of model parameters using fewer bits than their original size. This technique is commonly used to reduce the memory requirement for training deep neural networks, especially when dealing with large models like transformers. The most common way to perform quantization is by dividing the weight tensor into chunks, assigning a unique quantization constant to each chunk, and then scaling the standard deviations of the weight tensor to match the standard deviations of the k-bit data type. Another approach is to allow individual weights to move freely during training, so that they can adaptively adjust their discrete values in response to changing error gradients. Techniques like optimal brain quantization (OBQ) and GPTQ have been proposed recently to improve the efficiency of training large language models.

Source pages:
2,2,4,2,3


## Now make a version that loads multiple papers and cites them

In [37]:
!curl -o gptq.pdf https://arxiv.org/pdf/2210.17323.pdf
!curl -o qlora.pdf https://arxiv.org/pdf/2305.14314.pdf
!curl -o lora.pdf https://arxiv.org/pdf/2106.09685.pdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  509k  100  509k    0     0   676k      0 --:--:-- --:--:-- --:--:--  676k
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1040k  100 1040k    0     0  1469k 

In [38]:
pdf_paths = ["gptq.pdf", "qlora.pdf", "lora.pdf"]
pages = []
for pdf_path in pdf_paths:
    loader = PyPDFLoader(pdf_path)
    pages += loader.load_and_split()

In [39]:
saved_already = True

if saved_already:
    db = load_saved_vector_store(cache_loc)
else:
    db = generate_vector_store(pages, cache_loc)
    db.save_local(os.path.join("data","faiss_index"))

In [41]:
# Simple version
pdf_qa = ChatVectorDBChain.from_llm(llm,
                                    db, 
                                    return_source_documents=True)


result = pdf_qa({"question": query, "chat_history": ""})
print("Answer:")
pprint(result["answer"])
print("Sources:\n"+"\n".join([page.metadata['source'] + \
                              ', page' + str(page.metadata['page']) \
                              for page in result['source_documents']]))

query = 'Please explain what "quantization of model parameters" is. What do they mean by that?'

Answer:
Quantization can refer to the procedure of mapping continuous signals
or numbers to shorter strings of binary digits for implementation on
computers. In this setting, continuous quantities represented by
floating-point numbers may lose some of their resolution when
converted to integers due to roundoff mistakes. By choosing an
appropriate nonzero value called a "quantization reference" or simply
"quant," some fraction of the integers might be assigned the same code
word as the ones closest to the origanal quantity within a given
measurement interval (or boxcar); otherwise, there would be
"aliasing." Aliasing could occur with quantization unless properly
controlled via the selection of appropriate sample intervals and
oversampling rates depending upon desired accuracy levels attainable
within available computing hardware capabilities. It also finds
applications across various disciplines such as signal processing,
embedded systems design, robotics kinematics, image processing,
t