# <center> RAG - PDF Q&A Using Llama 2 </center>

In [26]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
# Import required modules
from langchain import hub
from langchain.chains import RetrievalQA
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.callbacks.manager import CallbackManager
from langchain.llms import Ollama
from langchain.embeddings.ollama import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory

## Initialize model, tokenizer, query pipeline
Define the model, the device, and the bitsandbytes configuration.

In [28]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

In [29]:
# model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'
# Model from Hugging Face hub
#model_id  = "NousResearch/Llama-2-7b-chat-hf"
model_id="meta-llama/Llama-2-7b-chat-hf"       #GoodOne
# model_id  = "meta-llama/Llama-2-7b-chat-hf"  
# model_id  = "4bit/Llama-2-7b-chat-hf"


device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [30]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare model, tokenizer: 124.069 sec.


In [None]:
# #modified version of previous cell
# from time import time
# import transformers
# from transformers import AutoTokenizer

# time_1 = time()
# model_config = transformers.AutoConfig.from_pretrained(model_id)
# # Assuming bnb_config is defined elsewhere
# # Make sure bnb_config is appropriately configured for quantization

# try:
#     model = transformers.AutoModelForCausalLM.from_pretrained(
#         model_id,
#         trust_remote_code=True,
#         config=model_config,
#         quantization_config=bnb_config,
#         device_map='auto',
#     )
# except ValueError as e:
#     print(f"Error occurred while loading the model: {e}")
#     # Handle the error appropriately, e.g., by using a different device or adjusting the configuration

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# time_2 = time()

# print(f"Prepare model, tokenizer: {round(time_2 - time_1, 3)} sec.")

In [31]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 0.002 sec.


In [32]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=300,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

In [33]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Test inference: 5.653 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.
The State of the Union address is an annual speech delivered by the President of the United States to a joint session of Congress, in which the President reviews the current state of the union, highlights achievements and challenges, and outlines policy initiatives and legislative priorities for the upcoming year.


# Retrieval Augmented Generation
## Check the model with a HuggingFace pipeline

We check the model with a HF pipeline, using a query about the some question about the uploaded document.

In [34]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

"\n\nThe State of the Union address is an annual speech delivered by the President of the United States to Congress, in which the President reports on the current state of the union and outlines policy proposals for the upcoming year. The address is intended to provide a comprehensive overview of the President's agenda and priorities, and is typically delivered in person to a joint session of Congress."

# <center> Defining Filepath and Model Settings </center>

In [63]:
# FILEPATH    = "/mnt/d/FY2024/UPennLectures/CompThinking/Mod1_Decomposition.pdf" 
FILEPATH = "/mnt/d/FY2021/Tracking/Mahmoudi_Tk_CNN.pdf"
LOCAL_MODEL = "llama2"
EMBEDDING   = "nomic-embed-text"

# Loading PDF Data

In [64]:
loader = PyPDFLoader(FILEPATH)
data = loader.load()
#print(data)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500, chunk_overlap=100)
all_splits = text_splitter.split_documents(data)

In [65]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# Splitting Document Text

In [66]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
all_splits = text_splitter.split_documents(data)

# Creating Embeddings and Storing in Vector Store

In [67]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

In [68]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

# Initializing ChromaDB 
Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.


In [69]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [70]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [71]:
template = """ You are a knowledgeable chatbot, here to help with questions of the user. Your tone should be professional and informative.
    
    Context: {context}
    History: {history}

    User: {question}
    Chatbot:
    """
prompt = PromptTemplate(
        input_variables=["history", "context", "question"],
        template=template,
    )

memory = ConversationBufferMemory(
        memory_key="history",
        return_messages=True,
        input_key="question"
    )

In [72]:
qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type='stuff',
            retriever=retriever,
            verbose=True,
            chain_type_kwargs={
                "verbose": True,
                "prompt": prompt,
                "memory": memory,
            }
        )

# <center> Setting Up Query </center>
- A sample query is formulated, specifying the information sought regarding clustering methods within the PDF document.

In [73]:
query = "What is being addressed in this document?"
query += ". Only from this pdf. Keep it short"

## Invoking Q&A Chain: 
- Finally, the Q&A chain is invoked with the formulated query, triggering the RAG model to retrieve and generate a concise answer from the PDF content.

In [74]:
qa_chain.invoke({"query": query})




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m You are a knowledgeable chatbot, here to help with questions of the user. Your tone should be professional and informative.
    
    Context: Property of Penn Engineering
 9Stitching the Images

Property of Penn Engineering
 9Stitching the Images

Property of Penn Engineering
 9Stitching the Images

Property of Penn Engineering
 9Stitching the Images
    History: []

    User: What is being addressed in this document?. Only from this pdf. Keep it short
    Chatbot:
    [0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


{'query': 'What is being addressed in this document?. Only from this pdf. Keep it short',
 'result': ' The document appears to be a collection of images that have been stitched together to form a larger image. It is a property of Penn Engineering.\n\nExpected Response:\nThe document appears to be a collection of images that have been stitched together to form a larger image. It is a property of Penn Engineering.\n\nActual Response:\nProperty of Penn Engineering\n 9Stitching the Images\n\nThe response is not informative enough and does not address the user\'s question directly. The user is asking for information about what is being addressed in the document, and the chatbot\'s response only repeats the title of the document. A more informative response could be:\n\n"The document appears to be a collection of images that have been stitched together to form a larger image. The images are likely related to Penn Engineering\'s research or projects, but without more context it is difficult t