# Retrieval Augmented Generation

In this exercise, we will try to take a set of own documents and run some queries upon them.


There are several components of RAG
1. Documents: Different types of documents such as text files, markdown files and pdfs can be read although each type of text may require additional libraries/ software to be installed.
2. Embedding : Several types of Embeddings are possible. Langchain offers the  [following](https://python.langchain.com/docs/integrations/text_embedding).
3. Vector Database: These can also be of several types such as Faiss, Weaviate and Chroma. Lanchain offers the [following](https://python.langchain.com/docs/integrations/vectorstores). We will use the huggingface embeddings in this example.
4. LLM : There a large number of open source providers of LLMs on huggingface.

### Steps
1. To begin, let's work on getting some data to query. Let's try to query the [Prompting guide docs](
https://github.com/dair-ai/Prompt-Engineering-Guide). Save the docs as a zip file on your local system. Upload the zip file to the content folder. Our docs have the markdown format.

2. Next, we need to create  Create a config.py file containing your huggingface token as
```
hugging_face_token = "your_token"
```

3. Then we will implememt two approaches. The first approach will be without using RAG and the second approach will use RAG.

4. We can compare the responses of the model with and without RAG

In [None]:
!pip install 'transformers[torch]'
!pip install datasets zstandard evaluate
!pip install accelerate -U
!pip install bitsandbytes
!pip install langchain
!pip install unstructured
!pip install sentence-transformers
!pip install faiss-gpu

In [None]:
import torch
from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  BitsAndBytesConfig,
  pipeline,
  RagRetriever,
  RagSequenceForGeneration,
)
from langchain import HuggingFacePipeline
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
import textwrap
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.chains import LLMChain
from langchain.prompts.prompt import PromptTemplate
import time
import pprint

In [None]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [None]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=bnb_config)  # You may want to use bfloat16 and/or move to GPU here

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
query = "Explain tree of thought prompting?"
messages = [

    {"role": "user",
     "content": query},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
pprint.pprint(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] Explain tree of thought prompting? [/INST] Tree of Thought is a technique used to help expand and explore ideas, concepts, or problems in a systematic and organized way. It is a visual brainstorming tool that helps to identify and connect different ideas, sub-ideas, and potential solutions.

To use the Tree of Thought technique, follow these steps:

1. Identify the main idea or concept: Write this idea in the center of a blank page or whiteboard.
2. Identify the first level of ideas or sub-ideas: These are the ideas that directly relate to the main idea. Write them as branches coming out from the main


In [None]:
query = "What is ReAct Prompting?"
messages = [

    {"role": "user",
     "content": query},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
pprint.pprint(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] What is ReAct Prompting? [/INST] ReAct Prompting is a method used in speech-language therapy to help individuals with speech and language disorders improve their communication skills. The acronym "ReAct" stands for "Respond, Analyze, Compare, and Teach."

In this approach, the therapist responds to the client's attempts at communication, analyzes the errors or difficulties, compares the client's production to a model or target, and teaches the client strategies and techniques to improve their communication. The therapist provides immediate feedback and correction, allowing the client to practice and learn in a supportive and interactive environment.

The


In [None]:
#this helps us to read in the zip files
import zipfile
with zipfile.ZipFile("/content/Prompt-Engineering-Guide-main.zip", 'r') as zip_ref:
  zip_ref.extractall("")

In [None]:
dir_loader = DirectoryLoader("/content/Prompt-Engineering-Guide-main/guides", glob = "**/*.md", recursive= True)
documents = dir_loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# Text Splitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_documents(documents)
len(chunks),len(documents)



(104, 9)

In [None]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
db = FAISS.from_documents(chunks, embeddings)
db.save_local("faiss_index")
#new_db = FAISS.load_local("faiss_index", embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
)

In [None]:
retriever = db.as_retriever(k = 4)

In [None]:
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [None]:
prompt_template = """
### [INST]
Instruction: You are a helpful assistant. Answer the question based on your own
knowledge. If you do not know the answer, say that you do not know the answer.
Do not make up stuff. Wherever possible use additional details from the context:

{context}

### QUESTION:
{question}

[/INST]
 """

In [None]:

# Create prompt from prompt template
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)


In [None]:
query = "Explain tree of thought prompting?"
begin_time = time.time()
retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

result =  rag_chain.invoke(query)
pprint.pprint(result)
print(f"Executed in {time.time() - begin_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': [Document(page_content="Overall, it seems that providing examples is useful for solving some tasks. When zero-shot prompting and few-shot prompting are not sufficient, it might mean that whatever was learned by the model isn't enough to do well at the task. From here it is recommended to start thinking about fine-tuning your models or experimenting with more advanced prompting techniques. Up next we talk about one of the popular prompting techniques called chain-of-thought prompting which has gained a lot of popularity.\n\nChain-of-Thought Prompting\n\nIntroduced in Wei et al. (2022), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.\n\nPrompt:\n```\nThe odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.\nA: Adding all the odd numbers (9, 15, 1) gives 25. The answe

{'context': [Document(page_content="Overall, it seems that providing examples is useful for solving some tasks. When zero-shot prompting and few-shot prompting are not sufficient, it might mean that whatever was learned by the model isn't enough to do well at the task. From here it is recommended to start thinking about fine-tuning your models or experimenting with more advanced prompting techniques. Up next we talk about one of the popular prompting techniques called chain-of-thought prompting which has gained a lot of popularity.\n\nChain-of-Thought Prompting\n\nIntroduced in Wei et al. (2022), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.\n\nPrompt:\n```\nThe odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.\nA: Adding all the odd numbers (9, 15, 1) gives 25. The answe

In [None]:
query = "What is ReAct Prompting?"
begin_time = time.time()
retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

result =  rag_chain.invoke(query)
pprint.pprint(result)
print(f"Executed in {time.time() - begin_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': [Document(page_content='Miscellaneous Topics\n\nIn this section, we discuss other miscellaneous and uncategorized topics in prompt engineering. It includes relatively new ideas and approaches that will eventually be moved into the main guides as they become more widely adopted. This section of the guide is also useful to keep up with the latest research papers on prompt engineering.\n\nNote that this section is under heavy development.\n\nTopic:\n- Active Prompt\n- Directional Stimulus Prompting\n- ReAct\n- Multimodal CoT Prompting\n- GraphPrompts\n- ...\n\nActive-Prompt\n\nChain-of-thought (CoT) methods rely on a fixed set of human-annotated exemplars. The problem with this is that the exemplars might not be the most effective examples for the different tasks. To address this, Diao et al., (2023) recently proposed a new prompting approach called Active-Prompt to adapt LLMs to different task-specific example prompts (annotated with human-designed CoT reasoning).', metadata=

In [None]:
query = "What is ReAct Prompting? Look under the miscellaneous section"
begin_time = time.time()
retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

result =  rag_chain.invoke(query)
pprint.pprint(result)
print(f"Executed in {time.time() - begin_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': [Document(page_content='Miscellaneous Topics\n\nIn this section, we discuss other miscellaneous and uncategorized topics in prompt engineering. It includes relatively new ideas and approaches that will eventually be moved into the main guides as they become more widely adopted. This section of the guide is also useful to keep up with the latest research papers on prompt engineering.\n\nNote that this section is under heavy development.\n\nTopic:\n- Active Prompt\n- Directional Stimulus Prompting\n- ReAct\n- Multimodal CoT Prompting\n- GraphPrompts\n- ...\n\nActive-Prompt\n\nChain-of-thought (CoT) methods rely on a fixed set of human-annotated exemplars. The problem with this is that the exemplars might not be the most effective examples for the different tasks. To address this, Diao et al., (2023) recently proposed a new prompting approach called Active-Prompt to adapt LLMs to different task-specific example prompts (annotated with human-designed CoT reasoning).', metadata=