# What is Retrieval Augmented Generation (RAG)?

- RAG combines general language understanding of a Large Language Model to specific sets of information that are not included in its training data.

- RAG is useful for enhancing the accuracy and reliability of generative AI models by giving the model information that it should base its answers on.
- While general Large language models (LLMs) are useful for responding to general prompts quickly, they might not be useful for those who want specific answers.

In simple terms, RAG works like this:
- First, it searches through a vast amount of existing text data to find relevant information related to a specific question or topic.
- Then, it uses this retrieved information to generate new text or responses that are contextually relevant and coherent.

# Install packages

In [1]:
!pip install transformers
!pip install torch

#https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
#https://gpt-index.readthedocs.io/en/latest/getting_started/reading.html
!pip install llama-index
!pip install huggingface_hub
!pip install accelerate
!pip install pypdf

Collecting transformers
  Using cached transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Using cached huggingface_hub-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2023.12.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Using cached tokenizers-0.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Using cached safetensors-0.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Using cached transformers-4.37.2-py3-none-any.whl (8.4 MB)
Using cached huggingface_hub-0.20.3-py3-none-any.whl (330 kB)
Using cached regex-2023.12.25-cp311-cp311-manylinux_2_17_x86_64.ma

# Import packages

In [5]:
import torch
from llama_index.prompts import PromptTemplate
from llama_index import ServiceContext
from llama_index.llms import HuggingFaceLLM
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.chat_engine import SimpleChatEngine
#For RAG
from llama_index import VectorStoreIndex, SimpleDirectoryReader

import accelerate
from huggingface_hub.hf_api import HfFolder

# Set up the Llama2 model

In [6]:
#Save your huggingface token to use the Llama2 model
#It is required to apply to access the model -> https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
HfFolder.save_token(hf_token)

In [7]:
SYSTEM_PROMPT = """You are an answer bot that answers questions based on the given input. 
Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Answer to the point and keep it short.
- Do not guess. 
"""

In [8]:
query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

In [9]:
model = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=1000,
    generate_kwargs={#"temperature": 0.3, 
        "do_sample": False},
    system_prompt=SYSTEM_PROMPT,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    #stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    #model_kwargs={"torch_dtype": torch.float16}
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
#Select embeddings to use
#https://huggingface.co/BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [11]:
#Set service context
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=model,
    embed_model=embed_model
)

# Define documents used for RAG and create an index

In [12]:
#Place documents used for RAG into a folder
#In this case, there are 2 pdf documents in "Testdocs" folder in root
documents = SimpleDirectoryReader("Testdocs").load_data()

In [13]:
#NOTE: The same service_context is used from the previous section
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

# Use the query engine to get answers based on provided documents

In [14]:
query_engine = index.as_query_engine()
RAG_response = query_engine.query("What is SAS Visual Text Analytics?")



In [15]:
RAG_response

Response(response='Based on the given context information, SAS Visual Text Analytics is a web-based text analytics application that uses context to provide a comprehensive solution to the challenge of identifying and categorizing key textual data. It is a part of SAS Viya and combines the visual programming flow of SAS Text Miner with the rules-based linguistic methods of categorization and concept extraction in SAS Contextual Analysis. The application allows users to identify key textual data in their document collections, build concept and categorization models, and remove meaningless textual data.', source_nodes=[NodeWithScore(node=TextNode(id_='62b548cc-3620-41f5-b2bd-cc98e29a335e', embedding=None, metadata={'page_label': '5', 'file_name': 'sasvta84.pdf', 'file_path': 'Testdocs/sasvta84.pdf', 'file_type': 'application/pdf', 'file_size': 2210066, 'creation_date': '2023-11-13', 'last_modified_date': '2023-11-13', 'last_accessed_date': '2024-02-08'}, excluded_embed_metadata_keys=['fil

In [12]:
RAG_response2 = query_engine.query("What languages are supported?")



In [13]:
RAG_response2

Response(response='Sure, I\'d be happy to help! Based on the context information provided, the answer to the query "What languages are supported?" is:\n\nSAS Visual Text Analytics 8.4 supports the following languages:\n\n1. Arabic\n2. Chinese (Simplified and Traditional)\n3. Croatian\n4. Czech\n5. Danish\n6. Dutch\n7. English\n8. Farsi\n9. Finnish\n10. French\n11. German\n12. Greek\n13. Hebrew\n14. Hindi\n15. Hungarian\n16. Indonesian\n17. Italian\n18. Japanese\n19. Kazakh\n20. Korean\n21. Norwegian (Bokmal and Nynorsk)\n22. Polish\n23. Portuguese\n24. Romanian\n25. Russian\n26. Slovak\n27. Slovene\n28. Spanish\n29. Swedish\n30. Tagalog\n31. Thai\n32. Turkish\n33. Vietnamese\n\nI hope this helps! Let me know if you have any other questions.', source_nodes=[NodeWithScore(node=TextNode(id_='9e7817bb-d77a-4211-8d2b-2a6c15709b57', embedding=None, metadata={'page_label': '11', 'file_name': 'sasvta84.pdf', 'file_path': 'Testdocs/sasvta84.pdf', 'file_type': 'application/pdf', 'file_size': 221