## RAG

<b> Challenges with current LLMs </b>

LLMs are significant in building intelligent bots that can solve complex user queries </br>
Inorder for an LLM to stay relevant it needs to access information by cross-referencing various authoratitative knowledge source </br>
Since LLM trainings/fine-tunings are cost & compute-intensive tasks (requiring high resources & time), updating LLMs with newer information is challenging

Note: LLMs have a cut-off date

Further, many industrial solutions deal with proprietary information and can be released openly due to various privacy concerns

<b> Retrieval Augmented Generation </b> or RAG pipeline is one approach to solve these challenges

Let us learn how we can easily build an RAG pipeline

![image.png](attachment:fe613ee6-e0dc-47f1-b7ef-fbc4d3c22085.png)

## Building blocks of RAG pipeline

1. External Knowledge Base --> Document Loader
2. Processing --> Chunking & Vector Stores
3. Retrieval --> fetching relevant chunks
4. Generation component --> LLMs


In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.10.55-py3-none-any.whl (6.8 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.8-py3-none-any.whl (13 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-core==0.10.55 (from llama-index)
  Downloading llama_index_core-0.10.55-py3-none-any.whl (15.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.5/15.5 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.10-py3-none-any.whl (6.2 kB)
Collecting llama-index-indices-managed-llama-cloud>=0.2.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.2.5-py3-none-any.whl (9.3 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_le

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

In [None]:
# SimpleDirectoryReader is the simplest way to load data from local files into LlamaIndex (GPT-Index)
# https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/

documents = SimpleDirectoryReader(input_files=["xyz.txt"]).load_data()

In [None]:
documents

[Document(id_='2ccc9be3-aae1-4b6f-a819-da42221056e3', embedding=None, metadata={'file_path': 'xyz.txt', 'file_name': 'xyz.txt', 'file_type': 'text/plain', 'file_size': 75393, 'creation_date': '2024-07-14', 'last_modified_date': '2024-07-14'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\r\n\r\nWhat I Worked On\r\n\r\nFebruary 2021\r\n\r\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\r\n\r\nThe first programs I tried writing were on the IBM 1401 that our school d

In [None]:
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-huggingface

Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.2.4-py3-none-any.whl (11 kB)
Collecting text-generation<0.8.0,>=0.7.0 (from llama-index-llms-huggingface)
  Downloading text_generation-0.7.0-py3-none-any.whl (12 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB

In [None]:
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate
import torch




In [None]:
llama2_7b_chat = "meta-llama/Llama-2-7b-chat-hf"
phi3_instruct = "microsoft/Phi-3-mini-4k-instruct"

SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given source documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n"+SYSTEM_PROMPT+ "<<SYS>>\n\n{query_str}[/INST]"
)

query_wrapper_prompt_phi3 = PromptTemplate(
    "<|system|>\n"+SYSTEM_PROMPT+"\n<|user|>{query_str}<|end|>\n<|assistant|>"
)

In [None]:
#llm = HuggingFaceLLM(
#    context_window=4096,
#    max_new_tokens=2048,
#    generate_kwargs={"temperature":0.0, "do_sample":False},
#    query_wrapper_prompt=query_wrapper_prompt,
#    tokenizer_name=llama2_7b_chat,
#    model_name=llama2_7b_chat,
#    device_map="auto",
#    model_kwargs={"torch_dtype":torch.float16}
#)

llm_phi3 = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature":0.0, "top_p":None, "do_sample":False},
    query_wrapper_prompt = query_wrapper_prompt_phi3,
    tokenizer_name=phi3_instruct,
    model_name=phi3_instruct,
    device_map="auto",
    model_kwargs={"torch_dtype":torch.float16}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model_name = "BAAI/bge-small-en-v1.5"

embed_model = HuggingFaceEmbedding(
    model_name=embedding_model_name
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from llama_index.core import Settings

Settings.llm = llm_phi3 #specify the llm to use
Settings.embed_model = embed_model #specify the embedding model

In [None]:
# VectorStoreIndex to take care of chunking & embedding the document as vectors

index = VectorStoreIndex.from_documents(documents)

In [None]:
# query
query_engine = index.as_query_engine()

In [None]:
query = input()

response = query_engine.query(query)
print(response)

What did I do before college?




Before college, you were involved in writing and programming. You wrote short stories that were not very good, and you attempted to program on an IBM 1401 computer. You also built your first computer, a Heathkit, and started programming simple games, a program to predict how high your model rockets would fly, and a word processor for your father to write a book. Later, you got a TRS-80 computer and continued programming, writing simple games, and developing a word processor. You also took an interest in philosophy and AI, which eventually led you to switch your focus from philosophy to AI.


### RAG application

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.38.1-py3-none-any.whl (12.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting altair<6.0,>=5.0 (from gradio)
  Downloading altair-5.3.0-py3-none-any.whl (857 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m857.8/857.8 kB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==1.1.0 (from gradio)
  Downloading gradio_client-1.1.0-py3-none-any.whl (318 kB)
[2K     [90m━━━━━━━━━━━━━━

In [None]:
import gradio as gr
import os
from tkinter import filedialog, Tk

In [None]:
def select_directory():
    root = Tk()
    root.attributes("-topmost", True)
    folder_path = filedialog.askdirectory()
    root.destroy()
    return folder_path

In [None]:
#select_directory()

In [None]:
def RAG_inference(data_path, query):
    #data_path = select_directory()
    if os.path.exists(data_path):
      documents = SimpleDirectoryReader(input_files=[data_path]).load_data()
      index = VectorStoreIndex.from_documents(documents, show_progress=True)
      query_engine = index.as_query_engine()

      response = query_engine.query(query)
    else:
      return "Incorrect Directory Path!!"
    return response

In [None]:
demo = gr.Interface(
    fn=RAG_inference,
    inputs=["text","text"],
    outputs=["text"]
)

demo.launch(inline=False, debug=True)
demo.close()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://903e1bddee37211ee7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/21 [00:00<?, ?it/s]

You are not running the flash-attention implementation, expect numerical differences.


Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/21 [00:00<?, ?it/s]

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://903e1bddee37211ee7.gradio.live
Closing server running on port: 7860
