<center><h1> Retrieval-Augmented Generation (RAG) enabled GenAI Application with LlamaIndex</h1>

<h3><b>Varad V. Deshmukh</b></h3>

_Data Scientist_ - _Machine Learning / MLOps Engineer_ - _AI Prompt Engineer_</center>

> ### **What is RAG?**

__Large Language Models__ (LLMs) are trained on an enormous corpus of data, e.g. Wikipedia pages, software/hardware technical documentation, blogs, etc. But, they are not trained on our personal data. One of the solutions to this issue is to fine-tune the model with our data. This involves altering the neural network architecture which lies underneath the model, including adding some layers and removing some. This altered model is then trained on our data. This appears as a promising approach, but has certain drawbacks - 

1. Training an LLM is expensive.
2. Due to the high costs associated with training them, its very difficult to continually update them with the latest data.
3. Observability is lacking, implying that we have no means to peek into the process by which the model arrived at the response.

__Retrieval-Augmented Generation__ (RAG) is an alternative, transformative paradigm that works around these problems. Instead of asking the model to generate the response directly, RAG first retrieves information from our data sources, adds it to the context of our query and then asks the model to answer the query based on the enriched prompt. RAG overcomes all three weaknesses of the fine-tuning approach -

1. There is no training involved, so its cheap.
2. Data is fetched only when you ask for it, so it is always up to date.
3. We can see the documents from where the model retrieved its response, so it is trustworthy.

RAG adds our data - stored in varied formats as text documents, pdfs, images, videos, audios, etc. - to the data LLMs already have access to. Our data is loaded and prepared for queries, i.e. 'indexed'. Almost always, this entails converting it to vector embeddings, which are numerical representations of the data concerned. User queries, or the questions that we want the model to respond to, generally through prompts, act on this indexed data. The RAG approach filters down the data down to the most relevant context, i.e. it chooses which document sources are the most relevant to answer the question at hand. This context and our query then go to the LLM in the form of a prompt, to which the model generates a response.

> ### **Stages within RAG**

There are five key stages within RAG, which are to be incorporated into any LLM application that we build -

1. __Loading__ : getting our data from where it lives - text files, PDFs, a website, a database or an API - into the pipeline

2. __Indexing__ : converting the data into a format suitable for querying, which almost always is a vector embedding, incorporating the semantics of our data as well as the necessary metadata

3. __Storing__ : storing the indexed data, i.e. the embeddings into a vector database, to avoid having to re-index it

4. __Querying__ : prompting the RAG-enabled model to answer a specific user query, to which it returns a context-aware response, along with the citations to the source documents

5. __Evaluation__ : checking the efficacy of the RAG pipeline and objectively measuring how accurate, failthful and fast the model responses are

In [None]:
# import the necessary modules
from llama_index import (
    ServiceContext,
    OpenAIEmbedding,
    PromptHelper,
    SimpleDirectoryReader,
    VectorStoreIndex
)
from llama_index.llms import OpenAI
from llama_index.text_splitter import SentenceSplitter
import tiktoken
import os

In [None]:
# setting the OpenAI API key
# replace with your own API key
os.environ['OPENAI_API_KEY'] = 'your_api_key_here'

In [None]:
# load documents
document_directory = '/Users/varad/Desktop/rck/data/'
documents = SimpleDirectoryReader(input_dir=document_directory).load_data()

In [None]:
# instantiating the LLM and the embedding model
llm = OpenAI(
  model='gpt-3.5-turbo',
  temperature=0,
  max_tokens=256
)
embed_model = OpenAIEmbedding()

In [None]:
# customise the embedding model
prompt_helper = PromptHelper(
  context_window=4096, 
  num_output=256, 
  chunk_overlap_ratio=0.1, 
  chunk_size_limit=None
)

In [None]:
# create nodes out of text chunks
text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

In [None]:
# setting the service context
service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  text_splitter=text_splitter,
  prompt_helper=prompt_helper
)

In [None]:
# storing the vector embeddings into a vector store
index = VectorStoreIndex.from_documents(
    documents, 
    service_context = service_context
    )
# save the embeddings to disk
index.storage_context.persist()

In [None]:
# setting up the query engine
query_engine = index.as_query_engine(service_context=service_context)

In [None]:
# function for questioning the model
def ask_the_model(prompt):
    response = query_engine.query(
        prompt,
        similarity_top_k=4
    )
    print(response)

In [None]:
# describe the question
prompt = '''
Your Prompt here.
'''

response = ask_the_model(prompt)
print(str(response))

In [None]:
# get the citations from the relevant source documents
response.get_formatted_sources

**Thank you for going through the notebook. If you liked it, please give a star!**