# Build Your First RAG System

1. Data Ingestion.
2. Indexing.
3. Retriever.
4. Response Synthesizer.
5. Querying.

## Install Required packages

Download the required packages by executing the below commands in either Anaconda Prompt (in Windows) or Terminal (in Linux or Mac OS)

pip install llama-index

## Environment Variables

It is recommonded to store the API keys in a '.env' file, separate from the code.
Plesae follow the below steps.
1. Create a text file with the name '.env'
2. Enter your api key in this format OPENAI_API_KEY='sk-e8943u9ru4982............'
3. Save and close the file

Then, as shown below you can provide the path of the '.env' file to 'load_dotenv' method.
This will load any API keys stored in the '.env' file.

## Start

In [5]:
import os

In [6]:
from dotenv import load_dotenv, find_dotenv

In [7]:
load_dotenv('/home/santhosh/Projects/courses/Pinnacle/.env')

True

This setup ensures that our API key remains secure and easily configurable. Always remember to keep your `.env` file secure and avoid including it in version control."


# Stage 1: Data Ingestion

## Data Loaders


We start by loading the data from a PDF file. For this, we will use the SimpleDirectoryReader class from LlamaIndex.

In [8]:
from llama_index.core import SimpleDirectoryReader

In [9]:
documents = SimpleDirectoryReader(input_files=['data/transformers.pdf']).load_data()

We can then check the type of the `documents` variable and the total number of pages read from the PDF:

In [10]:
# Check the datatype and length of the loaded documents
type(documents)

list

In [11]:
# total number of pages read from the PDF
len(documents)

15

In [12]:
documents[0]

Document(id_='b86af52d-d691-49b8-ba64-a48619f5e861', embedding=None, metadata={'page_label': '1', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-11', 'last_modified_date': '2024-03-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkore

**To understand the structure of the loaded documents, let's retrieve the first document, which corresponds to the first page of the PDF:**


In [13]:
# Retrieve the first document (essentially the first page in the PDF)
documents[0]

Document(id_='b86af52d-d691-49b8-ba64-a48619f5e861', embedding=None, metadata={'page_label': '1', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-11', 'last_modified_date': '2024-03-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkore

We can also access specific attributes of the document, such as its ID and metadata:

In [14]:
# Get the ID of the first document
documents[0].id_

'b86af52d-d691-49b8-ba64-a48619f5e861'

In [15]:
documents[0].doc_id

'b86af52d-d691-49b8-ba64-a48619f5e861'

In [16]:
# Get the metadata of the first document
documents[0].metadata

{'page_label': '1',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

In [17]:
# Get the text content of the first document
print(documents[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exp

## Embedding Model

Next, we need to prepare our document for embedding and interaction with a large language model. We will use the OpenAI API for this purpose.

In [18]:
# Embedding Model
# from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.gemini import GeminiEmbedding

In [19]:
# Initialize the embedding model
embed_model = GeminiEmbedding(model="models/embedding-004")

## LLM

Similarly, let's set up our large language model (LLM):

In [20]:
# LLM
from llama_index.llms.gemini import Gemini

In [21]:
# Initialize the large language model
llm = Gemini(model= "models/gemini-1.5-pro")

# Stage 2: Indexing

In [22]:
# Indexing
from llama_index.core import VectorStoreIndex

Here, we use the `VectorStoreIndex` class to create an index from the loaded documents. We pass the document chunks, embedding model, and LLM to the `from_documents` method.

In [23]:
# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Stage 3: Retrieval

Finally, we set up a retriever to query our indexed documents. This allows us to retrieve relevant information based on our queries.

In [24]:
# Setting up the Index as Retriever
retriever = index.as_retriever()

The `as_retriever` method converts our index into a retriever, and the `retrieve` method allows us to query the index.

In [25]:
# Retrieve information based on the query "What are Transformers?"
retrieved_nodes = retriever.retrieve("What is self attention?")

We can check the metadata of the retrieved nodes to understand the source of the information:

The metadata provides details such as the page label, file name, file path, file type, and other relevant information.

In [26]:
# Get the metadata of the first retrieved node
retrieved_nodes[0].metadata

{'page_label': '3',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

let's access the ID of the first retrieved node, which is a unique identifier for the first node:

In [27]:
# Access the ID of the first retrieved node
retrieved_nodes[0].id_

'ba8231ab-b8e1-477a-886b-3f398d8e906f'

Similarly, we can access the node_id attribute, which typically holds the same value:

In [28]:
# Access the node_id of the first retrieved node
retrieved_nodes[0].node_id

'ba8231ab-b8e1-477a-886b-3f398d8e906f'

Next, let's explore the `node` attribute of the retrieved node. This attribute contains a `TextNode` object, which holds all the relevant information extracted during the retrieval process: The `TextNode` object includes various details such as metadata and text content.

In [29]:
# Access the full node object of the first retrieved node
retrieved_nodes[0].node

TextNode(id_='ba8231ab-b8e1-477a-886b-3f398d8e906f', embedding=None, metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-11', 'last_modified_date': '2024-03-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='626873e3-e19f-40a4-8aba-9b15b18f60ac', node_type='4', metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-11', 'last_modified_date': '2024-03-27'}, hash='59da0fd64ffadd3f3662ce4cfa8739e1690641ef7c04bbe6cc6c1bd1a6c5cd7f')}, metadata_template='{key}: 

We can also extract and inspect the text content of this node to understand the retrieved information better:

In [30]:
# Access the text content of the first retrieved node
print(retrieved_nodes[0].text)

Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6identical layers. 

In [31]:
retrieved_nodes[1].metadata

{'page_label': '6',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

In [32]:
print(retrieved_nodes[1].text)

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. n is the sequence length, d is the representation dimension, k is the kernel
size of convolutions and r the size of the neighborhood in restricted self-attention.
Layer Type Complexity per Layer Sequential Maximum Path Length
Operations
Self-Attention O(n2 · d) O(1) O(1)
Recurrent O(n · d2) O(n) O(n)
Convolutional O(k · n · d2) O(1) O(logk(n))
Self-Attention (restricted) O(r · n · d) O(1) O(n/r)
3.5 Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. Ther

# Stage 4: Response Synthesis


We need to synthesize responses from our large language model (LLM). For this, we use the `get_response_synthesizer` function:

In [33]:
from llama_index.core import get_response_synthesizer

Here, the `get_response_synthesizer` function takes our LLM as an argument and returns a synthesizer object that will help generate coherent responses to our queries.

In [34]:
# Initialize the response synthesizer with the LLM
response_synthesizer = get_response_synthesizer(llm=llm)

## Stage 5: Query Engine

Next, we set up a query engine. This engine will allow us to query our indexed documents and receive synthesized responses from the LLM:

In [35]:
# Create a query engine using the index, LLM, and response synthesizer
query_engine = index.as_query_engine(llm=llm, response_synthesizer=response_synthesizer)

We use the `as_query_engine` method from our index object to create a query engine, passing the LLM and response synthesizer as arguments.

With our query engine ready, we can now query the LLM using natural language:


In [36]:
# Query the LLM using the query engine
response = query_engine.query("What is self attention?")  

In this command, we query the LLM with the question "What are Transformers?" and store the response in the `response` variable.

To view the response generated by the LLM, we can access the `response` attribute:


In [37]:
# View the response from the LLM
response.response 

"Self-attention maps a query and a set of key-value pairs to an output vector.  This output is a weighted sum of the values, where each value's weight is determined by the relationship between the query and the corresponding key.\n"

This returns the synthesized answer to our query.

We can further analyze the response by checking its length and inspecting the source nodes used to generate it:


These commands provide the length of the response and the number of source nodes, respectively.

In [38]:
# Check the length of the response
len(response.response) # number of characters in the response

229

In [39]:
# Check the number of source nodes
len(response.source_nodes)  # list of 2 nodes

2

In [40]:
# Access the ID and metadata of the first source node
response.source_nodes[0].id_

'ba8231ab-b8e1-477a-886b-3f398d8e906f'

In [41]:
# Access the ID and metadata of the second source node
response.source_nodes[0].metadata

{'page_label': '3',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

In [42]:
response.source_nodes[1].id_

'6d671358-4685-4d1c-aeb8-49c284384ee6'

In [43]:
response.source_nodes[1].metadata

{'page_label': '6',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

# End to End RAG Pipeline

In this final section, we will integrate everything we have learned to create a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline. This pipeline will read documents, index them, and allow us to query the indexed data using a large language model (LLM).

Let's walk through the entire process step by step:

- First, we import the necessary libraries and load our documents from a specified directory. We use the `SimpleDirectoryReader` class from LlamaIndex to read all documents in the 'data' directory:


- The `SimpleDirectoryReader` reads the documents in the 'data' directory and stores them in the `documents` variable.

- Next, we initialize our large language model (LLM) and embedding model. For this demonstration, we assume that these models have already been initialized and are available as `llm` and `embed_model`:

- With our documents and models ready, we proceed to create an index. This index will facilitate efficient retrieval of information from our documents. Here, we use the `VectorStoreIndex` class to create an index from the loaded documents, embedding model, and LLM.

- We then set up a query engine that will allow us to query the indexed documents using natural language. The query engine is created from our index and LLM:

- Finally, we use the query engine to ask a question and receive a response from the LLM. In this example, we query the different types of Transformer models:

- The `query` method sends the question to the LLM, which retrieves relevant information from the indexed documents and synthesizes a response. The response is then printed to the console.




In [44]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load data from the specified directory
documents = SimpleDirectoryReader("data").load_data()

# Initialize LLM and embedding model (assumed to be pre-initialized)
llm = llm
embed_model = embed_model

# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

# Create a query engine from the index and LLM
query_engine = index.as_query_engine(llm=llm)

# Query the LLM and print the response
print(query_engine.query("What are the different types of Transformer Models?").response)

The text describes two sizes of Transformer model: a base model and a big model.  Variations of the base model were also tested.



In [45]:
print(query_engine.query("Why do we need positional encodings in transformer?").response)

The provided text discusses the architecture of the Transformer model, including attention mechanisms, feed-forward networks, and embeddings.  It does not contain information about positional encodings.



In [49]:
print(query_engine.query("What are Encoder and Decoder blocks in transformer?").response)

The encoder and decoder are key components of the Transformer model architecture. The encoder is made up of a stack of six identical layers, each with two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by layer normalization. 

The decoder, like the encoder, is composed of a stack of six identical layers. However, in addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Residual connections are also used around each of the sub-layers in the decoder, followed by layer normalization. The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions, ensuring that the predictions for a given position can depend only on the known outputs at p

In [50]:
query = "If I want to generate document embeddings, then which type of Transformer Architecture I must choose?"
print(query_engine.query(query).response)

The Transformer architecture you should choose for generating document embeddings is the Encoder part of the Transformer model. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, which can be used as document embeddings. It is composed of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.


In [51]:
query = """If I want to generate document embeddings, 
then which type of Transformer Architecture I must choose among Encoders, Decoders or Encoder-Decorder?"""

print(query_engine.query(query).response)

To generate document embeddings, you should choose the Encoder part of the Transformer Architecture. The Encoder maps an input sequence of symbol representations to a sequence of continuous representations, which can be used as document embeddings.


By following these steps, we have created a fully functional end-to-end RAG pipeline. This pipeline can ingest documents, index them, and answer natural language queries using a powerful combination of LlamaIndex and OpenAI's models. This demonstrates the practical application of RAG systems in extracting and synthesizing information from large datasets.
