### NOTES:

Typical challenges without RAG
1. LLM depend on last update
2. Factual errors
3. Might not have the context

So its needed to include knowledge
1. Fine Tuning
-- pretrained + last layer retrained using proprietary data
-- expensive
-- needs expertise

2. In context learning
-- using prompts to give context and examples
-- max tokens limit
-- needs memory for permanent session storage


3. RAG
-- LLM + Info Retrieval
-- load docs
- indexing & storage
-- create chunk -> managing size
-- embeddings -> numeric tranformation of each chunk
-- indexing -> helps with efficient retrieval
-- create a vector store -> storage of this information with efficient retrieval
-- retrieves info -> get the right answer by taking the user query, embedding it, matching with vector store to get context, and then fetching the top k response back, these are then used to create the prompt for the LLM
-- synthesize the response -> top k relevant chunks + context + user query is passed to LLM and get an answer  
-- query engine -> provide output through generator LLM
-- evaluation


## RAG Notes

### RAG
Data comes from 
-- API
-- Raw file
-- Vector store
-- Database

Needs 
-- framework like Llamanindex or Langchain
-- embedding models
-- vector store
-- LLM

### Llamanindex
-- single framework to build LLM search + retrieval (Q&A) type applications




In [10]:
#STEPS
#1. Ingest
#2. Index
#3. Retrieve
#4. REsponse Systhesizer
#5. Ouery
'''
%%capture

!pip install llama-index
!pip install python-dotenv

'''

UsageError: Line magic function `%%capture` not found.


In [22]:
import os
import tqdm
from pathlib import Path
from dotenv import load_dotenv



load_dotenv(r'C:\Users\myohollc\Documents\llamaindex\av_RAG_LlamaIndex\.env')

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY


In [2]:
if GOOGLE_API_KEY=='':
    print('key not loaded')
else:
    print('key loaded')

if OPENAI_API_KEY=='':
    print('key not loaded')
else:
    print('key loaded')

key loaded
key loaded


In [11]:
### Step 0: Data Ingestion

In [8]:
###
from llama_index.core import SimpleDirectoryReader

## all pages get loaded as a list, where each element is a page

documents = SimpleDirectoryReader(input_files=["./data/transformers.pdf"]).load_data()

In [19]:
type(documents)
documents[0]
print(documents[0])
documents[0].text
documents[0].metadata
documents[0].id_


Doc ID: 117276a3-f08f-4df0-9dec-f83113e3b303
Text: Provided proper attribution is provided, Google hereby grants
permission to reproduce the tables and figures in this paper solely
for use in journalistic or scholarly works. Attention Is All You Need
Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google
Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com
Jakob Usz...


'117276a3-f08f-4df0-9dec-f83113e3b303'

In [None]:
### Step 1a: Embedding Model

In [23]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.gemini import GeminiEmbedding

In [34]:
#openai

#embed_model = OpenAIEmbedding(model_name="text-embedding-3-large", title="this is a openai document")
#example usage
#embeddings = embed_model.get_text_embedding("OpenAI Embeddings")

#gemini
embed_model = GeminiEmbedding(model_name="models/embedding-001", title="this is a gemini document")

# example usage
embeddings = embed_model.get_text_embedding("Google Gemini Embeddings.")

In [30]:
type(embeddings)
embeddings[0:3]
len(embeddings)

768

In [None]:
### Step 1b: LLM

In [None]:
### openAI
'''
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-3.5-turbo",
    # api_key="some key",  # uses OPENAI_API_KEY env var by default
)
'''


In [35]:

### gemini
from llama_index.llms.gemini import Gemini

llm = Gemini(
    model="models/gemini-1.5-flash",
    # api_key="some key",  # uses OPENAI_API_KEY env var by default
)

In [None]:
### Step 2: Indexing

In [36]:
from llama_index.core import VectorStoreIndex
# creates a datastore that has documents and the embeddings

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [55]:
type(index)

llama_index.core.indices.vector_store.base.VectorStoreIndex

In [None]:
### Step 3: Retrieval

In [41]:
# Set up index as a retriever to be able to query the index and get the responses, mainly top k chunks.
retriever = index.as_retriever()

In [42]:
retrieved_nodes = retriever.retrieve("What is self attention?")

In [43]:
retrieved_nodes

[NodeWithScore(node=TextNode(id_='652dd1e8-fd86-4dc9-9bad-c10e25070c75', embedding=None, metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='355044f3-c36a-411a-9c76-b52520e731ad', node_type='4', metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, hash='a251912bdd0963f1b37603da43eecebc172b05ad372017b68ce0564976555f8d')}, meta

In [53]:
type(retrieved_nodes)
len(retrieved_nodes)
retrieved_nodes[0].metadata
print(retrieved_nodes[0].text)

Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6identical layers. 

In [54]:
print(retrieved_nodes[1].text)

Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14


In [56]:
retrieved_nodes[0].node

TextNode(id_='652dd1e8-fd86-4dc9-9bad-c10e25070c75', embedding=None, metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='355044f3-c36a-411a-9c76-b52520e731ad', node_type='4', metadata={'page_label': '3', 'file_name': 'transformers.pdf', 'file_path': 'data\\transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-06-27', 'last_modified_date': '2025-02-15'}, hash='a251912bdd0963f1b37603da43eecebc172b05ad372017b68ce0564976555f8d')}, metadata_template='{key}

In [57]:
### Step 4: Response Synthesis

In [58]:
from llama_index.core import get_response_synthesizer

# this is needed for creating a way the llm can respond
response_synthesizer = get_response_synthesizer(llm=llm)

In [60]:
type(response_synthesizer)

llama_index.core.response_synthesizers.compact_and_refine.CompactAndRefine

In [61]:
### Stage 5: Query Engine

In [62]:
# this is used to query the indexed document and receive synthesized responses from the LLM
# it takes the index and the response synthesizer
query_engine = index.as_query_engine(llm=llm, response_synthesizer=response_synthesizer)

In [63]:
response = query_engine.query("What is self attention?")

In [70]:
print(response.response)
print(len(response.response))

Self-attention is a mechanism that maps a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.  The output is calculated as a weighted sum.  In the decoder stack, it's modified to prevent positions from attending to subsequent positions.

293


In [74]:
len(response.source_nodes)
response.source_nodes[0].text

'Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512.\nDecoder: The decoder is also composed of a stack of N = 6ident

In [75]:
print(query_engine.query("Why do we need positional encodings?").response)

To utilize the order of a sequence in a model without recurrence or convolution, positional encodings are added to the input embeddings.  This injects information about the relative or absolute position of tokens in the sequence.

