# Retrievers in LlamaIndex
* Notebook by Adam Lang
* Date: 3/21/2024
* We will review the different type of retrievers you can utilize in LlamaIndex.

# Question: How can we retrieve the top k relevant chunks of text/data?
* Using different types of retrievers!

# Different Modes of Retrievers
1. Vector Store Index Retriever
  * Stores notes as an index => fetches top k similar nodes from index
  * Passes top k => response synthesis module
2. Summary Index Retriever
  * Stores nodes as sequential chain => response synthesis module
  * Can also use a Summary Index LLM Retriever
3. Keyword Table Retriever
  * Uses keyword extraction to query each node
      * Two types are: 1) simple regex, 2) GPT
  * Sends to response synthesis module
4. Document Summary Index Retriever
  * Extracts summary of each document node
  * Fetches top k most relevant summaries
  * Retrieval modes: a) LLM retrieval, b) Embedding retrieval



In [1]:
# install llama-index
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.10.20-py3-none-any.whl (5.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.1.6-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.11-py3-none-any.whl (26 kB)
Collecting llama-index-core<0.11.0,>=0.10.20 (from llama-index)
  Downloading llama_index_core-0.10.21.post1-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.7-py3-none-any.whl (6.0 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.4-py3-none-any.whl (6.6 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downlo

In [2]:
## setup openai connection
import os
os.environ["OPENAI_API_KEY"] = '<your_key>'

### Download a dataset

In [3]:
!mkdir data
!wget 'https://raw.githubusercontent.com/aravindpai/Speech-Recognition/1882379d3152c8cd830d74e677be4dd161d024ea/transformers.pdf' -O 'data/transformers.pdf'

--2024-03-21 20:04:52--  https://raw.githubusercontent.com/aravindpai/Speech-Recognition/1882379d3152c8cd830d74e677be4dd161d024ea/transformers.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/octet-stream]
Saving to: ‘data/transformers.pdf’


2024-03-21 20:04:52 (51.1 MB/s) - ‘data/transformers.pdf’ saved [2215244/2215244]



In [4]:
## load llamaindex PDFReader
from pathlib import Path
from llama_index.core import download_loader

# create PDFReader
PDFReader = download_loader("PDFReader")

# create loader variable
loader = PDFReader()
documents = loader.load_data(file=Path('./data/transformers.pdf'))

  PDFReader = download_loader("PDFReader")


# 1. Vector Store Index

In [6]:
# create index
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

In [7]:
# create retriever
retriever = index.as_retriever(similarity_top_k=3)

In [8]:
# send data to nodes
nodes = retriever.retrieve("What is the use of positional encodings?")

In [9]:
# print length of nodes
len(nodes)

3

In [11]:
# lets take a look at the similarity scores
for node in nodes:
  print("Node Id:", node.id_)
  print("Metadata:", node.metadata)
  print("Score:",node.get_score())
  print("-------------------------------")

Node Id: ec484554-969e-4723-a164-9968ac75a902
Metadata: {'page_label': '6', 'file_name': '/content/data/transformers.pdf'}
Score: 0.785858985613765
-------------------------------
Node Id: 4e8d2dd0-69c2-473d-b01e-806abc646a0d
Metadata: {'page_label': '5', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7789498802510322
-------------------------------
Node Id: c9461bb2-2b72-4fe0-83a0-093d57b3eacb
Metadata: {'page_label': '11', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7771674913721881
-------------------------------


In [12]:
# print index 0 node
nodes[0].text

'Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations\nfor different layer types. nis the sequence length, dis the representation dimension, kis the kernel\nsize of convolutions and rthe size of the neighborhood in restricted self-attention.\nLayer Type Complexity per Layer Sequential Maximum Path Length\nOperations\nSelf-Attention O(n2·d) O(1) O(1)\nRecurrent O(n·d2) O(n) O(n)\nConvolutional O(k·n·d2) O(1) O(logk(n))\nSelf-Attention (restricted) O(r·n·d) O(1) O(n/r)\n3.5 Positional Encoding\nSince our model contains no recurrence and no convolution, in order for the model to make use of the\norder of the sequence, we must inject some information about the relative or absolute position of the\ntokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the\nbottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel\nas the embeddings, so that the two can be summed. There

# 2. Summary Index

In [13]:
# import library and create index
from llama_index.core import SummaryIndex

# create index
index = SummaryIndex.from_documents(documents)

## 2.1 Summary Index - "LLM Retriever"

In [17]:
# create LLM retriever
retriever = index.as_retriever(retriever_mode='llm',
                              choice_batch_size=5)

In [18]:
# send data to nodes
nodes = retriever.retrieve("What is the use of positional encodings?")

In [19]:
# lets take a look at the similarity scores
for node in nodes:
  print("Node Id:", node.id_)
  print("Metadata:", node.metadata)
  # print("Score:",node.get_score())
  print("-------------------------------")

Node Id: 43648bc2-4877-4deb-9bfb-0b96b3942f85
Metadata: {'page_label': '3', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: 49b44fe9-c0b2-4ea7-bee9-26145f339c41
Metadata: {'page_label': '5', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: 1e924c5c-5685-4bf7-81e6-5cdb1ba8369e
Metadata: {'page_label': '6', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: 17703b07-da7a-45f5-a972-135f435eb61e
Metadata: {'page_label': '7', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: 8612648e-cd23-4c53-a8b6-ec46a9f0b751
Metadata: {'page_label': '13', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: fef834e5-57e1-4090-a9cd-3a50e802d73c
Metadata: {'page_label': '14', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: 773d20bd-0b8f-4934-b5e6-687058c64f45
Metadata: {'page_label

## 2.2 Summary Index - "Embedding Retriever"

In [20]:
# create retriever
retriever = index.as_retriever(retriever_mode='embedding',
                               choice_batch_size=3,
                               similarity_top_k=5)

In [21]:
# send data to nodes
nodes = retriever.retrieve("What is the use of positional encodings?")

In [22]:
# print len of nodes
len(nodes)

5

In [23]:
# lets take a look at the similarity scores
for node in nodes:
  print("Node Id:", node.id_)
  print("Metadata:", node.metadata)
  print("Score:",node.get_score())
  print("-------------------------------")

Node Id: 1e924c5c-5685-4bf7-81e6-5cdb1ba8369e
Metadata: {'page_label': '6', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7861306606889289
-------------------------------
Node Id: 49b44fe9-c0b2-4ea7-bee9-26145f339c41
Metadata: {'page_label': '5', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7789498802510322
-------------------------------
Node Id: bc3c826d-198d-46a0-be64-b534438d5676
Metadata: {'page_label': '11', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7771674913721881
-------------------------------
Node Id: ad2a91b3-798b-436e-a5c0-58ba926aa84e
Metadata: {'page_label': '2', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7765658837135382
-------------------------------
Node Id: 8612648e-cd23-4c53-a8b6-ec46a9f0b751
Metadata: {'page_label': '13', 'file_name': '/content/data/transformers.pdf'}
Score: 0.7726808049026301
-------------------------------


# 3. Keyword Table Index

## 3.1 Keyword Table Simple Retriever

In [24]:
# import keyword table index
from llama_index.core import KeywordTableIndex

In [25]:
# create table index
keyword_table_index = KeywordTableIndex.from_documents(
    documents,
    show_progress=True
)

Parsing nodes:   0%|          | 0/15 [00:00<?, ?it/s]

Extracting keywords from nodes:   0%|          | 0/15 [00:00<?, ?it/s]

In [26]:
# create retriever
retriever = keyword_table_index.as_retriever(response_mode='simple')

In [27]:
# send data to nodes
nodes = retriever.retrieve("What is the use of positional encodings?")

In [28]:
# print len nodes
len(nodes)

2

In [29]:
# lets take a look at the similarity scores
for node in nodes:
  print("Node Id:", node.id_)
  print("Metadata:", node.metadata)
  # print("Score:",node.get_score())
  print("-------------------------------")

Node Id: 642f5cfe-3f1f-44ad-8e29-9cd2656c197a
Metadata: {'page_label': '9', 'file_name': '/content/data/transformers.pdf'}
-------------------------------
Node Id: a4ab42ba-a9aa-4b92-95a5-efbcd21cd1e1
Metadata: {'page_label': '6', 'file_name': '/content/data/transformers.pdf'}
-------------------------------


# 4. Document Summary Index

In [30]:
# import document summary index
from llama_index.core import DocumentSummaryIndex
from llama_index.core import get_response_synthesizer

In [31]:
# create response_synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize"
)

# create doc_summary_index
doc_summary_index = DocumentSummaryIndex.from_documents(
    documents,
    response_synthesizer=response_synthesizer,
    show_progress=True
)

Parsing nodes:   0%|          | 0/15 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/15 [00:00<?, ?it/s]

current doc id: 492a1949-a4e8-4d93-874c-afee4fbd4a27
current doc id: b3687d67-8b89-43c3-bb65-257b786c6d41
current doc id: 09641019-a351-439c-8a5f-8e576b62d80c
current doc id: ed4d1c0c-100b-47bd-b0ce-f11b4abd8fbe
current doc id: f4f2b136-ff83-405b-b789-db979f93c614
current doc id: 0e0e6a0d-fb58-4947-9ea5-074ff0e6028c
current doc id: 9546c625-1458-4863-a0c3-f81a9851e925
current doc id: f445b44f-1b5a-4916-8680-3359296b1bcf
current doc id: 09268535-20d5-4158-9962-c8509f2c275d
current doc id: 7322255b-5078-4aa4-90f2-5ed178bfa4a4
current doc id: 487963ff-c82d-41eb-a33d-e982ac4e6474
current doc id: 22c5b787-71ff-4566-b89b-2adcfc69c6a4
current doc id: 04a5ec18-e536-4803-a078-80c393852c8f
current doc id: 3e2a7a46-4e24-453e-be23-a9d3ef827218
current doc id: b6e2464b-9881-4149-9852-664d2ca40f7f


Generating embeddings:   0%|          | 0/15 [00:00<?, ?it/s]

## 4.1 Document Summary Index - "LLM Retriever"

In [32]:
# create retriever
retriever = doc_summary_index.as_retriever(retriever_mode='llm',
                                           choice_batch_size=3)

In [33]:
# send data to nodes
nodes = retriever.retrieve("What is the use of positional encodings?")

In [34]:
# print len of nodes
len(nodes)

1

In [36]:
# view output
for node in nodes:
  print("Node Id:", node.id_)
  print("Metadata:", node.metadata)
  print("-------------------------------")

Node Id: eb42f9c6-694a-4d79-9997-4f4f659374e1
Metadata: {'page_label': '3', 'file_name': '/content/data/transformers.pdf'}
-------------------------------


## 4.2 Document Summary Index - "Embedding Retriever"

In [37]:
# create retriever
retriever = doc_summary_index.as_retriever(retriever_mode="embedding",
                                           similarity_top_k=5)

In [38]:
# send data to nodes
nodes = retriever.retrieve("What is the use of positional encodings?")

In [39]:
# view output
for node in nodes:
  print("Node Id:", node.id_)
  print("Metadata:", node.metadata)
  print("-----------------------------")

Node Id: 189150e8-e3d9-4513-8e74-cf061aae016b
Metadata: {'page_label': '9', 'file_name': '/content/data/transformers.pdf'}
-----------------------------
Node Id: e9582568-ca4a-4945-9d49-868ad93401ff
Metadata: {'page_label': '5', 'file_name': '/content/data/transformers.pdf'}
-----------------------------
Node Id: c79b02eb-b33c-4590-a145-4b2ea4ec6e49
Metadata: {'page_label': '7', 'file_name': '/content/data/transformers.pdf'}
-----------------------------
Node Id: e13611a9-97da-40c9-98bf-394017bfd099
Metadata: {'page_label': '10', 'file_name': '/content/data/transformers.pdf'}
-----------------------------
Node Id: eb42f9c6-694a-4d79-9997-4f4f659374e1
Metadata: {'page_label': '3', 'file_name': '/content/data/transformers.pdf'}
-----------------------------
