# Using LLM-powered retreival and reranking - (Titan LLM + Bedrock Titan embedding)

### Context

Utilizing LLM-driven retrieval has the potential to yield more pertinent documents compared to retrieval based on embeddings. However, this advantage comes at the expense of increased latency and expenses. We will demonstrate that employing embedding-based retrieval initially, followed by a secondary retrieval stage for reevaluation, can offer a balanced solution.

A recent surge in applications involving "Develop a chatbot using your data" has emerged in the past several months. This trend has been facilitated by frameworks such as LlamaIndex and LangChain. Many of these applications rely on a standard approach known as retrieval augmented generation (RAG):

1) A vector store is employed to store unstructured documents (knowledge corpus).
2) When presented with a query, a retrieval model is utilized to fetch relevant documents from the corpus, followed by a synthesis model that generates a response.
3) The retrieval model retrieves the top-k documents based on the similarity of their embeddings to the query. It's important to note that the concept of top-k embedding-based semantic search has existed for over a decade and doesn't involve the use of LLM.

The utilization of embedding-based retrieval offers numerous advantages:

* Dot product calculations are swift and don't necessitate model invocations during query processing.
* Although not flawless, embeddings can effectively capture the semantics of documents and queries. There's a subset of queries for which embedding-based retrieval yields highly relevant outcomes.

However, embedding-based retrieval can exhibit imprecision and return irrelevant context for the query due to various factors. This subsequently diminishes the quality of the overall RAG system, irrespective of the LLM's quality.

Addressing this challenge is not novel; existing information retrieval and recommendation systems have adopted a two-stage approach. The initial stage employs embedding-based retrieval with a higher top-k value to maximize recall while accepting a lower precision. Subsequently, the second stage utilizes a somewhat more computationally intensive process characterized by higher precision and lower recall (such as BM25) to "rerank" the initially retrieved candidates.

Delving into the shortcomings of embedding-based retrieval would require an entire series of blog posts. This current post serves as an initial exploration of an alternative retrieval technique and its potential to enhance embedding-based retrieval methodologies.
 
![LLM retrival works](./images/arch.png)

### LLM Retrieval and Reranking

LLM Retrieval and reranking strategy employs the LLM to determine the document(s) or sections of text that align with the provided query. The input prompt comprises a collection of potential documents, and the LLM is entrusted with choosing the pertinent group of documents while also assigning a score to gauge their relevance using an internal measurement.


In this notebook we explain how to approach the retriever pattern of LLM-powered retrieval and reranking using Amazon Bedrock LLM and LlamaIndex

#### LlamaIndex
LlamaIndex is a data framework for your LLM application. It provides the following tools:

* Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
* Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
* Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
* Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, anything else).
* LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

### LLM Used:
We will be leveraging Bedrock - Anthropic Titan LLM and Bedrock Embedding (Titan model) for demonstration.



### Setup

We will first install the necessary libraries

In [2]:
!cd .. && ./download-dependencies.sh

./download-dependencies.sh: 2: Bad substitution
Existing dependency directory found, removing and replacing...
Creating dependency directory
Downloading dependencies
Unpacking dependencies
/usr/bin/unzip
Archive:  sdk.zip
   creating: reviews/
  inflating: awscli-1.29.21.tar.gz   
  inflating: AWSCLI32PY3.msi         
  inflating: manifest.json           
  inflating: AWSCLISetup.exe         
  inflating: .functional             
  inflating: awscli-bundle.zip       
  inflating: boto3-1.28.21-py3-none-any.whl  
  inflating: models-starfort-report.json  
  inflating: .unit                   
  inflating: AWSCLI32.msi            
  inflating: .unit-crt               
  inflating: AWSCLI64PY3.msi         
  inflating: boto3-1.28.21.tar.gz    
  inflating: .functional-crt         
  inflating: AWSCLI64.msi            
  inflating: botocore-1.31.21.tar.gz  
  inflating: awscli-1.29.21-py3-none-any.whl  
  inflating: botocore-1.31.21-py3-none-any.whl  
  inflating: reviews/awscli-1.29.21.di

In [3]:
import glob
import subprocess

botocore_whl_filename = glob.glob("../dependencies/botocore-*-py3-none-any.whl")[0]
boto3_whl_filename = glob.glob("../dependencies/boto3-*-py3-none-any.whl")[0]

subprocess.Popen(['pip', 'install', botocore_whl_filename, boto3_whl_filename, '--force-reinstall'], bufsize=1, universal_newlines=True)

<Popen: returncode: None args: ['pip', 'install', '../dependencies/botocore-...>

In [4]:
%pip install pydantic==1.10.12 --force-reinstall 

Processing /root/amazon-bedrock-rag/dependencies/botocore-1.31.21-py3-none-any.whl
Processing /root/amazon-bedrock-rag/dependencies/boto3-1.28.21-py3-none-any.whl
Collecting jmespath<2.0.0,>=0.7.1 (from botocore==1.31.21)
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting python-dateutil<3.0.0,>=2.1 (from botocore==1.31.21)
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pydantic==1.10.12
  Obtaining dependency information for pydantic==1.10.12 from https://files.pythonhosted.org/packages/bc/e0/0371e9b6c910afe502e5fe18cc94562bfd9399617c7b4f5b6e13c29115b3/pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (149 kB)
Collecting urllib3<1.27,>=1.25.4 (from botocore==1.31.21)
  Obtaining dependency information for urllib3<1.27,>=1.25.4 from https://files.pythonhosted.org/packages/c5/05/c214b32d21c0b465506f95c4f28ccb

In [5]:
# langchain==0.0.266 is required by llama-index==0.8.8
%pip install langchain==0.0.266 \
    pypdf==3.15.2 \
    llama-index==0.8.8 \
    sentence_transformers==2.2.2 --force-reinstall

  Attempting uninstall: s3transfer
    Found existing installation: s3transfer 0.6.2
    Uninstalling s3transfer-0.6.2:
      Successfully uninstalled s3transfer-0.6.2
  Attempting uninstall: boto3
    Found existing installation: boto3 1.28.21
    Uninstalling boto3-1.28.21:
[0m      Successfully uninstalled boto3-1.28.21
Successfully installed boto3-1.28.21 botocore-1.31.21 jmespath-1.0.1 python-dateutil-2.8.2 s3transfer-0.6.2 six-1.16.0 urllib3-1.26.16


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
awscli 1.29.14 requires botocore==1.31.14, but you have botocore 1.31.21 which is incompatible.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.2 which is incompatible.
jupyterlab 3.4.4 requires jupyter-server~=1.16, but you have jupyter-server 2.7.0 which is incompatible.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.7.0 which is incompatible.
notebook 6.5.5 requires jupyter-client<8,>=5.3.4, but you have jupyter-client 8.3.0 which is incompatible.
notebook 6.5.5 requires pyzmq<25,>=17, but you have pyzmq 25.1.0 which is incompatible.
panel 0.13.1 requires bokeh<2.5.0,>=2.4.0, but you have bokeh 3.2.1 which is incompatib

Collecting langchain==0.0.266
  Obtaining dependency information for langchain==0.0.266 from https://files.pythonhosted.org/packages/20/bf/976db78bec94a810375d3eb7fab5501ef774993da5f54947d5ba71efc997/langchain-0.0.266-py3-none-any.whl.metadata
  Using cached langchain-0.0.266-py3-none-any.whl.metadata (14 kB)
Collecting pypdf==3.15.2
  Obtaining dependency information for pypdf==3.15.2 from https://files.pythonhosted.org/packages/7e/19/d90c9a6b187df41a0136c8effdede51a63ed23c0be80f9792f042c992df8/pypdf-3.15.2-py3-none-any.whl.metadata
  Using cached pypdf-3.15.2-py3-none-any.whl.metadata (7.1 kB)
Collecting llama-index==0.8.8
  Obtaining dependency information for llama-index==0.8.8 from https://files.pythonhosted.org/packages/95/02/3ba91f40e7eed449158f2aaba47cbfa0f81e9525ccc5c0d4816302662593/llama_index-0.8.8-py3-none-any.whl.metadata
  Using cached llama_index-0.8.8-py3-none-any.whl.metadata (4.8 kB)
Collecting sentence_transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py

In [6]:
import nest_asyncio

nest_asyncio.apply()

In [7]:
import sys

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    LLMPredictor,
    get_response_synthesizer,
    set_global_service_context,
    StorageContext,
    ListIndex
)
from llama_index.indices.postprocessor import LLMRerank
from llama_index.llms import OpenAI
from IPython.display import Markdown, display

### Setup langchain and llama index

In this step we will be creating of instance for LLM and embedding models. We will be using Claude and Titan models

In [10]:
from llama_index import LangchainEmbedding
from langchain.llms.bedrock import Bedrock 
from langchain.embeddings.bedrock import BedrockEmbeddings

model_kwargs_titan = { 
        "maxTokenCount": 512,
        "stopSequences": [],
        "temperature":0.0,  
        "topP":0.5
    }

llm = Bedrock(model_id="amazon.titan-tg1-large",
              model_kwargs=model_kwargs_titan)

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-e1t-medium")
)

service_context = ServiceContext.from_defaults(llm=llm, 
                                               embed_model=embed_model, 
                                               chunk_size=512)
set_global_service_context(service_context)

### Load Datasets

In [11]:
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

As part of Amazon's culture, the CEO always includes a copy of the 1997 Letter to Shareholders with every new release. This will cause repetition, take longer to generate embeddings, and may skew your results. In the next section you will take the downloaded data, trim the 1997 letter (last 3 pages) and overwrite them as processed files.

In [12]:
import glob
from pypdf import PdfReader, PdfWriter

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()


Now that you have clean PDFs to work with, you will enrich your documents with metadata, then use a process called "chunking" to break up a larger document into small pieces. These small pieces will allow you to generate embeddings without surpassing the input limit of the embedding model.

In this example you will break the document into 1000 character chunks, with a 100 character overlap. This will allow your embeddings to maintain some of its context.

In [13]:
docs = []
for filename in filenames:
    doc = SimpleDirectoryReader(input_files=[f"data/{filename}"]).load_data()
    doc[0].doc_id = filename.replace(".pdf", "")
    docs.extend(doc)

### Build Document Summary Index

We show two ways of building the index:
- default mode of building the document summary index
- customizing the summary query


In [14]:
#### Un comment the following lines to run from your local environment outside of the AWS account with Bedrock access

#import os
#os.environ['BEDROCK_ASSUME_ROLE'] = '<YOUR_VALUES>'
#os.environ['AWS_PROFILE'] = 'bedrock-user'

In [15]:
import os
import boto3
import json
import sys

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww

os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
boto3_bedrock = bedrock.get_bedrock_client(os.environ.get('BEDROCK_ASSUME_ROLE', None))

Create new client
  Using region: us-west-2
boto3 Bedrock client successfully created!
bedrock(https://bedrock.us-west-2.amazonaws.com)


In [16]:
boto3_bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': '37ef4a35-d74a-48a2-86ea-8012ccffaeb3',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Thu, 24 Aug 2023 04:50:12 GMT',
   'content-type': 'application/json',
   'content-length': '1166',
   'connection': 'keep-alive',
   'x-amzn-requestid': '37ef4a35-d74a-48a2-86ea-8012ccffaeb3'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large'},
  {'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-e1t-medium',
   'modelId': 'amazon.titan-e1t-medium'},
  {'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/stability.stable-diffusion-xl',
   'modelId': 'stability.stable-diffusion-xl'},
  {'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/ai21.j2-grande-instruct',
   'modelId': 'ai21.j2-grande-instruct'},
  {'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/ai21.j2-jumbo-instruct',
   'modelId': 'ai21.j2-jumbo-i

In [17]:
index = VectorStoreIndex.from_documents(docs,
    service_context=service_context)

In [18]:
nodes = service_context.node_parser.get_nodes_from_documents(docs)

In [19]:
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

## Retrieval

In [20]:
from llama_index.retrievers import VectorIndexRetriever
from llama_index.indices.query.schema import QueryBundle
import pandas as pd
from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

def get_retrieved_nodes(
    query_str, vector_top_k=10, reranker_top_n=3, with_reranker=False
):
    query_bundle = QueryBundle(query_str)
    # configure retriever
    retriever = VectorIndexRetriever(
        index=index,
        similarity_top_k=vector_top_k

    )
    retrieved_nodes = retriever.retrieve(query_bundle)

    if with_reranker:
        # configure reranker
        reranker = LLMRerank(
            choice_batch_size=5, 
            top_n=reranker_top_n, 
            service_context=service_context
        )
        retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)

    return retrieved_nodes


def pretty_print(df):
    return display(HTML(df.to_html().replace("\\n", "<br>")))


def visualize_retrieved_nodes(nodes) -> None:
    result_dicts = []
    for node in nodes:
        result_dict = {"Score": node.score, "Text": node.node.get_text()}
        result_dicts.append(result_dict)

    pretty_print(pd.DataFrame(result_dicts))

Now, we will showcase how to do a two-stage pass for retrieval. Use embedding-based retrieval with a high top-k value in order to maximize recall and get a large set of candidate items. Then, use LLM-based retrieval to dynamically select the nodes that are actually relevant to the query.

In [21]:
retrieved_nodes1 = get_retrieved_nodes(
    "How has AWS evolved?", vector_top_k=3, with_reranker=False
)

In [22]:
len(retrieved_nodes1)

3

In [23]:
for i, node in enumerate(retrieved_nodes1):
    print(node.score)
    print(node.node.get_text())
    print("-----------------------------------------------------------------------------------------------------------------------------------")
    

0.6064377194306085
Everybody agreed that having a persistent block store was important to a complete compute service;
however, to have one ready would take an extra year.The question became could we offer customers auseful service where they could get meaningful value before we had all the features we thought they wanted?We decided that the initial launch of EC2 could be feature-poor if we also organized ourselves to listen tocustomers and iterate quickly.This approach works well if you indeed iterate quickly; but, is disastrous if youcan’t.We launched EC2 in 2006 with one instance size, in one data center, in one region of the world, withLinux operating system instances only (no Windows), without monitoring, load balancing, auto-scaling, oryes, persistent storage.EC2 was an initial success, but nowhere near the multi-billion-dollar service it’sbecome until we added the missing capabilities listed above, and then some.In the early days of AWS, people sometimes asked us why compute woul

In [24]:
retrieved_nodes1_withreranker = get_retrieved_nodes(
    "How has AWS evolved?",
    vector_top_k=3,
    reranker_top_n=1,
    with_reranker=True,
)

In [25]:
len(retrieved_nodes1_withreranker)

1

In [26]:
for i, node in enumerate(retrieved_nodes1_withreranker):
    print(node.score)
    print(node.node.get_text())
    print("-----------------------------------------------------------------------------------------------------------------------------------")
    

10.0
Everybody agreed that having a persistent block store was important to a complete compute service;
however, to have one ready would take an extra year.The question became could we offer customers auseful service where they could get meaningful value before we had all the features we thought they wanted?We decided that the initial launch of EC2 could be feature-poor if we also organized ourselves to listen tocustomers and iterate quickly.This approach works well if you indeed iterate quickly; but, is disastrous if youcan’t.We launched EC2 in 2006 with one instance size, in one data center, in one region of the world, withLinux operating system instances only (no Windows), without monitoring, load balancing, auto-scaling, oryes, persistent storage.EC2 was an initial success, but nowhere near the multi-billion-dollar service it’sbecome until we added the missing capabilities listed above, and then some.In the early days of AWS, people sometimes asked us why compute wouldn’t just be a

In [27]:
retrieved_nodes2 = get_retrieved_nodes(
    "Human: Why is Amazon successful?", vector_top_k=3, with_reranker=False
)

In [28]:
len(retrieved_nodes2)

3

In [29]:
for i, node in enumerate(retrieved_nodes2):
    print(node.score)
    print(node.node.get_text())
    print("-----------------------------------------------------------------------------------------------------------------------------------")
    

0.6096547344564625
through the pandemic the same way without the dedication and extraordinary efforts shown by our teams
during this period, and I’m eternally grateful.It’s not normal for a company of any size to be able to respond to something as discontinuous and
unpredictable as this pandemic turned out to be.What is it about Amazon that made it possible for us to doso?It’s because we weren’t starting from a standing start.We had been iterating on and remaking ourfulfillment capabilities for nearly two decades.In every business we pursue, we’re constantly experimentingand inventing.We’re divinely discontented with customer experiences, whether they’re our own or not.Webelieve these customer experiences can always be better, and we strive to make customers’ lives better andeasier every day.The beauty of this mission is that you never run out of runway; customers always want better,and our job is both to listen to their feedback and to imagine what else is possible and invent on their

In [30]:
retrieved_nodes2_withreranker = get_retrieved_nodes(
    "Human: Why is Amazon successful?",
    vector_top_k=3,
    reranker_top_n=1,
    with_reranker=True,
)

In [31]:
len(retrieved_nodes2_withreranker)

1

In [32]:
for i, node in enumerate(retrieved_nodes2_withreranker):
    print(node.score)
    print(node.node.get_text())
    print("-----------------------------------------------------------------------------------------------------------------------------------")
    

10.0
through the pandemic the same way without the dedication and extraordinary efforts shown by our teams
during this period, and I’m eternally grateful.It’s not normal for a company of any size to be able to respond to something as discontinuous and
unpredictable as this pandemic turned out to be.What is it about Amazon that made it possible for us to doso?It’s because we weren’t starting from a standing start.We had been iterating on and remaking ourfulfillment capabilities for nearly two decades.In every business we pursue, we’re constantly experimentingand inventing.We’re divinely discontented with customer experiences, whether they’re our own or not.Webelieve these customer experiences can always be better, and we strive to make customers’ lives better andeasier every day.The beauty of this mission is that you never run out of runway; customers always want better,and our job is both to listen to their feedback and to imagine what else is possible and invent on theirbehalf.People 