# Using LLM-powered retreival and reranking - (Titan LLM + Bedrock Titan embedding)

### Context

Utilizing LLM-driven retrieval has the potential to yield more pertinent documents compared to retrieval based on embeddings. However, this advantage comes at the expense of increased latency and expenses. We will demonstrate that employing embedding-based retrieval initially, followed by a secondary retrieval stage for reevaluation, can offer a balanced solution.

A recent surge in applications involving "Develop a chatbot using your data" has emerged in the past several months. This trend has been facilitated by frameworks such as LlamaIndex and LangChain. Many of these applications rely on a standard approach known as retrieval augmented generation (RAG):

1) A vector store is employed to store unstructured documents (knowledge corpus).
2) When presented with a query, a retrieval model is utilized to fetch relevant documents from the corpus, followed by a synthesis model that generates a response.
3) The retrieval model retrieves the top-k documents based on the similarity of their embeddings to the query. It's important to note that the concept of top-k embedding-based semantic search has existed for over a decade and doesn't involve the use of LLM.

The utilization of embedding-based retrieval offers numerous advantages:

* Dot product calculations are swift and don't necessitate model invocations during query processing.
* Although not flawless, embeddings can effectively capture the semantics of documents and queries. There's a subset of queries for which embedding-based retrieval yields highly relevant outcomes.

However, embedding-based retrieval can exhibit imprecision and return irrelevant context for the query due to various factors. This subsequently diminishes the quality of the overall RAG system, irrespective of the LLM's quality.

Addressing this challenge is not novel; existing information retrieval and recommendation systems have adopted a two-stage approach. The initial stage employs embedding-based retrieval with a higher top-k value to maximize recall while accepting a lower precision. Subsequently, the second stage utilizes a somewhat more computationally intensive process characterized by higher precision and lower recall (such as BM25) to "rerank" the initially retrieved candidates.

Delving into the shortcomings of embedding-based retrieval would require an entire series of blog posts. This current post serves as an initial exploration of an alternative retrieval technique and its potential to enhance embedding-based retrieval methodologies.
 
![LLM retrival works](./images/arch.png)

### LLM Retrieval and Reranking

LLM Retrieval and reranking strategy employs the LLM to determine the document(s) or sections of text that align with the provided query. The input prompt comprises a collection of potential documents, and the LLM is entrusted with choosing the pertinent group of documents while also assigning a score to gauge their relevance using an internal measurement.


In this notebook we explain how to approach the retriever pattern of LLM-powered retrieval and reranking using Amazon Bedrock LLM and LlamaIndex

#### LlamaIndex
LlamaIndex is a data framework for your LLM application. It provides the following tools:

* Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
* Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
* Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
* Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, anything else).
* LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

### LLM Used:
We will be leveraging Bedrock - Anthropic Titan LLM and Bedrock Embedding (Titan model) for demonstration.



### Setup

We will first install the necessary libraries

In [None]:
%pip install -U pip

In [1]:
%pip install pypdf==3.15.2 --force-reinstall --quiet
%pip install llama-index==0.9.10 --force-reinstall --quiet
%pip install sentence_transformers==2.2.2 --force-reinstall --quiet

[0mNote: you may need to restart the kernel to use updated packages.
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.6.0 requires daal==2021.4.0, which is not installed.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
autovizwidget 0.21.0 requires pandas<2.0.0,>=0.20.1, but you have pandas 2.1.3 which is incompatible.
botocore 1.33.4 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.1.0 which is incompatible.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.3 which is incompatible.
fastapi 0.95.2 requires pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2, but you have pydantic 2.5.2 which is incompatible.
hdijupyterutils 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.3 which 

In [3]:
#%pip install pydantic==1.10.13 --force-reinstall --quiet

In [4]:
#%pip install sqlalchemy==2.0.21 --force-reinstall --quiet

In [5]:
import nest_asyncio

nest_asyncio.apply()

In [6]:
import sys

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    LLMPredictor,
    get_response_synthesizer,
    set_global_service_context,
    StorageContext,
    ListIndex
)
from llama_index.indices.postprocessor import LLMRerank
from IPython.display import Markdown, display

### Setup LlamaIndex

In this step we will be creating of instance for LLM and embedding models. We will be using Claude and Titan models

In [7]:
#### Un comment the following lines to run from your local environment outside of the AWS account with Bedrock access

#import os
#os.environ['BEDROCK_ASSUME_ROLE'] = '<YOUR_VALUES>'
#os.environ['AWS_PROFILE'] = 'bedrock-user'

In [8]:
import os
import boto3
import json
import sys

region = os.environ.get("AWS_DEFAULT_REGION", None)


In [100]:
from llama_index.llms import Bedrock
from llama_index.embeddings import BedrockEmbedding

model_kwargs_titan = {
        "stopSequences": [],
        "temperature":0.0,  
        "topP":0.5
    }

llm = Bedrock(
                model="amazon.titan-tg1-large",
                context_size=512,
                aws_region_name=region,
                additional_kwargs=model_kwargs_titan
             )

embed_model = BedrockEmbedding().from_credentials(
    aws_profile=None,model_name='amazon.titan-embed-g1-text-02'
)

In [101]:
chunk_overlap = 20
chunk_size = 512
service_context = ServiceContext.from_defaults(llm=llm, 
                                               embed_model=embed_model, 
                                               chunk_size=chunk_size,
                                               chunk_overlap=chunk_overlap,)
set_global_service_context(service_context)

### Load Datasets

In [102]:
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

As part of Amazon's culture, the CEO always includes a copy of the 1997 Letter to Shareholders with every new release. This will cause repetition, take longer to generate embeddings, and may skew your results. In the next section you will take the downloaded data, trim the 1997 letter (last 3 pages) and overwrite them as processed files.

In [103]:
import glob
from pypdf import PdfReader, PdfWriter

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()


Now that you have clean PDFs to work with, you will enrich your documents with metadata, then use a process called "chunking" to break up a larger document into small pieces. These small pieces will allow you to generate embeddings without surpassing the input limit of the embedding model.

In this example you will break the document into 1000 character chunks, with a 100 character overlap. This will allow your embeddings to maintain some of its context.

In [104]:
docs = []
for filename in filenames:
    doc = SimpleDirectoryReader(input_files=[f"data/{filename}"]).load_data()
    doc[0].doc_id = filename.replace(".pdf", "")
    docs.extend(doc)

### Build Document Summary Index

We show two ways of building the index:
- default mode of building the document summary index
- customizing the summary query


In [105]:
index = VectorStoreIndex.from_documents(docs,
    service_context=service_context)

In [106]:
from llama_index import Document
from llama_index.node_parser import SentenceSplitter

node_parser = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size)

nodes = node_parser.get_nodes_from_documents(
    docs, show_progress=True
)

Parsing nodes:   0%|          | 0/25 [00:00<?, ?it/s]

In [107]:
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

## Retrieval

In [108]:
from llama_index.retrievers import VectorIndexRetriever
from llama_index.indices.query.schema import QueryBundle
import pandas as pd
from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

def get_retrieved_nodes(
    query_str, vector_top_k=10, reranker_top_n=3, with_reranker=False
):
    query_bundle = QueryBundle(query_str)
    # configure retriever
    retriever = VectorIndexRetriever(
        index=index,
        similarity_top_k=vector_top_k

    )
    retrieved_nodes = retriever.retrieve(query_bundle)

    if with_reranker:
        # configure reranker
        reranker = LLMRerank(
            choice_batch_size=5, 
            top_n=reranker_top_n, 
            service_context=service_context
        )
        retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)

    return retrieved_nodes


def pretty_print(df):
    return display(HTML(df.to_html().replace("\\n", "<br>")))


def visualize_retrieved_nodes(nodes) -> None:
    result_dicts = []
    for node in nodes:
        result_dict = {"Score": node.score, "Text": node.node.get_text()}
        result_dicts.append(result_dict)

    pretty_print(pd.DataFrame(result_dicts))

Now, we will showcase how to do a two-stage pass for retrieval. Use embedding-based retrieval with a high top-k value in order to maximize recall and get a large set of candidate items. Then, use LLM-based retrieval to dynamically select the nodes that are actually relevant to the query.

In [109]:
retrieved_nodes1 = get_retrieved_nodes(
    "How has AWS evolved?", vector_top_k=3, with_reranker=False
)

In [110]:
len(retrieved_nodes1)

3

In [111]:
visualize_retrieved_nodes(retrieved_nodes1)

Unnamed: 0,Score,Text
0,0.593798,"This type of iterative innovation is never finished and has periodic peaks in investment years, but leadsto better long-term customer experiences, customer loyalty, and returns for our shareholders. AWS : As we were defining AWS and working backwards on the services we thought customers wanted, we kept triggering one of the biggest tensions in product development—where to draw the line on functionality inV1. One early meeting in particular—for our core compute service called Elastic Compute Cloud (“EC2”)—was scheduled for an hour, and took three, as we animatedly debated whether we could launch a computeservice without an accompanying persistent block storage companion (a form of network attached storage)."
1,0.593097,"Compute is used for every bit of technology. That’s ahuge deal for customers. And, while Graviton2 has been a significant success thus far (48 of the top 50 AWSEC2 customers have already adopted it), the AWS Chips team was already learning from what customerssaid could be better, and announced Graviton3 this past December (offering a 25% improvement on top ofGraviton2’s relative gains). The list of what we’ve invented and delivered for customers in EC2 (and AWS ingeneral) is pretty mind-boggling, and this iterative approach to innovation has not only given customersmuch more functionality in AWS than they can find anywhere else (which is a significant differentiator), butalso allowed us to arrive at the much more game-changing offering that AWS is today. Devices : Our first foray into devices was the Kindle, released in 2007. It was not the most sophisticated industrial design (it was creamy white in color and the corners were uncomfortable for some people to hold),but revolutionary because it offered customers the ability to download any of over 90,000 books (nowmillions) in 60 seconds—and we got better and faster at building attractive designs. Shortly thereafter, welaunched a tablet, and then a phone (with the distinguishing feature of having front-facing cameras and agyroscope to give customers a dynamic perspective along with varied 3D experiences). The phone wasunsuccessful, and though we determined we were probably too late to this party and directed these resourceselsewhere, we hired some fantastic long-term builders and learned valuable lessons from this failure thathave served us well in devices like Echo and FireTV . When I think of the first Echo device—and what Alexa could do for customers at that point—it was noteworthy, yet so much less capable than what’s possible today."
2,0.552629,"Everybody agreed that having a persistent block store was important to a complete compute service; however, to have one ready would take an extra year. The question became could we offer customers auseful service where they could get meaningful value before we had all the features we thought they wanted?We decided that the initial launch of EC2 could be feature-poor if we also organized ourselves to listen tocustomers and iterate quickly. This approach works well if you indeed iterate quickly; but, is disastrous if youcan’t. We launched EC2 in 2006 with one instance size, in one data center, in one region of the world, withLinux operating system instances only (no Windows), without monitoring, load balancing, auto-scaling, oryes, persistent storage. EC2 was an initial success, but nowhere near the multi-billion-dollar service it’sbecome until we added the missing capabilities listed above, and then some. In the early days of AWS, people sometimes asked us why compute wouldn’t just be an undifferentiated commodity. But, there’s a lot more to compute than just a server. Customers want various flavors of compute(e.g. server configurations optimized for storage, memory, high-performance compute, graphics rendering,machine learning), multiple form factors (e.g. fixed instance sizes, portable containers, serverless functions),various sizes and optimizations of persistent storage, and a slew of networking capabilities. Then, there’sthe CPU chip that runs in your compute. For many years, the industry had used Intel or AMD x86 processors.We have important partnerships with these companies, but realized that if we wanted to push price andperformance further (as customers requested), we’d have to develop our own chips, too. Our first generalizedchip was Graviton, which we announced in 2018. This helped a subset of customer workloads run morecost-effectively than prior options. But, it wasn’t until 2020, after taking the learnings from Graviton and innovating on a new chip, that we had something remarkable with our Graviton2 chip, which provides up to40% better price-performance than the comparable latest generation x86 processors. Think about howmuch of an impact 40% improvement on compute is. Compute is used for every bit of technology."


In [112]:
retrieved_nodes1_withreranker = get_retrieved_nodes(
    "How has AWS evolved?",
    vector_top_k=3,
    reranker_top_n=1,
    with_reranker=True,
)

In [113]:
len(retrieved_nodes1_withreranker)

0

In [114]:
visualize_retrieved_nodes(retrieved_nodes1_withreranker)   

In [115]:
retrieved_nodes2 = get_retrieved_nodes(
    "Why is Amazon successful?", vector_top_k=2, with_reranker=False
)

In [116]:
len(retrieved_nodes2)

2

In [117]:
visualize_retrieved_nodes(retrieved_nodes2)

Unnamed: 0,Score,Text
0,0.573374,"through the pandemic the same way without the dedication and extraordinary efforts shown by our teams during this period, and I’m eternally grateful. It’s not normal for a company of any size to be able to respond to something as discontinuous and unpredictable as this pandemic turned out to be. What is it about Amazon that made it possible for us to doso? It’s because we weren’t starting from a standing start. We had been iterating on and remaking ourfulfillment capabilities for nearly two decades. In every business we pursue, we’re constantly experimentingand inventing. We’re divinely discontented with customer experiences, whether they’re our own or not. Webelieve these customer experiences can always be better, and we strive to make customers’ lives better andeasier every day. The beauty of this mission is that you never run out of runway; customers always want better,and our job is both to listen to their feedback and to imagine what else is possible and invent on theirbehalf. People often assume that the game-changing inventions they admire just pop out of somebody’s head, a light bulb goes off, a team executes to that idea, and presto—you have a new invention that’s a breakawaysuccess for a long time. That’s rarely, if ever, how it happens. One of the lesser known facts about innovativecompanies like Amazon is that they are relentlessly debating, re-defining, tinkering, iterating, andexperimenting to take the seed of a big idea and make it into something that resonates with customers andmeaningfully changes their customer experience over a long period of time. Let me give you some Amazon examples.Our Fulfillment Network : Going back to the pandemic, there’s no way we could have started working on our fulfillment network in March 2020 and satisfied anything close to what our customers needed. We’d beeninnovating in our fulfillment network for 20 years, constantly trying to shorten the time to get items to customers. In the early 2000s, it took us an average of 18 hours to get an item through our fulfillment centersand on the right truck for shipment. Now, it takes us two."
1,0.561009,"We’ll do consumers first. We offer low prices, vast selection, and fast delivery, but imagine we ignore all of that for the purpose of this estimate and value only one thing: we save customers time. Customers complete 28% of purchases on Amazon in three minutes or less, and half of all purchases are finished in less than 15 minutes. Compare that to the typical shopping trip to a physical store – driving, parking, searching store aisles, waiting in the checkout line, finding your car, and driving home. Research suggests the typical physical store trip takes about an hour. If you assume that a typical Amazon purchase takes 15 minutes and that it saves you a couple of trips to a physical store a week, that’s more than 75 hours a year saved. That’s important. We’re all busy in the early 21stcentury. So that we can get a dollar figure, let’s value the time savings at $10 per hour, which is conservative. Seventy- five hours multiplied by $10 an hour and subtracting the cost of Prime gives you value creation for each Prime member of about $630. We have 200 million Prime members, for a total in 2020 of $126 billion of value creation. AWS is challenging to estimate because each customer’s workload is so different, but we’ll do it anyway, acknowledging up front that the error bars are high. Direct cost improvements from operating in the cloud versus on premises vary, but a reasonable estimate is 30%. Across AWS’s entire 2020 revenue of $45 billion, that 30% would imply customer value creation of $19 billion (what would have cost them $64 billion on their own cost $45 billion from AWS). The difficult part of this estimation exercise is that the direct cost reduction is the smallest portion of the customer benefit of moving to the cloud. The bigger benefit is the increased speed of software development – something that can significantly improve the customer’s competitiveness and top line. We have no reasonable way of estimating that portion of customer value except to say that it’s almost certainly larger than the direct cost savings."


In [118]:
retrieved_nodes2_withreranker = get_retrieved_nodes(
    "Why is Amazon successful?",
    vector_top_k=2,
    reranker_top_n=1,
    with_reranker=True,
)

In [119]:
len(retrieved_nodes2_withreranker)

0

In [99]:
visualize_retrieved_nodes(retrieved_nodes2_withreranker)