# Llama Document Summary Index With Amazon Bedrock

This demo showcases the document summary index, over IRS Forms.

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.

Retrieval can be performed through the LLM or embeddings. We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.

In [5]:
#!pip install ../dependencies/boto3-1.26.162-py3-none-any.whl
#!pip install ../dependencies/botocore-1.29.162-py3-none-any.whl
!pip install ../dependencies/boto3-1.28.21-py3-none-any.whl
!pip install ../dependencies/botocore-1.31.21-py3-none-any.whl
!pip install langchain --quiet
!pip install pypdf --quiet
!pip install llama-index --quiet
!pip install sentence_transformers --quiet

Processing /root/amazon-bedrock-rag/dependencies/boto3-1.28.21-py3-none-any.whl
Installing collected packages: boto3
  Attempting uninstall: boto3
    Found existing installation: boto3 1.26.162
    Uninstalling boto3-1.26.162:
      Successfully uninstalled boto3-1.26.162
Successfully installed boto3-1.28.21
[0mProcessing /root/amazon-bedrock-rag/dependencies/botocore-1.31.21-py3-none-any.whl
botocore is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.
[0m

In [6]:
import nest_asyncio

nest_asyncio.apply()

In [7]:
from llama_index import (
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext,
    get_response_synthesizer,
    set_global_service_context
)
from llama_index.indices.document_summary import DocumentSummaryIndex


### Load Datasets

Load IRS Forms p1212, p15 and p1544

In [8]:
documents = SimpleDirectoryReader(input_files=["data/p1212.pdf"]).load_data()
#print(documents)

In [9]:
#doc_titles = ["p1212", "p15", "p1544"]
doc_titles = [ "p1544"]

In [10]:
# Load specified IRS Forms
form_docs = []
for doc_title in doc_titles:
    docs = SimpleDirectoryReader(input_files=[f"data/{doc_title}.pdf"]).load_data()
    docs[0].doc_id = doc_title
    form_docs.extend(docs)

### Build Document Summary Index

We show two ways of building the index:
- default mode of building the document summary index
- customizing the summary query


In [None]:
import boto3
import botocore
print(f"boto3={boto3.__version__}, botocore={botocore.__version__}")
bedrock_client = boto3.client(service_name='bedrock',
                              region_name='us-east-1')
bedrock_client.list_foundation_models()

In [11]:
from llama_index import LangchainEmbedding
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.embeddings.bedrock import BedrockEmbeddings

from langchain.llms.bedrock import Bedrock 
from langchain.embeddings.bedrock import BedrockEmbeddings
from langchain.embeddings.cohere import CohereEmbeddings

llm = Bedrock(model_id="anthropic.claude-v2")
embed_model = BedrockEmbeddings(model_id="amazon.titan-e1t-medium")


embed_model = LangchainEmbedding(
  HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)
"""
embed_model = LangchainEmbedding(
  BedrockEmbeddings(model_id="amazon.titan-e1t-medium")
)
"""
#service_context = ServiceContext.from_defaults(llm=llm, embed_model="local", chunk_size=1024)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=1024)
set_global_service_context(service_context)

In [12]:
# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    form_docs,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
)

current doc id: p1544
current doc id: d9de4aea-9250-4ebe-9068-8d0ad4caafe4
current doc id: 0fbd6a81-460a-4a4a-9f18-932a026ee118
current doc id: c0160685-0c21-4257-935d-8852a4a0ec2a
current doc id: a31047d3-19d7-4ff0-bdcc-b8bf23ac2cef
current doc id: 2e25a96c-03bb-499c-997b-54c3169dc6cb


In [13]:
doc_summary_index.get_document_summary("p1544")

'This document provides information about reporting cash payments over $10,000 received in a trade or business on IRS Form 8300. The key points are:\n\n- Businesses that receive over $10,000 in cash from a single buyer in'

In [None]:
#doc_summary_index.storage_context.persist("index")

In [None]:
"""
from llama_index.indices.loading import load_index_from_storage
from llama_index import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="index")
doc_summary_index = load_index_from_storage(storage_context)
"""

### Perform Retrieval from Document Summary Index

We show how to execute queries at a high-level. We also show how to perform retrieval at a lower-level so that you can view the parameters that are in place. We show both LLM-based retrieval and embedding-based retrieval using the document summaries.

#### High-level Querying

Note: this uses the default, LLM-based form of retrieval

In [14]:
query_engine = doc_summary_index.as_query_engine(
    response_mode="tree_summarize", use_async=True
)

In [17]:
response = query_engine.query("Who Must File Form 8300?")
print(response)


Based on the context information, the answer is:
Any person in a trade or business who receives more than $10,000 in cash in a single transaction or in related transactions must file Form 8300.

--------------------
page_label


#### LLM-based Retrieval

In [18]:
from llama_index.indices.document_summary import DocumentSummaryIndexRetriever

In [19]:
retriever = DocumentSummaryIndexRetriever(
    doc_summary_index,
    # choice_select_prompt=choice_select_prompt,
    # choice_batch_size=choice_batch_size,
    # format_node_batch_fn=format_node_batch_fn,
    # parse_choice_select_answer_fn=parse_choice_select_answer_fn,
    # service_context=service_context
)

In [20]:
retrieved_nodes = retriever.retrieve("Who Must File Form 8300?")

In [21]:
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

10.0
her. You can tell that the cashier's check is the 
proceeds of a bank loan because it includes in-
structions to you to have a lien put on the car as 
security for the loan. For this reason, the cash-
ier's check is not treated as cash. You do not 
have to file Form 8300 for the transaction.
Exception for certain installment sales.  A 
cashier's check, bank draft, traveler's check, or 
money order is not treated as cash if it is re-
ceived in payment on a promissory note or an 
installment sales contract (including a lease that 
is considered a sale for federal tax purposes). 
However, this exception applies only if:
1.You use similar notes or contracts in other 
sales to ultimate consumers in the ordi-
nary course of your trade or business, and
2.The total payments for the sale that you 
receive on or before the 60th day after the 
sale are 50% or less of the purchase price.
Exception for certain down payment plans. 
A cashier's check, bank draft, traveler's check, 
or money orde

In [None]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer()

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("Who Must File Form 8300?")
print(response)

#### Embedding-based Retrieval

In [22]:
from llama_index.indices.document_summary import DocumentSummaryIndexEmbeddingRetriever

In [23]:
retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # choice_select_prompt=choice_select_prompt,
    # choice_batch_size=choice_batch_size,
    # format_node_batch_fn=format_node_batch_fn,
    # parse_choice_select_answer_fn=parse_choice_select_answer_fn,
    # service_context=service_context
)

In [24]:
retrieved_nodes = retriever.retrieve("Who Must File Form 8300?")

In [25]:
len(retrieved_nodes)

3