# Long Document Summarization with Llama3 on Bedrock with LangChain


## Overview
This notebook is meant to demonstrate using the [Llama3 family of models](https://aws.amazon.com/about-aws/whats-new/2024/04/meta-llama-3-foundation-models-aws/) on Amazon Bedrock for abstract document summarization tasks. Although the Llama3 8B and 70B models are powerful and versatile language models that can be used for a wide range of natural language processing tasks, they have relatively small context window sizes compared to other models in the class. As a result, when working with multiple large documents there are several challenges that can arise. One of the main challenges is that the input text might exceed the model's context length of 8k tokens. This limitation can lead to incomplete or inaccurate responses, as the model may not have access to all the relevant information within the document. Another challenge is that language models can sometimes hallucinate or generate factually incorrect responses when dealing with very long documents. This can happen because the model may lose track of the overall context or make incorrect inferences based on partial information. Additionally, processing large documents can lead to out-of-memory errors, especially on resource-constrained systems or when working with large language models that have high memory requirements.

To address these challenges, this notebook will go through various summarization strategies that will use [LangChain](https://python.langchain.com/docs/get_started/introduction.html), a popular framework for developing applications powered by large language models (LLMs).



---
## Llama 3 Model Selection

Today, there are two Llama 3 models available on Amazon Bedrock:

### 1. Llama 3 8B

- **Description:** Ideal for limited computational power and resources, faster training times, and edge devices.
- **Max Tokens:** 2,048
- **Context Window:** 8,196
- **Languages:** English
- **Supported Use Cases:** Synthetic Text Generation, Text Classification, and Sentiment Analysis.

### 2. Llama 3 70B

- **Description:** Ideal for content creation, conversational AI, language understanding, research development, and enterprise applications. 
- **Max Tokens:** 2,048
- **Context Window:** 8,196
- **Languages:** English
- **Supported Use Cases:** Synthetic Text Generation and Accuracy, Text Classification and Nuance, Sentiment Analysis and Nuance Reasoning, Language Modeling, Dialogue Systems, and Code Generation.

### Performance and Cost Trade-offs

The table below compares the model performance on the Massive Multitask Language Understanding (MMLU) benchmark and their on-demand pricing on Amazon Bedrock.

| Model           | MMLU Score | Price per 1,000 Input Tokens | Price per 1,000 Output Tokens |
|-----------------|------------|------------------------------|-------------------------------|
| Llama 3 8B | 68.4%      | \$0.0004                   | \$0.0006                    |
| Llama 3 70B | 82.0%      | \$0.00265                   | \$0.0035                     |

For more information, refer to the following links:

1. [Llama 3 8B Model Cards and Prompt Formats](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3)
2. [Amazon Bedrock Pricing Page](https://aws.amazon.com/bedrock/pricing/)


### Local Setup (Optional)

For a local server, follow these steps to execute this jupyter notebook:

1. **Configure AWS CLI**: Configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your AWS credentials. Run `aws configure` and enter your AWS Access Key ID, AWS Secret Access Key, AWS Region, and default output format.

2. **Install required libraries**: Install the necessary Python libraries for working with SageMaker, such as [sagemaker](https://github.com/aws/sagemaker-python-sdk/), [boto3](https://github.com/boto/boto3), and others. You can use a Python environment manager like [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your Python packages in your preferred IDE (e.g. [Visual Studio Code](https://code.visualstudio.com/)).

3. **Create an IAM role for SageMaker**: Create an AWS Identity and Access Management (IAM) role that grants your user [SageMaker permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 

By following these steps, you can set up a local Jupyter Notebook environment capable of deploying machine learning models on Amazon SageMaker using the appropriate IAM role for granting the necessary permissions.

## Requirements

---
1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose ml.t3.medium.
2. For Select Kernel, choose [conda_pytorch_p310](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

---

Before we start building the agentic workflow, we'll first install some libraries:

+ AWS Python SDKs [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to be able to submit API calls to [Amazon Bedrock](https://aws.amazon.com/bedrock/).
+ [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction/) is a framework that provides off the shelf components to make it easier to build applications with large language models. It is supported in multiple programming languages, such as Python, JavaScript, Java and Go. 

---

In [23]:
%%writefile requirements.txt
langchain==0.1.14
boto3==1.34.58
botocore==1.34.101
sqlalchemy==2.0.29
pypdf==4.1.0
langchain-aws==0.1.6
transformers

Overwriting requirements.txt


In [24]:
!pip install -U -r requirements.txt --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### Restart the kernel with the updated packages that are installed through the dependencies above

---


## Initiate the Bedrock Client

Import the necessary libraries, along with langchain for bedrock model selection

In [25]:
import boto3
from boto3 import client
from botocore.config import Config
import json
from langchain_aws import ChatBedrock
from langchain.chains import ConversationChain
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import PyPDFLoader
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import numpy as np
from pypdf import PdfReader
from urllib.request import urlretrieve

In [26]:
config = Config(read_timeout=2000)

bedrock = boto3.client(service_name='bedrock-runtime', 
                       region_name='us-west-2',
                       config=config)

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> Ensure that you have access to the Llama3 model you wish to use through Bedrock in the selected region.
</div>

## Configure LangChain with Boto3

---


With LangChain, you can access Bedrock once you pass the boto3 session information to LangChain. Below, we also specify Meta Llama3 70b/8b in `model_id` and pass the Llama3 inference parameters as desired in `model_kwargs`.



### Supported parameters

The Llama models have the following inference parameters.


```
{
    "prompt": string,
    "temperature": float,
    "top_p": float,
    "max_gen_len": int
}
```

In [27]:
#Set the desired Llama3 model
llama3_70b_instruct = "meta.llama3-70b-instruct-v1:0"
llama3_8b_instruct = "meta.llama3-8b-instruct-v1:0"

DEFAULT_MODEL = llama3_70b_instruct

In [28]:
llm = ChatBedrock(
    model_id=DEFAULT_MODEL,
    model_kwargs={
        "max_gen_len": 2048,  
        "temperature": 0.5,
        "top_p": 0.9
    },
    client=bedrock,
)

In [29]:
#Initialize conversation chain 
conversation = ConversationChain(
    # We set verbose to false to suppress the printing of logs during the execution of the conversation chain. This can be set to true when you're debugging your conversation chain or trying to understand how it's working under the hood.
    llm=llm, verbose=False, memory=ConversationBufferMemory() 
)

conversation.predict(input="Hi there!")

'Hello! It\'s lovely to chat with you! I\'ve been running on a Linux-based operating system, specifically Ubuntu 20.04, and I\'ve been trained on a massive dataset of text from the internet, which I like to call my "knowledge graph." It\'s a vast network of interconnected concepts, entities, and relationships that I can draw upon to answer your questions. By the way, did you know that I\'ve been designed to learn and improve over time, so the more conversations I have, the more accurate and informative I become? What\'s on your mind today?'

---

## Document Processing Step

In this example to demonstrate summarization, we will be using a document that is a whitepaper from AWS. 

> The document is a [whitepaper](https://docs.aws.amazon.com/whitepapers/latest/architecting-hipaa-security-and-compliance-on-aws/architecting-hipaa-security-and-compliance-on-aws.pdf) on architecting HIIPA compliant workloads on AWS.


Let's first download the file to build our document store.

In [30]:
!mkdir -p ./data

urls = [
    'https://docs.aws.amazon.com/whitepapers/latest/architecting-hipaa-security-and-compliance-on-aws/architecting-hipaa-security-and-compliance-on-aws.pdf'
]

filenames = [
    'AWS-security-whitepaper.pdf'
]

metadata = [
    dict(year=2023, source=filenames[0])
]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


After downloading we can load the documents with the help of `DirectoryLoader` from `PyPDF` available under LangChain and splitting them into smaller chunks.

Note: For the sake of this use-case we are creating chunks of roughly 4000 characters with an overlap of 100 characters using `RecursiveCharacterTextSplitter`.

#### HIPAA Compliance document

In this section, we will load the HIPAA compliance document with `PyPDFLoader`, append document fragments with the metadata, and use LangChain's `RecursiveCharacterTextSplitter` to split the documents in `hipaa_documents` list into smaller text chunks using the `split_documents` method. 

In [31]:
#document 1 (HIPAA COMPLIANCE ON AWS)
hipaa_documents = []

# Load only the first file
hipaa_file = filenames[0]
hipaa_loader = PyPDFLoader(data_root + hipaa_file)
hipaa_document = hipaa_loader.load()

for idx, hipaa_document_fragment in enumerate(hipaa_document):
    hipaa_document_fragment.metadata = metadata[0] if metadata else {}
    hipaa_documents.append(hipaa_document_fragment)
    
#chunking
hipaa_doc_text_splitter = RecursiveCharacterTextSplitter(
    # Set a  small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

hipaa_docs = hipaa_doc_text_splitter.split_documents(hipaa_documents)
print(hipaa_docs[0])

#chunked doc count
hipaa_chunked_count = len(hipaa_docs)
print(
    f"\nNumber of documents chunked and created from the HIPAA Security document: {hipaa_chunked_count}"
)

page_content='AWS Whitepaper\nArchitecting for HIPAA Security and \nCompliance on Amazon Web Services\nCopyright © 2024 Amazon Web Services, Inc. and/or its aﬃliates. All rights reserved.' metadata={'year': 2023, 'source': 'AWS-security-whitepaper.pdf'}

Number of documents chunked and created from the HIPAA Security document: 237


---

## Summarizing Long Documents with LangChain

In the following sections, we will go over three different summarization techniques with LangChain:
    
 #####   1. Stuff
 #####   2. Map Reduce
 #####   3. Refine
 ---

### 1. Stuff with load_summarize_chain

Stuffing is the simplest method to pass data to a language model. It "stuffs" text into the prompt as context in a way that all of the relevant information can be processed by the model to get what you want. It is the default way to process documents with an LLM.

In LangChain, you can use `StuffDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is set `stuff` as the `chain_type` of your chain.

In [32]:
stuff_summary_chain = load_summarize_chain(llm=llm, chain_type="stuff", verbose=False)

Next, let's take a look at the Prompt template used by the Stuff summarize chain:

In [33]:
stuff_summary_chain.llm_chain.prompt.template

'Write a detailed and complete summary of the following:\n\n\n"{text}"\n\n\nDETAILED SUMMARY:'

Here, we see that by default, the Prompt template for `llm_chain` has been set to: 'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

This can be altered by instantiating using `from_template` with LangChain to set a new prompt. We can do that below:



In [34]:
stuff_prompt = PromptTemplate.from_template('Write a detailed and complete summary of the following:\n\n\n"{text}"\n\n\nDETAILED SUMMARY:')

In [35]:
stuff_summary_chain.llm_chain.prompt.template = stuff_prompt.template #set new prompt template

Now that we have set the new prompt template, let us first try generating a summary of the whitepaper.

In [36]:
try:
    stuff_hipaa_summary = stuff_summary_chain.invoke(hipaa_docs) 
except Exception as e:
    print(e)

Error raised by bedrock service: An error occurred (ValidationException) when calling the InvokeModel operation: This model's maximum context length is 8192 tokens. Please reduce the length of the prompt


#### Notes:
In the output for the above cell, we see that an error is raised due to the prompt far exceeding the model's maximum context length. Since stuffing summarizes text by feeding the entire document to a large language model (LLM) in a single call, it is difficult to process long documents. The Llama models have a context length of 8k tokens, which is the maximum number of tokens that can be processed in a single call. If the document is longer than the context length, stuffing will not work. Also the stuffing method is not suitable for summarizing large documents, as it can be slow and may not produce a good summary.

Let's explore a couple chunk-wise summarization techniques with [LangChain](https://python.langchain.com/docs/get_started/introduction.html) to be able to mitigate the restrictions of your large documents not fitting into the context window of the model.

---

### 2. Map Reduce with load_summarize_chain

The `Map_Reduce` method involves summarizing each document individually (map step) and then combining these summaries into a final summary (reduce step). This approach is more scalable and can handle larger volumes of text. The map reduce technique is designed for summarizing large documents that exceed the token limit of the language model. It involves dividing the document into chunks, generating summaries for each chunk, and then combining these summaries to create a final summary. This method is efficient for handling large files and significantly reduces processing time.

In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_summarize_chain method`. What you need to do is set `map_reduce` as the `chain_type` of your chain.

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
2. Langchain utility is used to split it into multiple smaller chunks (chunking)
3. Model generates individual summaries for all document chunks in parallel
4. Reduce all these summaries to a condensed final summary
---

![map-reduce](imgs/llama3mapreduce.png)

In [37]:
# Takes a list of documents, combines them into a single string, and passes this to an LLMChain, it then combines and iteratively reduces the mapped document
map_reduce_summary_chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=False)

The `ReduceDocumentsChain` handles taking the document mapping results and reducing them into a single output. It wraps a generic `CombineDocumentsChain` (like `StuffDocumentsChain`) but adds the ability to collapse documents before passing it to the `CombineDocumentsChain` if their cumulative size exceeds token_max.

In [38]:
# Instantiation using from_template (recommended)
#sets the prompt template for the summaries generated for all the individual document chunks.
initial_map_prompt = PromptTemplate.from_template("""
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      """)

map_reduce_summary_chain.llm_chain.prompt.template = initial_map_prompt.template

#sets the prompt template for generating a cumulative summary of all the document chunks for reduce documents chain.
reduce_documents_prompt= PromptTemplate.from_template("""
                      Write a detailed summary of the following text delimited by triple backquotes.
                      Return your response in bullet points which covers the key points of the text.
                      ```{text}```
                      BULLET POINT SUMMARY:
                      """)

map_reduce_summary_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt.template = reduce_documents_prompt.template

Here, we perform summarization on the **HIPAA and Security Compliance** document with `Map-Reduce`. Since this is document is quite large, it can take a while to run.
In order to see how Map_Reduce works, let us generate a summary of a subset of the document chunks **(50 to 70)**.

In [39]:
#this cell might take 5-10 minutes to run
try:
    map_reduce_summary = map_reduce_summary_chain.invoke(hipaa_docs[50:71])  
except Exception as e:
    print(e)

In [40]:
print(map_reduce_summary['output_text'].strip())

Here is a detailed summary of the text in bullet points, covering the key points:

**Amazon CloudWatch**

* Delivers a near-real-time stream of system events that describe changes in AWS resources
* Ensure PHI does not flow into CloudWatch Events, and configure AWS resources emitting CloudWatch events according to the Guidance
* Configure Amazon CloudWatch Events to register as an AWS API call in CloudTrail

**Amazon CloudWatch Logs**

* Allows customers to monitor, store, and access log files from various sources
* Log data can be retrieved from CloudWatch Logs
* Log data is encrypted while in transit and at rest, ensuring its security
* No re-encryption is needed for PHI emitted by other services and delivered to CloudWatch Logs

**Amazon Comprehend**

* Uses natural language processing to extract insights from documents
* Processes text files in UTF-8 format
* Develops insights by recognizing entities, key phrases, language, sentiments, and other common elements in a document
* Can 

#### Notes:
With `Map_Reduce`, the model is able to summarize a large document by overcoming the context limit of Stuffing method with parallel processing. 
However, it requires multiple calls to the model and potentially loses context between individual summaries of the chunks. To deal with this challenge, let us try another method that performs chunk-wise summarization.

---

### 3. Refine with load_summarize_chain

The `Refine` method is a technique that allows us to recursively summarize our input data. It iteratively updates its answer by looping over the input documents. This method is useful for refining a summary based on new context.`Refine` is a simpler alternative to `Map_Reduce`. It involves generating a summary for the first chunk, combining it with the second chunk, generating another summary, and continuing this process until a final summary is achieved. This method is suitable for large documents but requires less complexity compared to `Map_Reduce`.

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
2. Langchain utility is used to split it into multiple smaller chunks (chunking)
3. First chunk is sent to the model; Model returns the corresponding summary
4. Langchain gets next chunk and appends it to the returned summary and sends the combined text as a new request to the model; the process repeats until all chunks are processed
5. In the end, you have final summary that has been recursively updated using all the document chunks

---

![refine](imgs/llamarefine.png)



In [41]:
# Run an initial prompt on a small chunk of data to generate a summary. Then, for each subsequent document, the output from the previous document is passed in along with the new document, and the LLM is asked to refine the output based on the new document.
refine_summary_chain = load_summarize_chain(llm=llm, chain_type="refine", verbose=False)
refine_summary_chain_french = load_summarize_chain(llm=llm, chain_type="refine", verbose=False) #refine summary chain for summarization in french

Here, we perform summarization on the **HIPAA and Security Compliance** document with `Refine`. Since this is document is quite large, it can take a while to run.
In order to see how Refine works, let us generate a summary of a subset of the document chunks **(50 to 70)**.

In [42]:
#initial llm chain prompt template
initial_refine_prompt = PromptTemplate.from_template("""
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      """)

refine_summary_chain.initial_llm_chain.prompt.template = initial_refine_prompt.template

#refine llm chain prompt template
refine_documents_prompt= PromptTemplate.from_template("Your job is to produce a final summary.\nWe have provided an existing summary up to a certain point: {existing_answer}\nWe have the opportunity to refine the existing summary (only if needed) with some more context below.\n------------\n{text}\n------------\nGiven the new context, refine the original summary.\nIf the context isn't useful, return the original summary.")

refine_summary_chain.refine_llm_chain.prompt.template = refine_documents_prompt.template

In [43]:
#this cell might take 5-10 minutes to run
try:
    refine_summary = refine_summary_chain.invoke(hipaa_docs[50:71])
except Exception as e:
    print(e)

In [44]:
print(refine_summary['output_text'].strip())

Based on the additional context provided, I have refined the original summary to incorporate the new information. Here is the revised summary:

The text discusses the use of Amazon CloudWatch and Amazon Comprehend in relation to Protected Health Information (PHI). The main points are:

* Amazon CloudWatch Events delivers a near-real-time stream of system events that describe changes in AWS resources, and customers should ensure that PHI does not flow into CloudWatch Events.
* Customers should configure AWS resources emitting CloudWatch events that store, process, or transmit PHI in accordance with the Guidance.
* CloudWatch Events can be configured to register as an AWS API call in CloudTrail, and customers can find more information on how to do this in the "Creating a CloudWatch Events Rule That Triggers on an AWS API Call Using AWS CloudTrail" section.
* Amazon CloudWatch Logs can be used to monitor, store, and access log files from various AWS services, including Amazon EC2 instance

### Notes:
`Refine` has the potential to incorporate more relevant context compared to `Map_Reduce`, potentially resulting in a more comprehensive and accurate summary. However, it comes with a trade-off: `Refine` necessitates a significantly higher number of calls to the LLM than the `Stuff` and `Map_Reduce` since it is an incremental process where the subsequent chunk's summary uses the previous chunk's summary. Moreover, these calls are not independent, which means they cannot be parallelized, potentially leading to longer processing times. Another consideration is that the Refine method may exhibit recency bias, where the most recent document chunks in the sequence could carry more weight or influence in the final summary, as the method processes documents in a specific order.

---
## Conclusion

In this notebook, we have successfully looked at three different summarization techniques using LangChain; **Stuff**, **Map_Reduce**, and **Refine**. Each of these methods has its own distinct advantages/uses. 

- ***Stuff*** is straighforward and is the fastest method out of the three since it makes a single call to the LLM and fits the entire document within the model's context window. Although as we saw with the HIPAA Compliance document, it does not scale well to work with large volumes of text.

- ***Map_Reduce*** deals with the issue of the context window length while being able to parallelize generation of summaries for individual chunks, thereby speeding up the model's response while being able to process long documents. An issue with Map_Reduce is that since this is not a recursive process, we lose context between chunks during this process.

- ***Refine*** deals with the issues that arise with the previous methods. It performs recursive summarization by incrementally generating summaries for each of the chunks while retaining context between them. While this method generates the most accurate and comprehensive summary out of all 3 methods, the calls made to the LLM cannot be parallelized. This can result in longer processing times. Additionally, more recent document chunks tend to carry more weight due to the order that they are processed in.


---
## Distributors
- Amazon Web Services
- Meta

