# Long Document Summarization with Llama 3.1 on Bedrock with LangChain


## Overview
This notebook is meant to demonstrate how you can use the [Llama 3.1 family of models](https://aws.amazon.com/about-aws/whats-new/2024/07/meta-llama-3-1-405b-generally-available-amazon-bedrock/) on Amazon Bedrock for document summarization tasks. 

All Llama 3.1 models demonstrate significant improvements over previous versions. The models support a 128K context length and exhibit improved reasoning for multilingual dialogue use cases in eight languages. The models access more information from lengthy text to make more informed decisions and leverage richer contextual data to generate more subtle and refined responses. According to Meta, Llama 3.1 405B is one of the best and largest publicly available foundation models and is well suited for synthetic data generation and model distillation. Llama 3.1 models also provide state-of-the-art capabilities in general knowledge, math, tool use, and multilingual translation.

This notebook will go through various summarization strategies that will use [LangChain](https://python.langchain.com/docs/get_started/introduction.html), a popular framework for developing applications powered by large language models (LLMs). It will show improvements that Llama 3.1 offers as compared with Llama 3.

<b>Note (for reference):</b> Long Document Summarization with using Llama 3 on Bedrock with LangChain is discussed [here](https://github.com/aws-samples/Meta-Llama-on-AWS/blob/main/long-text-summarization/Llama3-Long-Document-summarization-LangChain.ipynb).


---
## Llama 3.1 Model Selection

There are [three](https://aws.amazon.com/about-aws/whats-new/2024/07/meta-llama-3-1-405b-generally-available-amazon-bedrock/) Llama 3.1 models available on Amazon Bedrock:

### 1. Llama 3.1 8B

- **Description:** Ideal for limited computational power and resources, faster training times, and edge devices. The model excels at text summarization, text classification, sentiment analysis, and language translation.
- **Context Window:** 128k
- **Languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- **Supported Use Cases:** Synthetic Text Generation, Text Classification, and Sentiment Analysis.

### 2. Llama 3.1 70B

- **Description:** Ideal for content creation, conversational AI, language understanding, research development, and enterprise applications. The model excels at text summarization and accuracy, text classification and nuance, sentiment analysis and nuance reasoning, language modeling, dialogue systems, code generation, and following instructions.
- **Context Window:** 128k
- **Languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- **Supported Use Cases:** Synthetic Text Generation and Accuracy, Text Classification and Nuance, Sentiment Analysis and Nuance Reasoning, Language Modeling, Dialogue Systems, and Code Generation.

### 2. Llama 3.1 405B

- **Description:** Ideal for enterprise level applications, research and development, synthetic data generation, and model distillation. The model excels at general knowledge, long-form text generation, machine translation, enhanced contextual understanding, advanced reasoning and decision making, better handling of ambiguity and uncertainty, increased creativity and diversity, steerability, math, tool use, multilingual translation, and coding.
- **Context Window:** 128k
- **Languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- **Supported Use Cases:** Synthetic Text Generation and Accuracy, Text Classification and Nuance, Sentiment Analysis and Nuance Reasoning, Language Modeling, Dialogue Systems, and Code Generation.


### Performance and Cost Trade-offs

The table below summarizes the model performance on the Massive Multitask Language Understanding ([MMLU](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md#instruction-tuned-models)) benchmark and their on-demand pricing on Amazon Bedrock.

| Model           | MMLU Score | Price per 1,000 Input Tokens | Price per 1,000 Output Tokens |
|-----------------|------------|------------------------------|-------------------------------|
| Llama 3.1 8B | 69.4%      | \$0.0003                   | \$0.0006                    |
| Llama 3.1 70B | 83.6%      | \$0.00265                   | \$0.0035                     |
| Llama 3.1 405B | 87.3%      | \$0.00532                   | \$0.016                     |

For more information, refer to the following links:

1. [Llama 3.1 Model Cards and Prompt Formats](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1)
2. [Amazon Bedrock Pricing Page](https://aws.amazon.com/bedrock/pricing/)


### Local Setup (Optional)

For a local server, follow these steps to execute this jupyter notebook:

1. **Configure AWS CLI**: Configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your AWS credentials. Run `aws configure` and enter your AWS Access Key ID, AWS Secret Access Key, AWS Region, and default output format.

2. **Install required libraries**: Install the necessary Python libraries for working with SageMaker, such as [sagemaker](https://github.com/aws/sagemaker-python-sdk/), [boto3](https://github.com/boto/boto3), and others. You can use a Python environment manager like [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your Python packages in your preferred IDE (e.g. [Visual Studio Code](https://code.visualstudio.com/)).

3. **Create an IAM role for SageMaker**: Create an AWS Identity and Access Management (IAM) role that grants your user [SageMaker permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 

By following these steps, you can set up a local Jupyter Notebook environment capable of deploying machine learning models on Amazon SageMaker using the appropriate IAM role for granting the necessary permissions.

## Requirements

---
1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html). <I>We are using Notebooks on SageMaker as it provides the kernel we need to run these examples</I>.
2. For Notebook Instance type, choose ml.t3.medium.
3. For Select Kernel, choose [conda_pytorch_p310](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).

---

Before we start building the agentic workflow, we'll first install some libraries:

1. AWS Python SDKs [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to be able to submit API calls to [Amazon Bedrock](https://aws.amazon.com/bedrock/).
2. [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction/) is a framework that provides off the shelf components to make it easier to build applications with large language models. It is supported in multiple programming languages, such as Python, JavaScript, Java and Go. 

---

In [None]:
%%writefile requirements.txt
langchain==0.1.14
boto3
botocore==1.34.142
sqlalchemy==2.0.29
pypdf==4.1.0
langchain-aws==0.1.6
transformers
rich

In [None]:
!pip install -U -r requirements.txt --quiet

<div class="alert alert-block alert-info"> 
<b>NOTE:</b> Restart the kernel with the updated packages that are installed through the dependencies above
</div>

In [None]:
# Restart the kernel
import os
os._exit(00)

---


## Initiate the Bedrock Client

Import the necessary libraries, along with langchain for bedrock model selection

In [None]:
import json
import rich
import numpy as np
import logging

from urllib.request import urlretrieve
from pypdf import PdfReader

import boto3
from boto3 import client
from botocore.config import Config
from botocore.exceptions import ClientError

from langchain_aws import ChatBedrock
from langchain.chains import ConversationChain
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import PyPDFLoader
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

In [None]:
# Set config object and create bedrock runtime client.
config = Config(read_timeout=2000)

DEFAULT_REGION = "us-west-2"

bedrock = boto3.client(service_name='bedrock-runtime',
                       region_name=DEFAULT_REGION,
                       config=config)

<div class="alert alert-block alert-warning"> 
<b>NOTE:</b> Ensure that you have access to the Llama 3.1 model you wish to use through Bedrock in the selected region. At the time of writing, Llama 3.1 models are available only in us-west-2
</div>

In [None]:
# Set the desired Llama 3.1 model ID
llama3_8b_instruct = "meta.llama3-8b-instruct-v1:0"
llama3_70b_instruct = "meta.llama3-70b-instruct-v1:0"

llama3_1_8b_instruct = "meta.llama3-1-8b-instruct-v1:0"
llama3_1_70b_instruct = "meta.llama3-1-70b-instruct-v1:0"
llama3_1_405b_instruct = "meta.llama3-1-405b-instruct-v1:0"

DEFAULT_MODEL = llama3_1_70b_instruct

## Configure LangChain with Boto3

---


With LangChain, you can access Bedrock once you pass the boto3 session information to LangChain. Below, we also specify Meta Llama3.1 405b/70b/8b in `model_id` and pass the Llama3 inference parameters as desired in `model_kwargs`.



### Supported parameters

The Llama models have the following inference parameters.


```
{
    "prompt": string,
    "temperature": float,
    "top_p": float,
    "max_gen_len": int
}
```

In [None]:
# For this notebook, we will look at the 70b model in Llama 3 and Llama 3.1 and 
# review the results. We will use this method to return the ChatBedrock object
# corresponding to the appropriate LLM by passing the model Id as string.
def GetLLMChatBedrockObject(modelID=DEFAULT_MODEL):
    # Instantiate the LangChain ChatBedrock object. This will allow you to use 
    # LangChain with Chat models on Amazon Bedrock
    llm = ChatBedrock(
        model_id=modelID,
        model_kwargs={
            "max_gen_len": 2048,
            "temperature": 0.5,
            "top_p": 0.9
        },
        client=bedrock,
    )

    return llm

In [None]:
# Get the ChatBedrock object for the intended model
# llm = GetLLMChatBedrockObject(llama3_70b_instruct)
llm = GetLLMChatBedrockObject(llama3_1_70b_instruct)

# Initialize conversation chain
conversation = ConversationChain(
    # We set verbose to false to suppress the printing of logs during the
    # execution of the conversation chain. This can be set to true when
    # you're debugging your conversation chain or trying to understand how
    # it is working under the hood.
    llm=llm, verbose=False, memory=ConversationBufferMemory()
)

In [None]:
%%time

#  Generate some text with a prompt.
prediction = conversation.predict(input="Hi there! How are you doing? Please provide me a trivia question and the answer.")

# Having multilingual support built-in means that you can use Llama 3.1 to write prompts
# and receive responses directly in those languages.
# Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
# Few examples below, uncomment a line and run the next cell to see the results.
# prediction = conversation.predict(input="Hallo! Wie geht es dir? Bitte stellen Sie mir eine Quizfrage und die Antwort.")  # German
# prediction = conversation.predict(input="Salut! Comment allez-vous? Veuillez me fournir une question triviale et la réponse.")  # French
# prediction = conversation.predict(input="Ciao! Come va? Per favore forniscimi una domanda banale e la risposta.")  # Italian
# prediction = conversation.predict(input="Olá! Como vai? Por favor, forneça-me uma pergunta trivial e a resposta.")  # Portugese
# prediction = conversation.predict(input="नमस्ते! आप कैसे हैं? कृपया मुझे एक सामान्य प्रश्न और उत्तर प्रदान करें।")  # Hindi
# prediction = conversation.predict(input="¡Hola! ¿Cómo estás? Por favor dame una pregunta de trivia y la respuesta.")  # Spanish
# prediction = conversation.predict(input="สวัสดี! เป็นอย่างไรบ้าง โปรดให้คำถามและคำตอบเรื่องไม่สำคัญแก่ฉัน")  # Thai

print(prediction)

## Document Processing Step

To demonstrate summarization, we will be using an [AWS whitepaper](https://docs.aws.amazon.com/whitepapers/latest/architecting-hipaa-security-and-compliance-on-aws/architecting-hipaa-security-and-compliance-on-aws.pdf) on architecting HIIPA compliant workloads on AWS.

In [None]:
# Let's first download the file to build our document store.
!mkdir -p ./data

urls = [
    'https://docs.aws.amazon.com/whitepapers/latest/architecting-hipaa-security-and-compliance-on-aws/architecting-hipaa-security-and-compliance-on-aws.pdf'
]

filenames = [
    'AWS-security-whitepaper.pdf'
]

metadata = [
    dict(year=2024, source=filenames[0])
]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

After downloading we can load the documents with the help of `DirectoryLoader` from `PyPDF` available under LangChain and splitting them into smaller chunks.

Note: For the sake of this use-case we are creating chunks of roughly 4000 characters with an overlap of 100 characters using `RecursiveCharacterTextSplitter`.

#### HIPAA Compliance document

In this section, we will load the HIPAA compliance document with `PyPDFLoader`, append document fragments with the metadata, and use LangChain's `RecursiveCharacterTextSplitter` to split the documents in `hipaa_documents` list into smaller text chunks using the `split_documents` method. 

In [None]:
%%time
# Document 1 (HIPAA COMPLIANCE ON AWS)
hipaa_documents = []

# Load only the first file
hipaa_file = filenames[0]
hipaa_loader = PyPDFLoader(data_root + hipaa_file)

# Here we load a PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.
# To access a subset of pages use something like: hipaa_document[70:84], for no of pages, use: len(hipaa_document)
hipaa_document = hipaa_loader.load()

for idx, hipaa_document_fragment in enumerate(hipaa_document):
    hipaa_document_fragment.metadata = metadata[0] if metadata else {}
    hipaa_documents.append(hipaa_document_fragment)
    
# Chunking
hipaa_doc_text_splitter = RecursiveCharacterTextSplitter(
    # Set a  small chunk size, just to show.
    chunk_size=1024,
    chunk_overlap=100,
)

hipaa_docs = hipaa_doc_text_splitter.split_documents(hipaa_documents)
print(hipaa_docs[0])

# Chunked doc count
hipaa_chunked_count = len(hipaa_docs)
print(
    f"\nTotal number of pages: {len(hipaa_documents)}\nNumber of documents chunked and created from the HIPAA Security document: {hipaa_chunked_count}"
)

### As Llama 3.1 has a larger context window, you can use the native invoke method directly without having to use langchain.

However LangChain offers few techniques that provides you more control over the summarization task at hand.

In [101]:
# Taking a piece of text which is larger than the context window size for Llama 3
text = hipaa_docs[50:97]

prompt = f"""Write a detailed summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""

In [103]:
%%time
# Embed the prompt in Llama 3.1's instruction format.
# More information: 
# https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1
# https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

# Switch between models here.
# DEFAULTMODEL = llama3_70b_instruct  #  Will throw an error
DEFAULTMODEL = llama3_1_70b_instruct  #  Takes about 2-3 mins to run.

formatted_prompt = f"""
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

# Format the request payload using the model's native structure.
native_request = {
    "prompt": formatted_prompt,
    "temperature": 0.5,
    "top_p": 0.9,
    "max_gen_len": 2048
}

# Convert the native request to JSON.
request = json.dumps(native_request)

inputTokens = outputTokens = 0
try:
    # Invoke the model with the request.
    response = bedrock.invoke_model(modelId=DEFAULTMODEL, body=request)

    # Decode the response body.
    model_response = json.loads(response["body"].read())

    # Extract and print the response text.
    response_text = model_response["generation"]
    
    inputTokens = model_response["prompt_token_count"]
    outputTokens = model_response["generation_token_count"]
        
    print(response_text)
    
    # Convert this to a lookup based on the model
    # print(f"Approx cost of this operation = {getCost(DEFAULT_REGION, DEFAULT_MODEL, inputTokens, outputTokens)}")

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{DEFAULT_MODEL}'. Reason: {e}")
    exit(1)

Here is a bullet point summary of the text:

* Amazon CloudWatch Events delivers a near-real-time stream of system events that describe changes in AWS resources.
* Customers should ensure that PHI does not flow into CloudWatch Events and that any AWS resource emitting a CloudWatch event that is storing, processing, or transmitting PHI is configured in accordance with the Guidance.
* Amazon CloudWatch Logs can be used to monitor, store, and access log files from Amazon EC2 instances, AWS CloudTrail, Amazon Route 53, and other sources.
* Log data is encrypted while in transit and at rest, and customers do not need to re-encrypt PHI emitted by any other service and delivered to CloudWatch Logs.
* Amazon Comprehend uses natural language processing to extract insights about the content of documents and can be used with data containing PHI.
* Amazon Comprehend processes any text file in UTF-8 format and develops insights by recognizing entities, key phrases, language, sentiments, and other c

In [104]:
# Util method to print output.
def print_json(data):
    rich.print_json(json.dumps(data))

# Print the request and response objects
print_json(native_request)
print_json(model_response)

As you can see, for the data provided, the token count (```prompt_token_count``` **+** ```generation_token_count```) is about **10,711 tokens** which is larger than Llama 3 context window (**8192 tokens**). However, due to the increased context window limit of Llama 3.1, you can directly use the native invoke method to get summary results.

---

## Summarizing Long Documents with LangChain

In the following sections, we will go over three different summarization techniques with LangChain:
    
 #####   1. Stuff
 #####   2. Map Reduce
 #####   3. Refine
 ---

### 1. Stuff with load_summarize_chain

Stuffing is the simplest method to pass data to a language model. It "stuffs" text into the prompt as context in a way that all of the relevant information can be processed by the model to get what you want. It is the default way to process documents with an LLM.

In LangChain, you can use `StuffDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is set `stuff` as the `chain_type` of your chain.

In [110]:
# Instantiate the LLM object corresponding to the model we need.
# llm = GetLLMChatBedrockObject(llama3_70b_instruct)
llm = GetLLMChatBedrockObject(llama3_1_70b_instruct)
    
stuff_summary_chain = load_summarize_chain(llm=llm,
                                           chain_type="stuff",
                                           verbose=False)

Next, let's take a look at the Prompt template used by the Stuff summarize chain:

In [111]:
stuff_summary_chain.llm_chain.prompt.template

'Write a detailed summary of the following text delimited by triple backquotes.\nReturn your response in bullet points which covers the key points of the text.\n```{text}```\nBULLET POINT SUMMARY:\n'

Here, we see that by default, the Prompt template for `llm_chain` has been set to: 'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

This can be altered by instantiating using `from_template` with LangChain to set a new prompt. We can do that below:



In [112]:
stuff_prompt = PromptTemplate.from_template('Write a detailed summary of the following text delimited by triple backquotes.\nReturn your response in bullet points which covers the key points of the text.\n```{text}```\nBULLET POINT SUMMARY:\n')

In [113]:
stuff_summary_chain.llm_chain.prompt.template = stuff_prompt.template  # Set new prompt template

Now that we have set the new prompt template, let us first try generating a summary of the whitepaper.

In [114]:
%%time
# We will get an ERROR in Llama 3.
# With Llama 3.1 this cell might take 1-2 minutes to run. But no error due to
# the increaesd (128k) context window size.
try:
    stuff_hipaa_summary = stuff_summary_chain.invoke(hipaa_docs[50:97])
except Exception as e:
    print(e)

CPU times: user 15.3 ms, sys: 142 μs, total: 15.4 ms
Wall time: 1min 36s


In [115]:
print(stuff_hipaa_summary['output_text'].strip())

Here is a bullet point summary of the text:

**Amazon CloudWatch Events**

* Delivers a near-real-time stream of system events that describe changes in AWS resources
* Customers should ensure PHI does not flow into CloudWatch Events
* Can be configured to register as an AWS API call in CloudTrail

**Amazon CloudWatch Logs**

* Allows customers to monitor, store, and access log files from various sources
* Log data is encrypted while in transit and at rest
* Customers can retrieve log data from CloudWatch Logs

**Amazon Comprehend**

* Uses natural language processing to extract insights from documents
* Can process data containing PHI
* Does not retain or store any data, and all API calls are encrypted with SSL/TLS
* Uses CloudTrail to log all API calls

**Identity and Access Management**

* Customers can use IAM to create a user with attached policies to enable Amazon Comprehend permissions
* Can use multi-factor authentication to require users to authenticate to AWS
* Can create a cu

#### Notes:
In the output for the above cell, we don't get an error despite the context length being longer than 8192 tokens. Refer to the [Llama 3 notebook](https://github.com/aws-samples/Meta-Llama-on-AWS/blob/main/long-text-summarization/Llama3-Long-Document-summarization-LangChain.ipynb), where you will receive an error on this cell. 

Due to the larger context window, you can also bypass LangChain completely, and use the native invoke method, but as seen above, there are latency benefits whilst using LangChain.

However, note that the stuffing method is not suitable for summarizing large documents, as it can be slow and may not produce a good summary. Let's explore a couple chunk-wise summarization techniques with [LangChain](https://python.langchain.com/docs/get_started/introduction.html).

---

### 2. Map Reduce with load_summarize_chain

The `Map_Reduce` method involves summarizing each document individually (map step) and then combining these summaries into a final summary (reduce step). This approach is more scalable and can handle larger volumes of text. The map reduce technique is designed for summarizing large documents that exceed the token limit of the language model. It involves dividing the document into chunks, generating summaries for each chunk, and then combining these summaries to create a final summary. This method is efficient for handling large files and significantly reduces processing time.

In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_summarize_chain method`. What you need to do is set `map_reduce` as the `chain_type` of your chain.

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
2. Langchain utility is used to split it into multiple smaller chunks (chunking)
3. Model generates individual summaries for all document chunks in parallel
4. Reduce all these summaries to a condensed final summary
---

![map-reduce](imgs/llama3mapreduce.png)

In [122]:
# Instantiate the LLM object corresponding to the model we need.
# llm = GetLLMChatBedrockObject(llama3_70b_instruct)
llm = GetLLMChatBedrockObject(llama3_1_70b_instruct)

# Takes a list of documents, combines them into a single string, and passes
# this to an LLMChain. It then combines and iteratively reduces the mapped 
# document
map_reduce_summary_chain = load_summarize_chain(llm=llm,
                                                chain_type="map_reduce",
                                                verbose=False)

The `ReduceDocumentsChain` handles taking the document mapping results and reducing them into a single output. It wraps a generic `CombineDocumentsChain` (like `StuffDocumentsChain`) but adds the ability to collapse documents before passing it to the `CombineDocumentsChain` if their cumulative size exceeds token_max.

In [123]:
# Instantiation using from_template (recommended)
# Sets the prompt template for the summaries generated for all the individual document chunks.
initial_map_prompt = PromptTemplate.from_template("""
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      """)

map_reduce_summary_chain.llm_chain.prompt.template = initial_map_prompt.template

# Sets the prompt template for generating a cumulative summary of all the document chunks for reduce documents chain.
reduce_documents_prompt = PromptTemplate.from_template("""
                      Write a detailed summary of the following text delimited by triple backquotes.
                      Return your response in bullet points which covers the key points of the text.
                      ```{text}```
                      BULLET POINT SUMMARY:
                      """)

map_reduce_summary_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt.template = reduce_documents_prompt.template

Here, we perform summarization on the **HIPAA and Security Compliance** document with `Map-Reduce`. Since this is document is quite large, it can take a while to run.
In order to see how Map_Reduce works, let us generate a summary of a subset of the document chunks **(50 to 71)**.

In [124]:
%%time
# This cell might take 3-5 minutes to run on Llama 3.
# This cell might take 2-3 minutes to run on Llama 3.1.

try:
    map_reduce_summary = map_reduce_summary_chain.invoke(hipaa_docs[50:71])
    # map_reduce_summary = map_reduce_summary_chain.invoke(hipaa_docs[50:97])
except Exception as e:
    print(e)

CPU times: user 188 ms, sys: 417 μs, total: 189 ms
Wall time: 3min 1s


In [125]:
print(map_reduce_summary['output_text'].strip())
# map_reduce_summary

Here is a detailed summary of the text in bullet points:

**Amazon CloudWatch Events**

* Delivers a near-real-time stream of system events that describe changes in AWS resources.
* Customers should ensure that Protected Health Information (PHI) does not flow into CloudWatch Events.
* AWS resources emitting CloudWatch events that store, process, or transmit PHI must be configured according to the Guidance.

**Amazon CloudWatch Logs**

* Allows customers to monitor, store, and access log files from various sources.
* Log data is stored and can be accessed through CloudWatch Logs.
* Log data is encrypted while in transit and at rest.

**Amazon Comprehend**

* Uses natural language processing to extract insights from documents.
* Supports text files in UTF-8 format.
* Can be used with data containing Protected Health Information (PHI).
* Does not retain or store any data.
* All API calls are encrypted with SSL/TLS.
* Uses CloudTrail to log all API calls.

**Security and Access Management*

#### Notes:

Llama 3.1 shows slight latency improvement over Llama 3 owing to its larger context window.

With `Map_Reduce`, the model is able to summarize a large document by overcoming the context limit of Stuffing method with parallel processing. 
However, it requires multiple calls to the model and potentially loses context between individual summaries of the chunks. To deal with this challenge, let us try another method that performs chunk-wise summarization.

---

### 3. Refine with load_summarize_chain

The `Refine` method is a technique that allows us to recursively summarize our input data. It iteratively updates its answer by looping over the input documents. This method is useful for refining a summary based on new context.`Refine` is a simpler alternative to `Map_Reduce`. It involves generating a summary for the first chunk, combining it with the second chunk, generating another summary, and continuing this process until a final summary is achieved. This method is suitable for large documents but requires less complexity compared to `Map_Reduce`.

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
2. Langchain utility is used to split it into multiple smaller chunks (chunking)
3. First chunk is sent to the model; Model returns the corresponding summary
4. Langchain gets next chunk and appends it to the returned summary and sends the combined text as a new request to the model; the process repeats until all chunks are processed
5. In the end, you have final summary that has been recursively updated using all the document chunks

---

![refine](imgs/llamarefine.png)



In [138]:
# Instantiate the LLM object corresponding to the model we need.
llm = GetLLMChatBedrockObject(llama3_70b_instruct)
# llm = GetLLMChatBedrockObject(llama3_1_70b_instruct)

# Run an initial prompt on a small chunk of data to generate a summary.
# Then, for each subsequent document, the output from the previous document is
# passed in along with the new document, and the LLM is asked to refine the
# output based on the new document.
refine_summary_chain = load_summarize_chain(llm=llm,
                                            chain_type="refine",
                                            verbose=False)

# Refine summary chain for summarization
refine_summary_chain_french = load_summarize_chain(llm=llm,
                                                   chain_type="refine",
                                                   verbose=False)

Here, we perform summarization on the **HIPAA and Security Compliance** document with `Refine`. Since this is document is quite large, it can take a while to run.
In order to see how Refine works, let us generate a summary of a subset of the document chunks **(50 to 70)**.

In [139]:
# Initial llm chain prompt template
initial_refine_prompt = PromptTemplate.from_template("""
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      """)

refine_summary_chain.initial_llm_chain.prompt.template = initial_refine_prompt.template

# Refine llm chain prompt template
refine_documents_prompt = PromptTemplate.from_template(
    "Your job is to produce a final summary.\nWe have provided an existing summary up to a certain point: {existing_answer}\nWe have the opportunity to refine the existing summary (only if needed) with some more context below.\n------------\n{text}\n------------\nGiven the new context, refine the original summary.\nIf the context isn't useful, return the original summary.")

refine_summary_chain.refine_llm_chain.prompt.template = refine_documents_prompt.template

In [140]:
%%time
# This cell might take 8-10 minutes to run on Llama 3.
# This cell might take 6-8 minutes to run on Llama 3.1.

try:
    refine_summary = refine_summary_chain.invoke(hipaa_docs[50:71])
except Exception as e:
    print(e)

CPU times: user 120 ms, sys: 19.6 ms, total: 140 ms
Wall time: 8min 49s


In [141]:
print(refine_summary['output_text'].strip())

Based on the provided context, I have refined the original summary to incorporate the new information about Amazon DocumentDB (with MongoDB compatibility) and additional security recommendations. Here is the refined summary:

**Main Points:**

* Amazon CloudWatch Events provides a near-real-time stream of system events related to AWS resources.
* Customers should ensure that Protected Health Information (PHI) is not sent to CloudWatch Events.
* AWS resources emitting CloudWatch events that involve PHI must be configured according to the Guidance.
* Amazon Comprehend can be used with data containing PHI, and all API calls are encrypted with SSL/TLS.
* Amazon Connect is a self-service, cloud-based contact center service that enables dynamic, personal, and natural customer engagement at any scale, and customers should not include any PHI in any fields associated with managing users, security profiles, and contact flows within Amazon Connect.
* Amazon DocumentDB (with MongoDB compatibility

### Notes:

Llama 3.1 shows slight latency improvement over Llama 3 owing to its larger context window.

`Refine` has the potential to incorporate more relevant context compared to `Map_Reduce`, potentially resulting in a more comprehensive and accurate summary. However, it comes with a trade-off: `Refine` necessitates a significantly higher number of calls to the LLM than the `Stuff` and `Map_Reduce` since it is an incremental process where the subsequent chunk's summary uses the previous chunk's summary. Moreover, these calls are not independent, which means they cannot be parallelized, potentially leading to longer processing times. Another consideration is that the Refine method may exhibit recency bias, where the most recent document chunks in the sequence could carry more weight or influence in the final summary, as the method processes documents in a specific order.

---
## Conclusion

In this notebook, we looked at three different summarization techniques using LangChain; **Stuff**, **Map_Reduce**, and **Refine**. Each of these methods has its own distinct advantages/uses. 

- ***Stuff*** is straighforward and is the fastest method out of the three since it makes a single call to the LLM and fits the entire document within the model's context window. Although as we saw with the HIPAA Compliance document, it does not scale well to work with large volumes of text.

- ***Map_Reduce*** deals with the issue of the context window length while being able to parallelize generation of summaries for individual chunks, thereby speeding up the model's response while being able to process long documents. An issue with Map_Reduce is that since this is not a recursive process, we lose context between chunks during this process.

- ***Refine*** deals with the issues that arise with the previous methods. It performs recursive summarization by incrementally generating summaries for each of the chunks while retaining context between them. While this method generates the most accurate and comprehensive summary out of all 3 methods, the calls made to the LLM cannot be parallelized. This can result in longer processing times. Additionally, more recent document chunks tend to carry more weight due to the order that they are processed in.

We saw, for summarization, due to the larger context window of Llama 3.1 you can use native ```invoke``` method instead of using one the ```load_summarize_chain``` (e.g., stuff). This way you can completely avoid LangChain, however, there are latency benefits and better control whilst using LangChain's ```load_summarize_chain``` with its options (namely, stuff, map_reduce, refine).

We can see both Llama 3.1 and Llama 3 work well with the Map Reduce and Refine methods, both of which are meant to work arond the context window limitation while using a large document as input. We do see slight latency improvement with Llama 3.1 on the same tasks owning to its larger context window (128k for Llama 3.1 as compared to 8k for Llama 3). This is just one aspect of comparison. For your usecase consider the approriate chunking strategy based on a holistic analysis considering all aspects as well. 

We also saw that given the direct support to 8 languages, it also allows you to prompt and get results in multiple languages. This can help you build powerful user experiences with multilingual chat interfaces.


---
## Distributors
- Amazon Web Services
- Meta

