# Long text summarization using LCEL chains on Langchain with Bedrock APIs

> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

## Overview
When we work with large documents, we can face some challenges as the input text might not fit into the model context length, or the model hallucinates with large documents, or out of memory errors occur, etc.

To solve those problems, we are going to go through a solution that is based on the concept of chunking and chaining prompts. This solution is leveraging [LangChain](https://python.langchain.com/docs/get_started/introduction.html) which is a popular open source framework for developing applications powered by language models.

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
2. A LangChain utility is used to split it into multiple smaller chunks (chunking)
3. The first chunk is sent to the model; the model returns the corresponding summary
4. LangChain gets next chunk and appends it to the returned summary and sends the combined text as a new request to the model; the process repeats until all chunks are processed
5. In the end, you have final summary based on entire content

### Use case
This approach can be used to summarize call transcripts, meetings transcripts, books, articles, blog posts, and other relevant content.

### Install the anthropic API For counting tokens

In [2]:
%pip install -Uq anthropic

Note: you may need to restart the kernel to use updated packages.


### Install Langchain pre-requisites

In [3]:
%pip install -U --no-cache-dir boto3
%pip install -U --no-cache-dir  \
    "langchain>=0.1.11" \
    langchain_aws==0.1.2 \
    sqlalchemy -U \
    "faiss-cpu>=1.7,<2" \
    "pypdf>=3.8,<4" \
    pinecone-client==2.2.4 \
    apache-beam==2.52. \
    tiktoken==0.5.2 \
    "ipywidgets>=7,<8" \
    matplotlib==3.8.2 \
    anthropic==0.9.0 # why is this being installed again?
%pip install -U --no-cache-dir transformers

Collecting boto3
  Downloading boto3-1.34.94-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.35.0,>=1.34.94 (from boto3)
  Downloading botocore-1.34.94-py3-none-any.whl.metadata (5.7 kB)
Downloading boto3-1.34.94-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading botocore-1.34.94-py3-none-any.whl (12.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: botocore, boto3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.92
    Uninstalling botocore-1.34.92:
      Successfully uninstalled botocore-1.34.92
  Attempting uninstall: boto3
    Found existing installation: boto3 1.34.92
    Uninstalling boto3-1.34.92:
      Successfully uninstalled boto3-1.34.92
[31mERROR: pip's dependen

### Imports

In [4]:
import json
import os
import sys
from langchain_aws import BedrockLLM
# import boto3
# import botocore
from langchain.agents import XMLAgent, tool, AgentExecutor


module_path = ".."
sys.path.append(os.path.abspath(module_path))



# boto3_bedrock_runtime = boto3.client('bedrock-runtime')

model = BedrockLLM(
    model_id="anthropic.claude-v2", 
    # client=boto3_bedrock_runtime,
    model_kwargs={'temperature': 0.3}
)

### Load shareholder letter

We will be following a process similar to lab 02 in this summarization section. First, let us load the 2022 Amazon shareholder letter

In [5]:
shareholder_letter = "./letters/2022-letter.txt"

with open(shareholder_letter, "r") as file:
    letter = file.read()

In [6]:
len(letter.split(' '))

5084

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"], chunk_size=4000, chunk_overlap=100
)

docs = text_splitter.create_documents([letter])

In [8]:
num_docs = len(docs)

num_tokens_first_doc = model.get_num_tokens(docs[0].page_content)

print(
    f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens"
)

Now we have 10 documents and the first one has 435 tokens


In [9]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import XMLOutputParser, PydanticOutputParser
from langchain.output_parsers.json import SimpleJsonOutputParser
from langchain.schema.output_parser import StrOutputParser


xml_parser = XMLOutputParser(tags=['insight'])
str_parser = StrOutputParser()

prompt = PromptTemplate(
    template="""
    
    Human:
    {instructions} : \"{document}\"
    Format help: {format_instructions}.
    Assistant:""",
    input_variables=["instructions","document"],
    partial_variables={"format_instructions": xml_parser.get_format_instructions()},
)

insight_chain = prompt | model | StrOutputParser()

# Option 1: Manually process insights, then summarize

In [22]:
%%time
insights=[]
for page, insight in enumerate(docs):
    insights.append(
        insight_chain.invoke({
        "instructions":"Provide Key insights from the following text",
        "document": {insight.page_content}
    }))

CPU times: user 66.5 ms, sys: 7.96 ms, total: 74.5 ms
Wall time: 1min 6s


In [31]:
# prints an insight from the insight's list
insights[5]

" <insight>\n<insight1>\nAmazon is making large investments in machine learning to improve its advertising algorithms and provide better insights and measurement tools for advertisers through products like Amazon Marketing Cloud. This allows advertisers to analyze audience data and campaign performance to optimize their marketing strategies.\n</insight1>\n\n<insight2>\nAmazon believes there are future opportunities to integrate advertising into its video, live sports, audio, and grocery offerings to help brands engage their target audiences. The company will continue growing its advertising business.\n</insight2> \n\n<insight3>\nAmazon uses a systematic process to evaluate new investment opportunities based on their potential size, current level of service in the market, whether Amazon has a differentiated approach, and if Amazon has the needed competencies or can acquire them quickly. This framework has led Amazon to expand into new areas like international stores.\n</insight3>\n\n<in

In [24]:
# str_parser = StrOutputParser()

prompt = PromptTemplate(
    template="""
    
    Human:
    {instructions} : \"{document}\"
    Assistant:""",
    input_variables=["instructions","document"]
)

summary_chain = prompt | model | StrOutputParser()

In [25]:
%%time
print(summary_chain.invoke({
        "instructions":"You will be provided with multiple sets of insights. Compile and summarize these insights and provide key takeaways in one concise paragraph. Do not use the original xml tags. Just provide a paragraph with your compiled insights.",
        "document": {'\n'.join(insights)}
    }))

 Here is a concise paragraph summarizing the key insights:

Amazon has transformed over 25 years from an online bookseller to a global e-commerce and cloud computing giant, pioneering innovations like Kindle, Alexa, and AWS along the way. Despite economic challenges, Amazon continues investing for the long-term in high-potential areas and optimizing operations to improve efficiency. Key focus areas going forward are expanding grocery and advertising, geographic growth, new healthcare offerings, satellite broadband access, and generative AI/machine learning. Amazon believes its best days lie ahead as it leads in customer experience, innovation and hard work across its consumer business, AWS, and new initiatives.
CPU times: user 8.63 ms, sys: 3.06 ms, total: 11.7 ms
Wall time: 6.92 s


# Option 2: Use Map reduce pattern on Langchain

In [26]:
from langchain.chains.summarize import load_summarize_chain
summary_chain = load_summarize_chain(llm=model, chain_type="map_reduce", verbose=False)

In [27]:
%%time
print(summary_chain.invoke(docs))

  warn_deprecated(


 Here is a concise summary of the key points:

Amazon has faced macroeconomic challenges in 2022 but remains optimistic about future growth opportunities. The company continues to innovate, expanding into new business areas like healthcare and satellite broadband. Amazon is focused on long-term investments like AWS and advertising despite needing to streamline costs with workforce reductions. The leadership team embraces change, rapidly evolving the business over 25+ years from an online bookseller to a global ecommerce and cloud giant. Amazon believes in-person collaboration drives innovation so is asking corporate employees to return to the office. The company is leveraging strengths like logistics and customer service to disrupt massive markets like grocery and business procurement. Amazon is investing heavily in AI like large language models to improve customer experiences, positioning itself to capture significant market share as retail and IT spending move online.
CPU times: user