# Summarize text with Bedrock and Langchain

This notebook explains steps requried to build a Sumarization with Bedrock.

## Pre-requisites
Install the required libraries and dependencies

In [None]:
!pip install langchain --upgrade

In [None]:
!pip install transformers==4.24.0

In [None]:
!pip install sagemaker --upgrade

In [None]:
!pip install boto3 --upgrade

## Restart Kernel

In [None]:
#Restart Kernel after the installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)  

## Setup Dependencies

In [1]:
#Check Python version is greater than 3.8 which is required by Langchain if you want to use Langchain
import sys
sys.version

'3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]'

In [2]:
assert sys.version_info >= (3, 8)

In [3]:
import langchain

In [4]:
langchain.__version__

'0.1.1'

In [5]:
import os, json
from tqdm import tqdm
import pathlib 

In [6]:
import boto3
import sagemaker
session = boto3.Session()
sagemaker_session = sagemaker.Session()
studio_region = sagemaker_session.boto_region_name 
bedrock = session.client("bedrock-runtime", region_name=studio_region)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Summarize Short text with boto3 API

In [7]:
model_id="amazon.titan-tg1-large"
model_args= {"maxTokenCount": 4096,"stopSequences": [],"temperature":0,"topP":1 }

In [8]:
prompt = """
Please provide a summary of the following text. 
<text>
Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. \
It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. \
Use Amazon Comprehend to create new products based on understanding the structure of documents. \
For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.\
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. \
You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. \
You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition.\
Amazon Comprehend may store your content to continuously improve the quality of its pre-trained models. \
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input.\
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature.
</text>
"""

In [9]:
body = json.dumps({"inputText": prompt, 
                   "textGenerationConfig":model_args
                  }) 

accept = 'application/json'
content_type = 'application/json'

response = bedrock.invoke_model(body=body, modelId=model_id, accept=accept, contentType=content_type)
response_body = json.loads(response.get('body').read())

In [10]:
response_body

{'inputTextTokenCount': 268,
 'results': [{'tokenCount': 162,
   'outputText': 'Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to extract insights about the content of documents. It recognizes entities, key phrases, language, sentiments, and other common elements in a document and can be used to create new products based on understanding the structure of documents. It can access document analysis capabilities using the Amazon Comprehend console or APIs, and can run real-time or asynchronous analysis jobs for small or large document sets. It can use pre-trained models or train custom models for classification and entity recognition, and may store your content to improve the quality of its pre-trained models. All Amazon Comprehend features accept UTF-8 text documents as input, and custom classification and custom entity recognition accept image files, PDF files, and Word files as input.',
   'completionReason': 'FINISH'}]}

In [11]:
response_body['results'][0]['outputText']

'Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to extract insights about the content of documents. It recognizes entities, key phrases, language, sentiments, and other common elements in a document and can be used to create new products based on understanding the structure of documents. It can access document analysis capabilities using the Amazon Comprehend console or APIs, and can run real-time or asynchronous analysis jobs for small or large document sets. It can use pre-trained models or train custom models for classification and entity recognition, and may store your content to improve the quality of its pre-trained models. All Amazon Comprehend features accept UTF-8 text documents as input, and custom classification and custom entity recognition accept image files, PDF files, and Word files as input.'

## Summarize Long text with Langchain and Chunking

In [12]:
from langchain.llms.bedrock import Bedrock

In [13]:
letter = "letters/2022-letter.txt"
with open(letter, "r") as file:
    letter = file.read()
print(letter)

As I sit down to write my second annual shareholder letter as CEO, I find myself optimistic and energized by what lies ahead for Amazon. Despite 2022 being one of the harder macroeconomic years in recent memory, and with some of our own operating challenges to boot, we still found a way to grow demand (on top of the unprecedented growth we experienced in the first half of the pandemic). We innovated in our largest businesses to meaningfully improve customer experience short and long term. And, we made important adjustments in our investment decisions and the way in which we’ll invent moving forward, while still preserving the long-term investments that we believe can change the future of Amazon for customers, shareholders, and employees.

While there were an unusual number of simultaneous challenges this past year, the reality is that if you operate in large, dynamic, global market segments with many capable and well-funded competitors (the conditions in which Amazon operates all of it

In [14]:
llm = Bedrock(model_id=model_id, client=bedrock, model_kwargs=model_args)  
llm.get_num_tokens(letter)

2024-01-16 22:36:38.767343: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (6526 > 1024). Running this sequence through the model will result in indexing errors


6526

In [15]:
#Chunck the document with 4000 charaecters and with stride as 100 charcters 
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"], chunk_size=4000, chunk_overlap=100
)

docs = text_splitter.create_documents([letter])

In [16]:
len(docs[4].page_content)

3673

In [17]:
num_docs = len(docs)

num_tokens_first_doc = llm.get_num_tokens(docs[0].page_content)

print(
    f"There are {num_docs} documents and the first one has {num_tokens_first_doc} tokens"
)

There are 10 documents and the first one has 439 tokens


In [18]:
# Set verbose=True if you want to see the prompts being used
from langchain.chains.summarize import load_summarize_chain
summary_chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=False)

In [19]:
output = summary_chain.run(docs)

  warn_deprecated(


ValueError: Error raised by bedrock service: Read timeout on endpoint URL: "https://bedrock-runtime.us-west-2.amazonaws.com/model/amazon.titan-tg1-large/invoke"

In [None]:
output