# Chapter 7 - Open-source Frameworks: Document Summarization with Amazon Bedrock and LangChain

## Overview
This notebook demonstrates how to build a document summarization system using LangChain integrated with Amazon Bedrock. We'll explore how to process large documents, extract key information, and generate concise summaries using foundation models.

## Introduction
This notebook demonstrates how to build a document summarization pipeline using Amazon Bedrock's foundation models and LangChain. We'll process text documents, tokenize them appropriately, and leverage Claude 3 Sonnet to generate concise, accurate summaries.

## Prerequisites
- AWS account with Amazon Bedrock access
- Access to Claude 3 Sonnet model
- Text documents for summarization

## Setup

### Install Required Dependencies

In [None]:
# Installing boto3 package with pip, using upgrade flag and disabling cache
%pip install -U --no-cache-dir boto3
%pip install -U --no-cache-dir  \
    "langchain>=0.1.11" \
    sqlalchemy -U \
    "faiss-cpu>=1.7,<2" \
    "pypdf>=3.8,<4" \
    pinecone-client==2.2.4 \
    apache-beam==2.52. \
    tiktoken==0.5.2 \
    "ipywidgets>=7,<8" \
    matplotlib==3.8.2 \
    anthropic==0.9.0
%pip install -U --no-cache-dir transformers

### Import Libraries

In [None]:
# Importing required Python modules
import warnings  
from io import StringIO
import sys
import textwrap
import os
from typing import Optional
import json
import boto3
import botocore

In [None]:
# Creating a client for Amazon Bedrock runtime service
boto3_bedrock = boto3.client('bedrock-runtime')

### Define Helper Functions

In [None]:
# Suppressing warning messages to keep output clean
warnings.filterwarnings('ignore')
# Defining a utility function to print text with word wrapping at specified width
def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))

warnings.filterwarnings('ignore')

## Document Processing

### Load and Split Document

In [None]:
# Importing LangChain modules for document processing and chain creation
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import TextLoader

In [None]:
# Installing PyPDF2 library for PDF processing
!pip install pypdf2

In [None]:
# Load the text file
loader = TextLoader('data/noob.txt')
data = loader.load()
#print(data)

In [None]:
# Split the text into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)

### Analyze Token Count

In [None]:
# Counting and Displaying the total token count
import tiktoken

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# Count the total number of tokens
total_tokens = 0

for text in texts:
    tokens = tokenizer.encode(text.page_content)
    num_tokens = len(tokens)
    total_tokens += num_tokens
#    print(f"Number of tokens in chunk: {num_tokens}")

print(f"Total number of tokens: {total_tokens}")

In [None]:
# Installing the langchain-aws package quietly (suppressing output)
!pip install -U langchain-aws --quiet

## Create Summarization Pipeline

### Define Summarization Prompt

In [None]:
# Importing PromptTemplate to create structured prompts
from langchain.prompts import PromptTemplate

summarize_prompt = PromptTemplate(
    input_variables=["text"],
    template="Please summarize the following text: {text}",
)

### Initialize Language Model

In [None]:
# Importing the ChatBedrock class from langchain_aws for Claude integration
from langchain_aws import ChatBedrock

In [None]:
# Creating a language model instance using Claude 3 Sonnet
llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0")

### Build Summarization Chain

In [None]:
# Load the summarize chain
chain = load_summarize_chain(llm, chain_type="map_reduce")

## Generate Document Summary

### Run Summarization Chain

In [None]:
# Summarize the text
summary = chain.invoke(texts)

### Display Results

In [None]:
print(summary)

In [None]:

# Print just the output_text
print(summary['output_text'])

# Conclusion

In this notebook, we've successfully built and demonstrated a document summarization pipeline that leverages Amazon Bedrock's Claude 3 Sonnet model and LangChain's orchestration capabilities. This implementation showcases how modern AI technologies can effectively distill lengthy documents into concise, meaningful summaries while maintaining the core message and key points.

Our approach addressed several critical challenges in document summarization:

1. **Large Document Processing**: By breaking text into manageable chunks with appropriate overlap, we ensured that even lengthy documents could be processed efficiently without exceeding token limitations.

2. **Context Preservation**: The map-reduce summarization strategy allowed us to maintain important context across sections while still generating a cohesive final summary.

3. **Token Management**: Using token counting helped us optimize our text splitting strategy, ensuring efficient use of the model's capacity.

4. **Foundation Model Integration**: Amazon Bedrock provided a robust, high-quality model that could understand complex text and generate natural summaries without requiring specialized model training.

The resulting summarization capability has numerous practical applications, from creating executive briefs of technical documents to processing academic papers, news articles, or legal texts. The approach is flexible enough to be adapted for different document types and summarization requirements by adjusting prompt templates, chunk sizes, or summarization strategies.

For future enhancements, consider implementing:
- Multi-document summarization for comparative analysis
- Custom prompts for different summary styles (extractive vs. abstractive)
- Domain-specific summarization by fine-tuning prompts for legal, medical, or technical content
- Integration with document management systems for automated summary generation

This powerful combination of Amazon Bedrock's foundation models with LangChain's flexible orchestration capabilities demonstrates how enterprises can quickly build practical, production-ready AI solutions that deliver real value by making information more accessible and actionable.