# Chapter 5 - Knowledge Bases: Chat with Your Document using Amazon Bedrock

## Overview
This notebook demonstrates how to build an interactive document question-answering system using Amazon Bedrock's retrieve and generate capabilities. By combining document processing with foundation models, you'll be able to ask natural language questions about documents and receive accurate, contextually relevant answers.

## Introduction
This notebook demonstrates how to build an interactive document question-answering system using Amazon Bedrock's retrieve and generate capabilities. By combining document processing with foundation models, you'll be able to ask natural language questions about documents and receive accurate, contextually relevant answers.

## Prerequisites
- AWS account with Amazon Bedrock access
- Access to Claude 3 Sonnet model
- PDF documents for question answering
- Python environment with required packages

## Setup

### Install Required Dependencies

In [None]:
# Install AWS SDK and PDF processing library
%pip install --upgrade boto3      # AWS SDK for Python
%pip install --upgrade botocore   # Core AWS library
%pip install pypdf                # PDF text extraction library

### Import Libraries

In [None]:
# Import AWS SDK for Bedrock integration
import boto3

### Initialize Bedrock Client

In [None]:
# Initialize Bedrock Agent Runtime client for document processing
bedrock_agent_client = boto3.client("bedrock-agent-runtime")

# Configure the AI model - Claude 3 Sonnet for high-quality text generation
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

# Set AWS region
region = "us-east-1"

## Document Processing

### Extract Text from Local PDF File

In [None]:
# Import PDF processing library
from pypdf import PdfReader  # Note: For more accurate extraction, consider Amazon Textract

# Specify the path to your PDF file
file_name = "data/sample-transcript.pdf"  # Update this path to your local file

# Create PDF reader object
reader = PdfReader(file_name)

# Display number of pages in the PDF
print(f"📄 Total pages in PDF: {len(reader.pages)}")

# Extract text from all pages
text = ""
page_count = 1

# Loop through each page and extract text
for page in reader.pages:
    text += f"\npage_{str(page_count)}\n {page.extract_text()}"
    page_count += 1

# Display extracted text
print("\n📝 Extracted Text:")
print(text)

### Using S3 Document Location (Optional)

In [None]:
# Configure S3 document location (uncomment and modify if using S3)
bucket_name = "<REPLACE_WITH_YOUR_BUCKET_NAME>"           # Your S3 bucket name
prefix_file_name = "<REPLACE_WITH_OBJECT_NAME_INCLUDING_PREFIX>"  # Object key/path in S3
document_s3_uri = f's3://{bucket_name}/{prefix_file_name}'        # Complete S3 URI

## Retrieve and Generate Function

### Define Core Function

In [None]:
def retrieveAndGenerate(input, sourceType="S3", model_id="anthropic.claude-3-sonnet-20240229-v1:0"):
    """
    Generate responses based on document content using Amazon Bedrock.
    
    Args:
        input (str): User question or query about the document
        sourceType (str): "S3" for S3-stored files, "BYTE_CONTENT" for local files
        model_id (str): Bedrock model identifier
    
    Returns:
        dict: Bedrock response containing generated text and citations
    """
    
    # Construct model ARN for Bedrock
    model_arn = f'arn:aws:bedrock:{region}::foundation-model/{model_id}'
    
    if sourceType == "S3":
        # Configuration for S3-stored documents
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'EXTERNAL_SOURCES',
                'externalSourcesConfiguration': {
                    'modelArn': model_arn,
                    "sources": [
                        {
                            "sourceType": sourceType,
                            "s3Location": {
                                "uri": document_s3_uri
                            }
                        }
                    ]
                }
            }
        )
    else:
        # Configuration for local/byte content documents
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'EXTERNAL_SOURCES',
                'externalSourcesConfiguration': {
                    'modelArn': model_arn,
                    "sources": [
                        {
                            "sourceType": sourceType,
                            "byteContent": {
                                "identifier": file_name,
                                "contentType": "application/pdf",
                                "data": text,
                            }
                        }
                    ]
                }
            }
        )

## Chat with Document

### Ask Questions About Document Content

In [None]:
# Ask a question about the document
query = "Summarize the document"

# Generate response using our function
response = retrieveAndGenerate(input=query, sourceType="BYTE_CONTENT")

# Extract the generated text from the response
generated_text = response['output']['text']

# Display the AI's response
print("🤖 AI Response:")
print(generated_text)

## Source Citations

### View and Verify Source Information

In [None]:
# Extract and display citations from the response
citations = response["citations"]
contexts = []

# Process each citation to extract the source text
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])

# Display the source citations
print("📖 Source Citations (Text snippets used to generate the response):")
for i, context in enumerate(contexts, 1):
    print(f"\nCitation {i}:")
    print(context)

## Conclusion

In this notebook, we've successfully implemented a document question-answering system using Amazon Bedrock's retrieve and generate capabilities. This solution demonstrates the power of combining document processing with foundation models to create intelligent document interaction systems.

Key accomplishments:
- **Document Processing**: We extracted text from PDF documents, preserving page structure for accurate context retrieval.
- **Contextual Question Answering**: Leveraged Claude 3 Sonnet's capabilities to answer questions based on document content.
- **Source Attribution**: Implemented citation tracking to provide transparency about which parts of the document were used to generate responses.
- **Flexible Document Sources**: Created a system that works with both local files and S3-stored documents.

This implementation has numerous practical applications:
- **Knowledge Base Interaction**: Query company documentation, reports, or research papers
- **Legal Document Analysis**: Extract insights from contracts or regulatory documents
- **Educational Content**: Create interactive learning experiences with textbooks or course materials
- **Research Assistance**: Quickly find relevant information in academic papers or reports

For production deployments, consider these enhancements:
- Implement robust error handling for various document formats and sizes
- Add a user interface for easier document uploading and interaction
- Integrate with document management systems for seamless workflow
- Implement caching mechanisms to improve response times for frequently accessed documents
- Add document preprocessing capabilities to handle complex layouts and formatting

By combining Amazon Bedrock's foundation models with the retrieve and generate pattern, you've created a powerful tool that makes document information more accessible and actionable through natural language interaction.