# Document Insights with Amazon Bedrock Data Automation: Setup

This notebook contains the prerequisites to setup a Multimodal Retrieval-Augmented Generation (RAG) application using Amazon Bedrock Data Automation (BDA) and Bedrock Knowledge Bases (KB). The application can analyze and generate insights from multiple data modalities, including documents, images, audio, and video.

## Setup and Configuration

Let's start by setting up the necessary dependencies and AWS clients.

In [1]:
%pip install "boto3>=1.37.4" s3fs tqdm retrying packaging --upgrade -qq

import boto3
import json
import uuid
import time
import os
import random
import sagemaker
import logging
import mimetypes
from botocore.exceptions import ClientError
import warnings
warnings.filterwarnings('ignore')

# Import utils and access the business context function
from utils.utils import BDARAGUtils

# Create utility instance to use its methods
rag_utils = BDARAGUtils()

# Display comprehensive business context for RAG
rag_utils.show_business_context("rag_complete")

# Configure logging
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region_name = boto3.session.Session().region_name

s3_client = boto3.client('s3')
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime')

print(f"Setup complete!")
print(f"Using AWS region: {region_name}")

Note: you may need to restart the kernel to use updated packages.
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Setup complete!
Using AWS region: us-west-2


## 1. Create Multimodal Knowledge Base

Now we'll create a Knowledge Base that can handle our multimodal data.

In [2]:
# Import our BDARAGUtils class
from utils.utils import BDARAGUtils

# Define the S3 prefix for our dataset
s3_prefix = 'bda-workshop/'

# Create or get the bucket name for knowledge base
bucket_name_kb = f"bda-workshop-{region_name}-{account_id}"

print(f"Using bucket: {bucket_name_kb}")
print(f"Using S3 prefix: {s3_prefix}")

# Display business context for Knowledge Base creation
rag_utils.show_business_context("knowledge_base")

# Verify the bucket exists and has content
try:
    response = s3_client.list_objects_v2(Bucket=bucket_name_kb, Prefix=s3_prefix, MaxKeys=10)
    if 'Contents' in response:
        print(f"Found {len(response['Contents'])} objects in bucket with prefix {s3_prefix}")
        for obj in response['Contents']:
            print(f"  - {obj['Key']} ({obj['Size']} bytes)")
    else:
        print(f"⚠️ No objects found in bucket {bucket_name_kb} with prefix {s3_prefix}")
        print("Checking if bucket exists without prefix...")
        response = s3_client.list_objects_v2(Bucket=bucket_name_kb, MaxKeys=10)
        if 'Contents' in response:
            print(f"Found {len(response['Contents'])} objects in bucket (without prefix):")
            for obj in response['Contents']:
                print(f"  - {obj['Key']}")
except Exception as e:
    print(f"Error accessing bucket {bucket_name_kb}: {e}")
    print("Please ensure the bucket exists and contains the workshop data.")

# Create a timestamp-based suffix for unique resource names
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(time.time()))[-7:]
kb_suffix = f"{timestamp_str}"

# Define Knowledge Base parameters
knowledge_base_name = f"multimodal-rag-kb-{kb_suffix}"
knowledge_base_description = "Multimodal RAG Knowledge Base for the BDA Workshop"

# Define data sources
data_sources = [{
    "type": "S3", 
    "bucket_name": bucket_name_kb,
    "inclusionPrefixes": [s3_prefix]
}]

# Create the Knowledge Base
print(f"🏗️ Creating Knowledge Base: {knowledge_base_name}")
print("This may take several minutes to complete...")

try:
    knowledge_base = BDARAGUtils(
        kb_name=knowledge_base_name,
        kb_description=knowledge_base_description,
        data_sources=data_sources,
        multi_modal=True,
        parser='BEDROCK_DATA_AUTOMATION',  # Always use Bedrock's default parser
        chunking_strategy="FIXED_SIZE",
        suffix=kb_suffix
    )
    
    knowledge_base.setup_resources()
    
    kb_id = knowledge_base.get_knowledge_base_id()
    print(f"\nKnowledge Base created successfully!")
    print(f"Knowledge Base ID: {kb_id}")
except Exception as e:
    print(f"\nError creating Knowledge Base: {e}")

Using bucket: bda-workshop-us-west-2-033741858282
Using S3 prefix: bda-workshop/


✅ Found 10 objects in bucket with prefix bda-workshop/
  - bda-workshop/audio/output/1df6199b-cd86-432f-bf33-ccfe9148c22d/0/.s3_access_check (0 bytes)
  - bda-workshop/audio/output/1df6199b-cd86-432f-bf33-ccfe9148c22d/0/standard_output/0/result.json (129382 bytes)
  - bda-workshop/audio/output/1df6199b-cd86-432f-bf33-ccfe9148c22d/job_metadata.json (454 bytes)
  - bda-workshop/audio/podcastdemo.mp3 (4846268 bytes)
  - bda-workshop/image/output/5d3c053f-34cc-4993-a99b-65f71290b8f7/0/.s3_access_check (0 bytes)
  - bda-workshop/image/output/5d3c053f-34cc-4993-a99b-65f71290b8f7/0/custom_output/0/result.json (321 bytes)
  - bda-workshop/image/output/5d3c053f-34cc-4993-a99b-65f71290b8f7/0/standard_output/0/result.json (6045 bytes)
  - bda-workshop/image/output/5d3c053f-34cc-4993-a99b-65f71290b8f7/job_metadata.json (641 bytes)
  - bda-workshop/image/travel.png (409501 bytes)
  - bda-workshop/video/content-moderation-demo.mp4 (49457586 bytes)
🏗️ Creating Knowledge Base: multimodal-rag-kb-110352

[2025-10-31 10:37:07,681] p2573 {base.py:258} INFO - PUT https://rgfoevyjulncbexf9fwj.us-west-2.aoss.amazonaws.com:443/bedrock-sample-rag-index-1103523 [status:200 request:0.333s]


Creating vector index:
Response: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'bedrock-sample-rag-index-1103523'}
Waiting for index creation to complete...
This may take about 60 seconds...
✓ Vector index created successfully.........................
Step 6 - Checking if Lambda function is needed
Not creating Lambda function as chunking strategy is FIXED_SIZE
Step 7 - Creating Knowledge Base
Creating Knowledge Base
✓ Knowledge Base created with ID: AT5MCPMWSZ
Using chunking strategy: FIXED_SIZE
Creating S3 data source for bucket: bda-workshop-us-west-2-033741858282
✓ Data source created with ID: YOE6F9NXLA
✅ Knowledge Base 'multimodal-rag-kb-1103523' created successfully with ID: AT5MCPMWSZ

✅ Knowledge Base created successfully!
Knowledge Base ID: AT5MCPMWSZ


## 2. Save Knowledge Base Configuration

Now we'll save the knowledge base configuration so the RAG notebook can use it.

In [3]:
# Save knowledge base configuration for the RAG notebook
if 'knowledge_base' in locals() and 'kb_id' in locals():
    kb_config = {
        'knowledge_base_id': kb_id,
        'knowledge_base_name': knowledge_base_name,
        'bucket_name': bucket_name_kb,
        's3_prefix': s3_prefix,
        'suffix': kb_suffix,
        'region_name': region_name,
        'account_id': account_id,
        'knowledge_base_object': {
            'kb_name': knowledge_base.kb_name,
            'kb_description': knowledge_base.kb_description,
            'data_sources': knowledge_base.data_sources,
            'multi_modal': knowledge_base.multi_modal,
            'parser': knowledge_base.parser,
            'chunking_strategy': knowledge_base.chunking_strategy,
            'embedding_model': knowledge_base.embedding_model
        }
    }
    
    # Save to a JSON file that the RAG notebook can read
    import json
    with open('../05-rag/kb_config.json', 'w') as f:
        json.dump(kb_config, f, indent=2)
    
    print(f"Knowledge Base configuration saved to ../05-rag/kb_config.json")
    print(f"The RAG notebook (05-rag) can now use this knowledge base.")
    print(f"Knowledge Base ID: {kb_id}")
else:
    print("⚠️ Knowledge Base was not created successfully. Please run the previous cell first.")

✅ Knowledge Base configuration saved to ../05-rag/kb_config.json
The RAG notebook (05-rag) can now use this knowledge base.
Knowledge Base ID: AT5MCPMWSZ


## Summary

This notebook has set up the foundation for a multimodal RAG application using Amazon Bedrock Data Automation. The key components created include:

### What was accomplished:
- **Environment Setup**: Installed required dependencies and configured AWS clients
- **Knowledge Base Creation**: Established a multimodal knowledge base capable of processing documents, images, audio, and video
- **S3 Integration**: Connected the knowledge base to your S3 data sources
- **Configuration Export**: Saved the knowledge base configuration for use in the RAG notebook

### Next Steps:
1. **Data Ingestion**: Upload your multimodal content to the configured S3 bucket
2. **Knowledge Base Sync**: Allow the knowledge base to process and index your content
3. **RAG Implementation**: Use the saved configuration in the RAG notebook to build your application
4. **Testing & Optimization**: Test queries across different modalities and optimize performance

The knowledge base is now ready to power your multimodal RAG application, enabling intelligent search and generation across all your content types.