# Module 2 - Create Knowledge Base and Ingest Documents

----

This notebook provides sample code with step by step instructions for setting up an Amazon Bedrock Knowledge Base.

----

### Contents

1. *Setup*
1. *Create an S3 Data Source*
1. *Setup AOSS Vector Index and Configure BKB Access Permissions*
2. *Configure Amazon Bedrock Knowledge Base and Synchronize it with Data Source*
3. *Conclusions and Next Steps*

----

### Introduction

Foundation models (FMs) are powerful AI models trained on vast amounts of general-purpose data. However, many real-world applications require these models to generate responses grounded in domain-specific or proprietary information. Retrieval Augmented Generation (RAG) is a technique that enhances generative AI responses by retrieving relevant information from external data sources at query time.

Amazon Bedrock Knowledge Bases (BKBs) provide a fully managed capability to implement RAG-based solutions. By integrating your own data — such as documents, manuals, and other domain-specific sources of information — into a knowledge base, you can improve the accuracy, relevance, and usefulness of model-generated responses. When a user submits a query, Amazon Bedrock Knowledge Bases search across the available data sources, retrieve the most relevant content, and pass this information to the foundation model to generate a more informed response.

![BKB illustration](./images/bkb_illustration.png)

This notebook demonstrates how to create an empty Amazon OpenSearch Serverless (AOSS) index, build an Amazon Bedrock Knowledge Base, and ingest documents into it for retrieval-augmented generation.

### Pre-requisites

Please make sure that you have enabled the following model access in _Amazon Bedrock Console_:
- `Amazon Titan Text Embeddings V2`.

**If you are running AWS-facilitated event**, all other pre-requisites are satisfied and you can go to the next section.

**If you are running this notebook as a self-paced lab**, then please note that this notebook requires permissions to:
- create and delete *Amazon IAM* roles
- create, update and delete *Amazon S3* buckets
- access to *Amazon Bedrock*
- access to *Amazon OpenSearch Serverless*

In particular, if running on *SageMaker Studio*, you should add the following managed policies to your SageMaker execution role:
- `IAMFullAccess`,
- `AWSLambda_FullAccess`,
- `AmazonS3FullAccess`,
- `AmazonBedrockFullAccess`,
- Custom policy for Amazon OpenSearch Serverless such as:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "aoss:*",
            "Resource": "*"
        }
    ]
}
````

----

### Pre-flight Checklist

Before starting this lab, let's verify your environment is properly configured.

In [None]:
# Interactive pre-flight checklist
import boto3
from botocore.exceptions import ClientError

checklist_status = {
    "prerequisites": {
        "AWS Credentials": False,
        "Bedrock Access": False,
        "Titan Embeddings V2 Access": False,
        "IAM Permissions": False,
        "Region Verification": False
    },
    "warnings": []
}

print("🔍 Running pre-flight checks...\n")

# Check 1: AWS Credentials
try:
    boto3.client('sts').get_caller_identity()
    checklist_status["prerequisites"]["AWS Credentials"] = True
    print("✅ AWS Credentials: Valid")
except Exception as e:
    checklist_status["warnings"].append("AWS credentials not configured")
    print("❌ AWS Credentials: Invalid or not configured")

# Check 2: Region
try:
    region = boto3.Session().region_name
    if region:
        checklist_status["prerequisites"]["Region Verification"] = True
        print(f"✅ Region: {region}")
    else:
        checklist_status["warnings"].append("AWS region not set")
        print("❌ Region: Not configured")
except Exception as e:
    print("❌ Region: Cannot determine")

# Check 3: Bedrock Access
try:
    bedrock_client = boto3.client('bedrock')
    bedrock_client.list_foundation_models(byProvider='amazon')
    checklist_status["prerequisites"]["Bedrock Access"] = True
    print("✅ Bedrock Access: Available")
except ClientError as e:
    if e.response['Error']['Code'] == 'AccessDeniedException':
        checklist_status["warnings"].append("No Bedrock access - check IAM permissions")
    print("❌ Bedrock Access: Denied or unavailable")
except Exception as e:
    print("❌ Bedrock Access: Error checking access")

# Check 4: Model Access - Titan Embeddings V2
try:
    bedrock_client = boto3.client('bedrock')
    response = bedrock_client.list_foundation_models(
        byProvider='amazon',
        byInferenceType='ON_DEMAND'
    )
    
    titan_v2_available = False
    for model in response.get('modelSummaries', []):
        if 'amazon.titan-embed-text-v2' in model['modelId']:
            titan_v2_available = True
            break
    
    if titan_v2_available:
        checklist_status["prerequisites"]["Titan Embeddings V2 Access"] = True
        print("✅ Titan Embeddings V2: Model access enabled")
    else:
        checklist_status["warnings"].append("Enable Titan Embeddings V2 in Bedrock Console")
        print("⚠️  Titan Embeddings V2: Please enable model access in Bedrock Console")
except Exception as e:
    print("❌ Titan Embeddings V2: Cannot verify access")

# Check 5: IAM Permissions
try:
    iam_client = boto3.client('iam')
    iam_client.list_roles(MaxItems=1)
    checklist_status["prerequisites"]["IAM Permissions"] = True
    print("✅ IAM Permissions: Sufficient for role creation")
except ClientError as e:
    if e.response['Error']['Code'] == 'AccessDenied':
        checklist_status["warnings"].append("Limited IAM permissions - may need IAMFullAccess")
        print("⚠️  IAM Permissions: Limited (required for this lab)")
except Exception as e:
    print("❌ IAM Permissions: Cannot verify")

# Summary
print("\n" + "="*60)
print("CHECKLIST SUMMARY")
print("="*60)

all_passed = all(checklist_status["prerequisites"].values())

if all_passed:
    print("🎉 All checks passed! You're ready to proceed.\n")
else:
    print("⚠️  Some checks failed. Review the warnings below:\n")
    for warning in checklist_status["warnings"]:
        print(f"  - {warning}")
    print("\nYou may still proceed, but expect potential errors.\n")

print("Prerequisites status:")
for check, status in checklist_status["prerequisites"].items():
    status_icon = "✅" if status else "❌"
    print(f"  {status_icon} {check}")

## 1. Setup

### 1.1 Install and import the required libraries


In [None]:
%pip install --force-reinstall -q -r ./requirements.txt

In [None]:
# Restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
# Standard library imports
import os
import sys
import json
import time
import random

# Third-party imports
import boto3
from botocore.exceptions import ClientError

# Local imports
import utility

# Print SDK versions
print(f"Python version: {sys.version.split()[0]}")
print(f"Boto3 SDK version: {boto3.__version__}")

### 1.2 Initial setup for clients and global variables

In [None]:
# Create boto3 session and set AWS region
boto_session = boto3.Session()
aws_region = boto_session.region_name

# Create boto3 clients for AOSS, Bedrock, and S3 services
aoss_client = boto3.client('opensearchserverless')
bedrock_agent_client = boto3.client('bedrock-agent')
s3_client = boto3.client('s3')

# Define names for AOSS, Bedrock, and S3 resources
resource_suffix = random.randrange(100, 999)
s3_bucket_name = f"bedrock-kb-{aws_region}-{resource_suffix}"
aoss_collection_name = f"bedrock-kb-collection-{resource_suffix}"
aoss_index_name = f"bedrock-kb-index-{resource_suffix}"
bedrock_kb_name = f"bedrock-kb-{resource_suffix}"

# Set the Bedrock model to use for embedding generation
embedding_model_id = 'amazon.titan-embed-text-v2:0'
embedding_model_arn = f'arn:aws:bedrock:{aws_region}::foundation-model/{embedding_model_id}'
embedding_model_dim = 1024

# Some temporary local paths
local_data_dir = 'data'

# Print configurations
print("AWS Region:", aws_region)
print("S3 Bucket:", s3_bucket_name)
print("AOSS Collection Name:", aoss_collection_name)
print("Bedrock Knowledge Base Name:", bedrock_kb_name)

## 2. Create an S3 Data Source

Amazon Bedrock Knowledge Bases can connect to a variety of data sources for downstream RAG applications. Supported data sources include Amazon S3, Confluence, Microsoft SharePoint, Salesforce, Web Crawler, and custom data sources.

In this workshop, we will use Amazon S3 to store unstructured data — specifically, PDF files containing Amazon Shareholder Letters from different years. This S3 bucket will serve as the source of documents for our Knowledge Base. During the ingestion process, Bedrock will parse these documents, convert them into vector embeddings using an embedding model, and store them in a vector database for efficient retrieval during queries.

### 2.1 Create an S3 bucket, if needed

In [None]:
# Check if bucket exists, and if not create S3 bucket for KB data source
max_attempts = 5
attempt = 0

while attempt < max_attempts:
    try:
        s3_client.head_bucket(Bucket=s3_bucket_name)
        print(f"Bucket '{s3_bucket_name}' already exists..")
        break
    except ClientError as e:
        if e.response['Error']['Code'] == '404':
            # Bucket doesn't exist in our account, try to create it
            try:
                print(f"Creating bucket: '{s3_bucket_name}'..")
                if aws_region == 'us-east-1':
                    s3_client.create_bucket(Bucket=s3_bucket_name)
                else:
                    s3_client.create_bucket(
                        Bucket=s3_bucket_name,
                        CreateBucketConfiguration={'LocationConstraint': aws_region}
                    )
                print(f"Successfully created bucket: '{s3_bucket_name}'")
                break
            except ClientError as create_error:
                if create_error.response['Error']['Code'] == 'BucketAlreadyExists':
                    # Bucket name is taken globally, generate a new one
                    attempt += 1
                    if attempt < max_attempts:
                        # Generate a more unique bucket name
                        resource_suffix = random.randrange(100000, 999999)
                        timestamp_suffix = int(time.time())
                        s3_bucket_name = f"bedrock-kb-{aws_region}-{resource_suffix}-{timestamp_suffix}"
                        print(f"Bucket name taken globally, trying new name: '{s3_bucket_name}'")
                        continue
                    else:
                        raise Exception("Failed to create a unique bucket name after 5 attempts. Please try running the notebook again.")
                else:
                    # Re-raise other errors
                    raise create_error
        else:
            # Re-raise other errors (like permission issues)
            raise e

### 2.1.1 Troubleshooting Common Issues

As you progress through this workshop, you might encounter some common issues. Here's how to resolve them:

#### Issue 1: BucketAlreadyExists Error

**Symptom**:
```
ClientError: An error occurred (BucketAlreadyExists) when calling the CreateBucket operation: 
The requested bucket name is not available.
```

**Cause**: S3 bucket names are globally unique across ALL AWS accounts

**Solution**: The notebook handles this automatically by:
1. Catching the `BucketAlreadyExists` error
2. Generating a new random suffix
3. Retrying up to 5 times

#### Issue 2: Access Denied (IAM Permissions)

**Symptom**:
```
ClientError: An error occurred (AccessDeniedException) when calling the CreateRole operation: 
User is not authorized to perform: iam:CreateRole
```

**Cause**: Your IAM user/role lacks necessary permissions

**Required Permissions**:
- `iam:CreateRole`, `iam:AttachRolePolicy`, `iam:CreatePolicy`
- `s3:CreateBucket`, `s3:PutObject`
- `aoss:CreateCollection`, `aoss:CreateSecurityPolicy`
- `bedrock:CreateKnowledgeBase`, `bedrock:InvokeModel`

**Solutions**:
1. **AWS Workshop**: Contact facilitator to grant permissions
2. **Self-paced**: Add managed policies to your IAM role:
   - `IAMFullAccess`
   - `AmazonS3FullAccess`
   - `AmazonBedrockFullAccess`
   - Custom policy for AOSS (see prerequisites)

#### Issue 3: Model Access Not Enabled

**Symptom**:
```
ValidationException: The model 'amazon.titan-embed-text-v2:0' is not available.
```

**Solution**:
1. Go to [Amazon Bedrock Console](https://console.aws.amazon.com/bedrock/)
2. Click "Model access" in left navigation
3. Enable "Titan Text Embeddings V2"
4. Wait 2-3 minutes for access to propagate

#### Issue 4: Ingestion Job Stuck or Failed

**Possible Causes**:
- S3 bucket permissions incorrect
- PDF files corrupted
- Bedrock service quota exceeded

**Debugging**:
```python
response = bedrock_agent_client.get_ingestion_job(
    knowledgeBaseId=bedrock_kb_id,
    dataSourceId=bedrock_ds_id,
    ingestionJobId=bedrock_job_id
)
print("Status:", response['ingestionJob']['status'])
print("Failures:", response['ingestionJob'].get('failureReasons', []))
```

### 1.3 Understanding Workshop Costs

Let's estimate the AWS costs for this workshop module to help you plan accordingly.

#### Expected Costs (for completing all notebooks in Module 02)

| Service | Usage | Estimated Cost | Notes |
|---------|-------|---------------|-------|
| **Amazon S3** | ~30 MB storage<br/>28 PDFs | ~$0.01 | $0.023/GB/month (first 50TB) |
| **OpenSearch Serverless** | 2-3 hours active | ~$0.48 - $0.72 | $0.24/OCU-hour (4 OCU minimum) |
| **Bedrock Embeddings** | ~600 chunks<br/>Titan V2 | ~$0.02 | $0.00002/token (input)<br/>~1M tokens total |
| **Bedrock Inference** | ~20 queries<br/>Nova Micro | ~$0.01 | $0.035/1K input tokens<br/>$0.14/1K output tokens |
| **Data Transfer** | Minimal | ~$0.00 | Within same region |
| **IAM/CloudWatch** | Standard usage | $0.00 | No additional charges |
| **TOTAL** | | **~$0.52 - $0.76** | **< $1 for complete module** |

#### Cost Optimization Tips

1. **Complete Workshop in One Session**: OpenSearch Serverless charges by the hour; finishing quickly reduces costs
2. **Run Clean-up Notebook**: Delete resources when done (especially AOSS collection)
3. **Use Same Region**: Avoid data transfer charges by keeping all resources in one region
4. **Reuse Knowledge Base**: If experimenting, use the same KB instead of creating multiple

#### Cost Breakdown by Notebook

- **Notebook 1** (Setup & Ingestion): ~$0.50 (mostly AOSS creation + embeddings)
- **Notebook 2** (Managed RAG): ~$0.01 (4 queries)
- **Notebook 3** (Custom RAG): ~$0.01 (3 queries)
- **Notebook 4** (Clean-up): $0.00 (resource deletion)

**Important**: If you forget to run the clean-up notebook, the AOSS collection continues charging **$0.24/hour** (~$5.76/day) until deleted.

### 1.4 Resource Naming Convention

You may have noticed we're using a `resource_suffix` random number in our resource names. Let's understand why:

#### Current Naming Pattern

```python
resource_suffix = random.randrange(100, 999)
s3_bucket_name = f"bedrock-kb-{aws_region}-{resource_suffix}"
aoss_collection_name = f"bedrock-kb-collection-{resource_suffix}"
bedrock_kb_name = f"bedrock-kb-{resource_suffix}"
```

#### Why Random Suffixes?

**Problem**: AWS resource names must be globally unique
- S3 bucket names are unique across ALL AWS accounts worldwide
- AOSS collection names must be unique within your account
- If 50 students run this workshop simultaneously, conflicts will occur

**Solution**: Random suffix prevents naming collisions
- `resource_suffix = random.randrange(100, 999)` generates a 3-digit number
- Each student gets a unique suffix (e.g., 347, 892, 156)
- Probability of collision: ~0.2% with 50 students

#### Resource Name Examples

Your resources will be named like:
```
S3 Bucket: bedrock-kb-us-east-1-742
AOSS Collection: bedrock-kb-collection-742
Knowledge Base: bedrock-kb-742
IAM Role: AmazonBedrockExecutionRoleForKnowledgeBase_742
```

#### Best Practices

✅ **Use suffixes in workshops/shared environments**
✅ **Include region in S3 bucket names** (aids troubleshooting)
✅ **Use descriptive prefixes** (bedrock-kb- indicates purpose)
❌ **Don't use hardcoded names** in reusable notebooks
❌ **Don't use special characters** (stick to hyphens and underscores)

#### Cleanup Consideration

The random suffix ensures your resources don't conflict, but remember:
- You must track your specific suffix to clean up later
- We use Jupyter's `%store` magic to persist variables across notebooks
- Always run notebook 4 (clean-up) to avoid orphaned resources

### 2.2 Download data and upload to S3

In [None]:
import os
from urllib.request import urlretrieve

# List of shareholder-letter URLs (1997–2024)
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2025/ar/2024-Shareholder-Letter-Final.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2018-Letter-to-Shareholders.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/Amazon_Shareholder_Letter.pdf',  # 2017
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2016-Letter-to-Shareholders.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2015-Letter-to-Shareholders.PDF',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/AMAZON-2014-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2013-Letter-to-Shareholders.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2012-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/letter.PDF',  # 2011
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/117006_ltr_ltr2.pdf',  # 2010
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/AMZN_Shareholder-Letter-2009-(final).pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/Amazon_SH_Letter_2008.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2007letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2006.PDF',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/shareholderletter2005.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2004_shareholderLetter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2003_-Shareholder_-Letter041304.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2002_shareholderLetter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/2001_shareholderLetter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/00ar_letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/Shareholderletter99.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/Shareholderletter98.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/annual/Shareholderletter97.pdf',
]

# Corresponding output filenames
filenames = [
    'AMZN-2024-Shareholder-Letter.pdf',
    'AMZN-2023-Shareholder-Letter.pdf',
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter-and-1997-Reprint.pdf',
    'AMZN-2019-Shareholder-Letter.pdf',
    'AMZN-2018-Shareholder-Letter.pdf',
    'AMZN-2017-Shareholder-Letter.pdf',
    'AMZN-2016-Shareholder-Letter.pdf',
    'AMZN-2015-Shareholder-Letter.pdf',
    'AMZN-2014-Shareholder-Letter.pdf',
    'AMZN-2013-Shareholder-Letter.pdf',
    'AMZN-2012-Shareholder-Letter.pdf',
    'AMZN-2011-Shareholder-Letter.pdf',
    'AMZN-2010-Shareholder-Letter.pdf',
    'AMZN-2009-Shareholder-Letter.pdf',
    'AMZN-2008-Shareholder-Letter.pdf',
    'AMZN-2007-Shareholder-Letter.pdf',
    'AMZN-2006-Shareholder-Letter.pdf',
    'AMZN-2005-Shareholder-Letter.pdf',
    'AMZN-2004-Shareholder-Letter.pdf',
    'AMZN-2003-Shareholder-Letter.pdf',
    'AMZN-2002-Shareholder-Letter.pdf',
    'AMZN-2001-Shareholder-Letter.pdf',
    'AMZN-2000-Shareholder-Letter.pdf',
    'AMZN-1999-Shareholder-Letter.pdf',
    'AMZN-1998-Shareholder-Letter.pdf',
    'AMZN-1997-Shareholder-Letter.pdf',
]

# Ensure the output directory exists
os.makedirs(local_data_dir, exist_ok=True)

# Download each file
for url, filename in zip(urls, filenames):
    file_path = os.path.join(local_data_dir, filename)
    urlretrieve(url, file_path)
    print(f"Downloaded: '{filename}' to '{local_data_dir}'.")


In [None]:
for root, _, files in os.walk(local_data_dir):
    for file in files:
        full_path = os.path.join(root, file)
        s3_client.upload_file(full_path, s3_bucket_name, file)
        print(f"Uploaded: '{file}' to 's3://{s3_bucket_name}'..")

### 2.3 About the Amazon Shareholder Letters Dataset

We've just downloaded 28 years of Amazon shareholder letters (1997-2024). Let's understand why this is an excellent dataset for RAG demonstrations.

#### Dataset Characteristics

**Volume & Timespan**:
- 28 PDF documents spanning 28 years
- ~150-200 pages total content
- File size: ~30 MB total
- Multiple CEO voices: Jeff Bezos (1997-2020), Andy Jassy (2021-2024)

**Content Type**:
- Annual business performance reviews
- Strategic vision and long-term thinking
- Financial metrics and growth statistics
- Customer obsession principles
- Technology innovation discussions (AWS, Alexa, Prime)

#### Why This Dataset is Valuable for RAG

1. **Temporal Dimension**: Tracks Amazon's evolution from bookstore to tech giant
2. **Rich Context**: Business strategy, financial data, historical events
3. **Real-World Complexity**: Mixed content (narratives, tables, financial data)
4. **Educational Value**: Publicly available, well-written, relevant

#### Example Queries This Dataset Supports

**Factual**: "What was Amazon's net income in 2023?"
**Conceptual**: "What does Jeff Bezos mean by 'Day 1 thinking'?"
**Temporal**: "How has R&D investment changed from 1997 to 2024?"
**Synthesis**: "Summarize Amazon's AI strategy evolution"

#### Dataset Limitations

⚠️ **Be Aware**:
- Some letters include reprints (2020 includes 1997)
- Financial data requires exact year specification
- CEO transition in 2021 may affect response consistency

## 3 Setup AOSS Vector Index and Configure BKB Access Permissions

In this section, we’ll create a vector index using Amazon OpenSearch Serverless (AOSS) and configure the necessary access permissions for the Bedrock Knowledge Base (BKB) that we’ll set up later. AOSS provides a fully managed, serverless solution for running vector search workloads at billion-vector scale. It automatically handles resource scaling and eliminates the need for cluster management, while delivering low-latency, millisecond response times with pay-per-use pricing.

While this example uses AOSS, it’s worth noting that Bedrock Knowledge Bases also supports other popular vector stores, including Amazon Aurora PostgreSQL with pgvector, Pinecone, Redis Enterprise Cloud, and MongoDB, among others

### 3.1 Create IAM Role with Necessary Permissions for Bedrock Knowledge Base

Let's first create an IAM role with all the necessary policies and permissions to allow BKB to execute operations, such as invoking Bedrock FMs and reading data from an S3 bucket. We will use a helper function for this.

In [None]:
bedrock_kb_execution_role = utility.create_bedrock_execution_role(bucket_name=s3_bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

print("Created KB execution role with ARN:", bedrock_kb_execution_role_arn)

### 3.2 Create AOSS Policies and Vector Collection

Next we need to create and attach three key policies for securing and managing access to the AOSS collection: an encryption policy, a network access policy, and a data access policy. These policies ensure proper encryption, network security, and the necessary permissions for creating, reading, updating, and deleting collection items and indexes. This step is essential for configuring the OpenSearch collection to interact with BKB securely and efficiently (you can read more about AOSS collections [here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html)). We will use another helper function for this.

⚠️ **Note:** _in order to keep setup overhead at mininum, in this example we **allow public internet access** to the OpenSearch Serverless collection resource. However, for production environments we strongly suggest to leverage private connection between your VPC and Amazon OpenSearch Serverless resources via an VPC endpoint, as described [here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vpc.html)._

In [None]:
# Create AOSS policies for the new vector collection
aoss_encryption_policy, aoss_network_policy, aoss_access_policy = utility.create_policies_in_oss(
    vector_store_name=aoss_collection_name,
    aoss_client=aoss_client,
    bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)

print("Created encryption policy with name:", aoss_encryption_policy['securityPolicyDetail']['name'])
print("Created network policy with name:", aoss_network_policy['securityPolicyDetail']['name'])
print("Created access policy with name:", aoss_access_policy['accessPolicyDetail']['name'])

With all the necessary policies in place, let's proceed to actually creating a new AOSS collection. Please note that this can take a **few minutes to complete**. While you wait, you may want to [explore the AOSS Console](https://console.aws.amazon.com/aos/home?#opensearch/collections), where you will see your AOSS collection being created.

In [None]:
# Request to create AOSS collection
aoss_collection = aoss_client.create_collection(name=aoss_collection_name, type='VECTORSEARCH')

# Wait until collection becomes active with educational content
print("Creating Amazon OpenSearch Serverless collection...")
print("\n" + "="*70)
print("WHAT'S HAPPENING: AOSS Collection Provisioning")
print("="*70)
print("""
While we wait (2-3 minutes), here's what's happening behind the scenes:

1. Resource Allocation:
   - AWS is provisioning compute units (OCUs) for your collection
   - Minimum 4 OCUs allocated (2 for indexing, 2 for search)
   - Each OCU provides 6 GB memory + corresponding compute

2. Security Configuration:
   - Applying encryption policy (data encrypted at rest with AWS KMS)
   - Enforcing network policy (public access enabled for workshop)
   - Activating data access policy (permissions for Bedrock + your IAM user)

3. Collection Initialization:
   - Setting up OpenSearch cluster infrastructure
   - Preparing vector search engine (FAISS-based HNSW algorithm)
   - Creating API endpoints for data plane operations

This serverless architecture means:
✅ No cluster management needed
✅ Automatic scaling based on workload
✅ Pay only for resources used
❌ Initial provisioning latency (2-3 min)
""")

print("Progress: ", end='', flush=True)
start_time = time.time()
check_count = 0

while True:
    response = aoss_client.list_collections(collectionFilters={'name': aoss_collection_name})
    status = response['collectionSummaries'][0]['status']
    if status in ('ACTIVE', 'FAILED'):
        elapsed = time.time() - start_time
        print(f" Done! ({elapsed:.1f} seconds)")
        break
    print('█', end='', flush=True)
    check_count += 1
    
    # Educational milestones
    if check_count == 12:  # ~60 seconds
        print("\n  ⏱️  1 minute elapsed - OCUs being allocated...")
        print("Progress: ", end='', flush=True)
    elif check_count == 24:  # ~120 seconds
        print("\n  ⏱️  2 minutes elapsed - Security policies applying...")
        print("Progress: ", end='', flush=True)
    
    time.sleep(5)

if status == 'ACTIVE':
    print("\n✅ AOSS collection is now ACTIVE and ready for indexing!")
    print(f"   Collection ID: {response['collectionSummaries'][0]['id']}")
    print(f"   Collection ARN: {response['collectionSummaries'][0]['arn']}")
else:
    print(f"\n❌ Collection creation FAILED with status: {status}")

print("An AOSS collection created:", json.dumps(response['collectionSummaries'], indent=2))

### 3.2 Grant BKB Access to AOSS Data

In this step, we create a data access policy that grants BKB the necessary permissions to read from our AOSS collections. We then attach this policy to the Bedrock execution role we created earlier, allowing BKB to securely access AOSS data when generating responses. We will be using helper function once again.

In [None]:
aoss_policy_arn = utility.create_oss_policy_attach_bedrock_execution_role(
    collection_id=aoss_collection['createCollectionDetail']['id'],
    bedrock_kb_execution_role=bedrock_kb_execution_role)

print("Waiting 60 sec for data access rules to be enforced: ", end='')
for _ in range(12):  # 12 * 5 sec = 60 sec
    print('█', end='', flush=True)
    time.sleep(5)
print(" done.")

print("Created and attached policy with ARN:", aoss_policy_arn)

### 3.3 Create an AOSS Vector Index

Now that we have all necessary access permissions in place, we can create a vector index in the AOSS collection we created previously.



In [None]:
from requests_aws4auth import AWS4Auth
from opensearchpy import OpenSearch, RequestsHttpConnection

# Use default credential configuration for authentication
credentials = boto_session.get_credentials()
awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    aws_region,
    'aoss',
    session_token=credentials.token)

# Construct AOSS endpoint host
host = f"{aoss_collection['createCollectionDetail']['id']}.{aws_region}.aoss.amazonaws.com"

# Build the OpenSearch client
os_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

We need to first define the index definiton with the desired indexing configuration, where we specify such things as number of shards and replicas of the index, vector embedding dimensions, the vector search engine (we are using FAISS here), as well as names and types of any other fields we need to have in the index:

In [None]:
# Define the configuration for the AOSS vector index
index_definition = {
   "settings": {
      "index.knn": "true",
       "number_of_shards": 1,
       "knn.algo_param.ef_search": 512,
       "number_of_replicas": 0,
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",
            "dimension": embedding_model_dim,
             "method": {
                 "name": "hnsw",
                 "engine": "faiss",
                 "space_type": "l2"
             },
         },
         "text": {
            "type": "text"
         },
         "text-metadata": {
            "type": "text"
         }
      }
   }
}

# Create an OpenSearch index
response = os_client.indices.create(index=aoss_index_name, body=index_definition)

# Waiting for index creation to propagate
print("Waiting 30 sec for index update to propagate: ", end='')
for _ in range(6):  # 6 * 5 sec = 30 sec
    print('█', end='', flush=True)
    time.sleep(5)
print(" done.")

print("A new AOSS index created:", json.dumps(response, indent=2))

## 4. Configure Amazon Bedrock Knowledge Base and Synchronize it with Data Source

In this section, we’ll create an Amazon Bedrock Knowledge Base (BKB) and connect it to the data that will be stored in our newly created AOSS vector index.

### 4.1 Create a Bedrock Knowledge Base

Setting up a Knowledge Base involves providing two key configurations:
1. **Storage Configuration** tells Bedrock where to store the generated vector embeddings by specifying the target vector store and providing the necessary connection detail (here, we use the AOSS vector index we created earlier),
2. **Knowledge Base Configuration** defines how Bedrock should generate vector embeddings from your data by specifying the embedding model to use (`Titan Text Embeddings V2` in this sample), along with any additional settings required for handling multimodal content.

In [None]:
# Knowledge Base Configuration
knowledge_base_config = {
    "type": "VECTOR",
    "vectorKnowledgeBaseConfiguration": {
        "embeddingModelArn": embedding_model_arn
    }
}

# Storage Configuration  
storage_config = {
    "type": "OPENSEARCH_SERVERLESS",
    "opensearchServerlessConfiguration": {
        "collectionArn": aoss_collection['createCollectionDetail']['arn'],
        "vectorIndexName": aoss_index_name,
        "fieldMapping": {
            "vectorField": "vector",
            "textField": "text",
            "metadataField": "text-metadata"
        }
    }
}

# Check if Knowledge Base already exists
try:
    # List existing knowledge bases to see if ours already exists
    existing_kbs = bedrock_agent_client.list_knowledge_bases()
    existing_kb = None
    
    for kb in existing_kbs['knowledgeBaseSummaries']:
        if kb['name'] == bedrock_kb_name:
            existing_kb = kb
            break
    
    if existing_kb:
        print(f"Knowledge Base '{bedrock_kb_name}' already exists with ID: {existing_kb['knowledgeBaseId']}")
        print("Using existing Knowledge Base...")
        bedrock_kb_id = existing_kb['knowledgeBaseId']
        
        # Check if it's active
        response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=bedrock_kb_id)
        if response['knowledgeBase']['status'] == 'ACTIVE':
            print("Existing Knowledge Base is already active.")
        else:
            print("Waiting for existing Knowledge Base to become active: ", end='')
            while True:
                response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=bedrock_kb_id)
                if response['knowledgeBase']['status'] == 'ACTIVE':
                    print(" done.")
                    break
                print('█', end='', flush=True)
                time.sleep(5)
    else:
        # Create new Knowledge Base
        response = bedrock_agent_client.create_knowledge_base(
            name=bedrock_kb_name,
            description="Amazon shareholder letter knowledge base.",
            roleArn=bedrock_kb_execution_role_arn,
            knowledgeBaseConfiguration=knowledge_base_config,
            storageConfiguration=storage_config)

        bedrock_kb_id = response['knowledgeBase']['knowledgeBaseId']

        print("Waiting until BKB becomes active: ", end='')
        while True:
            response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=bedrock_kb_id)
            if response['knowledgeBase']['status'] == 'ACTIVE':
                print(" done.")
                break
            print('█', end='', flush=True)
            time.sleep(5)

        print("A new Bedrock Knowledge Base created with ID:", bedrock_kb_id)

except Exception as e:
    print(f"Error: {e}")
    raise

In [None]:
# Verification: Confirm Knowledge Base configuration
print("🔍 Verifying Knowledge Base Configuration...\n")

try:
    kb_details = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=bedrock_kb_id)
    kb = kb_details['knowledgeBase']
    
    # Check status
    assert kb['status'] == 'ACTIVE', f"Expected ACTIVE, got {kb['status']}"
    print("✅ Knowledge Base Status: ACTIVE")
    
    # Check embedding model
    expected_model = 'amazon.titan-embed-text-v2:0'
    actual_model = kb['knowledgeBaseConfiguration']['vectorKnowledgeBaseConfiguration']['embeddingModelArn']
    assert expected_model in actual_model, f"Wrong embedding model: {actual_model}"
    print(f"✅ Embedding Model: {expected_model}")
    
    # Check vector store type
    storage_type = kb['storageConfiguration']['type']
    assert storage_type == 'OPENSEARCH_SERVERLESS', f"Wrong storage type: {storage_type}"
    print(f"✅ Vector Store: {storage_type}")
    
    # Check AOSS configuration
    aoss_config = kb['storageConfiguration']['opensearchServerlessConfiguration']
    print(f"✅ AOSS Collection ARN: {aoss_config['collectionArn']}")
    print(f"✅ Vector Index Name: {aoss_config['vectorIndexName']}")
    
    # Check IAM role
    print(f"✅ Execution Role ARN: {kb['roleArn']}")
    
    print("\n🎉 All verifications passed! Knowledge Base is properly configured.")
    
except AssertionError as e:
    print(f"❌ Verification failed: {e}")
except Exception as e:
    print(f"❌ Error during verification: {e}")

Let's call a Bedrock API to get the information about our newly created Knowledge Base:

In [None]:
response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=bedrock_kb_id)

print(json.dumps(response['knowledgeBase'], indent=2, default=str))

### 4.1.5 Understanding Chunking Strategies

Before we connect our Knowledge Base to a data source, it's crucial to understand **chunking strategies** — one of the most important configuration decisions in RAG systems.

**What is Chunking?**

When documents are ingested into a Knowledge Base, they cannot be stored as single, monolithic texts. Instead, they must be split into smaller "chunks" that can be:
- Converted into vector embeddings
- Retrieved independently based on semantic similarity
- Injected into LLM prompts without exceeding context limits

The chunking strategy determines **how** this splitting occurs, which directly impacts:
- **Retrieval accuracy**: Whether the right information is found
- **Response quality**: Whether the LLM has sufficient context
- **Cost**: Number of chunks = number of embeddings stored
- **Latency**: Smaller chunks = faster retrieval but potentially less context

#### Available Chunking Strategies in Amazon Bedrock Knowledge Bases

Amazon Bedrock supports four chunking strategies:

| Strategy | Description | Best For | Trade-offs |
|----------|-------------|----------|------------|
| **FIXED_SIZE** | Splits text into equal-sized chunks based on token count with configurable overlap | General-purpose documents, consistent structure | May split semantic units (sentences/paragraphs) |
| **NONE** | No chunking; each document becomes one chunk | Short documents (<512 tokens), pre-chunked data | Not suitable for long documents; may exceed embedding limits |
| **HIERARCHICAL** | Creates parent-child chunk relationships; retrieves child, returns parent context | Documents with clear structure (sections, chapters) | More complex, requires structured content |
| **SEMANTIC** | Uses AI to identify natural semantic boundaries; chunks based on topic shifts | Unstructured content, narratives, varied topics | Higher ingestion cost, slower processing |

**Current Configuration**: In this workshop, we use **FIXED_SIZE** with:
- `maxTokens: 512` - Each chunk contains up to 512 tokens
- `overlapPercentage: 20` - 20% overlap between adjacent chunks (prevents context loss at boundaries)

#### Why Use Overlap?

Consider this example from an Amazon shareholder letter:

**Without Overlap (0%)**:
- Chunk 1: "...we launched Amazon Prime in 2005. This program transformed customer expectations"
- Chunk 2: "and created unprecedented loyalty. Members shop more frequently..."

**With 20% Overlap**:
- Chunk 1: "...we launched Amazon Prime in 2005. This program transformed customer expectations and created unprecedented loyalty."
- Chunk 2: "This program transformed customer expectations and created unprecedented loyalty. Members shop more frequently..."

**Query**: "How did Amazon Prime affect customer loyalty?"

With overlap, Chunk 2 includes the critical context that "this program" refers to Amazon Prime, improving retrieval accuracy.

#### Practical Comparison: Shareholder Letters

Let's analyze how different strategies would handle our Amazon shareholder letters:

**Example Text** (from 2024 letter):
> "Our AWS business continued its strong momentum in Q4, with revenue reaching $26.3 billion, up 13% year-over-year. The AI and machine learning services within AWS saw particularly strong adoption, with customers leveraging Amazon Bedrock and SageMaker for their generative AI initiatives."

**FIXED_SIZE (512 tokens, 20% overlap)**:
- ✅ Predictable chunk sizes
- ✅ Works well for financial data with numbers
- ⚠️ Might split "AWS revenue" from "year-over-year comparison"

**SEMANTIC**:
- ✅ Keeps "AWS revenue" and "year-over-year growth" together
- ✅ Groups related services (Bedrock, SageMaker) in one chunk
- ⚠️ Variable chunk sizes might affect consistency

**HIERARCHICAL**:
- ✅ Could treat each year's letter as parent, each section as child
- ✅ Retrieves specific metric but returns full section context
- ⚠️ Requires structured PDFs with clear section markers

**NONE**:
- ❌ Each full shareholder letter (5-10 pages) becomes one chunk
- ❌ Exceeds typical embedding model limits (512-1024 tokens)
- ❌ Not recommended for this use case

In [None]:
# DEMONSTRATION: How different chunking strategies affect the same text
# Note: This is a simulation for educational purposes

sample_text = """
Amazon Web Services (AWS) continued to show strong performance throughout 2024. 
The cloud infrastructure business grew revenue by 13% year-over-year, reaching $26.3 billion in Q4 alone.

Our AI and machine learning services saw exceptional adoption. Amazon Bedrock, our fully managed 
generative AI service, enabled thousands of customers to build innovative applications. SageMaker 
customers increased by 45% as organizations accelerated their ML initiatives.

Looking forward, we remain focused on three key areas: expanding our infrastructure footprint, 
enhancing our AI/ML capabilities, and deepening customer relationships through innovation.
"""

import re

def simulate_fixed_size_chunking(text, max_tokens=50, overlap_pct=20):
    """Simulates FIXED_SIZE chunking strategy"""
    # Simplified: using words as proxy for tokens (1 word ≈ 1.3 tokens)
    words = text.split()
    max_words = int(max_tokens / 1.3)
    overlap_words = int(max_words * overlap_pct / 100)
    
    chunks = []
    i = 0
    chunk_num = 1
    
    while i < len(words):
        chunk_end = min(i + max_words, len(words))
        chunk = ' '.join(words[i:chunk_end])
        chunks.append(f"Chunk {chunk_num}: {chunk}...")
        
        # Move forward by (chunk_size - overlap)
        i += (max_words - overlap_words)
        chunk_num += 1
        
        if i >= len(words):
            break
    
    return chunks

def simulate_semantic_chunking(text):
    """Simulates SEMANTIC chunking strategy"""
    # Simplified: splits on paragraph boundaries (semantic breaks)
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    chunks = [f"Semantic Chunk {i+1}: {p}" for i, p in enumerate(paragraphs)]
    return chunks

def simulate_no_chunking(text):
    """Simulates NONE strategy"""
    return [f"Single Chunk (NONE strategy): {text}"]

# Compare strategies
print("="*80)
print("FIXED_SIZE Chunking (50 tokens, 20% overlap)")
print("="*80)
for chunk in simulate_fixed_size_chunking(sample_text):
    print(f"\n{chunk}\n")

print("\n" + "="*80)
print("SEMANTIC Chunking (natural boundaries)")
print("="*80)
for chunk in simulate_semantic_chunking(sample_text):
    print(f"\n{chunk}\n")

print("\n" + "="*80)
print("NONE Strategy (no chunking)")
print("="*80)
for chunk in simulate_no_chunking(sample_text):
    print(f"\n{chunk}\n")

print("\n" + "="*80)
print("Analysis:")
print("="*80)
print("""
FIXED_SIZE:
  - Created multiple overlapping chunks
  - Consistent size enables predictable retrieval
  - Overlap preserves context at boundaries

SEMANTIC:
  - Aligned with natural topic breaks
  - Variable sizes (AWS performance, AI/ML services, Future focus)
  - Each chunk is semantically complete

NONE:
  - Entire text as single chunk
  - Only suitable if text is already small (<512 tokens)
""")

#### Chunking Best Practices

**For Financial Documents (like shareholder letters)**:
- ✅ **Recommended**: FIXED_SIZE with 512 tokens, 20% overlap
- **Rationale**: Financial data benefits from consistent chunk sizes; overlap ensures metrics stay with their context
- **Alternative**: SEMANTIC for narrative sections, but FIXED_SIZE for tables/metrics

**General Guidelines**:

1. **Chunk Size Selection**:
   - **300-512 tokens**: Optimal for most use cases (balances context vs. precision)
   - **Less than 300**: Risk of insufficient context
   - **More than 800**: Risk of retrieving irrelevant information

2. **Overlap Percentage**:
   - **15-20%**: Standard overlap for general content
   - **30-40%**: Higher overlap for dense technical content
   - **0-10%**: Lower overlap for clearly structured documents

3. **Strategy Selection Decision Tree**:
   ```
   Is your content pre-chunked? (e.g., Q&A pairs)
       YES → Use NONE
       NO → Continue
   
   Does your content have clear hierarchical structure? (e.g., documentation with sections)
       YES → Consider HIERARCHICAL
       NO → Continue
   
   Is your content highly unstructured with varying topics? (e.g., emails, transcripts)
       YES → Consider SEMANTIC
       NO → Use FIXED_SIZE (safest default)
   ```

4. **Cost Considerations**:
   - Smaller chunks = More embeddings = Higher storage cost
   - Balance chunk size with retrieval quality needs
   - For our 28 shareholder letters: ~500-700 chunks expected with FIXED_SIZE

#### What Happens Next?

When we create our data source in the next cell, Bedrock will:
1. Read each PDF from S3
2. Extract text content
3. Apply the FIXED_SIZE chunking strategy (512 tokens, 20% overlap)
4. Generate embeddings for each chunk using Titan Embeddings V2
5. Store embeddings in our AOSS vector index

Let's proceed to configure the data source!

#### 🎯 Learning Checkpoint: Chunking Strategies

Before proceeding, test your understanding:

**Question 1**: If a document is 2,000 tokens long and we use FIXED_SIZE chunking with maxTokens=512 and overlapPercentage=20, approximately how many chunks will be created?

<details>
<summary>Click to see answer</summary>

**Answer**: Approximately 4-5 chunks

**Explanation**:
- Each chunk is 512 tokens
- Overlap is 20% = ~102 tokens
- Effective advancement per chunk = 512 - 102 = 410 tokens
- 2000 / 410 ≈ 4.9 chunks

Calculation:
- Chunk 1: tokens 0-512
- Chunk 2: tokens 410-922 (102 overlap from chunk 1)
- Chunk 3: tokens 820-1332
- Chunk 4: tokens 1230-1742
- Chunk 5: tokens 1640-2000 (partial)

</details>

**Question 2**: Why might SEMANTIC chunking be slower than FIXED_SIZE during ingestion?

<details>
<summary>Click to see answer</summary>

**Answer**: SEMANTIC chunking uses AI models to identify topic boundaries, requiring additional inference calls.

**Explanation**:
- FIXED_SIZE: Simple token counting (fast)
- SEMANTIC: Analyzes text semantics using ML models (slower but more intelligent)
- Trade-off: Better semantic coherence vs. longer ingestion time

</details>

**Question 3**: When would you choose NONE as your chunking strategy?

<details>
<summary>Click to see answer</summary>

**Answer**: When documents are already small (<512 tokens) or pre-chunked into semantic units.

**Examples**:
- FAQ pairs where each Q&A is already a complete unit
- Product descriptions (50-200 tokens each)
- Pre-processed data where you've already done custom chunking
- Email subjects and bodies stored separately

**Warning**: NONE with large documents will fail if they exceed embedding model limits.

</details>

### 4.2 Connect BKB to a Data Source

With our Knowledge Base in place, the next step is to connect it to a data source. This involves two key actions:

1. **Create a data source for the Knowledge Base** that will point to the location of our raw data (in this case, S3),
2. **Define how that data should be processed and ingested into the vector store** — for example, by specifying a chunking configuration that controls how large each text fragment should be when generating vector embeddings for retrieval.

In [None]:
# Data Source Configuration
data_source_config = {
        "type": "S3",
        "s3Configuration":{
            "bucketArn": f"arn:aws:s3:::{s3_bucket_name}",
            # "inclusionPrefixes":["*.*"]   # you can use this if you want to create a KB using data within s3 prefixes.
        }
    }

# Vector Ingestion Configuration
vector_ingestion_config = {
        "chunkingConfiguration": {
            "chunkingStrategy": "FIXED_SIZE",
            "fixedSizeChunkingConfiguration": {
                "maxTokens": 512,
                "overlapPercentage": 20
            }
        }
    }

response = bedrock_agent_client.create_data_source(
    name=bedrock_kb_name,
    description="Amazon shareholder letter knowledge base.",
    knowledgeBaseId=bedrock_kb_id,
    dataSourceConfiguration=data_source_config,
    vectorIngestionConfiguration=vector_ingestion_config
)

bedrock_ds_id = response['dataSource']['dataSourceId']

print("A new BKB data source created with ID:", bedrock_ds_id)

Let's also use Bedrock API to get the information about our newly created BKB data source:

In [None]:
response = bedrock_agent_client.get_data_source(knowledgeBaseId=bedrock_kb_id, dataSourceId=bedrock_ds_id)

print(json.dumps(response['dataSource'], indent=2, default=str))

### 4.3 Synchronize BKB with Data Source

Once the Knowledge Base and its data source are configured, we can start a fully-managed data ingestion job. During this process, BKB will retrieve the documents from the connected data source (on S3, in this case), extract and preprocess the content, split it into smaller chunks based on the configured chunking strategy, generate vector embeddings for each chunk, and store those embeddings in the vector store (AOSS vector store, in this case).

![BKB data ingestion](./images/data_ingestion.png)

In [None]:
# Start an ingestion job
response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId=bedrock_kb_id, dataSourceId=bedrock_ds_id)

bedrock_job_id = response['ingestionJob']['ingestionJobId']

print("A new BKB ingestion job started with ID:", bedrock_job_id)

In [None]:
# Wait until ingestion job completes
print("Starting Knowledge Base ingestion job...")
print("\n" + "="*70)
print("WHAT'S HAPPENING: Document Ingestion Pipeline")
print("="*70)
print("""
Processing 28 Amazon shareholder letters (1997-2024). This typically takes 5-7 minutes.

The Ingestion Pipeline:

 S3 → Parse → Extract → Chunk → Embed → Index → Validate

Step-by-Step Process:

1. Document Retrieval (30 sec):
   - Bedrock downloads all 28 PDFs from S3
   - Validates file integrity and permissions

2. Content Extraction (1-2 min):
   - PDF parsing (text extraction, layout analysis)
   - Removes headers, footers, page numbers
   - Handles multi-column layouts and tables

3. Chunking Strategy Application (30 sec):
   - Splits text using FIXED_SIZE strategy
   - Creates chunks of 512 tokens with 20% overlap
   - Expected output: ~600-700 chunks from 28 letters

4. Embedding Generation (3-4 min) - SLOWEST STEP:
   - Each chunk sent to Titan Embeddings V2 model
   - Generates 1,024-dimensional vectors
   - Batch processing: ~10-20 chunks/second

5. Vector Indexing (1 min):
   - Inserts vectors into AOSS collection
   - Builds HNSW graph for similarity search

6. Validation & Finalization (10 sec):
   - Verifies all documents processed successfully
   - Updates Knowledge Base status

Cost: ~$0.02 for 600 chunks × 1024 dimensions
""")

print("\nProgress: ", end='', flush=True)
start_time = time.time()
check_count = 0

while True:
    response = bedrock_agent_client.get_ingestion_job(
        knowledgeBaseId = bedrock_kb_id,
        dataSourceId = bedrock_ds_id,
        ingestionJobId = bedrock_job_id)
    
    status = response['ingestionJob']['status']
    
    if status == 'COMPLETE':
        elapsed = time.time() - start_time
        print(f" Done! ({elapsed:.1f} seconds)")
        break
    elif status == 'FAILED':
        print(f"\n❌ Ingestion FAILED")
        break
    
    print('█', end='', flush=True)
    check_count += 1
    
    # Educational milestones
    if check_count == 12:  # ~60 seconds
        print("\n  ⏱️  1 minute - Document retrieval complete, extracting text...")
        print("Progress: ", end='', flush=True)
    elif check_count == 36:  # ~180 seconds
        print("\n  ⏱️  3 minutes - Chunking complete, generating embeddings (~60% done)...")
        print("Progress: ", end='', flush=True)
    elif check_count == 60:  # ~300 seconds
        print("\n  ⏱️  5 minutes - Embedding generation complete, indexing vectors...")
        print("Progress: ", end='', flush=True)
    
    time.sleep(5)

# Display statistics
if status == 'COMPLETE':
    stats = response['ingestionJob'].get('statistics', {})
    print("\n✅ Ingestion Complete!")
    print(f"\n📊 Ingestion Statistics:")
    print(f"   Documents Processed: {stats.get('numberOfDocumentsScanned', 'N/A')}")
    print(f"   Documents Failed: {stats.get('numberOfDocumentsFailed', 0)}")
    if 'numberOfChunks' in stats:
        print(f"   Chunks Created: {stats['numberOfChunks']}")
    print(f"   Total Time: {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")

print("\nFull job details:", json.dumps(response['ingestionJob'], indent=2, default=str))

## 5. Conclusions and Next Steps

In this notebook, we walked through the process of creating an Amazon Bedrock Knowledge Base (BKB) and ingesting documents to enable Retrieval Augmented Generation (RAG) capabilities. We started by setting up the environment, installing the required libraries, and initializing the necessary AWS service clients. Then, we created an Amazon S3 bucket to store unstructured data (PDF documents) and uploaded sample files. We proceeded by provisioning an Amazon OpenSearch Serverless (AOSS) collection and index, configuring the appropriate IAM roles and permissions, and granting access to the BKB. Finally, we created the BKB, connected it to the S3 data source, and synchronized the documents to generate vector embeddings, which were stored in AOSS.

### Next Steps

Please execute next cell to store some important varables that will be needed in other notebooks of this module:

In [None]:
%store s3_bucket_name aoss_encryption_policy aoss_network_policy aoss_access_policy aoss_collection bedrock_kb_id

Now please go on to explore how you can interact with the newly created Knowledge Base via Bedrock APIs for RAG applications, please proceed to the next notebook:

&nbsp; **NEXT ▶** [2_managed-rag-with-retrieve-and-generate-api.ipynb](./2\_managed-rag-with-retrieve-and-generate-api.ipynb).