More details at [AWS Samples notebook](https://github.com/aws-samples/amazon-bedrock-samples/blob/main/rag/knowledge-bases/features-examples/01-rag-concepts/01_create_ingest_documents_test_kb_multi_ds.ipynb)

![](./images/data_ingestion.png)

In [3]:
import boto3
import pprint

### List Knowledge Bases

In [None]:
# create a boto3 client for bedrock
bedrock = boto3.client(service_name='bedrock-agent')

# list all knowledge bases
response = bedrock.list_knowledge_bases()

# print the response
response

### Añadiendo nuevo Data Source al Knowledge Base (KB) que ya tenemos

#### Syntaxis base:
```python
response = client.create_data_source(
    dataDeletionPolicy='RETAIN'|'DELETE',
    dataSourceConfiguration={
        's3Configuration': {
            'bucketArn': 'string',
            'bucketOwnerAccountId': 'string',
            'inclusionPrefixes': [
                'string',
            ]
        },
        'type': 'S3'
    },
    description='string',
    knowledgeBaseId=kb_id,
    name='string',
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE'|'NONE'|'HIERARCHICAL'|'SEMANTIC',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 123,
                'overlapPercentage': 123
            },
            'hierarchicalChunkingConfiguration': {
                'levelConfigurations': [
                    {
                        'maxTokens': 123
                    },
                ],
                'overlapTokens': 123
            },
            'semanticChunkingConfiguration': {
                'breakpointPercentileThreshold': 123,
                'bufferSize': 123,
                'maxTokens': 123
            }
        }
    }
)
```

> Es más fácil a través de una función

In [5]:
def create_s3_data_source(kb_id,
                          kb_data_source_name,
                          kb_s3_bucket_name_arn,
                          kb_s3_data_source_path,
                          kb_s3_bucket_account_id,
                          vector_ingestion_configuration):
    """_summary_

    Args:
        kb_id (_type_): _description_
        kb_data_source_name (_type_): _description_
        kb_s3_bucket_name_arn (_type_): _description_
        kb_s3_data_source_path (_type_): _description_
        kb_s3_bucket_account_id (_type_): _description_
        vector_ingestion_configuration (_type_): _description_

    Returns:
        _type_: _description_
    """
    # Set SDK
    client = boto3.client('bedrock-agent')

    # Create S3 Data Source 
    response = client.create_data_source(
        dataDeletionPolicy='RETAIN',
        dataSourceConfiguration={
            's3Configuration': {
                'bucketArn': kb_s3_bucket_name_arn,
                'bucketOwnerAccountId': kb_s3_bucket_account_id,
                'inclusionPrefixes': [
                    kb_s3_data_source_path,
                ]
            },
            'type': 'S3'
        },
        description='S3 data source with different chunking strategy for testing purposes',
        knowledgeBaseId=kb_id,
        name=kb_data_source_name,
        vectorIngestionConfiguration=vector_ingestion_configuration
    )

    return response

---
## Chunking Strategy: HIERARCHICAL  
> IMPORTANTE! Cambien los detalles debajo

In [6]:
# CHANGE ME!!
kb_chunking_strategy = "HIERARCHICAL"

# Knowledge Base and New Data Source details:
# - Note: Account ID can be fetched using sts_client.get_caller_identity()["Account"]
kb_id = "6ER5P7TAJM"
kb_s3_bucket_name_arn = "arn:aws:s3:::genai-carlos-contreras-bucket-data-quarks-labs-oregon-01"
kb_s3_bucket_account_id = "992382616037"

# No need to change the following values:
kb_s3_bucket_name = kb_s3_bucket_name_arn.split(":::")[-1]
kb_data_source_name = f"virtual-assistant-s3-{kb_chunking_strategy}"
kb_s3_data_source_path = f"datasets/demo_kb/knowledge-base-ecommerce-s3-001/{kb_data_source_name}/"

> Conceptos rápidos sobre [Hierarchical Chunking](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking.html):

For hierarchical chunking, Amazon Bedrock knowledge bases supports specifying two levels or the following depth for chunking:

- Parent: You set the maximum parent chunk token size.

- Child: You set the maximum child chunk token size.

In [8]:
# Define Lab:
vectorIngestionConfiguration={
    'chunkingConfiguration': {
        'chunkingStrategy': kb_chunking_strategy,
        'hierarchicalChunkingConfiguration': {
            'levelConfigurations': [
                {
                    'maxTokens': 100
                },
                {
                    'maxTokens': 30
                }
            ],
            'overlapTokens': 20
        }
    }
}

In [None]:
# Create data source
response = create_s3_data_source(kb_id=kb_id,
                                 kb_data_source_name=kb_data_source_name,
                                 kb_s3_bucket_name_arn=kb_s3_bucket_name_arn,
                                 kb_s3_bucket_account_id=kb_s3_bucket_account_id,
                                 kb_s3_data_source_path=kb_s3_data_source_path,
                                 vector_ingestion_configuration=vectorIngestionConfiguration)

# Get Data Source ID, so we can delete it after this lab
data_source_id = response['dataSource']['dataSourceId']
print(f"New Data Source ID: {data_source_id}")

In [10]:
# Copy PDF to S3:
pdf_file = 'octank_financial_10K.pdf'
s3_pdf_file = f"{kb_s3_data_source_path}{pdf_file}"
s3_client = boto3.client('s3')

In [11]:
# Upload file to s3:
s3_client.upload_file(f"synthetic_dataset/{pdf_file}",kb_s3_bucket_name,s3_pdf_file)

> SYNC Knowledge Base, pero esta vez usando Boto3

In [None]:
# Sync del KB
bedrock_agent_client = boto3.client('bedrock-agent')
response = bedrock_agent_client.start_ingestion_job(
    dataSourceId=data_source_id,
    description='Ingesting data for the first time, from ETL X',
    knowledgeBaseId=kb_id
)
ingestion_job = response['ingestionJob']['ingestionJobId']
print(f'Ingestion Job ID: {ingestion_job}')

In [None]:
# Check ingestion status
response = bedrock_agent_client.get_ingestion_job(
    dataSourceId=data_source_id,
    ingestionJobId=ingestion_job,
    knowledgeBaseId=kb_id
)

# Show status
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(response['ingestionJob'])

## 2. Probamos el Knowledge Base

> Analizamos chunks

In [25]:
# User question
query = "Provide a summary of consolidated statements of cash flows of Octank Financial for the fiscal years ended December 31, 2019?"

In [26]:
# SDK and model settings
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
boto_region = bedrock_agent_runtime_client.meta.region_name

In [None]:
# Retrieve the knowledge base ID
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(boto_region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

In [29]:
def citations_rag_print(response_ret):
    for num,chunk in enumerate(response_ret,1):
        print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

In [None]:
response_standard = response['citations'][0]['retrievedReferences']
print("# of citations or chunks used to generate the response: ", len(response_standard))
citations_rag_print(response_standard)

### Probando el Knowledge Base con Retrieve API

In [None]:
# Call Bedrock Retrieve
response = bedrock_agent_runtime_client.retrieve(
    knowledgeBaseId=kb_id, 
    nextToken='string',
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults":5,
        } 
    },
    retrievalQuery={
        "text": "How many new positions were opened across Amazon's fulfillment and delivery network?"
    }
)

def response_print(retrieve_resp):
    for num,chunk in enumerate(response['retrievalResults'],1):
        print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Score: ',chunk['score'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

response_print(response)

In [None]:
pp.pprint(response)

### Pasos de Cierre de Chunking Lab:
1. Borramos archivos en S3: Esto hará que el siguiente Sync borre esta info (embeddings)
2. Sync del Knowledge Base de nuevo
3. Borramos Data Source

In [None]:
# Step 1
s3_client.delete_object(Bucket=kb_s3_bucket_name,Key=s3_pdf_file)

In [None]:
# Sync del KB. This can be achieved using EventBridge or similar
bedrock_agent_client = boto3.client('bedrock-agent')
response = bedrock_agent_client.start_ingestion_job(
    dataSourceId=data_source_id,
    description='Deleting embeddings (syncing) after removing S3 files',
    knowledgeBaseId=kb_id
)

ingestion_job = response['ingestionJob']['ingestionJobId']
print(f'Ingestion Job ID: {ingestion_job}')

In [None]:
# Check ingestion status
response = bedrock_agent_client.get_ingestion_job(
    dataSourceId=data_source_id,
    ingestionJobId=ingestion_job,
    knowledgeBaseId=kb_id
)

# Show status
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(response['ingestionJob'])

In [37]:
# Delete de Data Source
bedrock_agent_client = boto3.client('bedrock-agent')
response = bedrock_agent_client.delete_data_source(
    dataSourceId=data_source_id,
    knowledgeBaseId=kb_id
)

In [None]:
# Show status
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(response)