# Knowledge Base Setup for PR Article Generation

This notebook sets up the Amazon Bedrock Knowledge Base that will be used by the the lab6 agentic notebooks. The knowledge base contains examples movie titles information that serve as reference material for matching title search.

## Purpose
- Create a Knowledge Base for Amazon Bedrock
- Upload example movie titles to S3
- Configure embeddings for semantic search
- Store knowledge base ID for use in the main workflow

## Prerequisites
- AWS credentials configured
- S3 bucket access
- Bedrock service permissions
- Example movie title information in the `titles` directory

Let's get started!

In [1]:
import boto3
import os
import uuid
import time

sts_client = boto3.client('sts')
session = boto3.session.Session()

account_id = sts_client.get_caller_identity()["Account"]
region = session.region_name

s3_client = boto3.client('s3', region)
bedrock_client = boto3.client('bedrock-runtime', region)

## Importing helper functions
In the following section, we're adding bedrock_agent_helper.py and knowledge_base_helper on Python path, so the files can be recognized and their functionalities can be invoked.

In general, the helper functions handle common tasks including agent creation, Knowledge Bases for Bedrock creation, and accessing data in S3.

In [2]:
import sys

sys.path.insert(0, ".")
sys.path.insert(1, "..")
sys.path.insert(2, "../..")

from utils.bedrock_agent_helper import (
    AgentsForAmazonBedrock
)
from utils.knowledge_base_helper import (
    KnowledgeBasesForAmazonBedrock, upload_directory
)
agents = AgentsForAmazonBedrock()
kb = KnowledgeBasesForAmazonBedrock()

## Create and synchronize Knowledge Base
Before creating an agent, we need to create a Knowledge Base (KB) and associates it with the PR Generator agent.
This KB will contain pristine examples of high quality PRs across different movies. We have synthetically generated some example PRs and stored them in the `good_prs` folder. We'll use them as the basis to create our knowledge base system.

This creation process can take several minutes.

In [3]:
knowledge_base_name = f'lab6-media-agent-kb-{str(uuid.uuid4())[:5]}'
knowledge_base_description = "KB containing information about media dataset"
s3_bucket_name = f"labs-bucket-{region}-{account_id}"
bucket_prefix = "lab6/"

In [None]:
%%time
lab6_kb_id, lab6_ds_id = kb.create_or_retrieve_knowledge_base(
    knowledge_base_name,
    knowledge_base_description,
    s3_bucket_name,
    "amazon.titan-embed-text-v2:0",
    bucket_prefix
)

print(f"Knowledge Base ID: {lab6_kb_id}")
print(f"Data Source ID: {lab6_ds_id}")

## Upload title information to S3 Bucket
For this lab, we created a few synthetic media titles that could be used for title information retrieval. 
The data can be found in the *titles* sub folder in this lab. For simplicity, we created the title with the following details:

* title_id: a unique identifier for the title
* title: name of the media
* year: the year when the title was released
* duration: the total duration of the title

Feel free to explore these files for ideas about how to best structure the title data for optimizing media search and retrieval process.

In [None]:
upload_directory("titles", s3_bucket_name, bucket_prefix)

In [None]:
kb.synchronize_data(lab6_kb_id, lab6_ds_id)

## Store Knowledge Base ID

Store the knowledge base ID in the notebook's variable store so it can be accessed by other notebooks in the same session.

In [None]:
# Store the knowledge base ID for use in other notebooks
%store lab6_kb_id
%store lab6_ds_id

print(f"Stored variables:")
print(f"  kb_id = {lab6_kb_id}")
print(f"  ds_id = {lab6_ds_id}")