 # Smart Content Processor with RAG for Safe Learning



This application is tailored to meet the demands of the modern learner by seamlessly transforming multimedia content into text, enabling efficient summarization and question-answering capabilities. Leveraging technologies like Retrieval Augmented Generation (RAG), it offers a comprehensive solution for educational institutions. With the ability to summarize multi part video series and answer context-specific queries, this app empowers students and teachers alike. Robust guardrails ensure a safe learning environment by filtering out obscene or irrelevant content. Harnessing the power of Amazon Bedrock's knowledge base and open search, this industry-focused solution epitomizes the potential of Generative AI in the educational domain.

This notebook works well on ml.t3.medium instance with Python3 kernel from JupyterLab or Data Science 3.0 kernel from SageMaker Studio Classic. 

### Step 1 : Setup : Installing pre requisites

In [1]:
%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

%pip install langchain==0.0.309 "transformers>=4.24,<5"

%pip install tiktoken

Collecting boto3>=1.28.57
  Downloading boto3-1.35.7-py3-none-any.whl.metadata (6.6 kB)
Collecting awscli>=1.29.57
  Downloading awscli-1.34.7-py3-none-any.whl.metadata (11 kB)
Collecting botocore>=1.31.57
  Downloading botocore-1.35.7-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3>=1.28.57)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3>=1.28.57)
  Downloading s3transfer-0.10.2-py3-none-any.whl.metadata (1.7 kB)
Collecting docutils<0.17,>=0.10 (from awscli>=1.29.57)
  Downloading docutils-0.16-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting PyYAML<6.1,>=3.10 (from awscli>=1.29.57)
  Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting colorama<0.4.7,>=0.2.5 (from awscli>=1.29.57)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting rsa<4.8,>=3.1.2 (from awscli>=1.29.57)
  Downloading rsa-4.7.2-py3-no

The file utility.py is used to create roles and policies for:
1. Policy for accessing foundation model
2. Policy for reading documents from s3
3. Amazon Bedrock Knowledge Base Execution Role for accessing OSS and S3

### Step 2 : Downloading sample content

The content from two domains are considered for this demo
1. Quantum physics lectures from MIT
2. Lectures in the field of Botany

sample content 

In [6]:
!wget -O podcast.mp3 http://www.brown.edu/Departments/Watson_Institute/Open_Source/RadioOpenSource-Harold_Bloom-Melville.mp3 --no-check-certificate

--2024-08-28 06:37:03--  http://www.brown.edu/Departments/Watson_Institute/Open_Source/RadioOpenSource-Harold_Bloom-Melville.mp3
Resolving www.brown.edu (www.brown.edu)... 104.18.2.173, 104.18.3.173, 2606:4700::6812:2ad, ...
Connecting to www.brown.edu (www.brown.edu)|104.18.2.173|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.brown.edu/Departments/Watson_Institute/Open_Source/RadioOpenSource-Harold_Bloom-Melville.mp3 [following]
--2024-08-28 06:37:03--  https://www.brown.edu/Departments/Watson_Institute/Open_Source/RadioOpenSource-Harold_Bloom-Melville.mp3
Connecting to www.brown.edu (www.brown.edu)|104.18.2.173|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 17252346 (16M) [audio/mpeg]
Saving to: ‘podcast.mp3’


2024-08-28 06:37:03 (255 MB/s) - ‘podcast.mp3’ saved [17252346/17252346]



### sample upload

In [8]:
import boto3

s3_client = boto3.client('s3')
file_name = "podcast.mp3"
bucket = "edify-apjc"
object_name = "sample/podcast.mp3"
response = s3_client.upload_file(file_name, bucket, object_name)

### Step 3 : Converting the media into text using a Transcribe job

In [9]:
import json
import os
import sys
import sagemaker
import time
import json
import warnings
import langchain
import tiktoken
warnings.filterwarnings('ignore')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [10]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
role

'arn:aws:iam::533267439967:role/service-role/AmazonSageMaker-ExecutionRole-20240416T114832'

In [14]:
transcribe = boto3.client('transcribe')

job_name = "transcribe_quantum"
job_uri = "s3://edify-apjc/transcribe-input/quantum/Lecture 1_ Introduction to Superposition.mp4"
output_job_uri = "edify-apjc"
output_loc = "quantumtranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [15]:
max_tries = 60
while max_tries > 0:
    max_tries -= 1
    job = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    job_status = job['TranscriptionJob']['TranscriptionJobStatus']
    if job_status in ['COMPLETED', 'FAILED']:
        print(f"Job {job_name} is {job_status}.")
        if job_status == 'COMPLETED':
            print(
                f"Download the transcript from\n"
                f"\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}.")
        break
    else:
        print(f"Waiting for {job_name}. Current status is {job_status}.")
    time.sleep(10)

Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Waiting for transcribe_quantum. Current status is IN_PROGRESS.
Job transcribe_quantum is COMPLETED.
Download the transcript from
	https://s3.us-east-1.amazonaws.com/edify-apjc/quantumtranscribe-output/transcribe_quantum.json.


In [17]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'quantumtranscribe-output/transcribe_quantum.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][:500])
#text

The following content is provided under a Creative Commons license. Your support will help mit open courseware continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of mit courses. Visit Mit Open Courseware at ocw.mit.edu. Hi everyone. Welcome to 804 for spring 2013. This is the uh fourth and uh presumably final time that I will be teaching this class. So I'm pretty excited about it. Um So my name is Alan Adams. I'll be lec


In [18]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

14562

 Adding more contents

In [20]:
transcribe = boto3.client('transcribe')

job_name = "ExperimentalFactsofLife"
job_uri = "s3://edify-apjc/transcribe-input/quantum/Lecture 2_ Experimental Facts of Life.mp4"
output_job_uri = "edify-apjc"
output_loc = "quantumtranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [21]:
max_tries = 60
while max_tries > 0:
    max_tries -= 1
    job = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    job_status = job['TranscriptionJob']['TranscriptionJobStatus']
    if job_status in ['COMPLETED', 'FAILED']:
        print(f"Job {job_name} is {job_status}.")
        if job_status == 'COMPLETED':
            print(
                f"Download the transcript from\n"
                f"\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}.")
        break
    else:
        print(f"Waiting for {job_name}. Current status is {job_status}.")
    time.sleep(10)

Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current status is IN_PROGRESS.
Waiting for ExperimentalFactsofLife. Current sta

In [22]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

14562

In [43]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'quantumtranscribe-output/ExperimentalFactsofLife.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][7000:9000])
#text

lain that that's gonna be our first challenge. Now, interestingly Geiger, who is this uh this collaborator of Rutherford young junior collaborator of Rutherford um went on uh to develop a really neat instrument. So suppose you wanna see radiation, we do this all the time. I'm looking at you and I'm seeing radiation, seeing light, but I'm not seeing ultra high energy radiation. I'm seeing eng uh uh radiation in the op electromagnetic waves in the optical spectrum. Meanwhile, I'm also not seeing alpha particles. So what Geiger wanted was a way to detect without using your eyes radiation that's, you know, hard to see. So the way he did this is he, he took a capacitor and he filled the surrounded the capacitor with some noble gas, it doesn't interact. There's no, it's very hard to ionize. And uh if you crank up the potential across his capacitor plate high enough. What do you get a spark? Right? You, you all know this, if you crank up a capacitor eventually breaks down because the dielectr

lecture 3

In [26]:
transcribe = boto3.client('transcribe')

job_name = "TheWaveFunction"
job_uri = "s3://edify-apjc/transcribe-input/quantum/Lecture 3_ The Wave Function.mp4"
output_job_uri = "edify-apjc"
output_loc = "quantumtranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [44]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

16865

In [45]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'quantumtranscribe-output/TheWaveFunction.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][7000:9000])
#text

 electrons have both particulate and wave like features simultaneously. So these relations are called the de bru relations. They're de Broli, I leave it up to you to decide how to pronounce that. Uh uh And those relations are gonna play an important role for us in the next few lectures. I just wanna give them a name and a little context. This is a good example of um parsimony and elegance, theoretical elegance um leading you to an idea that turns out to be true of the world. Now, that's a dangerous strategy for finding truth, right? Boy, wouldn't it be nice if, right? Wouldn't it be nice if you know, we didn't have to pay taxes? But we also had Medicare. Um uh But uh but of course, you know, so that's not a terribly useful uh uh guide all the time, but sometimes it really does lead you in the right direction. This is a great example of physical intuition, wildly divorced from experiment pushing you in the right direction. I'm making it sound a little more shocking than, well, it was sh

lecture 5

In [31]:
transcribe = boto3.client('transcribe')

job_name = "OperatorsandtheSchrdingerEquation"
job_uri = "s3://edify-apjc/transcribe-input/quantum/Lecture 5_ Operators and the Schrödinger Equation.mp4"
output_job_uri = "edify-apjc"
output_loc = "quantumtranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [46]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

13465

In [47]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'quantumtranscribe-output/OperatorsandtheSchrdingerEquation.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][7000:9000])
#text

ch you have vectors with two components. So example a vector that has two components V one and V two and the matrices, this is the object and the operators a matrix, a 11812821822 as a operator an M on a vector is a vector. If you add with a matrix on a vector, this two by two matrix on this column vector, you get another vector. So that's the simplest example of operators acting on objects. In our case, we're gonna talk about a more so we're gonna have to begin in quantum mechanics. We're required to begin with a more sophisticated one in which the objects are gonna be object are gonna be functions. In fact, I will write complex functions of X. So let's see the list of operators and what do the operators do? The operators act on the functions. So uh what is an operator is a rule on how to take any function? And you must give a rule on how to obtain from that function, another function. So let's start with the examples is probably the easiest thing to do. So an operator acts on the fun

lecture 6

In [35]:
transcribe = boto3.client('transcribe')

job_name = "TimeEvolutionandtheSchrdingerEquation"
job_uri = "s3://edify-apjc/transcribe-input/quantum/Lecture 6_ Time Evolution and the Schrödinger Equation.mp4"
output_job_uri = "edify-apjc"
output_loc = "quantumtranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [48]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

10984

In [49]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'quantumtranscribe-output/TimeEvolutionandtheSchrdingerEquation.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][7000:9000])
#text

ed as a superposition of those Eigen functions of your physical observable. We'll go over this in more detail in a minute. But here I want to emphasize that uh these, that the Eigen functions have a special property that for observables, for operators corresponding to observables. The Eigen functions form a basis. Any function can be expanded as some linear combination of these basis functions. The classic example being the fourier expansion, any function, any periodic function can be expanded as a sum of sines and cosines and any function on the real line can be expanded as a sum of exponentials E to the IKX. This is the same statement the Eigen functions of momentum are what even the IKX. And so this is the statement that an arbitrary function when the observable is the momentum. This is the statement that an arbitrary function can be expanded as a superposition or a sum of exponentials. And that's the 48. The cool. Was there a question? OK. Good. Yeah. OK. Other questions on these p

Lecture 7

In [38]:
transcribe = boto3.client('transcribe')

job_name = "MoreonEnergyEigenstates"
job_uri = "s3://edify-apjc/transcribe-input/quantum/Lecture 7_ More on Energy Eigenstates.mp4"
output_job_uri = "edify-apjc"
output_loc = "quantumtranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [50]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

16690

In [51]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'quantumtranscribe-output/MoreonEnergyEigenstates.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][7000:9000])
#text

 this is equal to from both of these, I get an overall factor of E one. So notice that we get the same vector back times a constant, a common constant. So when we have degenerate Eigen functions, we can take arbitrary linear combinations to them get another degenerate Eigen function cool. So this is like imagine I have a vector and I have another vector and they share the property that they're both Eigen functions of some operator. That means any linear combination of them is also right. So there's a whole vector space, there's a whole space of possible functions that all have the same eigenvalue. So now you say, well look are these two orthogonal to each other. No, these two, no. But here's the thing. If you have a vector space, if you have a space, you can always find orthogonal guys and a basis for that space. Yes. So while it's not true that the, the Eigen functions are always orthogonal, it is true, we will not prove this, but we will discuss the proof of it later by pulling the m

### Step 4: Adding botany content as a separate job/location

In [41]:
transcribe = boto3.client('transcribe')

job_name = "IntroductiontoBotany"
job_uri = "s3://edify-apjc/transcribe-input/botany/Introduction to Botany.mp4"
output_job_uri = "edify-apjc"
output_loc = "botanytranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [55]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

690

In [56]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'botanytranscribe-output/IntroductiontoBotany.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][:700])
#text

A lot about the science star da explains. In my biology series, we learned all about biology, which is the study of all living things there. We came to understand what cells are and how they function as well as what it means to be alive. We've also learned about biochemistry in my biochemistry series. So we know about the molecular components of cells and we understand how chemical processes occur within cells, including how energy is produced. And in my classical physics series, we learned all about energy conservation and the ways that energy can be converted from one form to another through all of the biological activity that occurs every day, all over the world. But all of this begs the 


lecture 2

In [57]:
transcribe = boto3.client('transcribe')

job_name = "MechanismsofPlantGrowth"
job_uri = "s3://edify-apjc/transcribe-input/botany/Mechanisms of Plant Growth.mp4"
output_job_uri = "edify-apjc"
output_loc = "botanytranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

Lecture 3

In [58]:

transcribe = boto3.client('transcribe')

job_name = "PlantAnatomyandStructure"
job_uri = "s3://edify-apjc/transcribe-input/botany/Plant Anatomy and Structure.mp4"
output_job_uri = "edify-apjc"
output_loc = "botanytranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

lecture 4

In [60]:

transcribe = boto3.client('transcribe')

job_name = "PlantPigments"
job_uri = "s3://edify-apjc/transcribe-input/botany/Plant Pigments.mp4"
output_job_uri = "edify-apjc"
output_loc = "botanytranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

lecture 5

In [61]:

transcribe = boto3.client('transcribe')

job_name = "TypesofPlantCells"
job_uri = "s3://edify-apjc/transcribe-input/botany/Types of Plant Cells.mp4"
output_job_uri = "edify-apjc"
output_loc = "botanytranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

Lecture 6

In [62]:
transcribe = boto3.client('transcribe')

job_name = "TypesofPlantTissues"
job_uri = "s3://edify-apjc/transcribe-input/botany/Types of Plant Tissues.mp4"
output_job_uri = "edify-apjc"
output_loc = "botanytranscribe-output/"
response = transcribe.start_transcription_job( 
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    OutputBucketName=output_job_uri,
    OutputKey=output_loc,
    MediaFormat='mp3',
    LanguageCode='en-US')

In [72]:
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(text))
num_tokens

2116

In [73]:
s3 = boto3.client('s3')
bucket = 'edify-apjc'
key = 'botanytranscribe-output/TypesofPlantTissues.json'
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response['Body'].read().decode('utf-8'))
text = data['results']['transcripts'][0]['transcript']
print(data['results']['transcripts'][0]['transcript'][1000:2000])
#text

ponsible for the photosynthetic layer in leaves called the mesophyll where the plant performs gas exchange and creates sugars, making its own food pereny tissue is also how a plant stores excess energy in the form of starches, which are complex polysaccharides. Starch filled, pereny tissue can be found in a plant's roots and pereny. A tissue also makes up the majority of a seed so that the starches can feed the embryonic plant until it's able to photosynthesize on its own. Additionally, pereny tissue is so prevalent throughout a plant that it also takes on the role of growing to cover wounds and replace other tissues lost through physical trauma or disease. Wound closure is an important function for plants just like it is for us. Because if a plant has an open wound, then all sorts of pathogens like fungi and bacteria could invade the plant and quickly kill it. The other two subtypes of ground tissue, ground Kenya tissue and ground sclerenchyma tissue are also composed of cells by the 

### Step 6: Experimenting with anthropic claude 3.5

In [90]:
import boto3
import json
from botocore.exceptions import ClientError

# Initialize S3 and Bedrock clients
s3 = boto3.client('s3')
bedrock = boto3.client(service_name='bedrock-runtime')

# S3 bucket and object details
bucket_name = 'edify-apjc'
object_key = 'botanytranscribe-output/TypesofPlantTissues.json'

modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
accept = 'application/json'
contentType = 'application/json'

def read_json_from_s3(bucket, key):
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        json_content = json.loads(response['Body'].read().decode('utf-8'))
        return json_content
    except ClientError as e:
        print(f"An error occurred: {e}")
        return None
    except json.JSONDecodeError:
        print(f"The file {key} is not a valid JSON.")
        return None

# Read the JSON from S3
json_data = read_json_from_s3(bucket_name, object_key)

if json_data is None:
    print("Failed to retrieve or parse JSON from S3.")
else:
    # Function to extract text from nested JSON
    def extract_text(data):
        if isinstance(data, dict):
            for value in data.values():
                text = extract_text(value)
                if text:
                    return text
        elif isinstance(data, list):
            for item in data:
                text = extract_text(item)
                if text:
                    return text
        elif isinstance(data, str) and len(data) > 100:  # Assuming the text content is relatively long
            return data
        return None

    text_to_summarize = extract_text(json_data)
    
    if not text_to_summarize:
        print("No suitable text found in the JSON to summarize.")
    else:
        # Prepare the request body for the Messages API
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Summarize the following text in a concise manner. Write the response in bullet points.\n\n{text_to_summarize}"
                    }
                ]
            }
        ]

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
             "temperature": 0.1,
             "top_p": 0.5,
            "messages": messages
        })

        try:
            response = bedrock.invoke_model(
                body=body,
                modelId=modelId,
                accept=accept,
                contentType=contentType
            )

            response_body = json.loads(response['body'].read())

            # Extract the assistant's response
            assistant_message = response_body['content'][0]['text']
            print(assistant_message)
        except Exception as e:
            print(f"An error occurred while invoking the model: {e}")

Here is a summary of the text in bullet points:

- Three main types of plant tissues:
    - Ground tissue (parenchyma, collenchyma, sclerenchyma)
    - Dermal tissue (epidermis, cuticle, stomata, guard cells)
    - Vascular tissue (xylem, phloem)

- Ground tissue:
    - Parenchyma - photosynthesis, energy storage, wound healing
    - Collenchyma and sclerenchyma - structural support

- Dermal tissue:
    - Epidermis - protective layer, cuticle secretion
    - Stomata - gas exchange openings, controlled by guard cells
    - Periderm and bark in older plants

- Vascular tissue:
    - Xylem - water and mineral transport upwards
    - Phloem - sugar transport from sources to sinks
    - Transpiration and translocation drive xylem and phloem transport

- Tissues organize into organs like roots, stems, leaves


### Step 7 : Summarizing a quantum lecture content

In [94]:
import boto3
import json
from botocore.exceptions import ClientError

# Initialize S3 and Bedrock clients
s3 = boto3.client('s3')
bedrock = boto3.client(service_name='bedrock-runtime')

# S3 bucket and object details
bucket_name = 'edify-apjc'
object_key = 'quantumtranscribe-output/TimeEvolutionandtheSchrdingerEquation.json'  

modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
accept = 'application/json'
contentType = 'application/json'

def read_json_from_s3(bucket, key):
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        json_content = json.loads(response['Body'].read().decode('utf-8'))
        return json_content
    except ClientError as e:
        print(f"An error occurred: {e}")
        return None
    except json.JSONDecodeError:
        print(f"The file {key} is not a valid JSON.")
        return None

# Read the JSON from S3
json_data = read_json_from_s3(bucket_name, object_key)

if json_data is None:
    print("Failed to retrieve or parse JSON from S3.")
else:
    # Function to extract text from nested JSON
    def extract_text(data):
        if isinstance(data, dict):
            for value in data.values():
                text = extract_text(value)
                if text:
                    return text
        elif isinstance(data, list):
            for item in data:
                text = extract_text(item)
                if text:
                    return text
        elif isinstance(data, str) and len(data) > 100:  # Assuming the text content is relatively long
            return data
        return None

    text_to_summarize = extract_text(json_data)
    
    if not text_to_summarize:
        print("No suitable text found in the JSON to summarize.")
    else:
        # Prepare the request body for the Messages API
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Summarize the following text in a concise manner. Write the response in bullet points.\n\n{text_to_summarize}"
                    }
                ]
            }
        ]

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
             "temperature": 0.1,
             "top_p": 0.5,
            "messages": messages
        })

        try:
            response = bedrock.invoke_model(
                body=body,
                modelId=modelId,
                accept=accept,
                contentType=contentType
            )

            response_body = json.loads(response['body'].read())

            # Extract the assistant's response
            assistant_message = response_body['content'][0]['text']
            print(assistant_message)
        except Exception as e:
            print(f"An error occurred while invoking the model: {e}")

Here are the key points summarized in bullet form:

- Solving the Schrödinger equation can be done via brute force, extreme cleverness, or numerically
- For an energy eigenstate, the wavefunction only changes by a phase factor over time
- Any wavefunction can be expressed as a superposition of energy eigenstates
- The expansion coefficients give the probability of measuring each energy eigenvalue
- Example 1: Free particle - energy eigenfunctions are plane waves
- Example 2: Particle in a box 
    - Energy eigenvalues are quantized (discrete)
    - Ground state energy is non-zero
    - Energy levels get further apart for higher energies
- Quantization of energy levels connects to experimental observations like emission spectra of atoms


### Step 8 : q & a service test


In [101]:
import boto3
import json

# Initialize Bedrock client
bedrock = boto3.client(service_name='bedrock-runtime')

modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
accept = 'application/json'
contentType = 'application/json'

def invoke_bedrock_model(messages):
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "temperature": 0.1,
        "top_p": 0.5,
        "messages": messages
    })

    try:
        response = bedrock.invoke_model(
            body=body,
            modelId=modelId,
            accept=accept,
            contentType=contentType
        )

        response_body = json.loads(response['body'].read())
        return response_body['content'][0]['text']
    except Exception as e:
        print(f"An error occurred while invoking the model: {e}")
        return None

# Ensure text_to_summarize is defined and contains the full content
# If it's not already defined, uncomment and fill in the following line:
# text_to_summarize = """Paste your original text here"""

# Check if text_to_summarize is defined and not empty
if 'text_to_summarize' not in locals() or not text_to_summarize.strip():
    print("Error: text_to_summarize is not defined or is empty.")
    text_to_summarize = input("Please paste the original content here: ")

# Initialize conversation history with the original text
messages = [
    {"role": "user", "content": [{"type": "text", "text": f"Here's the content for reference. Please acknowledge that you've read it and are ready to answer questions about it:\n\n{text_to_summarize}"}]},
]

# Get initial acknowledgment from the model
response = invoke_bedrock_model(messages)
print("\nInitial Response:")
print(response)

messages.append({"role": "assistant", "content": [{"type": "text", "text": response}]})

# Follow-up questions loop
while True:
    follow_up = input("\nDo you have a question about the content? (yes/no): ").lower()
    if follow_up != 'yes':
        break

    question = input("Enter your question: ")
    messages.append({"role": "user", "content": [{"type": "text", "text": question}]})

    response = invoke_bedrock_model(messages)
    print("\nResponse:")
    print(response)

    # Add the model's response to the conversation history
    messages.append({"role": "assistant", "content": [{"type": "text", "text": response}]})

print("Thank you for using the Q&A service!")


Initial Response:
Yes, I have read through the provided content on solving the Schrödinger equation and understanding energy eigenfunctions and eigenvalues in quantum mechanics. I am ready to answer questions related to this material.



Do you have a question about the content? (yes/no):  yes
Enter your question:  how does the lecture relate time and schrodinger equation?



Response:
The lecture discusses how the Schrödinger equation governs the time evolution of the wavefunction in quantum mechanics. Some key points about time evolution from the Schrödinger equation:

1. The Schrödinger equation is a first-order linear differential equation for the time derivative of the wavefunction Ψ(x,t):

iħ ∂Ψ/∂t = ĤΨ

Where Ĥ is the Hamiltonian operator representing the total energy of the system.

2. For a state that is an energy eigenfunction Ψ = φE with eigenvalue E, the time evolution is simple:

Ψ(x,t) = φE(x)e^(-iEt/ħ)

The wavefunction just picks up a phase factor that oscillates at a frequency E/ħ.

3. Using the superposition principle, the general solution for an arbitrary initial state Ψ(x,0) can be written as a linear combination of energy eigenfunctions:

Ψ(x,t) = Σn cn φEn(x) e^(-iEnt/ħ)

Where the coefficients cn determine the probability to measure each energy eigenvalue En.

4. This shows that the Schrödinger equation provides deterministic, unitar


Do you have a question about the content? (yes/no):  yes
Enter your question:  what does the lecture say about of plant tissues?



Response:
The lecture does not mention anything about plant tissues. It is focused entirely on solving the Schrödinger equation in quantum mechanics and discussing the time evolution of wavefunctions, energy eigenfunctions and eigenvalues for simple systems like a free particle and a particle in an infinite square well potential. There is no discussion of plant tissues or anything related to plants or biology in this lecture content.



Do you have a question about the content? (yes/no):  no


Thank you for using the Q&A service!


cannot answer question on biology which is a different content

### Step 9: creating a vector DB for RAG

In [105]:
%pip install -U opensearch-py==2.3.1
%pip install -U boto3==1.33.3
%pip install -U retrying==1.3.4

Collecting opensearch-py==2.3.1
  Downloading opensearch_py-2.3.1-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting urllib3<2,>=1.21.1 (from opensearch-py==2.3.1)
  Downloading urllib3-1.26.19-py2.py3-none-any.whl.metadata (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Downloading opensearch_py-2.3.1-py2.py3-none-any.whl (327 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.3/327.3 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading urllib3-1.26.19-py2.py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.9/143.9 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: urllib3, opensearch-py
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.2.2
    Uninstalling urllib3-2.2.2:
      Successfully uninstalled urllib3-2.2.2
[31mERROR: pip's dependency resolver does not currently take int

In [106]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [107]:
import warnings
warnings.filterwarnings('ignore')

check bucket

In [109]:
import json
import os
import boto3
from botocore.exceptions import ClientError
import pprint
from utility import create_bedrock_execution_role, create_oss_policy_attach_bedrock_execution_role, create_policies_in_oss, interactive_sleep
import random
from retrying import retry
suffix = random.randrange(200, 900)

sts_client = boto3.client('sts')
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
service = 'aoss'
s3_client = boto3.client('s3')
account_id = sts_client.get_caller_identity()["Account"]
s3_suffix = f"{region_name}-{account_id}"
bucket_name = f'edify-apjc'
pp = pprint.PrettyPrinter(indent=2)

In [110]:
try:
    s3_client.head_bucket(Bucket=bucket_name)
    print(f'Bucket {bucket_name} Exists')
except ClientError as e:
    print(f'Creating bucket {bucket_name}')

Bucket edify-apjc Exists


In [111]:
import boto3
import time
vector_store_name = f'bedrock-edify-rag-{suffix}'
index_name = f"bedrock-edify-rag-index-{suffix}"
aoss_client = boto3_session.client('opensearchserverless')
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name=bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

In [112]:
encryption_policy, network_policy, access_policy = create_policies_in_oss(vector_store_name=vector_store_name,
                       aoss_client=aoss_client,
                       bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)
collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

In [113]:
pp.pprint(collection)

{ 'ResponseMetadata': { 'HTTPHeaders': { 'connection': 'keep-alive',
                                         'content-length': '313',
                                         'content-type': 'application/x-amz-json-1.0',
                                         'date': 'Thu, 29 Aug 2024 07:12:03 '
                                                 'GMT',
                                         'x-amzn-requestid': 'c3901f6c-ac9e-4fbd-aee2-fc00c39a9833'},
                        'HTTPStatusCode': 200,
                        'RequestId': 'c3901f6c-ac9e-4fbd-aee2-fc00c39a9833',
                        'RetryAttempts': 0},
  'createCollectionDetail': { 'arn': 'arn:aws:aoss:us-east-1:533267439967:collection/hyurqnu1q1dut2msng40',
                              'createdDate': 1724915523156,
                              'id': 'hyurqnu1q1dut2msng40',
                              'kmsKeyArn': 'auto',
                              'lastModifiedDate': 1724915523156,
                             

In [114]:
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
print(host)

hyurqnu1q1dut2msng40.us-east-1.aoss.amazonaws.com


In [115]:
response = aoss_client.batch_get_collection(names=[vector_store_name])
# Periodically check collection status
while (response['collectionDetails'][0]['status']) == 'CREATING':
    print('Creating collection...')
    interactive_sleep(30)
    response = aoss_client.batch_get_collection(names=[vector_store_name])
print('\nCollection successfully created:')
pp.pprint(response["collectionDetails"])


Collection successfully created:
[ { 'arn': 'arn:aws:aoss:us-east-1:533267439967:collection/hyurqnu1q1dut2msng40',
    'collectionEndpoint': 'https://hyurqnu1q1dut2msng40.us-east-1.aoss.amazonaws.com',
    'createdDate': 1724915523156,
    'dashboardEndpoint': 'https://hyurqnu1q1dut2msng40.us-east-1.aoss.amazonaws.com/_dashboards',
    'id': 'hyurqnu1q1dut2msng40',
    'kmsKeyArn': 'auto',
    'lastModifiedDate': 1724915547601,
    'name': 'bedrock-edify-rag-357',
    'standbyReplicas': 'ENABLED',
    'status': 'ACTIVE',
    'type': 'VECTORSEARCH'}]


In [117]:
try:
    create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
                                                    bedrock_kb_execution_role=bedrock_kb_execution_role)
    # It can take up to a minute for data access rules to be enforced
    interactive_sleep(60)
except Exception as e:
    print("Policy already exists")
    pp.pprint(e)

Opensearch serverless arn:  arn:aws:iam::533267439967:policy/AmazonBedrockOSSPolicyForKnowledgeBase_564
Done!.......................................................


### Step 10 : creating vector index

In [118]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, RequestError
credentials = boto3.Session().get_credentials()
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)

index_name = f"bedrock-edify-index-{suffix}"
body_json = {
   "settings": {
      "index.knn": "true",
       "number_of_shards": 1,
       "knn.algo_param.ef_search": 512,
       "number_of_replicas": 0,
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",
            "dimension": 1536,
             "method": {
                 "name": "hnsw",
                 "engine": "faiss",
                 "space_type": "l2"
             },
         },
         "text": {
            "type": "text"
         },
         "text-metadata": {
            "type": "text"         }
      }
   }
}

oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

In [119]:
try:
    response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
    print('\nCreating index:')
    pp.pprint(response)

    # index creation can take up to a minute
    interactive_sleep(60)
except RequestError as e:
    # you can delete the index if its already exists
    # oss_client.indices.delete(index=index_name)
    print(f'Error while trying to create the index, with error {e.error}\nyou may unmark the delete above to delete, and recreate the index')


Creating index:
{ 'acknowledged': True,
  'index': 'bedrock-edify-index-357',
  'shards_acknowledged': True}
Done!.......................................................


In [120]:
opensearchServerlessConfiguration = {
            "collectionArn": collection["createCollectionDetail"]['arn'],
            "vectorIndexName": index_name,
            "fieldMapping": {
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }

# Ingest strategy - How to ingest data from the data source
chunkingStrategyConfiguration = {
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 512,
        "overlapPercentage": 20
    }
}

# The data source to ingest documents from, into the OpenSearch serverless knowledge base index
s3Configuration = {
    "bucketArn": f"arn:aws:s3:::{bucket_name}",
    # "inclusionPrefixes":["*.*"] # you can use this if you want to create a KB using data within s3 prefixes.
}

# The embedding model used by Bedrock to embed ingested documents, and realtime prompts
embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1"

name = f"bedrock-edify-knowledge-base-{suffix}"
description = "edify knowledge base"
roleArn = bedrock_kb_execution_role_arn

In [121]:
from retrying import retry

@retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
def create_knowledge_base_func():
    create_kb_response = bedrock_agent_client.create_knowledge_base(
        name = name,
        description = description,
        roleArn = roleArn,
        knowledgeBaseConfiguration = {
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embeddingModelArn
            }
        },
        storageConfiguration = {
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration":opensearchServerlessConfiguration
        }
    )
    return create_kb_response["knowledgeBase"]

In [122]:
try:
    kb = create_knowledge_base_func()
except Exception as err:
    print(f"{err=}, {type(err)=}")

In [123]:
pp.pprint(kb)

{ 'createdAt': datetime.datetime(2024, 8, 29, 7, 18, 30, 639648, tzinfo=tzlocal()),
  'description': 'edify knowledge base',
  'knowledgeBaseArn': 'arn:aws:bedrock:us-east-1:533267439967:knowledge-base/5SSLBN29TM',
  'knowledgeBaseConfiguration': { 'type': 'VECTOR',
                                  'vectorKnowledgeBaseConfiguration': { 'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1'}},
  'knowledgeBaseId': '5SSLBN29TM',
  'name': 'bedrock-edify-knowledge-base-357',
  'roleArn': 'arn:aws:iam::533267439967:role/AmazonBedrockExecutionRoleForKnowledgeBase_564',
  'status': 'CREATING',
  'storageConfiguration': { 'opensearchServerlessConfiguration': { 'collectionArn': 'arn:aws:aoss:us-east-1:533267439967:collection/hyurqnu1q1dut2msng40',
                                                                   'fieldMapping': { 'metadataField': 'text-metadata',
                                                                                     'textF

In [124]:
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])


In [125]:
create_ds_response = bedrock_agent_client.create_data_source(
    name = name,
    description = description,
    knowledgeBaseId = kb['knowledgeBaseId'],
    dataSourceConfiguration = {
        "type": "S3",
        "s3Configuration":s3Configuration
    },
    vectorIngestionConfiguration = {
        "chunkingConfiguration": chunkingStrategyConfiguration
    }
)
ds = create_ds_response["dataSource"]
pp.pprint(ds)

{ 'createdAt': datetime.datetime(2024, 8, 29, 7, 19, 18, 890100, tzinfo=tzlocal()),
  'dataSourceConfiguration': { 's3Configuration': { 'bucketArn': 'arn:aws:s3:::edify-apjc'},
                               'type': 'S3'},
  'dataSourceId': 'IC4L6ZCSYY',
  'description': 'edify knowledge base',
  'knowledgeBaseId': '5SSLBN29TM',
  'name': 'bedrock-edify-knowledge-base-357',
  'status': 'AVAILABLE',
  'updatedAt': datetime.datetime(2024, 8, 29, 7, 19, 18, 890100, tzinfo=tzlocal()),
  'vectorIngestionConfiguration': { 'chunkingConfiguration': { 'chunkingStrategy': 'FIXED_SIZE',
                                                               'fixedSizeChunkingConfiguration': { 'maxTokens': 512,
                                                                                                   'overlapPercentage': 20}}}}


In [126]:
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])


{'ResponseMetadata': {'RequestId': 'ab586efd-a0f0-4806-9560-cf7b2e63414d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Thu, 29 Aug 2024 07:19:29 GMT',
   'content-type': 'application/json',
   'content-length': '558',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'ab586efd-a0f0-4806-9560-cf7b2e63414d',
   'x-amz-apigw-id': 'dQoYVFv2IAMERfA=',
   'x-amzn-trace-id': 'Root=1-66d02101-6cf075fa382e7a8b2696e759'},
  'RetryAttempts': 0},
 'dataSource': {'knowledgeBaseId': '5SSLBN29TM',
  'dataSourceId': 'IC4L6ZCSYY',
  'name': 'bedrock-edify-knowledge-base-357',
  'status': 'AVAILABLE',
  'description': 'edify knowledge base',
  'dataSourceConfiguration': {'type': 'S3',
   's3Configuration': {'bucketArn': 'arn:aws:s3:::edify-apjc'}},
  'vectorIngestionConfiguration': {'chunkingConfiguration': {'chunkingStrategy': 'FIXED_SIZE',
    'fixedSizeChunkingConfiguration': {'maxTokens': 512,
     'overlapPercentage': 20}}},
  'createdAt': datetime.datetime(2024, 8, 29, 7, 19, 18, 89010

### Step 11 : Ingesting the data in S3 bucket to vector DB

In [127]:
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])


In [128]:
job = start_job_response["ingestionJob"]
pp.pprint(job)

{ 'dataSourceId': 'IC4L6ZCSYY',
  'ingestionJobId': 'QXYX750ICY',
  'knowledgeBaseId': '5SSLBN29TM',
  'startedAt': datetime.datetime(2024, 8, 29, 7, 19, 56, 98919, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 0,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 0},
  'status': 'STARTING',
  'updatedAt': datetime.datetime(2024, 8, 29, 7, 19, 56, 98919, tzinfo=tzlocal())}


In [129]:
while(job['status']!='COMPLETE' ):
    get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
    job = get_job_response["ingestionJob"]
pp.pprint(job)
interactive_sleep(50)

{ 'dataSourceId': 'IC4L6ZCSYY',
  'failureReasons': [ '["Ignored 7 files as they were larger than service '
                      'limit of MaximumFileSizeSupported: 52428800 '
                      'bytes","Encountered error: Ignored 3 files as their '
                      'file format was not supported. [Files: '
                      's3://edify-apjc/transcribe-input/botany/Introduction to '
                      'Botany.mp4, '
                      's3://edify-apjc/transcribe-input/botany/Plant Anatomy '
                      'and Structure.mp4, '
                      's3://edify-apjc/transcribe-input/botany/Mechanisms of '
                      'Plant Growth.mp4]. Call to Customer Source did not '
                      'succeed.","Encountered error: Ignored 2 files as their '
                      'file format was not supported. [Files: '
                      's3://edify-apjc/transcribe-input/botany/Types of Plant '
                      'Tissues.mp4, s3://edify-apjc/sample/pod

Checking the Knowledge Base ID

In [130]:
kb_id = kb["knowledgeBaseId"]
pp.pprint(kb_id)

'5SSLBN29TM'


In [131]:
%store kb_id

Stored 'kb_id' (str)


### Step 12 :  Testing output with RAG 

In [139]:
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)

claude_model_ids = [
    ["Claude 3 Sonnet", "anthropic.claude-3-sonnet-20240229-v1:0"],
    ["Claude Instant", "anthropic.claude-instant-v1"]
]

In [140]:
def ask_bedrock_llm_with_knowledge_base(query: str, model_arn: str, kb_id: str) -> str:
    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kb_id,
                'modelArn': model_arn
            }
        },
    )

    return response

In [141]:
query = " how does the lecture relate time and schrodinger equation?"

for model_id in claude_model_ids:
    model_arn = f'arn:aws:bedrock:{region_name}::foundation-model/{model_id[1]}'
    response = ask_bedrock_llm_with_knowledge_base(query, model_arn, kb_id)
    generated_text = response['output']['text']
    citations = response["citations"]
    contexts = []
    for citation in citations:
        retrievedReferences = citation["retrievedReferences"]
        for reference in retrievedReferences:
            contexts.append(reference["content"]["text"])
    print(f"---------- Generated using {model_id[0]}:")
    pp.pprint(generated_text )
    print(f'---------- The citations for the response generated by {model_id[0]}:')
    pp.pprint(contexts)
    print()

---------- Generated using Claude 3 Sonnet:
('The lecture explains that the Schrodinger equation governs the time '
 "evolution of a quantum system's wavefunction. Specifically, given a "
 'wavefunction at some initial time, the Schrodinger equation allows us to '
 'determine the time derivative (rate of change) of the wavefunction and '
 'integrate it forward in time to find how the wavefunction evolves over time. '
 'The lecture also notes that time evolution in quantum mechanics, governed by '
 'the Schrodinger equation, is surprisingly simple compared to classical '
 'mechanics. The Schrodinger equation is a linear differential equation, '
 'whereas the equations of motion in classical mechanics are generally highly '
 'nonlinear.')
---------- The citations for the response generated by Claude 3 Sonnet:
[ 'And so that leaves us with the following question, which is another way of '
  'asking the question you just asked, what happens next? How does the system '
  'evolve subsequentl

Test summarization

In [142]:
import boto3
import pprint
from botocore.client import Config

pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

model_id = "anthropic.claude-v2" # try with both claude instant as well as claude-v2. for claude v2 - "anthropic.claude-v2"
region_id = "us-east-1" # replace it with the region you're running sagemaker notebook

In [143]:
def retrieveAndGenerate(input, kbId, sessionId=None, model_id = "anthropic.claude-v2", region_id = "us-east-1"):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    if sessionId:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kbId,
                    'modelArn': model_arn
                }
            },
            sessionId=sessionId
        )
    else:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kbId,
                    'modelArn': model_arn
                }
            }
        )

In [144]:
query = "can you summarize the lecture on wave functions and energy eigen states?"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

('The lecture covered energy eigenfunctions and wave functions in quantum '
 'mechanics. The professor discussed that energy eigenfunctions satisfy the '
 'time-independent Schrodinger equation and represent stationary states. The '
 'wave function can be expanded in terms of energy eigenfunctions. Measuring '
 'an observable collapses the wave function into an eigenstate of that '
 'observable operator.')


In [145]:
query = "summarize the lecture on plant pigments"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

('The lecture summarizes the different types of plant pigments and their '
 'functions. The main pigments discussed are chlorophyll, carotenoids, '
 'anthocyanins, and flavonoids. Chlorophyll is the most important pigment for '
 'photosynthesis and makes plants green. Carotenoids assist chlorophyll and '
 'give red/orange/yellow colors. Anthocyanins produce red colors in '
 'flowers/fruits. Flavonoids are found in plant cell vacuoles.')


In [146]:
query = "what role does energy eigen states play in plant pigments?"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

'Sorry, I am unable to assist you with this request.'


In [148]:
query = "is there a relation between energy eigen states and plant pigments?"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

('I could not find an exact answer to whether there is a relation between '
 'energy eigenstates and plant pigments based on the provided search results. '
 'The search results discuss various plant pigments and their roles, but do '
 'not mention energy eigenstates.')


In [151]:
    query = "provide a gist of plant pigments and mechanisms of plant growth lecture"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

('The lecture discusses different types of plant pigments like chlorophyll, '
 'carotenoids, anthocyanins, and phytochromes. Chlorophyll is the most '
 'important pigment and is involved in photosynthesis. Carotenoids assist '
 'chlorophyll in light absorption. Anthocyanins produce red colors in plants. '
 'Phytochromes regulate plant growth and development.')


In [153]:
query = "can you check if does anthocyanins play a role in schrodinger equation?"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

'Sorry, I am unable to assist you with this request.'


### Step 13 : Cleanup


In [None]:
To avoid incurring future charges, delete the resources.