Run the cells in this section to install the packages needed by the notebooks in this workshop.

IGNORE ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

In [1]:
%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"


Collecting boto3>=1.28.57
  Using cached boto3-1.42.11-py3-none-any.whl.metadata (6.8 kB)
Collecting awscli>=1.29.57
  Using cached awscli-1.44.1-py3-none-any.whl.metadata (11 kB)
Collecting botocore>=1.31.57
  Using cached botocore-1.42.11-py3-none-any.whl.metadata (5.9 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3>=1.28.57)
  Using cached jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.17.0,>=0.16.0 (from boto3>=1.28.57)
  Using cached s3transfer-0.16.0-py3-none-any.whl.metadata (1.7 kB)
Collecting python-dateutil<3.0.0,>=2.1 (from botocore>=1.31.57)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3!=2.2.0,<3,>=1.25.4 (from botocore>=1.31.57)
  Using cached urllib3-2.6.2-py3-none-any.whl.metadata (6.6 kB)
Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore>=1.31.57)
  Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting docutils<=0.19,>=0.18.1 (from awscli>=1.29.57)
  Usi

This notebook demonstrates invoking Bedrock models directly using the AWS SDK, but for later part of this notebook, you'll also need to install other packages

In [None]:
%pip install  \
    "langchain>=0.0.350" \
    "transformers>=4.24,<5" \
    sqlalchemy -U \
    "faiss-cpu>=1.7,<2" \
    "pypdf>=3.8,<4" \
    pinecone-client==2.2.4 \
    tiktoken==0.5.2 \
    "ipywidgets>=7,<8" \
    matplotlib==3.8.2 \
    anthropic==0.9.0 \
    datasets==2.15.0 \
    numexpr==2.8.8

### Restart Kernel 

In [3]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

The boto3 provides different clients for Amazon Bedrock to perform different actions. The actions for InvokeModel and InvokeModelWithResponseStream are supported by Amazon Bedrock Runtime

In [4]:
import json
import os
import sys

import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock



boto3_bedrock = bedrock.get_bedrock_client(
    runtime = True
)

Create new client
  Using region: us-east-1
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


In [5]:
pip install -U langchain -communitylangchain

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'ommunitylangchain'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [6]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.llms import Bedrock

# - create the Anthropic Model
llm = Bedrock(model_id="anthropic.claude-v2", client=boto3_bedrock, model_kwargs={'max_tokens_to_sample':300})
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=boto3_bedrock)

  llm = Bedrock(model_id="anthropic.claude-v2", client=boto3_bedrock, model_kwargs={'max_tokens_to_sample':300})
  bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=boto3_bedrock)


We begin with instantiating the LLM and the Embeddings model. Here we are using Anthropic Claude for text generation and Amazon Titan for text embedding.

Note: It is possible to choose other models available with Bedrock. You can replace the model_id as follows to change the model.

Let's first download some of the files to build our document store. In this example I am downloading the official paper for RAG.

In [7]:
from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [
    "https://arxiv.org/pdf/2005.11401.pdf"
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After downloading we can load the documents with the help of DirectoryLoader from PyPDF available under LangChain and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. 
Also the embeddings model has a limit of the length of input tokens limited to 8192 tokens, which roughly translates to ~32,000 characters. 
For the sake of this use-case we are creating chunks of roughly 2000 characters with an overlap of 200 characters using RecursiveCharacterTextSplitter.


In [None]:
pip install langchain-text-splitters 

In [9]:
import numpy as np
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader


loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 2000,
    chunk_overlap  = 200,
)
docs = text_splitter.split_documents(documents)

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


Lets review how many chunks and characters we are dealing with

In [10]:

abs
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

Average length among 19 documents loaded is 3637 characters.
After the split we have 48 documents more than the original 19.
Average length among 48 documents (after split) is 1542 characters.


Now we can see how a sample embedding would look like for the first chunk

In [11]:
try:
    
    sample_embedding = np.array(bedrock_embeddings.embed_query(docs[1].page_content))
    print("Sample chunk: ",docs[0].page_content)
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

except ValueError as error:
    if  "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubeshoot this issue please refer to the following resources.\
         \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
         \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")      
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

Sample chunk:  Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research;‡University College London;⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-parametric
memory have so far been only investigated for extractive

Following the similar pattern embeddings could be generated for the entire corpus and stored in a vector store.

This can be easily done using FAISS implementation inside LangChain which takes input the embeddings model and the documents to create the entire vector store. 
Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. 
VectorStoreIndexWrapper helps us with that.

⚠️⚠️⚠️ NOTE: it might take few minutes to run the following cell ⚠️⚠️⚠️


In [None]:
pip install -U langchain-core

In [13]:

from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores.faiss import DistanceStrategy
#from langchain_community.indexes import VectorstoreIndexCreator
#from langchain_community.vectorstores import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
    distance_strategy=DistanceStrategy.COSINE
)

#wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

Now that we have our vector store in place, we can start asking questions.

In [14]:
query = """Explain Retrieval Augment Generation to a 6th grader"""

Lets review the mebedding for the query.

In [15]:
query_embedding = vectorstore_faiss.embedding_function.embed_query(query)
np.array(query_embedding)

array([-0.62890625, -0.41796875, -0.234375  , ...,  0.66015625,
        0.375     , -0.03222656])

We can use this embedding of the query to then fetch relevant documents. Now our query is represented as embeddings we can do a similarity search of our query against our data store providing us with the most relevant information.

In [16]:
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

4 documents are fetched which are relevant to the query.
----
## Document 1: Appendices for Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
A Implementation Details
For Open-domain QA we report test numbers using 15 retrieved documents for RAG-Token models.
For RAG-Sequence models, we report test results using 50 retrieved documents, and we use the
Thorough Decoding approach since answers are generally short. We use greedy decoding for QA as
we did not ﬁnd beam search improved results. For Open-MSMarco and Jeopardy question generation,
we report test numbers using ten retrieved documents for both RAG-Token and RAG-Sequence,
and we also train a BART-large model as a baseline. We use a beam size of four, and use the Fast
Decoding approach for RAG-Sequence models, as Thorough Decoding did not improve performance.
B Human Evaluation
Figure 4: Annotation interface for human evaluation of factuality. A pop-out for detailed instructions
and a worked example appear when clicki

You have the possibility to use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM. This wrapper performs the following steps behind the scences:

    Take the question as input
    Create question embedding
    Fetch relevant documents
    Stuff the documents and the question into a prompt
    Invoke the model with the prompt and generate the answer in a human readable manner.


In [None]:
pip install -U langchain langchain-community langchain-text-splitters boto3 faiss-cpu

In [18]:

#from langchain_community.chains import RetrievalQA  # modern supported QA chain
#from langchain.prompts import ChatPromptTemplate


retriever = vectorstore_faiss.as_retriever()



In [19]:
from langchain_core.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 4}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
query = "Explain Retrieval Augment Generation to a 6th grader"
answer = qa({"query": query})
print_ww(answer['result'])


ModuleNotFoundError: No module named 'langchain_core.chains'

Review the documents which became context for the LLM

In [None]:
answer['source_documents']

Lets ask a question which cannot be answewred on the the bases of provided content

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 4}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
query = "Discuss challanges of DevSecOps"
answer = qa({"query": query})
print_ww(answer['result'])


You can also query the vector database and find the similarity score. the lower the score is, the better the result is. Read more about this https://python.langchain.com/docs/integrations/vectorstores/faiss

In [None]:
relevant_documents = vectorstore_faiss.similarity_search_with_score("Explain benefits of RAG")
relevant_documents


In [None]:
relevant_documents = vectorstore_faiss.similarity_search_with_score("Discuss challenges of DecSecOps")
relevant_documents

In [None]:

query = "what is RAG"
#query_embedding = vectorstore_faiss.embedding_function.embed_query(query)


#docs_and_scores = vectorstore_faiss._similarity_search_with_relevance_scores(query)
docs_and_scores = docs._similarity_search_with_relevance_scores(query)



# Iterate through the results to access documents and scores
for doc, score in docs_and_scores:
    print(f"Document content: {doc.page_content}")
    print(f"Score (Distance): {score}")
    print("-" * 20)

In [None]:
retriever=vectorstore_faiss.as_retriever(search_type="similarity", search_kwargs={"k": 4})

In [None]:
retriever

In [23]:
db = FAISS.from_documents(
    docs,
    bedrock_embeddings,
    distance_strategy=DistanceStrategy.COSINE # This line sets the configuration
)

print("FAISS vector database created with COSINE distance configuration.")

# 4. Example search with scores (scores will be 0-1, higher is better)
query = "What is RAG"
results_with_scores = db.similarity_search_with_score(query, k=2)

print(f"\nSearch results for query: '{query}'\n")

for document, score in results_with_scores:
    print("-" * 40)
    # When using COSINE strategy, LangChain maps distance [0, 2] to a relevance score [0, 1]
    print(f"Relevance Score (0-1, higher is better): {score:.4f}")
    print(f"Document snippet: {document.page_content}")

FAISS vector database created with COSINE distance configuration.

Search results for query: 'What is RAG'

----------------------------------------
Relevance Score (0-1, higher is better): 258.9944
Document snippet: Broader Impact
This work offers several positive societal beneﬁts over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct beneﬁt to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.
With these advantages also come potential downsides: Wikipedia, or any potential external knowledge
source, will probably never be entirely factual and completely devoid of bias. Since RAG can be
employed as a language model, similar concerns as for GP

In [20]:
import boto3
from langchain_aws import BedrockEmbeddings, ChatBedrockConverse

bedrock_client = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1" # Use your region
)

# Initialize the Bedrock LLM (Example using Claude 3 Haiku)

# Initialize the Bedrock LLM using ChatBedrockConverse (which uses Messages API)
llm = ChatBedrockConverse(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0", # The model ID that caused the error previously
    client=bedrock_client,
    max_tokens=512,        # <-- Pass max_tokens directly
    temperature=0.1 
    # Note: Use 'max_tokens' instead of 'max_tokens_to_sample' for new models/APIs
    #model_kwargs={"max_tokens": 512, "temperature": 0.1}
)


In [26]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever(search_kwargs={"k": 3})


# --- 3. Define the RAG Prompt Template (Optimized for Chat Models) ---

# Chat models prefer system prompts and specific message roles.
# Use MessagesPlaceholder for more complex chat history handling if needed,
# but for basic RAG, we stick to system/human roles.
rag_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful AI assistant.\n\nContext: {context}",
        ),
        ("human", "{question}"),
    ]
)


# --- 4. Build and Run the RAG Chain ---
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

question = "Who is elon musk?"
print(f"Asking question: {question}\n")

# Invoke the chain
response = rag_chain.invoke(question)

print("Response from Bedrock LLM:")
print(response)

Asking question: Who is elon musk?

Response from Bedrock LLM:
I don't see any information about Elon Musk in the provided context documents. The documents appear to be from a research paper about RAG (Retrieval-Augmented Generation) models and contain references to academic work, but they don't contain biographical information about Elon Musk.

To answer your question about who Elon Musk is, I would need different source material that actually contains information about him. The current context focuses on machine learning research, specifically around question-answering systems and language models.
