# Lab 3: Rag with Amazon SageMaker AI endpoint and Amazon OpenSearch



## Overview
This notebook demonstrates how to implement a Retrieval Augmented Generation (RAG) solution using:
- Amazon SageMaker for hosting embedding and LLM models
- Amazon OpenSearch for vector search
- LangChain for orchestrating the RAG pipeline

In this notebook, Question Answering solution with Large Language Models (LLMs) and Amazon OpenSearch Service. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response.

LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

<H2>Part 1: Build conversational search with OpenSearch Service</H2>

The vector dataset used in this part of the lab is comprised of a predefined content resource from the [PubMedQA](https://pubmedqa.github.io/) dataset.

You will use OpenSearch ingest pipeline with embedding processor to generate text embeddings for the dataset. Using the neural plugin in OpenSearch will allow you to generate the embeddings of the search query as well.
You will then use the large language model (LLM) hosted on Amazon SageMaker endpoints with the RAG processor in the search pipeline to generate text. The RAG processor will combine the retrieved search results from OpenSearch with the generated answer from the LLM to send back to the end user.

Follow step 1 to step 5 to complete part 1 of the lab.

### The key steps in part 1 of this lab are as follow:

1. Get pre-requisites installed and libraries imported.
1. Deploy the embedding model to a SageMaker endpoint, create a KNN-enabled index and ingest the catalog items into the index.
1. Build the end-to-end pipeline with LangChain.

# 1. Lab Pre-requisites
This notebook is designed to be run as part of the larger workshop [placeholder for workshop].
Before proceeding with this notebook, you should complete all of the steps.

## Prerequisites
- Required Python libraries: opensearch-py, langchain, boto3, requests_aws4auth
- Access to Amazon SageMaker and OpenSearch
- Appropriate IAM roles and permissions

## 1.1. Import libraries & initialize resources
The code blocks below will install and import all the relevant libraries and modules used in this notebook.

In [None]:
%pip install opensearch-py -q
%pip install opensearch_py_ml -q
%pip install deprecated -q
%pip install requests_aws4auth -q
%pip install langchain boto3 -q
print("Installs completed.")

Import the required libraries:

In [None]:
# Import Python libraries
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
import os
from os import path
import urllib.request
import tarfile
from requests_aws4auth import AWS4Auth
from ruamel.yaml import YAML
from PIL import Image
import base64
import re
import time as t
import pandas as pd
from IPython.display import display, HTML
import sys
import requests
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel
from typing import Dict, List

In [None]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
    
sm_runtime_client = boto3.client("sagemaker-runtime")

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# 2. Deploy the embedding model to a SageMaker endpoint & build retrieval integration with OpenSearch

We have taken the PubMedQA dataset and prepared it to include the contexts in the `extracted_context.json` file.

The following cells will perform the steps to generate embeddings with the dataset and ingest into the OpenSearch vector database.

## 2.1 Establish a connection to the OpenSearch Service domain

### OpenSearch Configuration
- Establish connection to OpenSearch domain
- Create index with KNN vector search capabilities
- Define mapping for document embeddings

In [None]:
# Get the Amazon OpenSearch Service domain endpoint info from DynamoDB
session = boto3.Session()
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = session.region_name

aos_host = "<TODO>" # replace with the output opensearch cluster name, you can find it from the cloudformation output

### 🚨 Authentication cell below 🚨 
The below cell establishes an authenticated connection to our OpenSearch Service domain. The connection will periodically expire.
If you see an `AuthorizationException` error later in this notebook it means that the connection has expired and you just need to re-run the cell to get a new security tokken.

In [None]:
# Connect to OpenSearch using the IAM Role of this notebook
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    region,
    'es',
    session_token=credentials.token
)

# Create OpenSearch client
aos_client = OpenSearch(
    hosts=[f'https://{aos_host}'],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=60
)
print("Connection details: ")
aos_client

## 2.2 Create the index with defined mappings.

It is important to define the 'knn_vector' fields as without the propper definitions dynamic mapping would type these as simple float fields.

A **k-NN (k-Nearest Neighbors)** enabled index is created in OpenSearch to store vector embeddings. The index schema defines:

A **knn_vector** field (`context_vector`) for storing embeddings.

To learn more about OpenSearch service, you can refer to the [document](https://aws.amazon.com/opensearch-service/).

In [None]:
### Create the k-NN index
# Check if the index exists. Delete and recreate if it does. 
if aos_client.indices.exists(index='opensearch-rag-index'):
    print("The index exists. Deleting...")
    response = aos_client.indices.delete(index='opensearch-rag-index')
    
payload = { 
  "settings": {
    "index": {
      "knn": True
    }
  },
    "mappings": {
        "properties": {
            "context_vector": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "faiss",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            },
            "template": {
              "type": "keyword"
            }
          }
        }
}

print("Creating index...")
response = aos_client.indices.create(index='opensearch-rag-index',body=payload)
response

## 2.3 Create SageMaker Embedding Endpoint
A **Hugging Face text embedding model (all-MiniLM-L6-v2)** is deployed via SageMaker JumpStart to a SageMaker real-time endpoint. This model converts text into 384-dimensional vectors for semantic search.
### Embedding Model Deployment
- Deploy Hugging Face embedding model (all-MiniLM-L6-v2) on SageMaker
- Create embedding endpoint for text vectorization
- Configure content handlers for model input/output processing

In [None]:
# retrieve the image uri based on instance type
def get_image_uri(instance_type):
    key = "huggingface-tei" if instance_type.startswith("ml.g") or instance_type.startswith("ml.p") else "huggingface-tei-cpu"
    return get_huggingface_llm_image_uri(key, version="1.4.0")

In [None]:
model_id, model_version = "huggingface-textembedding-all-MiniLM-L6-v2", "*"

In [None]:
model = JumpStartModel(model_id=model_id, model_version=model_version)

In [None]:
# sagemaker config
instance_type = "ml.g5.xlarge"
 
# create HuggingFaceModel with the image uri
emb_model = HuggingFaceModel(
  role=role,
  image_uri=get_image_uri(instance_type),
  model_data=model.model_data['S3DataSource']['S3Uri'],
  env={'HF_MODEL_ID': "/opt/ml/model"}     # Path to the model in the container
)

Deploy the model onto a SageMaker endpoint

In [None]:
predictor = model.deploy()

In [None]:
embed_endpoint_name = predictor.endpoint_name
print(f"Successfully deployed embedding model to the SageMaker endpoint: {embed_endpoint_name}")

In [None]:
query_text = "Is adjustment for reporting heterogeneity necessary in sleep disorders?"
# invoke the embedding model
input_str = {"inputs": query_text}
output = sm_runtime_client.invoke_endpoint(
    EndpointName=embed_endpoint_name,
    Body=json.dumps(input_str),
    ContentType="application/json"
)
embeddings = output["Body"].read().decode("utf-8")
print(embeddings)

We can wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding model and can be use with other LangChain functions.

In [None]:
from langchain_community.embeddings import SagemakerEndpointEmbeddings
from langchain_community.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
        """
        Transforms the input into bytes that can be consumed by SageMaker endpoint.
        Args:
            inputs: List of input strings.
            model_kwargs: Additional keyword arguments to be passed to the endpoint.
        Returns:
            The transformed bytes input.
        """
        # Example: inference.py expects a JSON string with a "inputs" key:
        input_str = json.dumps({"inputs": inputs, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> List[List[float]]:
        """
        Transforms the bytes output from the endpoint into a list of embeddings.
        Args:
            output: The bytes output from SageMaker endpoint.
        Returns:
            The transformed output - list of embeddings
        Note:
            The length of the outer list is the number of input strings.
            The length of the inner lists is the embedding dimension.
        """
        # Example: inference.py returns a JSON string with the list of
        # embeddings in a "vectors" key:
        response_json = json.loads(output.read().decode("utf-8"))
        # print(len(response_json))
        return response_json


content_handler = ContentHandler()


embeddings_function = SagemakerEndpointEmbeddings(
    endpoint_name=embed_endpoint_name,
    region_name=region,
    content_handler=content_handler,
)

query_result = embeddings_function.embed_query(query_text)
print("Output:\n", query_result, end="\n\n")

## 2.4 Load data into the new index

### Data Processing
- Load and process input data
- Generate embeddings for documents
- Index documents with their embeddings in OpenSearch

We will use the [bulk API](https://opensearch.org/docs/latest/api-reference/document-apis/bulk/) to load all of the products into our newly created index. 

In [None]:
def get_embedding(text, embed_endpoint_name, model_kwargs=None):
    """
    Call the SageMaker embedding model to embed the given text.
    Adjust the payload and response parsing according to your model's API.
    """
    embeddings = SagemakerEndpointEmbeddings(
        endpoint_name=embed_endpoint_name,
        region_name=region,
        content_handler=content_handler,
    )

    return embeddings.embed_query(text)

get_embedding(query_text, embed_endpoint_name)

- **Chunking**: Long documents are split into smaller passages (max 256 tokens) using LangChain's `RecursiveCharacterTextSplitter`.

- **Embedding Generation**: Each chunk is converted into a vector using the SageMaker embedding endpoint.

- **Bulk Ingestion**: The embeddings and text are indexed into OpenSearch for efficient retrieval.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Tokenizer

# Initialize tokenizer matching your embedding model (e.g., "sentence-transformers/all-mpnet-base-v2")
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)

# Configure splitter with model-aware tokenization
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=250,  # 256 - safety buffer
    chunk_overlap=10,
    separators=["\n\n", "\n"],  # FIRST try splitting at paragraphs, then lines
    keep_separator=True,  # Preserve paragraph/line breaks in chunks
    is_separator_regex=False
)

def validate_chunk(chunk: str) -> bool:
    """Ensure chunk doesn't exceed token limit with model's actual tokenization"""
    tokens = tokenizer.encode(chunk, add_special_tokens=True)
    return len(tokens) <= 256

In [None]:
input_filename = "extracted_context.json"
output_filename = "output_embedded.jsonl"  # Line-delimited JSON


# Load the input JSON file (mapping IDs to lists of context strings)
with open(input_filename, "r", encoding="utf-8") as infile:
    data = json.load(infile)

# Open the output file for writing line-delimited JSON objects
with open(output_filename, "w", encoding="utf-8") as outfile:
    for key, contexts in data.items():
        embeddings = []
        all_chunks = []
        for ctx_idx, context in enumerate(contexts):
                # First attempt: split at paragraphs/lines only
                chunks = text_splitter.split_text(context)

                # Second pass: check and fix any chunks that still exceed limits
                final_chunks = []
                for chunk in chunks:
                    if validate_chunk(chunk):
                        final_chunks.append(chunk)
                    else:
                        # Force split at sentences ONLY if absolutely necessary
                        emergency_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
                            tokenizer=tokenizer,
                            chunk_size=250,
                            chunk_overlap=50,
                            separators=[". "],  # Only split sentences when forced
                            keep_separator=True
                        )
                        final_chunks.extend(emergency_splitter.split_text(chunk))

                # Embed validated chunks
                for chunk_idx, chunk in enumerate(final_chunks):
                    if not validate_chunk(chunk):
                        continue  # Skip invalid chunks or handle differently

                    embedding = get_embedding(chunk, embed_endpoint_name)
                    output_obj = {
                        "id": f"{key}-{ctx_idx}-{chunk_idx}",
                        "contexts": chunk,
                        "context_vector": embedding
                    }
                    outfile.write(json.dumps(output_obj) + "\n")

print(f"Embeddings saved to {output_filename}")


In [None]:
# Read all JSON objects from the JSONL file
with open("output_embedded.jsonl", "r", encoding="utf-8") as infile:
    json_objects = [json.loads(line) for line in infile]

# Write the objects as a JSON array into a new .txt file
with open("merged_output.txt", "w", encoding="utf-8") as outfile:
    json.dump(json_objects, outfile, indent=4)

print("Merged JSON objects have been saved to merged_output.txt")

In [None]:
def transform_file(input_filename, output_filename):
    # Load the merged file, which is expected to be a JSON array
    with open(input_filename, 'r', encoding='utf-8') as infile:
        records = json.load(infile)
    
    with open(output_filename, 'w', encoding='utf-8') as outfile:
        # Process each record in the array
        for record in records:
            contexts = record.get("contexts", [])
            vectors = record.get("context_vector", [])
            # For each pair of context string and corresponding embedding vector:
            # Create a new object without the "id" field.
            new_obj = {
                "contexts": contexts,
                "context_vector": vectors
            }
            # Write the JSON object as a single line
            outfile.write(json.dumps(new_obj) + "\n")

transform_file("merged_output.txt", "final_output_oneline.txt")
print("Transformation complete. Check final_output.txt")


In [None]:
## Index TEXT file into index: opensearch-rag-index
batch = 0
count = 0
batch_size = 5
body_ = ''
action = json.dumps({ 'index': { '_index': 'opensearch-rag-index' } })
errors = []
with open('final_output_oneline.txt', 'r') as file:
    for line in file:
        if count > 5000:
            break # Use this to run a limited number of items.
        body_ = body_ + action + "\n" + line + "\n"
        # print(f"body: {body_}")
        if count % batch_size == 0 and count != 0:
            batch+=1
            if count % (batch_size*30) == 0:
                print("Batch: " + str(batch) + ", count: " + str(count)+ ", errors: " + str(len(errors)))
            response = aos_client.bulk(
                index = 'opensearch-rag-index',
                body = body_
            )
            body_ = ''
            if response['errors'] == True:
                for item in response['items']:
                    if item['index']['status'] != 201:
                        errors.append(item['index']['error']) 
        # print(response)
        # break 
        count += 1
if body_ !="":
    response = aos_client.bulk(
        index = 'opensearch-rag-index',
        body = body_
    )
if response['errors'] == True:
    for item in response['items']:
        if item['index']['status'] != 201:
            errors.append(item['index']['error'])
print("Last batch: " + str(batch) + ", documet count: " + str(count)+ ", errors: " + str(len(errors)))

## 2.5 Query OpenSearch Database

Once the vectors are ingested into the database, we can run queries to retrieve relevant contexts based on the input query.

In [None]:
# Your natural language query
query_vector = get_embedding(query_text, embed_endpoint_name)

# Now, use the embedding in a k-NN query
knn_query = {
    "size": 5,  # adjust how many results you want to retrieve
    "query": {
        "knn": {
            "context_vector": {
                "vector": query_vector,
                "k": 5
            }
        }
    }
}

response_knn = aos_client.search(
    index="opensearch-rag-index",
    body=knn_query
)

print("KNN Query Results:")
for hit in response_knn['hits']['hits']:
    print(hit['_source'])



# 3. Build end-to-end RAG pipeline with LLM models hosted on SageMaker AI and LangChain

We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with SageMaker hosted embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. (We have already created the embedding SageMaker wrapper class in the previous section.
2. Prepare the dataset to build the knowledge data base. 

---

Now we need to deploy a **Llama 3.1 8B LLM** onto a SageMaker real-time endpoint and prepare the SageMaker Endpoint class for LangChain integration.

In [None]:
model_id_llm, model_version = "meta-textgeneration-llama-3-1-8b-instruct", "*"
accept_eula = True

In [None]:
model = JumpStartModel(model_id=model_id_llm, model_version=model_version)

In [None]:
predictor = model.deploy(accept_eula=accept_eula)

Invoke the LLM endpoint for a quick test

In [None]:
llm_endpoint_name = predictor.endpoint_name
input_str = { "inputs": query_text, 
            "parameters": { 
                "max_new_tokens": 100, 
                "top_p": 0.9, 
                "temperature": 0.6 
            }
        }
output = sm_runtime_client.invoke_endpoint(
    EndpointName=llm_endpoint_name,
    Body=json.dumps(input_str),
    ContentType="application/json"
)
embeddings = output["Body"].read().decode("utf-8")
print(embeddings)

In [None]:
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_community.llms import SagemakerEndpoint
from langchain_community.llms.sagemaker_endpoint import LLMContentHandler


from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [327]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint

parameters = {
    "max_new_tokens": 100,
    "temperature": 0.2,
    "top_p": 0.9
}


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        self.len_prompt = len(prompt)
        input_str = json.dumps({"inputs": prompt, "parameters": {**model_kwargs}})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        
        ans = res['generated_text']
        # print(ans)
        return ans 


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=llm_endpoint_name,
    region_name=region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

We combine the retrieved documents with prompt and question and send them into SageMaker LLM.

We define a customized prompt as below.

In [None]:
from langchain import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

opensearch_url = f"https://{aos_host}"


For this example, we have created the OpenSearch cluster with this user-name and password. But in real application, we suggest you store the user name and password using services that can securely store the value, for example SecretsManager as [shown here](https://github.com/aws-samples/rag-with-amazon-opensearch-and-sagemaker/blob/main/app/opensearch_retriever_llama2.py#L89).

>Note: If you are in an AWS hosted workshop, the prebuilt OpenSearch cluster password will be in your lab instructions.

In [None]:
http_auth = ("master", "<<YOUR_OS_PASSWORD>>") 

In [None]:
opensearch_vector_search = OpenSearchVectorSearch(
    opensearch_url=opensearch_url,
    index_name='opensearch-rag-index',
    embedding_function=embeddings_function,
    http_auth=http_auth
)

In [None]:
retriever = opensearch_vector_search.as_retriever(
    search_kwargs={"k": 3, "vector_field": "context_vector", "text_field": "contexts"})


In [None]:
chain_type_kwargs = {"prompt": PROMPT, "verbose": True}
qa = RetrievalQA.from_chain_type(
    sm_llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs=chain_type_kwargs,
    # return_source_documents=True, ## you can uncomment this line to see the detailed retrieved data source
    # verbose=True, #DEBUG
)


In [None]:
qa(query_text)

### Key Workflow Summary
- Data Preparation: Text is split into chunks and embedded.
- OpenSearch Setup: A vector index is created and populated.
- Model Deployment: Embedding and LLM models are hosted on SageMaker.
- RAG Pipeline: Queries retrieve relevant context, and the LLM generates answers.

This notebook provides an end-to-end example of building a production-ready RAG system using AWS services. The same approach can be adapted for other domains by replacing the dataset and fine-tuning the models.

# Congratulations for finishing Lab 3. Now please continue on to the next Lab.