# Lab 3: Rag with Amazon SageMaker AI endpoint and Amazon OpenSearch and evaluate RAG with Ragas and Langfuse



## Overview
This notebook demonstrates how to implement a Retrieval Augmented Generation (RAG) solution using:
- Amazon SageMaker for hosting embedding and LLM models
- Amazon OpenSearch for vector search
- LangChain for orchestrating the RAG pipeline
- we'll explore ways to evaluate the quality of Retrieval-Augmented Generation (RAG) pipelines with the opensource tools like [RAGAS](https://docs.ragas.io/en/v0.1.21/index.html) and leverage the features in [Langfuse](https://langfuse.com/) to manage and trace the RAG pipelines with traces and spans. We will create a OpenSearch Vector Database and the RAG results generation to show offline evaluation and scoring.

In this notebook, Question Answering solution with Large Language Models (LLMs) and Amazon OpenSearch Service. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response.

LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

<H2>Part 1: Build conversational search with OpenSearch Service</H2>

The vector dataset used in this part of the lab is comprised of a predefined content resource from the [PubMedQA](https://pubmedqa.github.io/) dataset.

You will use OpenSearch ingest pipeline with embedding processor to generate text embeddings for the dataset. Using the neural plugin in OpenSearch will allow you to generate the embeddings of the search query as well.
You will then use the large language model (LLM) hosted on Amazon SageMaker endpoints with the RAG processor in the search pipeline to generate text. The RAG processor will combine the retrieved search results from OpenSearch with the generated answer from the LLM to send back to the end user.

Follow step 1 to step 5 to complete part 1 of the lab.

### The key steps in part 1 of this lab are as follow:

1. Get pre-requisites installed and libraries imported.
1. Deploy the embedding model to a SageMaker endpoint, create a KNN-enabled index and ingest the catalog items into the index.
1. Build the end-to-end pipeline with LangChain.

# 1. Lab Pre-requisites
This notebook is designed to be run as part of the larger workshop [placeholder for workshop].
Before proceeding with this notebook, you should complete all of the steps.

## Prerequisites
- Required Python libraries: opensearch-py, langchain, boto3, requests_aws4auth
- Access to Amazon SageMaker and OpenSearch
- Appropriate IAM roles and permissions

## 1.1. Import libraries & initialize resources
The code blocks below will install and import all the relevant libraries and modules used in this notebook.

In [1]:
!pip install opensearch-py -q
!pip install opensearch_py_ml -q
!pip install deprecated -q
!pip install requests_aws4auth -q
!pip install langchain boto3 -q
!pip install transformers -q
print("Installs completed.")

Installs completed.


In [None]:
%pip install langfuse datasets ragas python-dotenv sagemaker langchain-aws opensearch-py requests_aws4auth boto3 --upgrade

In [None]:
!pip uninstall packaging -y -q
!pip install packaging==24.1

Import the required libraries:

In [None]:
# Import Python libraries
from typing import Any, Dict, List, Optional
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
import os
from os import path
import urllib.request
import tarfile
from requests_aws4auth import AWS4Auth
from ruamel.yaml import YAML
from PIL import Image
import base64
import re
import time as t
import pandas as pd
from IPython.display import display, HTML
import sys
import requests
from typing import Dict, List
from botocore.response import StreamingBody
from transformers import AutoTokenizer
from datasets import load_dataset
from random import sample
from datasets import Dataset

# Langchain
from langchain_aws.chat_models.sagemaker_endpoint import ChatSagemakerEndpoint, ChatModelContentHandler
from langchain_core.messages import HumanMessage, AIMessageChunk, SystemMessage
from langchain_aws.embeddings import BedrockEmbeddings
from langchain_community.embeddings import SagemakerEndpointEmbeddings
from langchain_community.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_community.llms import SagemakerEndpoint
from langchain_community.llms.sagemaker_endpoint import LLMContentHandler
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Tokenizer

# Langfuse
import langfuse  # assuming you're using the SDK
from langfuse import Langfuse
from langfuse.api.resources.commons.types.trace_with_details import TraceWithDetails
from langfuse.decorators import observe, langfuse_context

# Sagemaker
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel

# RAGAS
import ragas
from ragas.run_config import RunConfig
from ragas.metrics.base import MetricWithLLM, MetricWithEmbeddings
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from ragas.metrics import answer_relevancy, faithfulness, context_precision
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample

In [5]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
    
sm_runtime_client = boto3.client("sagemaker-runtime")

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::626836212174:role/service-role/AmazonSageMaker-ExecutionRole-20220824T222021
sagemaker bucket: sagemaker-us-east-1-626836212174
sagemaker session region: us-east-1


# 2. Deploy the embedding model to a SageMaker endpoint & build retrieval integration with OpenSearch

We have taken the PubMedQA dataset and prepared it to include the contexts in the `extracted_context.json` file.

The following cells will perform the steps to generate embeddings with the dataset and ingest into the OpenSearch vector database.

## 2.1 Establish a connection to the OpenSearch Service domain

### OpenSearch Configuration
- Establish connection to OpenSearch domain
- Create index with KNN vector search capabilities
- Define mapping for document embeddings

In [6]:
# Get the Amazon OpenSearch Service domain endpoint info from DynamoDB
session = boto3.Session()
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = session.region_name

aos_host = "search-opensearchservi-gtmmdbjee3lt-43vqrfek2ekx4ah6sdk3iw3eli.us-east-1.es.amazonaws.com" # replace with the output opensearch cluster name, you can find it from the cloudformation output

### 🚨 Authentication cell below 🚨 
The below cell establishes an authenticated connection to our OpenSearch Service domain. The connection will periodically expire.
If you see an `AuthorizationException` error later in this notebook it means that the connection has expired and you just need to re-run the cell to get a new security tokken.

In [42]:
# Connect to OpenSearch using the IAM Role of this notebook
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    region,
    'es',
    session_token=credentials.token
)

# Create OpenSearch client
aos_client = OpenSearch(
    hosts=[f'https://{aos_host}'],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=60
)
print("Connection details: ")
aos_client

Connection details: 


<OpenSearch([{'host': 'search-opensearchservi-gtmmdbjee3lt-43vqrfek2ekx4ah6sdk3iw3eli.us-east-1.es.amazonaws.com', 'port': 443, 'use_ssl': True}])>

## 2.2 Create the index with defined mappings.

It is important to define the 'knn_vector' fields as without the propper definitions dynamic mapping would type these as simple float fields.

A **k-NN (k-Nearest Neighbors)** enabled index is created in OpenSearch to store vector embeddings. The index schema defines:

A **knn_vector** field (`context_vector`) for storing embeddings.

To learn more about OpenSearch service, you can refer to the [document](https://aws.amazon.com/opensearch-service/).

In [None]:
### Create the k-NN index
# Check if the index exists. Delete and recreate if it does. 
if aos_client.indices.exists(index='opensearch-rag-index'):
    print("The index exists. Deleting...")
    response = aos_client.indices.delete(index='opensearch-rag-index')
    
payload = { 
  "settings": {
    "index": {
      "knn": True
    }
  },
    "mappings": {
        "properties": {
            "context_vector": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "faiss",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            },
            "template": {
              "type": "keyword"
            }
          }
        }
}

print("Creating index...")
response = aos_client.indices.create(index='opensearch-rag-index',body=payload)
response

## 2.3 Create SageMaker Embedding Endpoint
A **Hugging Face text embedding model (all-MiniLM-L6-v2)** is deployed via SageMaker JumpStart to a SageMaker real-time endpoint. This model converts text into 384-dimensional vectors for semantic search.
### Embedding Model Deployment
- Deploy Hugging Face embedding model (all-MiniLM-L6-v2) on SageMaker
- Create embedding endpoint for text vectorization
- Configure content handlers for model input/output processing

In [9]:
# retrieve the image uri based on instance type
def get_image_uri(instance_type):
    key = "huggingface-tei" if instance_type.startswith("ml.g") or instance_type.startswith("ml.p") else "huggingface-tei-cpu"
    return get_huggingface_llm_image_uri(key, version="1.4.0")

In [10]:
model_id, model_version = "huggingface-textembedding-all-MiniLM-L6-v2", "*"

In [11]:
model = JumpStartModel(model_id=model_id, model_version=model_version)

Using model 'huggingface-textembedding-all-MiniLM-L6-v2' with wildcard version identifier '*'. You can pin to version '2.0.7' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


No instance type selected for inference hosting endpoint. Defaulting to ml.g5.xlarge.


In [12]:
# sagemaker config
instance_type = "ml.g5.xlarge"
 
# create HuggingFaceModel with the image uri
emb_model = HuggingFaceModel(
  role=role,
  image_uri=get_image_uri(instance_type),
  model_data=model.model_data['S3DataSource']['S3Uri'],
  env={'HF_MODEL_ID': "/opt/ml/model"}     # Path to the model in the container
)

Deploy the model onto a SageMaker endpoint

In [13]:
predictor = model.deploy()

-------------!

In [14]:
embed_endpoint_name = predictor.endpoint_name
print(f"Successfully deployed embedding model to the SageMaker endpoint: {embed_endpoint_name}")

Successfully deployed embedding model to the SageMaker endpoint: hf-textembedding-all-minilm-l6-v2-2025-04-18-07-26-15-099


In [15]:
query_text = "Is adjustment for reporting heterogeneity necessary in sleep disorders?"
# invoke the embedding model
input_str = {"inputs": query_text}
output = sm_runtime_client.invoke_endpoint(
    EndpointName=embed_endpoint_name,
    Body=json.dumps(input_str),
    ContentType="application/json"
)
embeddings = output["Body"].read().decode("utf-8")
print(embeddings)

[[0.096046135,0.05919965,0.0036651383,0.08261194,0.04320125,0.06315744,-0.07904435,-0.0029822798,-0.007873776,-0.033055924,-0.0019092165,0.017405923,-0.025697775,0.041194484,0.03704159,-0.014242477,0.035062693,0.075476766,0.0358431,0.033390384,0.052482553,0.008863225,0.12598042,0.019705344,-0.0038323689,-0.02158669,0.0053444128,-0.05546483,0.007539315,-0.022757303,-0.041222353,0.095767416,0.004786977,-0.032721464,0.004104119,-0.06678077,0.0029822798,0.013991631,-0.02232529,0.034365896,0.09654783,-0.03896474,-0.0053862203,-0.053346574,-0.11828781,-0.022617945,-0.016709128,-0.023231125,-0.09838736,0.09409511,0.035118435,-0.029794928,0.010577339,0.060091544,0.011329876,-0.013796528,-0.014897464,0.03503482,0.059422623,0.11550064,-0.03517418,0.08751737,-0.0016514027,-0.02622734,0.09515424,0.13712913,-0.043479968,-0.07893287,0.005225958,0.022590073,-0.055855036,0.00005574355,0.016068079,-0.0049925316,0.015998399,0.0132948365,-0.0063617327,0.009469436,0.08879948,-0.01758709,0.010633082,0.0918

We can wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding model and can be use with other LangChain functions.

In [16]:
class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
        """
        Transforms the input into bytes that can be consumed by SageMaker endpoint.
        Args:
            inputs: List of input strings.
            model_kwargs: Additional keyword arguments to be passed to the endpoint.
        Returns:
            The transformed bytes input.
        """
        # Example: inference.py expects a JSON string with a "inputs" key:
        input_str = json.dumps({"inputs": inputs, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> List[List[float]]:
        """
        Transforms the bytes output from the endpoint into a list of embeddings.
        Args:
            output: The bytes output from SageMaker endpoint.
        Returns:
            The transformed output - list of embeddings
        Note:
            The length of the outer list is the number of input strings.
            The length of the inner lists is the embedding dimension.
        """
        # Example: inference.py returns a JSON string with the list of
        # embeddings in a "vectors" key:
        response_json = json.loads(output.read().decode("utf-8"))
        # print(len(response_json))
        return response_json


content_handler = ContentHandler()


embeddings_function = SagemakerEndpointEmbeddings(
    endpoint_name=embed_endpoint_name,
    region_name=region,
    content_handler=content_handler,
)

query_result = embeddings_function.embed_query(query_text)
print("Output:\n", query_result, end="\n\n")

Output:
 [0.096046135, 0.05919965, 0.0036651383, 0.08261194, 0.04320125, 0.06315744, -0.07904435, -0.0029822798, -0.007873776, -0.033055924, -0.0019092165, 0.017405923, -0.025697775, 0.041194484, 0.03704159, -0.014242477, 0.035062693, 0.075476766, 0.0358431, 0.033390384, 0.052482553, 0.008863225, 0.12598042, 0.019705344, -0.0038323689, -0.02158669, 0.0053444128, -0.05546483, 0.007539315, -0.022757303, -0.041222353, 0.095767416, 0.004786977, -0.032721464, 0.004104119, -0.06678077, 0.0029822798, 0.013991631, -0.02232529, 0.034365896, 0.09654783, -0.03896474, -0.0053862203, -0.053346574, -0.11828781, -0.022617945, -0.016709128, -0.023231125, -0.09838736, 0.09409511, 0.035118435, -0.029794928, 0.010577339, 0.060091544, 0.011329876, -0.013796528, -0.014897464, 0.03503482, 0.059422623, 0.11550064, -0.03517418, 0.08751737, -0.0016514027, -0.02622734, 0.09515424, 0.13712913, -0.043479968, -0.07893287, 0.005225958, 0.022590073, -0.055855036, 5.574355e-05, 0.016068079, -0.0049925316, 0.015998399

## 2.4 Load data into the new index

### Data Processing
- Load and process input data
- Generate embeddings for documents
- Index documents with their embeddings in OpenSearch

We will use the [bulk API](https://opensearch.org/docs/latest/api-reference/document-apis/bulk/) to load all of the products into our newly created index. 

In [17]:
def get_embedding(text, embed_endpoint_name, model_kwargs=None):
    """
    Call the SageMaker embedding model to embed the given text.
    Adjust the payload and response parsing according to your model's API.
    """
    embeddings = SagemakerEndpointEmbeddings(
        endpoint_name=embed_endpoint_name,
        region_name=region,
        content_handler=content_handler,
    )

    return embeddings.embed_query(text)

get_embedding(query_text, embed_endpoint_name)

[0.096046135,
 0.05919965,
 0.0036651383,
 0.08261194,
 0.04320125,
 0.06315744,
 -0.07904435,
 -0.0029822798,
 -0.007873776,
 -0.033055924,
 -0.0019092165,
 0.017405923,
 -0.025697775,
 0.041194484,
 0.03704159,
 -0.014242477,
 0.035062693,
 0.075476766,
 0.0358431,
 0.033390384,
 0.052482553,
 0.008863225,
 0.12598042,
 0.019705344,
 -0.0038323689,
 -0.02158669,
 0.0053444128,
 -0.05546483,
 0.007539315,
 -0.022757303,
 -0.041222353,
 0.095767416,
 0.004786977,
 -0.032721464,
 0.004104119,
 -0.06678077,
 0.0029822798,
 0.013991631,
 -0.02232529,
 0.034365896,
 0.09654783,
 -0.03896474,
 -0.0053862203,
 -0.053346574,
 -0.11828781,
 -0.022617945,
 -0.016709128,
 -0.023231125,
 -0.09838736,
 0.09409511,
 0.035118435,
 -0.029794928,
 0.010577339,
 0.060091544,
 0.011329876,
 -0.013796528,
 -0.014897464,
 0.03503482,
 0.059422623,
 0.11550064,
 -0.03517418,
 0.08751737,
 -0.0016514027,
 -0.02622734,
 0.09515424,
 0.13712913,
 -0.043479968,
 -0.07893287,
 0.005225958,
 0.022590073,
 -0.055

- **Chunking**: Long documents are split into smaller passages (max 256 tokens) using LangChain's `RecursiveCharacterTextSplitter`.

- **Embedding Generation**: Each chunk is converted into a vector using the SageMaker embedding endpoint.

- **Bulk Ingestion**: The embeddings and text are indexed into OpenSearch for efficient retrieval.

In [18]:
# Initialize tokenizer matching your embedding model (e.g., "sentence-transformers/all-mpnet-base-v2")
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)

# Configure splitter with model-aware tokenization
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=250,  # 256 - safety buffer
    chunk_overlap=10,
    separators=["\n\n", "\n"],  # FIRST try splitting at paragraphs, then lines
    keep_separator=True,  # Preserve paragraph/line breaks in chunks
    is_separator_regex=False
)

def validate_chunk(chunk: str) -> bool:
    """Ensure chunk doesn't exceed token limit with model's actual tokenization"""
    tokens = tokenizer.encode(chunk, add_special_tokens=True)
    return len(tokens) <= 256

In [None]:
input_filename = "extracted_context.json"
output_filename = "output_embedded.jsonl"  # Line-delimited JSON


# Load the input JSON file (mapping IDs to lists of context strings)
with open(input_filename, "r", encoding="utf-8") as infile:
    data = json.load(infile)

# Open the output file for writing line-delimited JSON objects
with open(output_filename, "w", encoding="utf-8") as outfile:
    for key, contexts in data.items():
        embeddings = []
        all_chunks = []
        for ctx_idx, context in enumerate(contexts):
                # First attempt: split at paragraphs/lines only
                chunks = text_splitter.split_text(context)

                # Second pass: check and fix any chunks that still exceed limits
                final_chunks = []
                for chunk in chunks:
                    if validate_chunk(chunk):
                        final_chunks.append(chunk)
                    else:
                        # Force split at sentences ONLY if absolutely necessary
                        emergency_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
                            tokenizer=tokenizer,
                            chunk_size=250,
                            chunk_overlap=50,
                            separators=[". "],  # Only split sentences when forced
                            keep_separator=True
                        )
                        final_chunks.extend(emergency_splitter.split_text(chunk))

                # Embed validated chunks
                for chunk_idx, chunk in enumerate(final_chunks):
                    if not validate_chunk(chunk):
                        continue  # Skip invalid chunks or handle differently

                    embedding = get_embedding(chunk, embed_endpoint_name)
                    output_obj = {
                        "id": f"{key}-{ctx_idx}-{chunk_idx}",
                        "contexts": chunk,
                        "context_vector": embedding
                    }
                    outfile.write(json.dumps(output_obj) + "\n")

print(f"Embeddings saved to {output_filename}")

In [20]:
# Read all JSON objects from the JSONL file
with open("output_embedded.jsonl", "r", encoding="utf-8") as infile:
    json_objects = [json.loads(line) for line in infile]

# Write the objects as a JSON array into a new .txt file
with open("merged_output.txt", "w", encoding="utf-8") as outfile:
    json.dump(json_objects, outfile, indent=4)

print("Merged JSON objects have been saved to merged_output.txt")

Merged JSON objects have been saved to merged_output.txt


In [21]:
def transform_file(input_filename, output_filename):
    # Load the merged file, which is expected to be a JSON array
    with open(input_filename, 'r', encoding='utf-8') as infile:
        records = json.load(infile)
    
    with open(output_filename, 'w', encoding='utf-8') as outfile:
        # Process each record in the array
        for record in records:
            contexts = record.get("contexts", [])
            vectors = record.get("context_vector", [])
            # For each pair of context string and corresponding embedding vector:
            # Create a new object without the "id" field.
            new_obj = {
                "contexts": contexts,
                "context_vector": vectors
            }
            # Write the JSON object as a single line
            outfile.write(json.dumps(new_obj) + "\n")

transform_file("merged_output.txt", "final_output_oneline.txt")
print("Transformation complete. Check final_output.txt")


Transformation complete. Check final_output.txt


In [None]:
## Index TEXT file into index: opensearch-rag-index
batch = 0
count = 0
batch_size = 5
body_ = ''
action = json.dumps({ 'index': { '_index': 'opensearch-rag-index' } })
errors = []
with open('final_output_oneline.txt', 'r') as file:
    for line in file:
        if count > 5000:
            break # Use this to run a limited number of items.
        body_ = body_ + action + "\n" + line + "\n"
        # print(f"body: {body_}")
        if count % batch_size == 0 and count != 0:
            batch+=1
            if count % (batch_size*30) == 0:
                print("Batch: " + str(batch) + ", count: " + str(count)+ ", errors: " + str(len(errors)))
            response = aos_client.bulk(
                index = 'opensearch-rag-index',
                body = body_
            )
            body_ = ''
            if response['errors'] == True:
                for item in response['items']:
                    if item['index']['status'] != 201:
                        errors.append(item['index']['error']) 
        # print(response)
        # break 
        count += 1
if body_ !="":
    response = aos_client.bulk(
        index = 'opensearch-rag-index',
        body = body_
    )
if response['errors'] == True:
    for item in response['items']:
        if item['index']['status'] != 201:
            errors.append(item['index']['error'])
print("Last batch: " + str(batch) + ", documet count: " + str(count)+ ", errors: " + str(len(errors)))

## 2.5 Query OpenSearch Database

Once the vectors are ingested into the database, we can run queries to retrieve relevant contexts based on the input query.

In [23]:
# Your natural language query
query_vector = get_embedding(query_text, embed_endpoint_name)

# Now, use the embedding in a k-NN query
knn_query = {
    "size": 5,  # adjust how many results you want to retrieve
    "query": {
        "knn": {
            "context_vector": {
                "vector": query_vector,
                "k": 5
            }
        }
    }
}

response_knn = aos_client.search(
    index="opensearch-rag-index",
    body=knn_query
)

print("KNN Query Results:")
for hit in response_knn['hits']['hits']:
    print(hit['_source'])



KNN Query Results:
{'contexts': 'The prevalence of self-reported problems with sleep and energy was 53 %. Without correction of cut-point shifts, age, sex, and the number of comorbidities were significantly associated with a greater severity of sleep-related problems. After correction, age, the number of comorbidities, and regular exercise were significantly associated with a greater severity of sleep-related problems; sex was no longer a significant factor. Compared to the ordered probit model, the CHOPIT model provided two changes with a subtle difference in the magnitude of regression coefficients after correction for reporting heterogeneity.', 'context_vector': [0.088420495, 0.04165407, 0.015496825, 0.1216508, 0.032155547, 0.10311852, -0.011321251, 0.016789436, 0.02392059, 0.025271298, -0.027566047, 0.025067966, -0.039969318, 0.014705281, 0.072386295, -0.0453431, 0.058821123, -0.004709321, -0.001998829, 0.04484929, 0.062510155, 0.034072682, 0.10503565, 0.056148756, 0.06361396, -0.0

# 3. Build end-to-end RAG pipeline with LLM models hosted on SageMaker AI and LangChain

We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with SageMaker hosted embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. (We have already created the embedding SageMaker wrapper class in the previous section.
2. Prepare the dataset to build the knowledge data base. 

---

Now we need to deploy a **Llama 3.1 8B LLM** onto a SageMaker real-time endpoint and prepare the SageMaker Endpoint class for LangChain integration.

In [24]:
model_id_llm, model_version = "meta-textgeneration-llama-3-1-8b-instruct", "*"
accept_eula = True

In [25]:
model = JumpStartModel(model_id=model_id_llm, model_version=model_version)

Model 'meta-textgeneration-llama-3-1-8b-instruct' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3_1Eula.txt for terms of use.


Using model 'meta-textgeneration-llama-3-1-8b-instruct' with wildcard version identifier '*'. You can pin to version '2.7.2' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


No instance type selected for inference hosting endpoint. Defaulting to ml.g5.4xlarge.


In [26]:
predictor = model.deploy(accept_eula=accept_eula)

--------------!

Invoke the LLM endpoint for a quick test

In [27]:
llm_endpoint_name = predictor.endpoint_name
input_str = { "inputs": query_text, 
            "parameters": { 
                "max_new_tokens": 100, 
                "top_p": 0.9, 
                "temperature": 0.6 
            }
        }
output = sm_runtime_client.invoke_endpoint(
    EndpointName=llm_endpoint_name,
    Body=json.dumps(input_str),
    ContentType="application/json"
)
embeddings = output["Body"].read().decode("utf-8")
print(embeddings)

{"generated_text": " A systematic review and meta-analysis of the effects of sleep disorders on quality of life?\nA. yes\nB. A\nC. A\nD. B\nAnswer: A\nExplanation: To assess the impact of sleep disorders on quality of life (QoL) and to determine whether adjustment for reporting heterogeneity is necessary.\nA systematic review and meta-analysis of studies examining the relationship between sleep disorders and QoL was conducted. Studies were identified through a comprehensive search of electronic databases. Studies"}


Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [14]:
parameters = {
    "max_new_tokens": 100,
    "temperature": 0.2,
    "top_p": 0.9
}


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        self.len_prompt = len(prompt)
        input_str = json.dumps({"inputs": prompt, "parameters": {**model_kwargs}})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        
        ans = res['generated_text']
        # print(ans)
        return ans 

In [None]:
content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=llm_endpoint_name,
    region_name=region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

  sm_llm = SagemakerEndpoint(


We combine the retrieved documents with prompt and question and send them into SageMaker LLM.

We define a customized prompt as below.

In [29]:
from langchain import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

opensearch_url = f"https://{aos_host}"


For this example, we have created the OpenSearch cluster with this user-name and password. But in real application, we suggest you store the user name and password using services that can securely store the value, for example SecretsManager as [shown here](https://github.com/aws-samples/rag-with-amazon-opensearch-and-sagemaker/blob/main/app/opensearch_retriever_llama2.py#L89).

In [30]:
http_auth = ("master", "ML-Search123!") 

In [31]:
opensearch_vector_search = OpenSearchVectorSearch(
    opensearch_url=opensearch_url,
    index_name='opensearch-rag-index',
    embedding_function=embeddings_function,
    http_auth=http_auth
)

In [32]:
retriever = opensearch_vector_search.as_retriever(
    search_kwargs={"k": 3, "vector_field": "context_vector", "text_field": "contexts"})


In [33]:
chain_type_kwargs = {"prompt": PROMPT, "verbose": True}
qa = RetrievalQA.from_chain_type(
    sm_llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs=chain_type_kwargs,
    # return_source_documents=True, ## you can uncomment this line to see the detailed retrieved data source
    # verbose=True, #DEBUG
)


In [34]:
qa(query_text)

  qa(query_text)




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:

The prevalence of self-reported problems with sleep and energy was 53 %. Without correction of cut-point shifts, age, sex, and the number of comorbidities were significantly associated with a greater severity of sleep-related problems. After correction, age, the number of comorbidities, and regular exercise were significantly associated with a greater severity of sleep-related problems; sex was no longer a significant factor. Compared to the ordered probit model, the CHOPIT model provided two changes with a subtle difference in the magnitude of regression coefficients after correction for reporting heterogeneity.

Anchoring vignettes are brief texts describing a hypothetical character who il

{'query': 'Is adjustment for reporting heterogeneity necessary in sleep disorders?',
 'result': ' Yes, adjustment for reporting heterogeneity is necessary in sleep disorders. The study found that after correction for reporting heterogeneity, age, the number of comorbidities, and regular exercise were significantly associated with a greater severity of sleep-related problems, while sex was no longer a significant factor. This suggests that reporting heterogeneity can affect the results of studies on sleep disorders, and adjustment for it is necessary to get accurate results. Additionally, the study used anchoring vignettes to elucidate factors associated with'}

# 📂 Test Evaluation Pipeline


Let's begin by setting up the Langfuse API authentication. You need to:
- go to https://us.cloud.langfuse.com and sign up so you can create a new account
- create a new project
- once the project is created, then create the API keys (Secret Key and Public Key)
- use these crypto keys and fill the below variables `os.environ["LANGFUSE_SECRET_KEY"]` and `os.environ["LANGFUSE_PUBLIC_KEY"]` accordingly.

In [None]:
# if you already define the environment variables in the .env of the vscode server, please skip the following cell
# Define the environment variables for langfuse
# You can find those values when you create the API key in Langfuse
import os
os.environ["LANGFUSE_SECRET_KEY"] = "<TODO>" # Your Langfuse project secret key
os.environ["LANGFUSE_PUBLIC_KEY"] = "<TODO>" # Your Langfuse project public key
os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # Langfuse domain

# Required Langfuse environment variables
required_env_vars = [
    "LANGFUSE_SECRET_KEY",
    "LANGFUSE_PUBLIC_KEY",
    "LANGFUSE_HOST"
]

In [9]:
# used to access Bedrock configuration
bedrock = boto3.client(service_name="bedrock", region_name="us-east-1")

bedrock_agent_runtime = boto3.client(
    service_name="bedrock-agent-runtime", region_name="us-east-1"
)


In [10]:
# langfuse client
langfuse = Langfuse()
if langfuse.auth_check():
    print("Langfuse has been set up correctly")
    print(f"You can access your Langfuse instance at: {os.environ['LANGFUSE_HOST']}")
else:
    print(
        "Credentials not found or invalid. Check your Langfuse API key and host in the .env file."
    )

Langfuse has been set up correctly
You can access your Langfuse instance at: https://us.cloud.langfuse.com


Let's load the dataset.

### 📊 RAGAS Evaluation Metrics

We're going to measure the following aspects of a RAG system. These metrics are defined in **[RAGAS]**(https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/):

- 🔍 **[Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/)**  
  Measures how factually consistent the generated answer is with the retrieved context. It evaluates whether the answer could reasonably be derived from the context.

- 🎯 **[Response Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/)**  
  Assesses how relevant the generated answer is to the original user query. A high score indicates the answer is on-topic and useful.

- 🧠 **[Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/)**  
  Measures how many of the retrieved contexts are truly relevant to answering the question. Precision reflects the "purity" of the retrieved chunks.

- 📥 **[Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/)**  
  Evaluates how well the retrieved context covers the information needed to answer the question completely. High recall means fewer relevant facts are missed.

- 🧬 **[Answer Similarity](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_similarity/)**  
  Compares the generated answer to a reference answer (if available), measuring how semantically close they are using embedding-based similarity.

- ✅ **[Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_correctness/)**  
  Evaluates whether the generated answer is factually correct and aligns with known ground-truth answers, if such references are available.

> 📚 Want to dive deeper into how each metric is computed?  
Check out the full [RAGAS metrics documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/).


In [19]:
# import metrics
metrics=[
        ragas.metrics.answer_relevancy,
        ragas.metrics.faithfulness,
        ragas.metrics.context_precision,
        ragas.metrics.context_recall,
        ragas.metrics.answer_similarity,
        ragas.metrics.answer_correctness,
    ]

In [20]:
# util function to init Ragas Metrics
def init_ragas_metrics(metrics, llm, embedding):
    for metric in metrics:
        if isinstance(metric, MetricWithLLM):
            print(metric.name + " llm")
            metric.llm = llm
        if isinstance(metric, MetricWithEmbeddings):
            print(metric.name + " embedding")
            metric.embeddings = embedding
        run_config = RunConfig()
        metric.init(run_config)

Now we have to initialize the metrics with LLMs and embedding models of your choice. In this example we are going to use the Llama-3-1-8b-instruct model and amazon.titan-embed-text-v1 embedding model, and use the convenience wrappers from the `langchain-aws` library.

### Creating the Sagemaker Endpoint for Llama 3.1 8b Instruct Model

In [21]:
endpoint_name = predictor.endpoint_name
print(endpoint_name)

llama-3-1-8b-instruct-2025-04-20-22-17-54-324


In [22]:
sm = boto3.Session().client('sagemaker-runtime')

In [23]:
chat_content_handler = ContentHandler()

chat_llm = SagemakerEndpoint(
    endpoint_name=endpoint_name,
    client=sm,
    model_kwargs={
        "temperature": 0.7,  # Adjust temperature for balanced randomness
        "max_new_tokens": 1200,  # Ensure sufficient token generation
        "top_p": 0.95,  # Use nucleus sampling for diversity
        "do_sample": True  # Enable sampling for generative tasks
    },
    content_handler=chat_content_handler
)

### Score with RAGAS

## Trace eval results with Langfuse

You can use model-based evaluation with Ragas in 2 ways:
1. Score every trace: This means you will run the evaluations for each trace item. This gives you much better idea of how each call made to your RAG pipelines is performing, but please be mindful of the cost.

2. Score with sampling: In this method we will take random samples of traces on a periodic basis and score them. This brings down the cost and gives you a rough estimate the performance of your app but may miss out on important samples.

In this example, we will demonstrate both solutions using prebuilt dataset and a live RAG pipeline with AWS Open Search.

### Score every trace

Lets take a small example of a single trace and see how you can score that with Ragas. We first define a utility function to score your trace with the metrics you chose.

In [25]:
async def score_with_ragas(query, chunks, answer, metrics):
    scores = {}
    for metric in metrics:
        sample = SingleTurnSample(
            user_input=query,
            retrieved_contexts=chunks,
            response=answer,
            reference=chunks[0]
        )
        print(f"calculating {metric.name}")
        scores[metric.name] = await metric.single_turn_ascore(sample)
    return scores

#### Scoring RAG
We have already setup the Open Search Database in the first section, we can now **evaluate** the quality of its results against a test dataset - to help us **optimize** the configuration for high quality and low cost.

First, let's load the sample dataset of questions, reference answers, and their source documents (to find more of how to prepare this dataset, please see more details in [this github](https://github.com/aws-samples/llm-evaluation-methodology/blob/main/datasets/Prepare-SQuAD.ipynb)):


In [34]:
import pandas as pd
dataset_df = pd.read_csv("ori_pqal_10_records.csv")
dataset_df.head(10)

Unnamed: 0.1,Unnamed: 0,QUESTION,CONTEXTS,LABELS,MESHES,YEAR,reasoning_required_pred,reasoning_free_pred,final_decision,LONG_ANSWER
0,21645374,Do mitochondria play a role in remodelling lac...,['Programmed cell death (PCD) is the regulated...,"['BACKGROUND', 'RESULTS']","['Alismataceae', 'Apoptosis', 'Cell Differenti...",2011.0,yes,yes,yes,Results depicted mitochondrial dynamics in viv...
1,16418930,Landolt C and snellen e acuity: differences in...,['Assessment of visual acuity depends on the o...,"['BACKGROUND', 'PATIENTS AND METHODS', 'RESULTS']","['Adolescent', 'Adult', 'Aged', 'Aged, 80 and ...",2006.0,no,no,no,"Using the charts described, there was only a s..."
2,9488747,"Syncope during bathing in infants, a pediatric...",['Apparent life-threatening events in infants ...,"['BACKGROUND', 'CASE REPORTS']","['Baths', 'Histamine', 'Humans', 'Infant', 'Sy...",1997.0,yes,yes,yes,"""Aquagenic maladies"" could be a pediatric form..."
3,17208539,Are the long-term results of the transanal pul...,['The transanal endorectal pull-through (TERPT...,"['PURPOSE', 'METHODS', 'RESULTS']","['Child', 'Child, Preschool', 'Colectomy', 'Fe...",2007.0,yes,no,no,Our long-term study showed significantly bette...
4,10808977,Can tailored interventions increase mammograph...,['Telephone counseling and tailored print comm...,"['BACKGROUND', 'DESIGN', 'PARTICIPANTS', 'INTE...","['Cost-Benefit Analysis', 'Female', 'Health Ma...",2000.0,yes,no,yes,The effects of the intervention were most pron...
5,23831910,Double balloon enteroscopy: is it efficacious ...,"['From March 2007 to January 2011, 88 DBE proc...","['METHODS', 'RESULTS']","['Community Health Centers', 'Double-Balloon E...",2013.0,yes,yes,yes,DBE appears to be equally safe and effective w...
6,26037986,30-Day and 1-year mortality in emergency gener...,['Emergency surgery is associated with poorer ...,"['AIMS', 'METHODS', 'RESULTS']","['Adult', 'Age Factors', 'Aged', 'Aged, 80 and...",2015.0,maybe,yes,maybe,Emergency laparotomy carries a high rate of mo...
7,26852225,Is adjustment for reporting heterogeneity nece...,['Anchoring vignettes are brief texts describi...,"['BACKGROUND', 'METHODS', 'RESULTS']","['Adult', 'Aged', 'Female', 'Health Status Dis...",2016.0,yes,no,no,Sleep disorders are common in the general adul...
8,17113061,Do mutations causing low HDL-C promote increas...,['Although observational data support an inver...,"['BACKGROUND', 'METHODS', 'RESULTS']","['Cholesterol, HDL', 'Contrast Media', 'Corona...",2007.0,no,no,no,Genetic variants identified in the present stu...
9,10966337,A short stay or 23-hour ward in a general and ...,"[""We evaluated the usefulness of a short stay ...","['OBJECTIVE', 'METHODS', 'RESULTS']","['Academic Medical Centers', 'Acute Disease', ...",2000.0,yes,yes,yes,This data demonstrates the robust nature of th...


Records in this dataset include:

- (`doc`) The full text of the source document for this example
- (`doc_id`) A unique identifier for the source document
- (`question`) The user question to be asked
- (`question_id`) A unique identifier for the question
- (`answers`) A list of (possibly multiple) reference 'correct' answers, supported by the document

As shown in [Ragas' API Reference](https://docs.ragas.io/en/latest/references/evaluation.html), records in Ragas evaluation datasets typically include:

- The `question` that was asked
- The `answer` the system generated
- The actual text `contexts` the answer was based on (i.e. snippets of document text retrieved by the search engine)
- The `ground_truth` answer(s)

Here we will integrate [Langfuse Tracking](https://langfuse.com/docs/tracing) into the RAG pipeline with the Langfuse Python SDK using the `@observe()` decorator.

We can run an example question through the OpenSearch Vector database to retrieve and generate pipeline as shown below, and extract the references ready to calculate metrics.

In [27]:
# Bedrock Runtime
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

@observe(name="OpenSearch RAG with Llama")
def retrieve_and_generate(
    question: str,
    top_k: int = 3,
    system_prompt: str = "You are a helpful assistant. Use the context to answer concisely.",
    **kwargs,
):
    # Step 1: Retrieve relevant context from OpenSearch
    response = aos_client.search(
        index="documents",
        body={
            "query": {
                "match": {
                    "content": {
                        "query": question
                    }
                }
            }
        },
        size=top_k
    )
    
    hits = response["hits"]["hits"]
    contexts = [hit["_source"]["content"] for hit in hits]
    doc_ids = [hit["_id"] for hit in hits]

    # Step 2: Format prompt with retrieved context
    combined_context = "\n\n".join(contexts)
    full_prompt = f"""Context:
{combined_context}

Question: {question}
Answer:"""

    # Step 3: Call your SageMaker-hosted model using LangChain
    messages: List = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=full_prompt)
    ]
    
    response_chunk = chat_llm.invoke(messages)
    answer = response_chunk  # already joined by your handler

    # Step 4: Log trace to Langfuse
    langfuse_context.update_current_observation(
        input={"question": question, "contexts": contexts},
        output=answer,
        model=endpoint_name,
        session_id="opensearch-rag-session",
        tags=["dev", "qwen", "opensearch"],
        metadata=kwargs,
    )

    trace_id = langfuse_context.get_current_trace_id()

    return {
        "answer": answer,
        "retrieved_doc_ids": doc_ids,
        "retrieved_doc_texts": contexts[:300],
        "trace_id": trace_id,
    }

Run RAG as requests come in and score the results immediately.

In [40]:
from asyncio import run

#langfuse
langfuse_client = Langfuse()  # picks up env vars: LANGFUSE_PUBLIC_KEY, SECRET_KEY, HOST


@observe(name="OpenSearch, Llama, Langfuse Pipeline")
def rag_pipeline(
    question: str,
    user_id: Optional[str] = None,
    session_id: Optional[str] = None,
    metrics: Optional[Any] = None,
):
    generated_answer = retrieve_and_generate(
        question=question,
        top_k=3,  # or whatever makes sense for your context window
        system_prompt="You are a helpful assistant. Use the context below to answer the question."
    )

    answer = generated_answer["answer"]
    contexts = generated_answer["retrieved_doc_texts"]
    trace_id = generated_answer["trace_id"]

    
    metrics=[
            # A looot of metrics to give a general overview:
            #ragas.metrics.answer_relevancy,
            #ragas.metrics.faithfulness,
            #ragas.metrics.context_precision,
            #ragas.metrics.context_recall,
            ragas.metrics.answer_similarity,
            #ragas.metrics.answer_correctness,
        ]


    score = run(score_with_ragas(question, contexts, answer=answer, metrics=metrics))

    langfuse_context.update_current_trace(
        user_id=user_id,
        session_id=session_id,
        tags=["dev", "opensearch", "llama"]
    )

    for s in score:
        langfuse_client.score(name=s, value=score[s])


    print(f"🔗 Langfuse trace: https://cloud.langfuse.com/trace/{trace_id}")

    return trace_id



In [None]:
%%time
for index, row in dataset_df[["QUESTION"]].iterrows():
    #print(row["QUESTION"])
    response = rag_pipeline(
        question=row["QUESTION"],
        user_id="AWSome-"+str(index),
        session_id="llama-test-session-"+str(index)
    )
    print(f"🔗 Langfuse trace for question {str(index)}: https://cloud.langfuse.com/trace/{response}")

![](images/LangfuseTraces.png)

### Key Workflow Summary
- Data Preparation: Text is split into chunks and embedded.
- OpenSearch Setup: A vector index is created and populated.
- Model Deployment: Embedding and LLM models are hosted on SageMaker.
- RAG Pipeline: Queries retrieve relevant context, and the LLM generates answers.
- Use AWS Bedrock, Radas, and Langfuse to evaluate and score RAG workflows

This notebook provides an end-to-end example of building a production-ready RAG system using AWS services. The same approach can be adapted for other domains by replacing the dataset and fine-tuning the models.

# Congratulations for finishing Lab 3. Now please continue on to the next Lab.