# Lab 0: Prerequisites



## Overview
This notebook demonstrates how to implement a Retrieval Augmented Generation (RAG) solution using:
- Amazon SageMaker for hosting embedding and LLM models
- Amazon OpenSearch for vector search
- LangChain for orchestrating the RAG pipeline
- we'll explore ways to evaluate the quality of Retrieval-Augmented Generation (RAG) pipelines with the opensource tools like [RAGAS](https://docs.ragas.io/en/v0.1.21/index.html) and leverage the features in [Langfuse](https://langfuse.com/) to manage and trace the RAG pipelines with traces and spans. We will create a OpenSearch Vector Database and the RAG results generation to show offline evaluation and scoring.

In this notebook, Question Answering solution with Large Language Models (LLMs) and Amazon OpenSearch Service. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response.

LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

<H2>Part 1: Build conversational search with OpenSearch Service</H2>

The vector dataset used in this part of the lab is comprised of a predefined content resource from the [PubMedQA](https://pubmedqa.github.io/) dataset.

You will use OpenSearch ingest pipeline with embedding processor to generate text embeddings for the dataset. Using the neural plugin in OpenSearch will allow you to generate the embeddings of the search query as well.
You will then use the large language model (LLM) hosted on Amazon SageMaker endpoints with the RAG processor in the search pipeline to generate text. The RAG processor will combine the retrieved search results from OpenSearch with the generated answer from the LLM to send back to the end user.

### The key steps in this lab are as follows:

1. Get prerequisites installed and libraries imported.
1. Create an OpenSearch Service Domain
1. Create a KNN-enabled index and ingest the catalog items into the index
1. Deploy the embedding model to a SageMaker endpoint.
1. Deploy the generation model to a SageMaker endpoint.
1. Build the end-to-end pipeline with LangChain.

## 1.1. Import libraries & initialize resources
The code blocks below will install and import all the relevant libraries and modules used in this notebook.

In [None]:
%pip uninstall -q -y autogluon-multimodal autogluon-timeseries autogluon-features autogluon-common autogluon-core

%pip install -Uq boto3==1.37.38
%pip install -Uq sagemaker==2.243.2
    
%pip install -Uq opensearch-py==2.8.0
%pip install -Uq opensearch_py_ml==1.1.0
    
print("Installs completed.")

In [None]:
# Import Python libraries
from typing import Any, Dict, List, Optional
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import os

import time

from transformers import AutoTokenizer


# Sagemaker
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel

In [None]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

region = sess.boto_region_name

account_id = boto3.client('sts').get_caller_identity().get('Account')

sm_runtime_client = boto3.client("sagemaker-runtime")
opensearch_client = boto3.client('opensearch')


print(f"account id: {account_id}")
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {region}")

# 2. Deploy the embedding model to a SageMaker endpoint & build retrieval integration with OpenSearch

We have taken the PubMedQA dataset and prepared it to include the contexts in the `extracted_context.json` file.

The following cells will perform the steps to generate embeddings with the dataset and ingest into the OpenSearch vector database.

## 2.1 Create an OpenSearch Service Domain

In the following steps, you will create a new OpenSearch Service domain. The configuration used here creates a publicly accessible domain, in 1 AZ, with your SageMaker execution role as the master user. If you are deploying an OpenSearch Domain for a real use case, you will want to deploy inside of a VPC and use multiple nodes in multiple AZs for high availability.

In [None]:
OS_DOMAIN_NAME = 'opensearch-rag-domain'

Store this variable for later use

In [None]:
%store OS_DOMAIN_NAME

In [None]:
domain_policy = \
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [f"{sagemaker.get_execution_role()}"]
             },
             "Action": ["es:*"],
             "Resource": f"arn:aws:es:{sagemaker.Session().boto_region_name}:{sagemaker.Session().account_id()}:domain/{OS_DOMAIN_NAME}/*"
        }
    ]
}

domain_policy

In [None]:
create_domain_payload = {
    "DomainName": OS_DOMAIN_NAME,
    "EngineVersion": "OpenSearch_2.17",
    "ClusterConfig": {
        "InstanceType": "t3.small.search",
        "InstanceCount": 1,
        "DedicatedMasterEnabled": False,
        "ZoneAwarenessEnabled": False,
        "WarmEnabled": False,
        "ColdStorageOptions": {
            "Enabled": False
        },
        "MultiAZWithStandbyEnabled": False
    },
    "EBSOptions": {
        "EBSEnabled": True,
        "VolumeType": "gp3",
        "VolumeSize": 100,
        "Iops": 3000,
        "Throughput": 125
    },
    "AccessPolicies": json.dumps(domain_policy),
    "IPAddressType": "dualstack",
    "SnapshotOptions": {},
    "EncryptionAtRestOptions": {
        "Enabled": True
    },
    "NodeToNodeEncryptionOptions": {
        "Enabled": True
    },
    "AdvancedOptions": {
        "indices.fielddata.cache.size": "20",
        "override_main_response_version": "false",
        "indices.query.bool.max_clause_count": "1024",
        "rest.action.multi.allow_explicit_index": "true"
    },
    "DomainEndpointOptions": {
        "EnforceHTTPS": True,
        "CustomEndpointEnabled": False
    },
    "AdvancedSecurityOptions": {
        "Enabled": True,
        "InternalUserDatabaseEnabled": False,
        "MasterUserOptions": {
            "MasterUserARN": sagemaker.get_execution_role()
        }
    },
    "TagList": [],
    "OffPeakWindowOptions": {
        "Enabled": True,
        "OffPeakWindow": {
            "WindowStartTime": {
                "Hours": 0,
                "Minutes": 0
            }
        }
    },
    "SoftwareUpdateOptions": {
        "AutoSoftwareUpdateEnabled": False
    },
    "AIMLOptions": {
        "NaturalLanguageQueryGenerationOptions": {
            "DesiredState": "ENABLED"
        }
    }
}

Creating an OpenSearch Domain will take about X minutes.

In [None]:
create_domain_response = opensearch_client.create_domain(**create_domain_payload)

In [None]:
while True:
    status = opensearch_client.describe_domain(DomainName=OS_DOMAIN_NAME)["DomainStatus"]["Processing"]
    print(f"DomainProcessingStatus: {status}")
    if status:
        time.sleep(60)
    else:
        break

# wait for endpoint uri to be available
time.sleep(10)

AOS_HOST = opensearch_client.describe_domain(DomainName=OS_DOMAIN_NAME)["DomainStatus"]["Endpoint"]
print(f"Your OpenSearch Endpoint is available: https://{AOS_HOST}")

In [None]:
%store AOS_HOST

## 2.3 Create SageMaker Embedding Endpoint
A **Hugging Face text embedding model (gte-base-en-v1.5)** is deployed via SageMaker JumpStart to a SageMaker real-time endpoint. This model converts text into 384-dimensional vectors for semantic search.
### Embedding Model Deployment
- Deploy Hugging Face embedding model (gte-base-en-v1.5) on SageMaker
- Create embedding endpoint for text vectorization
- Configure content handlers for model input/output processing

Now deploy `Alibaba-NLP/gte-base-en-v1.5` as an embedding model endpoint.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

def get_embedding_image_uri(instance_type, version="1.4"):
  key = "huggingface-tei" if instance_type.startswith("ml.g") or instance_type.startswith("ml.p") else "huggingface-tei-cpu"
  return get_huggingface_llm_image_uri(key, version=version)

In [None]:
EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-base-en-v1.5"
embedding_instance_type = "ml.c5.4xlarge"

# currently the latest HF TEI containers arent supported by the python SDK, use the direct URI
#embedding_image = get_embedding_image_uri(embedding_instance_type)
embedding_image = f"683313688378.dkr.ecr.{region}.amazonaws.com/tei-cpu:2.0.1-tei1.6.0-cpu-py310-ubuntu22.04"
embedding_image

In [None]:
from sagemaker.huggingface import HuggingFaceModel

hub = {
    'HF_MODEL_ID': EMBEDDING_MODEL_NAME
}

embedding_model_for_deployment = HuggingFaceModel(
    role=role,
    env=hub,
    image_uri=embedding_image,
)

EMBED_ENDPOINT_NAME = sagemaker.utils.name_from_base("gte-base-en-v1-5")

health_check_timeout = 300

embedding_model_for_deployment.deploy(
    endpoint_name=EMBED_ENDPOINT_NAME,
    initial_instance_count=1,
    instance_type=embedding_instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    routing_config = {
        "RoutingStrategy":  sagemaker.enums.RoutingStrategy.LEAST_OUTSTANDING_REQUESTS
    }
)

In [None]:
print(f"Successfully deployed embedding model to the SageMaker endpoint: {EMBED_ENDPOINT_NAME}")

In [None]:
%store EMBEDDING_MODEL_NAME
%store EMBED_ENDPOINT_NAME

Now deploy `Llama 3.1 8B` onto a SageMaker real-time endpoint for generation.

In [None]:
model_id_llm, model_version = "meta-textgeneration-llama-3-1-8b-instruct", "*"
accept_eula = True

In [None]:
generation_instance_type = "ml.g5.4xlarge"
generation_model = JumpStartModel(model_id=model_id_llm, model_version=model_version,instance_type=generation_instance_type)

In [None]:
generation_predictor = generation_model.deploy(accept_eula=accept_eula)

In [None]:
GENERATION_ENDPOINT_NAME = generation_predictor.endpoint_name
print(f"Successfully deployed generation model to the SageMaker endpoint: {GENERATION_ENDPOINT_NAME}")

In [None]:
%store GENERATION_ENDPOINT_NAME