# Deploy a Vector Search Microservice with Amazon SageMaker

This notebook shows how to deploy a vector search microservice using Amazon SageMaker's real time endpoint feature. The microservice uses FAISS for efficient similarity search combined with some of LangChain's useful document handlers and its FAISS wrapper. The embedding model which powers the similarity search is retrieved from the HuggingFace model hub.

# Notebook Setup

This section imports necessary AWS SDK libraries like boto3 and sagemaker. It also sets up the boto3 session to point to the right AWS credentials and resources.

The local variable is used to determine if we are running in SageMaker Studio or not. If local is True, we use the AWS_PROFILE environment variable to configure boto3. The role variable is also configured to either use the SageMaker execution role or get it from Studio.

Please note - the `SAGEMAKER_ROLE` variable must be an an execution role from SageMaker for this to work.

In [None]:
%pip install -r src/requirements.txt

In [None]:
import boto3
import sagemaker
import os

In [None]:
# change to True if running locally and not in sagemaker studio
local = False
if local:
    boto3.setup_default_session(profile_name=os.environ['AWS_PROFILE'])
    role = os.environ['SAGEMAKER_ROLE']
else:
    role = sagemaker.get_execution_role()

# Embed Documents with LangChain and FAISS

This section prepares the example document (a sample of the Amazon SageMaker FAQ docs), embeds them with a HuggingFace model, and saves the embeddings to a FAISS index.

This is all done with LangChain wrappers which we import below.

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS

## Prepare Documents

First, make sure to download the dataset to your local environment using the cell below. However, you are able to use any sort of text data here which you see fit.

In [None]:
s3_path = "s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv"
!aws s3 cp $s3_path ./data/sagemaker_faqs.csv

Now we load the example CSV file containing SageMaker FAQs and wrap the text with the Document class from LangChain.

Since the document is large and we would not want to include the whole FAQ list as context to a large language model, we split it into smaller chunks using the CharacterTextSplitter. This is important because it will help decrease the input token count for a retrieval augmented generation (RAG) system.

In [None]:
# load the sagemaker FAQ list
with open('./data/sagemaker_faqs.csv') as f:
    doc = f.read()

# create a loader
docs = []
loader = TextLoader('')
docs.append(Document(page_content=doc))

# split documents into chunks
text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=1000,
    chunk_overlap=0,
)
split_docs = text_splitter.split_documents(docs)

## Create Vector Store

Finally, we create the FAISS index from the documents using the embeddings module. This stores the document embeddings for efficient similarity search later.

Storing the embeddings in a vector store like FAISS allows us to quickly find similar passages by doing nearest neighbor search on the embedding index. This is the foundation for dense retrieval in systems like RAG.

In [None]:
# create instantiation to embedding model
model_name = "BAAI/bge-small-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    query_instruction="Represent this sentence for searching relevant passages: "
)

# create vector store
vs = FAISS.from_documents(split_docs, hf)

## Save the Vector Store Locally

Once the documents have been indexed into FAISS, we will save the index locally into a directory called `faiss_vector_store`. You can see below that you are able to use the `load_local` method to create a FAISS in-memory index from this persisted vector store.

In [None]:
vs.save_local('faiss_vector_store')

In [None]:
vs = FAISS.load_local("faiss_vector_store", hf)

# Deploy the Vector Store to a SageMaker Endpoint

Now that we have created our vector store, let's go ahead and deploy it to a SageMaker endpoint!

First, lets start by creating a SageMaker session and specifying an S3 bucket location to store our vector search index.

In [None]:
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
prefix = 'faiss-demo-deploy'
s3_uri = f's3://{bucket}/{prefix}'

## Package the vector store as a tar file in S3

The local directory containing the FAISS index we created earlier now has to be packaged into a tar file because SageMaker expects all model objects in tar.gz format. This process is similar to packaging a serialized machine learning model for SageMaker deployment; only in this case, we are using a FAISS index as our "model".

In [None]:
!tar -czvf model.tar.gz ./faiss_vector_store
!tar -tvf model.tar.gz
model_uri = sess.upload_data('model.tar.gz', bucket = bucket, key_prefix=f"{prefix}/model")
!rm model.tar.gz
!rm -rf faiss_vector_store

## Create a SageMaker Model Object

Once the model artifact has been uploaded to S3, you will use the SageMaker SDK to create a model object which references the model artifact in S3, one of SageMaker's PyTorch inference containers, and the inference code stored in the `src` directory in this repository. The `inference.py` is the code which is executed at runtime while the `requirements.txt` tells SageMaker to install the necessary libraries inside its Docker container.

In [None]:
import time
image = sagemaker.image_uris.retrieve(
    framework='pytorch',
    region='us-east-1',
    image_scope='inference',
    version='1.12',
    instance_type='ml.m5.2xlarge',
)

model_name = f'faiss-vs-{int(time.time())}'
faiss_model_sm = sagemaker.model.Model(
    model_data=model_uri,
    image_uri=image,
    role=role,
    entry_point="inference.py",
    source_dir='src',
    name=model_name,
)

## Deploy the Vector Store Endpoint

Deploying the model object to sagemaker can be done with the deploy function. We will be using a CPU instance in this case. Make sure to scale this instance vertically for effective vector search processing and horizontally for load handling as required in a production system.

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

endpoint_name = f'faiss-endpoint-{int(time.time())}'
faiss_model_sm.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.2xlarge",
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    wait=True,
)

## Test Retrieval from SageMaker Endpoint

Once the model has deployed, you can connect to the endpoint using the `Predictor` class in the SageMaker SDK. This connection can then use the predict method in order to search input text against our SageMaker FAQ index.

In [None]:
sagemaker_vector_store = sagemaker.predictor.Predictor(endpoint_name)
assert sagemaker_vector_store.endpoint_context().properties['Status'] == 'InService'

In [None]:
import json
payload = json.dumps({
    "text": "what is a shadow test?",
    "k": 3,
})

out = sagemaker_vector_store.predict(
    payload,
    initial_args={"ContentType": "application/json", "Accept": "application/json"}
)
out = json.loads(out)

The final output is the text chunks which closely match the input question. Just like that we have a retrieval system API up and running which can power a RAG based application!

In [None]:
out

## Optional: Clean Up Endpoint

Once you have finished testing you endpoint, you have the option to delete your SageMaker endpoint. This is a good practice as experimental endpoints can be removed in order to decrease your SageMaker costs when they are not in use.

In [None]:
sagemaker_vector_store.delete_endpoint()