# 4: Load Docs into OpenSearch Serverless Vector Database 

- SageMaker Notebook Kernel: `conda_python3`
- SageMaker Notebook Instance Type: ml.m5d.large | ml.t3.large

In this notebook, you will load previously extracted, split, and embedded texts from the [Amazon Bedrock](https://aws.amazon.com/bedrock/) user guide into [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html) database. OpenSearch is a fully open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. In this notebook we use will a new feature (in preview as of November 2023) named Vector Search. The vector search collection type in OpenSearch Serverless provides a similarity search capability that is scalable and high performing. It makes it easy for you to build modern machine learning (ML) augmented search experiences and generative artificial intelligence (AI) applications without having to manage the underlying vector database infrastructure.

## Runtime 

This notebook takes approximately 10 minutes to run.

## Contents

1. [Prerequisites](#prerequisites)
1. [Setup](#setup)
1. [Download data](#download-data)
1. [Get OpenSearch Serverless collection name](#get-opensearch-serverless-collection-name)
1. [Update the OpenSearch access policy](#update-the-opensearch-access-policy-with-the-notebook-assumed-role)
1. [Create the OpenSearch runtime client](#create-the-opensearch-runtime-client)
1. [Patch the OpenSearch client creation in LangChain](#patch-the-opensearch-client-creation-in-langchain)
1. [Load embeddings into OpenSearch](#load-embeddings-into-opensearch)
1. [Similarity Search](#similarity-search)

## Prerequisites

`amazon.titan-embed-text-v1` enabled in the Amazon Bedrock console in `us-west-2`


## Setup

Let's start by installing and importing the required packages for this notebook. 

<div class="alert alert-block alert-warning"><b>Note:</b> Verify that the notebook kernel is `conda_python3`. Also, if you run into an issue where a module can't be imported after installation, restart the notebook kernel, then rerun the import notebook cell.</div>

In [None]:
%pip install langchain==0.0.317 --quiet
%pip install opensearch-py==2.3.2 --quiet

In [None]:
import os
import pickle
import json
import boto3
import langchain.vectorstores.opensearch_vector_search as ovs

from pprint import pprint
from IPython.display import display
from opensearchpy import OpenSearch, RequestsHttpConnection,AWSV4SignerAuth, helpers
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import OpenSearchVectorSearch

***

Next, you will initialize the Amazon Bedrock boto3 client and the Amazon OpenSearch Serverless boto3 client. 

***


In [None]:
boto3_session = boto3.Session()
aoss_client = boto3_session.client("opensearchserverless")
region = boto3_session.region_name

bedrock_client = boto3_session.client("bedrock-runtime")
embeddings_model_id = "amazon.titan-embed-text-v1"

# Create the data directory
os.makedirs("data", exist_ok=True)

bedrock_embeddings = BedrockEmbeddings(
    client=bedrock_client, model_id=embeddings_model_id
)

print(f"boto3 region: {region}")

## Download data

To save some time, we extracted all of the content on the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide) site, split the data into chunks, and generated the embeddings. Let's download the file from S3.


In [None]:
s3_data_bucket = os.getenv("ASSETS_BUCKET_NAME")
s3_data_prefix = os.getenv("ASSETS_BUCKET_PREFIX")
s3_data_uri = f"s3://{s3_data_bucket}/{s3_data_prefix}data/bedrock_user_guide_embeddings.pkl"
!aws s3 cp {s3_data_uri} ./data/bedrock_user_guide_embeddings.pkl --region {region}


***

Open the the file and print details about the contents. The file contains a dictionary with three arrays; one for the document text, one for the metadata about the document, and one for the embeddings. This data structure makes it easy to load the documents into OpenSearch.

***

In [None]:
with open(os.path.join("data", "bedrock_user_guide_embeddings.pkl"), "rb") as f:
   bedrock_user_guide_embeddings = pickle.load(f)

stats = {key: len(bedrock_user_guide_embeddings[key]) for key in bedrock_user_guide_embeddings}

print(json.dumps(stats, indent=4))

## Get OpenSearch Serverless collection name

The OpenSearch collection has already been created for you. Let's query the collections and print the collection name and id. There should only be one.

In [None]:
list_collections_response = aoss_client.list_collections()
collections = list_collections_response
collection = collections.get("collectionSummaries")[0]
collection_id = collection.get("id")
collection_name = collection.get("name")

print(f"Collection Id: {collection_id}")
print(f"Collection Name: {collection_name}")

## Update the OpenSearch access policy with the notebook assumed role

The OpenSearch encryption, network, and access policies were created with the collection for you, but the assumed role of the SageMaker notebook hasn't been added to the access policy yet. We need to update the principals with the notebooks assumed role to be able to access the runtime api.


In [None]:
def update_access_policy(policy_name):
    policy_type = "data"
    policy_response = aoss_client.get_access_policy(name=policy_name, type=policy_type)
    access_policy_detail = policy_response.get("accessPolicyDetail")
    policy = access_policy_detail.get("policy")
    policy_version = access_policy_detail.get("policyVersion")
    policy_principals = policy[0].get("Principal")
    assumed_role_arn = boto3.client("sts").get_caller_identity().get("Arn")
    update_needed = False
    if assumed_role_arn not in policy_principals:
        policy_principals.append(assumed_role_arn)
        update_needed = True
    if update_needed:
        print("Updating the access policy with the notebook assumed role")
        response = aoss_client.update_access_policy(
            name=policy_name, policy=json.dumps(policy), policyVersion=policy_version, type=policy_type
        )
        print(response)
    else:
        print("Notebook assumed role already exists in the policy, skipping update")


update_access_policy(f"{collection_name}-access")

## Create the OpenSearch runtime client

Create the `OpenSearch` runtime client. The `AWSV4SignerAuth` class handles signing the requests with [AWS Signature V4](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html) allowing us to use AWS IAM role credentials when invoking the endpoint.

In [None]:
service = "aoss"
host = f"{collection_id}.{region}.aoss.amazonaws.com"

credentials = boto3_session.get_credentials()
http_auth = AWSV4SignerAuth(credentials, region, service)

aoss_runtime_client = OpenSearch(
    hosts=[{"host": host, "port": 443}],
    http_auth=http_auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300,
    pool_maxsize=20,
)

## Patch the OpenSearch client creation in LangChain

The `OpenSearchVectorSearch` class from LangChain doesn't allow you to pass the OpenSearch client you just created. Since we want to use AWS IAM role based credentials, you will patch the `_get_opensearch_client` method to return our pre-configured client when executed.

In [None]:
def get_opensearch_client(opensearch_url: str, **kwargs):
    return aoss_runtime_client

ovs._get_opensearch_client = get_opensearch_client


## Load embeddings into OpenSearch 

The `OpenSearchVectorSearch` class from LangChain does a lot of the heavy lifting for us. The static `from_embeddings` method will create the index and upload the texts, metadata, and embeddings to the database and then return an instance of itself that we can use for similarity searching. One of the parameters to the class is the `BedrockEmbeddings` we used in an earlier notebook. The embeddings model is used when loading new texts as well as when searching for documents based on text. If you want to learn more about the class see [OpenSearchVectorSearch](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html). 

There are some restrictions of which engines and similarity algorithms you can use when using the vector search component of OpenSearch Serverless. To learn more see [OpenSearch Developer Guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html). For this workshop you will create the index using cosine similarity with the `nmslib` engine.

In [None]:
index_name = "bedrock-docs"

db = OpenSearchVectorSearch.from_embeddings(
    opensearch_url=host,
    http_auth=http_auth,
    index_name=index_name,
    engine="nmslib",
    space_type="cosinesimil",
    embedding=bedrock_embeddings,
    **bedrock_user_guide_embeddings
)

***

Query the index and take a look at the contents.

***

In [None]:
try:
    response = aoss_runtime_client.indices.get(index_name)
    print(json.dumps(response, indent=2))
except Exception as ex:
    print(ex)

## Similarity search

Run a couple of queries and review the results. The `similarity_search_with_score` returns a number of documents defined by `k`, which is the second parameter of the function, with the similarity scores. We can use the scores to eliminate results that return with low similarity.

<div class="alert alert-block alert-warning"><b>Note:</b> It takes a minute or two after uploading the texts to OpenSearch for them to be indexed and available for query. If you run the cell below and it returns no results, wait a minute, then run the cell again.</div>

In [None]:
db.similarity_search_with_score("What is Amazon Bedrock?", 1)


In [None]:
db.similarity_search_with_score("What large language models are available on bedrock?", 2)

***

Try a query for completely unrelated input. What do you notice?

***

In [None]:
db.similarity_search_with_score("Who was the main actor in Jurrasic Park?", 2)

***

Try your own queries below about Amazon Bedrock and see if the results match your understanding. If you are unfamiliar with Bedrock, navigate to the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide) and peek around.

***

In [None]:
db.similarity_search_with_score("<question-here>", 2)

## Notebook complete

Now that we have all the data loaded into OpenSearch Serverless, move to the next notebook to learn how to tie the model and data together using LangChain.
