# Semantic Search with OpenSearch Neural Search 

We will use Neural Search plugin in OpenSearch to implement semantic search

### 1. Check PyTorch Version


As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch.

In [None]:
import torch
print(torch.__version__)

### 2. Retrieve notebook variables

The line below will retrieve your shared variables from the previous notebook.

In [None]:
%store -r

### 3. Install OpenSearch ML Python library

In [None]:
!pip install opensearch-py-ml
!pip install accelerate
!pip install deprecated
!pip install requests-aws4auth

Now we need to restart the kernel by running below cell.

In [None]:
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

### 4. Import library



In [None]:
import boto3
import re
import time

### 5. Prepare Headset PQA data
We have already downloaded the dataset in Module 2, so let's start by ingesting 1000 rows of the data into a Pandas data frame. 

Before we can run any queries, we need to download the Amazon Product Question and Answer data from : https://registry.opendata.aws/amazon-pqa/

In [None]:
!aws s3 cp --no-sign-request s3://amazon-pqa/amazon_pqa_headsets.json ./amazon-pqa/amazon_pqa_headsets.json

In [None]:
import json
import pandas as pd

def load_pqa(file_name,number_rows=1000):
    qa_list = []
    df = pd.DataFrame(columns=('question', 'answer'))
    with open(file_name) as f:
        i=0
        for line in f:
            data = json.loads(line)
            df.loc[i] = [data['question_text'],data['answers'][0]['answer_text']]
            i+=1
            if(i == number_rows):
                break
    return df


qa_list = load_pqa('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)



### 6. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.

Note: if you're using a region other than us-east-1, please update the region in the code below.

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
import boto3

cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

bucket = outputs['s3BucketTraining']
aos_host = outputs['OpenSearchDomainEndpoint']

outputs

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3
import json

kms = boto3.client('secretsmanager')
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

region = 'us-east-1' 

auth = (aos_credentials['username'], aos_credentials['password'])

index_name = 'nlp_pqa'

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### 7. Register Model Groups and External Connector 

Initialize OpenSearch authentication with AWS auth.

In [None]:
import boto3
import requests 
from requests_aws4auth import AWS4Auth

service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)


Register model group

---
### Note: Before run this cell, make sure you have completed the steps in the lab instruction "Map the ML role in OpenSearch Dashboards". 

If you don't complete the lab instruction steps, you will get "403" error, error message likes "There is error in creating connector{"error":{"root_cause":[{"type":"security_exception","reason":"no permissions for ...". 


In [None]:
path = '_plugins/_ml/model_groups/_register'
url = 'https://' + aos_host + '/' + path

payload = {
  "name": "remote_model_groups_for_embedding",
  "description": "A model group for remote models"
}

headers = {"Content-Type": "application/json"}

r = requests.post(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
if r.status_code == 200:
    data = json.loads(r.text)

    model_group_id = data['model_group_id']
    print("model group id:" + model_group_id)
else:
    raise Exception("There is error in creating model groups" + str(r.text))

Uncomment the following code if you want delete model group. Replace  `{model_group_id}` with the value you want delete.

In [None]:
# path = '_plugins/_ml/model_groups/{model_group_id}'
# url = 'https://' + aos_host + '/' + path


# headers = {"Content-Type": "application/json"}

# r = requests.delete(url, auth=awsauth, headers=headers)
# print(r.text)

Get account id

In [None]:
account_id = boto3.client('sts').get_caller_identity().get('Account')
print(account_id)

### Register SageMaker Connector


---

There are several important parameters to register SageMaker connector. Please refer [OpenSearch Connector blueprints](https://opensearch.org/docs/latest/ml-commons-plugin/extensibility/blueprints/#configuration-options) for more information.

* pre_process_function: Function used to pre-process data before sending to embedding model. Example script used to preprocess the input data is as following:
```
    StringBuilder builder = new StringBuilder();
    builder.append("\"");
    builder.append(params.text_docs[0]);
    builder.append("\"");
    def parameters = "{" +"\"inputs\":" + builder + "}";
    return "{" +"\"parameters\":" + parameters + "}";
```

* request_body: Define the data structure sent to enmbedding model. Example input data to GPT-J embedding model is like following:

```
{ "text_inputs": "${parameters.inputs}"}

```

* post_process_function: Function used to post-process data after getting embedding data from ML model. Example script used to post-process the model output data:
```
    def name = "sentence_embedding";
    def dataType = "FLOAT32";
    if (params.embedding == null || params.embedding.length == 0) {
        return null;
    }
    def shape = [params.embedding[0].length];
    def json = "{" +
               "\"name\":\"" + name + "\"," +
               "\"data_type\":\"" + dataType + "\"," +
               "\"shape\":" + shape + "," +
               "\"data\":" + params.embedding[0] +
               "}";
    return json;
```


In [None]:

path = '_plugins/_ml/connectors/_create'
url = 'https://' + aos_host + '/' + path

sagemaker_url = "https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/" + outputs["EmbeddingEndpointName"]+ "/invocations"
role_arn = "arn:aws:iam::" + account_id + ":role/opensearch-sagemaker-role"

payload = {
   "name": "sagemaker embedding",
   "description": "Remote connector for Sagemaker embedding model",
   "version": 1,
   "protocol": "aws_sigv4",
   "credential": {
      "roleArn": role_arn
   },
   "parameters": {
      "region": "us-east-1",
      "service_name": "sagemaker"
   },
   "actions": [
      {
         "action_type": "predict",
         "method": "POST",
         "headers": {
            "content-type": "application/json"
         },
         "url": sagemaker_url,
         "pre_process_function": '\n    StringBuilder builder = new StringBuilder();\n    builder.append("\\"");\n    builder.append(params.text_docs[0]);\n    builder.append("\\"");\n    def parameters = "{" +"\\"inputs\\":" + builder + "}";\n    return "{" +"\\"parameters\\":" + parameters + "}";\n    ', 
         "request_body": "{ \"text_inputs\": \"${parameters.inputs}\"}",
         "post_process_function": '\n    def name = "sentence_embedding";\n    def dataType = "FLOAT32";\n    if (params.embedding == null || params.embedding.length == 0) {\n        return null;\n    }\n    def shape = [params.embedding[0].length];\n    def json = "{" +\n               "\\"name\\":\\"" + name + "\\"," +\n               "\\"data_type\\":\\"" + dataType + "\\"," +\n               "\\"shape\\":" + shape + "," +\n               "\\"data\\":" + params.embedding[0] +\n               "}";\n    return json;\n    '

      }
   ]
}
headers = {"Content-Type": "application/json"}

r = requests.post(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
if r.status_code == 200:
    data = json.loads(r.text)

    sagemaker_connector_id = data['connector_id']
    print("SageMaker connector id:" + sagemaker_connector_id)
else:
    raise Exception("There is error in creating connector" + str(r.text))

### Register Bedrock Connector

---

Register Bedrock connecter is similiar with SageMaker connector, only with some parameters difference. Please refere [Amazon Bedrock connector](https://opensearch.org/docs/latest/ml-commons-plugin/extensibility/connectors/#amazon-bedrock-connector) on how to register Bedrock connector.


In [None]:

path = '_plugins/_ml/connectors/_create'
url = 'https://' + aos_host + '/' + path

bedrock_url = "https://bedrock-runtime." + region + ".amazonaws.com/model/amazon.titan-embed-text-v1/invoke"
role_arn = "arn:aws:iam::" + account_id + ":role/opensearch-sagemaker-role"

payload = {
  "name": "Amazon Bedrock Connector: embedding",
  "description": "The connector to bedrock Titan embedding model",
  "version": 1,
  "protocol": "aws_sigv4",
  "parameters": {
    "region": region,
    "service_name": "bedrock"
  },
  "credential": {
    "roleArn": role_arn
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "url": bedrock_url,
      "headers": {
        "content-type": "application/json",
        "x-amz-content-sha256": "required"
      },
      "pre_process_function": "\n    StringBuilder builder = new StringBuilder();\n    builder.append(\"\\\"\");\n    String first = params.text_docs[0];\n    builder.append(first);\n    builder.append(\"\\\"\");\n    def parameters = \"{\" +\"\\\"inputText\\\":\" + builder + \"}\";\n    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";",
      "request_body": "{ \"inputText\": \"${parameters.inputText}\" }",
      "post_process_function": "\n      def name = \"sentence_embedding\";\n      def dataType = \"FLOAT32\";\n      if (params.embedding == null || params.embedding.length == 0) {\n        return params.message;\n      }\n      def shape = [params.embedding.length];\n      def json = \"{\" +\n                 \"\\\"name\\\":\\\"\" + name + \"\\\",\" +\n                 \"\\\"data_type\\\":\\\"\" + dataType + \"\\\",\" +\n                 \"\\\"shape\\\":\" + shape + \",\" +\n                 \"\\\"data\\\":\" + params.embedding +\n                 \"}\";\n      return json;\n    "
    }
  ]
}
headers = {"Content-Type": "application/json"}

r = requests.post(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
if r.status_code == 200:
    data = json.loads(r.text)

    bedrock_connector_id = data['connector_id']
    print("Bedrock connector id:" + bedrock_connector_id)
else:
    raise Exception("There is error in creating connector" + str(r.text))

### Select Embedding Model

Here you can choose SageMaker or Bedrock for embedding. After the Dropbox is show, select one model you will use for the following steps.

Depends on how you create this lab environment, you may have Bedrock in your environment or not. If you don't have Bedorck in your environment, you have to set `is_bedrock_available` to False.


In [None]:
is_bedrock_available=True

In [None]:
from ipywidgets import Dropdown

llm_selection = [
    "SageMaker",
    "Bedrock",
]

llm_dropdown = Dropdown(
    options=llm_selection,
    value="SageMaker",
    description="Select embedding model",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(llm_dropdown)

In [None]:
llm_category = llm_dropdown.value

if not is_bedrock_available:
    llm_category = "SageMaker"

In [None]:
print("You selected {0} as embedding model".format(llm_category))

In [None]:
match llm_category:
    case "SageMaker":
        connector_id = sagemaker_connector_id
    case "Bedrock":
        connector_id = bedrock_connector_id


In [None]:
print("connector id: " + connector_id)


### 8. Register  Remote Model


Register a remote model with connector and model groups

In [None]:
path = '_plugins/_ml/models/_register'
url = 'https://' + aos_host + '/' + path

sagemaker_payload = {
    "name": "sagemaker-opensearch-embedding",
    "function_name": "remote",
    "model_group_id": model_group_id,
    "description": "opensearch gpt-j embedding",
    "connector_id": connector_id
}

bedrock_payload = {
    "name": "bedrock-opensearch-embedding",
    "function_name": "remote",
    "model_group_id": model_group_id,
    "description": "bedrock titan embedding",
    "connector_id": connector_id
}

match llm_category:
    case "SageMaker":
        payload = sagemaker_payload
    case "Bedrock":
        payload = bedrock_payload
        
headers = {"Content-Type": "application/json"}

r = requests.post(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
if r.status_code == 200:
    data = json.loads(r.text)

    model_id = data['model_id']
    print(model_id)
else:
    raise Exception("There is error in registering model" + str(r.text))

### 9. Load the model for inference.

In [None]:
from opensearch_py_ml.ml_commons import MLCommonClient

ml_client = MLCommonClient(aos_client)

load_model_output = ml_client.deploy_model(model_id)

print(load_model_output)

### 10.Get the model detailed information and test the model

In [None]:
model_info = ml_client.get_model_info(model_id)

print(model_info)

Test the remote embedding model 

In [None]:
path = '_plugins/_ml/models/' + model_id + '/_predict'
url = 'https://' + aos_host + '/' + path


sagemaker_payload = {
  "parameters": {
    "inputs": "this is test"
  }
}

bedrock_payload = {
  "parameters": {
    "inputText": "this is test"
  }
}

match llm_category:
    case "SageMaker":
        payload = sagemaker_payload
    case "Bedrock":
        payload = bedrock_payload
        
headers = {"Content-Type": "application/json"}

r = requests.post(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
if r.status_code == 200:
    data = json.loads(r.text)
    print(data)
else:
    raise Exception("There is error in calling model" + str(r.text))

### 11. Create pipeline to convert text into vector with BERT model
We will use the just uploaded model to convert `qestion` field into vector(embedding) and stored into `question_vector` field.

In [None]:
pipeline={
  "description": "An semantic search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": model_id,
        "field_map": {
           "question": "question_vector"
        }
      }
    }
  ]
}
pipeline_id = 'nlp_pipeline'
aos_client.ingest.put_pipeline(id=pipeline_id,body=pipeline)

Verify pipeline is created succefuflly.

In [None]:
aos_client.ingest.get_pipeline(id=pipeline_id)

### 12. Create a index in Amazon Opensearch Service 
Whereas we previously created an index with 2 fields, this time we'll define the index with 3 fields: the first field ' question_vector' holds the vector representation of the question, the second is the "question" for raw sentence and the third field is "answer" for the raw answer data.

To create the index, we first define the index in JSON, then use the aos_client connection we initiated ealier to create the index in OpenSearch.

Here we need to define different index for SageMaker and Bedrock embedding because the vector dimension is different. SageMaker embedding dimension is 4096, Bedrock embedding dimension is 1536.

In [None]:
sagemaker_knn_index = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
        "default_pipeline": pipeline_id,
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "question_vector": {
                "type": "knn_vector",
                "dimension": 4096,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "faiss"
                },
                "store": True
            },
            "question": {
                "type": "text",
                "store": True
            },
            "answer": {
                "type": "text",
                "store": True
            }
        }
    }
}

bedrock_knn_index = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
        "default_pipeline": pipeline_id,
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "question_vector": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "faiss"
                },
                "store": True
            },
            "question": {
                "type": "text",
                "store": True
            },
            "answer": {
                "type": "text",
                "store": True
            }
        }
    }
}

match llm_category:
    case "SageMaker":
        knn_index = sagemaker_knn_index
    case "Bedrock":
        knn_index = bedrock_knn_index


## Note: If this is the first time you're running this, you can comment this line code which try to delete index created before.

In [None]:
aos_client.indices.delete(index="nlp_pqa")


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [None]:
print(knn_index)
aos_client.indices.create(index="nlp_pqa",body=knn_index,ignore=400)


Let's verify the created index information

In [None]:
aos_client.indices.get(index="nlp_pqa")

### 13. Load the raw data into the Index
Next, let's load the headset enhanced PQA data into the index we've just created. During ingest data, `question` field will also be converted to vector(embedding) by the `nlp_pipeline` we defined.

In [None]:
i = 0
for c in qa_list["question"].tolist():
    content=c
    answer=qa_list["answer"][i]
    i+=1
    aos_client.index(index='nlp_pqa',body={"question": content,"answer":answer})

To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index.

In [None]:
res = aos_client.search(index="nlp_pqa", body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])


### 14. Search vector with "Semantic Search" 

We can search the data with neural search.


In [None]:
query={
  "_source": {
        "exclude": [ "question_vector" ]
    },
  "size": 30,
  "query": {
    "neural": {
      "question_vector": {
        "query_text": "does this work with xbox?",
        "model_id": model_id,
        "k": 30
      }
    }
  }
}

res = aos_client.search(index="nlp_pqa", 
                       body=query,
                       stored_fields=["question","answer"])
print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question'],hit['_source']['answer']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 15. Search the same query with "Text Search"

Let's repeat the same query with a keyword search and compare the differences.

In [None]:
query={
    "size": 30,
    "query": {
        "match": {
            "question":"does this work with xbox?"
        }
    }
}

res = aos_client.search(index="nlp_pqa", 
                       body=query,
                       stored_fields=["question","answer"])
#print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['fields']['question'][0],hit['fields']['answer'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 16. Observe The Results

Compare the first few records in the two searches above. For the Semantic search, the first 10 or so results are very similar to our input questions, as we expect. Compare this to keyword search, where the results quickly start to deviate from our search query (e.g. "it shows xbox 360. Does it work for ps3 as well?" - this matches on keywords but has a different meaning).

You can also use "Compare search results" in Search relevance plugin to compare search relevance side by side. Please refer the lab "Option 2: OpenSearch Dashboard Dev Tools" to compare search results.

### 17. Choose another embedding model 

Go back to [Select Embedding Model](#Select-Embedding-Model)

### 18. Summary
With OpenSearch Neural Search remote model, embedding is automatically generated with model hosted in SageMaker or Bedrock. We don't need care about inference pipeline anymore. It makes the semantic search solution simple to develop and maintain. 