# Neural Sparse Search with Amazon OpenSearch Service

**Welcome to Neural sparse search notebook. Use this notebook to build a Search application powered by Amazon OpenSearch Service**

In this notebook, you will perform the following steps in sequence,

The lab includes the following steps:
1. [Step 1: Get the Cloudformation outputs](#Step-1:-Get-the-Cloudformation-outputs)
2. [Step 2: Create the OpenSearch-Sagemaker ML connector](#Step-2:-Create-the-OpenSearch-Sagemaker-ML-connector)
3. [Step 3: Register and deploy the sparse encoding model in OpenSearch](#Step-3:-Register-and-deploy-the-sparse-encoding-model-in-OpenSearch)
4. [Step 4: Create the OpenSearch ingest pipeline with sparse-encoding processor](#TODO-Step-4:-Create-the-OpenSearch-ingest-pipeline-with-sparse-encoding-processor)
5. [Step 5: Create the opensearch index](#Step-5:-Create-the-opensearch-index)
6. [Step 6: Prepare the dataset](#Step-6:-Prepare-the-dataset)
7. [Step 7: Ingest the prepared data into OpenSearch](#Step-7:-Ingest-the-prepared-data-into-OpenSearch)
8. [Step 8: Create two phase search pipeline](#Step-7:-create-twp-phase-search-pipeline)
9. [Step 9: Launch the search application](#Step-7:-Launch-the-search-application)

In [None]:
#Install dependencies
#Implement header-based authentication and request authentication for AWS services that support AWS auth v4
%pip install requests_aws4auth
#OpenSearch Python SDK
%pip install opensearch_py

## Step 1: Get the Cloudformation outputs

Here, we retrieve the services that are already deployed as a part of the cloudformation template to be used in building the application. The services include,
1. **Sagemaker Endpoint**
2. **OpenSearch Domain** Endpoint

In [None]:
import sagemaker, boto3, json, time
from sagemaker.session import Session
import subprocess
from IPython.utils import io

cfn = boto3.client('cloudformation')

response = cfn.list_stacks(StackStatusFilter=['CREATE_COMPLETE','UPDATE_COMPLETE'])

for cfns in response['StackSummaries']:
    if('TemplateDescription' in cfns.keys()):
        if('Neural Sparse search' in cfns['TemplateDescription']):
            stackname = cfns['StackName']
stackname

response = cfn.describe_stack_resources(
    StackName=stackname
)
# for resource in response['StackResources']:
#     if(resource['ResourceType'] == "AWS::SageMaker::Endpoint"):
#         SagemakerEmbeddingEndpoint = resource['PhysicalResourceId']

cfn_outputs = cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']

for output in cfn_outputs:
    if('OpenSearchDomainEndpoint' in output['OutputKey']):
        OpenSearchDomainEndpoint = output['OutputValue']
        
    if('SageMakerSparseModelEndpoint' in output['OutputKey']):
        SagemakerEmbeddingEndpoint = output['OutputValue']
    if('WebAppURL' in output['OutputKey']):
        WebAppURL = output['OutputValue']
        
region = boto3.Session().region_name  
        

account_id = boto3.client('sts').get_caller_identity().get('Account')



print("stackname: "+stackname)
print("account_id: "+account_id)  
print("region: "+region)
print("SageMakerSparseModelEndpoint: "+SagemakerEmbeddingEndpoint)
print("OpenSearchDomainEndpoint: "+OpenSearchDomainEndpoint)

## Step 2: Initialise OpenSearch client


In [None]:
#Initialise OpenSearch client
import boto3
import requests 
from requests_aws4auth import AWS4Auth
import json

host = 'https://'+OpenSearchDomainEndpoint+'/'
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

## Step 3: Create the OpenSearch-Sagemaker ML connector 

Amazon OpenSearch Service AI connectors allows you to create a connector from OpenSearch Service to SageMaker Runtime.
To create a connector, we use the Amazon OpenSearch Domain endpoint, SagemakerEndpoint that hosts the sparse encoding model and an IAM role that grants OpenSearch Service access to invoke the sagemaker model (this role is already created as a part of the cloudformation template)

Here, Using the connector_id obtained from the previous step, we register and deploy the model in OpenSearch and get a model identifier (model_id)

In [None]:
connector_path_url = host+'_plugins/_ml/connectors/_create'
register_deploy_model_path_url = host+'_plugins/_ml/models/_register?deploy=true'
headers = {"Content-Type": "application/json"}

#create connector
payload_1 = {
       "name": "sparse encoding model",
       "description": "Connector for sparse encoding model",
       "version": 1,
       "protocol": "aws_sigv4",
       "credential": {
          "roleArn": "arn:aws:iam::"+account_id+":role/opensearch-sagemaker-role"
       },
       "parameters": {
          "region": region,
          "service_name": 'sagemaker'
       },
       "actions": [
          {
             "action_type": "predict",
             "method": "POST",
             "headers": {
                "content-type": "application/json"
             },
             "url": "https://runtime.sagemaker."+region+".amazonaws.com/endpoints/"+SagemakerEmbeddingEndpoint+"/invocations",
             "pre_process_function": '\n    StringBuilder builder = new StringBuilder();\n    builder.append("\\"");\n    builder.append(params.text_docs[0]);\n    builder.append("\\"");\n    def parameters = "{" +"\\"inputs\\":" + builder + "}";\n    return "{" +"\\"parameters\\":" + parameters + "}";\n    ', 
             "request_body": """["${parameters.inputs}"]""",
          }
       ]
    }
    

r_1 = requests.post(connector_path_url, auth=awsauth, json=payload_1, headers=headers)
connector_id = json.loads(r_1.text)["connector_id"]
time.sleep(1)

#register and deploy model
    
payload_2 = { 
                "name": "sparse encoding model",
                "function_name":"remote",
                "description": "sparse encoding model",
                "connector_id": connector_id
                
                }

r_2 = requests.post(register_deploy_model_path_url, auth=awsauth, json=payload_2, headers=headers)
model_id = json.loads(r_2.text)["model_id"]
time.sleep(1)
    
#test model

payload_4 = {
      "parameters": {
        "inputs": "hello"
          }
            }

path_4 = host+'_plugins/_ml/models/'+model_id+'/_predict'
r_4 = requests.post(path_4, auth=awsauth, json=payload_4, headers=headers)
print(r_4.text)

## Step 4: Create the OpenSearch ingest pipeline with sparse_encoding processor

In the ingestion pipeline, you choose "text_embedding" processor to generate vector embeddings from "caption" field and store vector data in "caption_embedding" field of type knn_vector.

In [None]:
path = "_ingest/pipeline/sagemaker-sparse-ingest-pipeline"
url = host + path
payload = {
  "description": "An sparse encoding ingest pipeline",
  "processors": [
    {
      "sparse_encoding": {
        "model_id": model_id,
        "field_map": {
          "product_description": "product_description_sparse_encoding"
        }
      }
    }
  ]
}

r = requests.put(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
print(r.text)


## Step 5: Create the Sparse index with rank_features

Create an opensearch index and set the pipeline created in the previous step "sagemaker-sparse-ingest-pipeline" as the default pipeline. The product_description_sparse_encoding field must be mapped as a reank_featuers field type. 

In [None]:
path = "sagemaker-sparse-search-index"
url = host + path
payload = {
  "settings": {
    
    "default_pipeline": "sagemaker-sparse-ingest-pipeline",
    "number_of_shards": 1,
    "number_of_replicas": "0"
  },
  "mappings": {
    "properties": {
      "product_description_sparse_encoding": {
        "type": "rank_features"
      },
      "product_description": {
        "type": "text"
      }
    }
  }
}
r = requests.put(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
print(r.text)

## Step 6: Download the dataset (.gz) and extract the .gz file

In [None]:
import os
import urllib.request
import tarfile

os.makedirs('tmp/images', exist_ok = True)
metadata_file = urllib.request.urlretrieve('https://aws-blogs-artifacts-public.s3.amazonaws.com/BDB-3144/products-data.yml', 'tmp/images/products.yaml')
img_filename,headers= urllib.request.urlretrieve('https://aws-blogs-artifacts-public.s3.amazonaws.com/BDB-3144/images.tar.gz', 'tmp/images/images.tar.gz')              
print(img_filename)
file = tarfile.open('tmp/images/images.tar.gz')
file.extractall('tmp/images/')
file.close()
#remove images.tar.gz
os.remove('tmp/images/images.tar.gz')

## Step 7: Ingest the prepared data into OpenSearch

We ingest only the captions and the image urls of the images into the opensearch index

This step takes approcimately 3 minutes to load the data into opensearch. 

**A total of 49 batches will be ingested into the index, where every batch has 50 documents**

In [None]:
from ruamel.yaml import YAML
from PIL import Image
import os
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

headers = { "Content-Type": "application/json"}
aos_client = OpenSearch(
    hosts = [{'host': OpenSearchDomainEndpoint, 'port': 443}],
    http_auth = awsauth,
    use_ssl = True,
    #verify_certs = True,
    connection_class = RequestsHttpConnection
)

# Load the products from the dataset
yaml = YAML()
items_ = yaml.load(open('tmp/images/products.yaml'))

batch = 0
count = 0
body_ = ''
batch_size = 50
last_batch = int(len(items_)/batch_size)
action = json.dumps({ 'index': { '_index': 'sagemaker-sparse-search-index' } })

for item in items_:
    count+=1
    payload = {}
    payload['image_url'] = "/home/ec2-user/SageMaker/AI-search-with-amazon-opensearch-service/tmp/images/"+item["category"]+"/"+item["image"]
    payload['product_description'] = item['description']
    payload['caption'] = item['name']
    payload['category'] = item['category']
    payload['price'] = item['price']
    if('gender_affinity' in item):
        if(item['gender_affinity'] == 'M'):
            payload['gender_affinity'] = 'Male'
        else:
            if(item['gender_affinity'] == 'F'):
                payload['gender_affinity'] = 'Female'
            else:
                payload['gender_affinity'] = item['gender_affinity']
    if('style' in item):          
        payload['style'] = item['style']
    
    body_ = body_ + action + "\n" + json.dumps(payload) + "\n"
    
    if(count == batch_size):
        response = aos_client.bulk(
        index = 'sagemaker-sparse-search-index',
        body = body_
        )
        batch += 1
        count = 0
        print("batch "+str(batch) + " ingestion done!")
        if(batch != last_batch):
            body_ = ""
        
            
#ingest the remaining rows
response = aos_client.bulk(
        index = 'sagemaker-sparse-search-index',
        body = body_
        )
        
print("All "+str(last_batch)+" batches ingested into index")

## Step 8: Create two phase search pipeline

In [None]:
path = "_search/pipeline/neural_sparse_two_phase_search_pipeline"
url = host + path
payload = {
  "request_processors": [
    {
      "neural_sparse_two_phase_processor": {
        "tag": "neural-sparse",
        "description": "This processor is making two-phase processor.",
        "enabled": True,
        "two_phase_parameter": {
          "prune_ratio": 0.4
        }
      }
    }
  ]
}
r = requests.put(url, auth=awsauth, json=payload, headers=headers)
print(r.status_code)
print(r.text)


## Step 9: Launch the Search application

In [None]:
print(WebAppURL)