# Semantic Search with Amazon OpenSearch Service 
To create semantic search, we will add a vector representation of the metadata to our data set in OpenSearch, then do the same with our sample query "Wildfires in Canada". In OpenSearch, we'll use a KNN search to find matches based on a cosine similarity rating on the vector.
We will:
1. Use a HuggingFace sentence-transformer BERT model to generate sentence embedding for the geo.ca metadata dataset
2. Upload the dataset to OpenSearch, with the original metadata schema text combined with the vector representation of the questions.
3. Translate the query question to a vector.
4. Perform a KNN search in OpenSearch to perform semantic search

### 1. Check PyTorch Version


As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch. The version should already be 2.0.1 or higher. If not, please run the lab in order to get everything set up.

In [1]:
import torch 
print(torch.__version__)

2.1.0


### 2. Retrieve notebook variables

The line below will retrieve your shared variables from the previous notebook.

In [2]:
%store -r

### 3. import library 

In [3]:
# #installed in the previous notebook
# !pip install -q boto3
# !pip install -q requests
# !pip install -q requests-aws4auth
# !pip install -q opensearch-py
# !pip install -q tqdm
# !pip install -q boto3
# !pip install -q install transformers[torch]
# !pip install -q transformers
# !pip install -q sentence-transformers rank_bm25
# !pip install -q nltk

In [4]:
import boto3
import re
import time
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### 4. Prepare BERT Model 
#### Option 1: DistilBert model
For this module, we will be using the HuggingFace BERT model to generate vectorization data, where every sentence is 768 dimension data. Let's create some helper functions we'll use later on.
![BERT](image/nlp_bert.png)

We are creating 2 functions:
1. mean_pooling
2. sentence_to_vector - this is the key function we'll use to generate our vector embedding for the metadata dataset.

A reason for not using DistilBert:
 Transformer models like DistilBert have a fixed maximum input length (512), and any input longer than this limit can cause errors during processing.Our input sequence length (1086 tokens) exceeds the model's maximum sequence length (512 tokens).

In [5]:
# import torch
# from transformers import AutoTokenizer, AutoModel
# from transformers import DistilBertTokenizer, DistilBertModel

# #model_name = "distilbert-base-uncased"
# #model_name = "sentence-transformers/msmarco-distilbert-base-dot-prod-v3"
# model_name = "sentence-transformers/distilbert-base-nli-stsb-mean-tokens" #https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens


# #Mean Pooling - Take attention mask into account for correct averaging
# def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
#     sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
#     return sum_embeddings / sum_mask


# def sentence_to_vector(raw_inputs):
#     tokenizer = DistilBertTokenizer.from_pretrained(model_name)
#     model = DistilBertModel.from_pretrained(model_name)
#     inputs_tokens = tokenizer(raw_inputs, padding=True, return_tensors="pt")
    
#     with torch.no_grad():
#         outputs = model(**inputs_tokens)

#     sentence_embeddings = mean_pooling(outputs, inputs_tokens['attention_mask'])
#     return sentence_embeddings




#### Option 2: all-MiniLM-L6-v2
We can also use sentence-transformer models ['all-MiniLM-L6-v2'](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which simplifies the process of obtaining sentence embeddings. It has 384 dimensions.It designed for generating sentence embeddings directly, which means we can use the Sentence Transformers library's functionality to handle both tokenization and embedding in a more streamlined manner compared to manually handling with DistilBertModel and DistilBertTokenizer.

The SentenceTransformer class's encode method directly handles the text input, tokenization, and conversion to sentence embeddings, eliminating the need for manual mean pooling. The encode method returns a tensor of sentence embeddings, where each embedding corresponds to the input sentences provided to the function.



In [5]:
#!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import numpy as np 

# Load the Sentence Transformer model
model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

def sentence_to_vector(raw_inputs):
    # Encode sentences to get sentence embeddings
    sentence_embeddings = model.encode(raw_inputs, convert_to_tensor=True)
    """
    When you work with vectors (such as embeddings) in Elasticsearch or OpenSearch, you need to convert the PyTorch tensor to a list of floats before indexing the document. 
    """
    encod_np_array = np.array(sentence_embeddings)
    encod_list = encod_np_array.tolist()
        
    return encod_list

  from tqdm.autonotebook import tqdm, trange


### 5.Preprocess and embed the text


In [6]:
import pandas as pd 
import boto3
import json
from requests_aws4auth import AWS4Auth
import io
import os 
from tqdm import tqdm

In [7]:
# Load metadata
def read_parquet_from_s3_as_df(region, s3_bucket, s3_key):
    """
    Load a Parquet file from an S3 bucket into a pandas DataFrame.

    Parameters:
    - region: AWS region where the S3 bucket is located.
    - s3_bucket: Name of the S3 bucket.
    - s3_key: Key (path) to the Parquet file within the S3 bucket.

    Returns:
    - df: pandas DataFrame containing the data from the Parquet file.
    """

    # Setup AWS session and clients
    session = boto3.Session(region_name=region)
    s3 = session.resource('s3')

    # Load the Parquet file as a pandas DataFrame
    object = s3.Object(s3_bucket, s3_key)
    body = object.get()['Body'].read()
    df = pd.read_parquet(io.BytesIO(body))
    return df

# Upload the duplicate date to S3 as a parquet file 
def upload_df_to_s3_as_parquet(df, bucket_name, file_key):
    # Save DataFrame as a Parquet file locally
    parquet_file_path = 'temp.parquet'
    df.to_parquet(parquet_file_path)

    # Create an S3 client
    s3_client = boto3.client('s3')

    # Upload the Parquet file to S3 bucket
    try:
        response = s3_client.upload_file(parquet_file_path, bucket_name, file_key)
        os.remove(parquet_file_path)
        print(f'Uploading {file_key} to {bucket_name} as parquet file')
        # Delete the local Parquet file
        return True
    except Exception as e:
        print(e)
        return False

# Create new column 'organization_en' required by the API JSON response 
def extract_organisation_en(contact_str):
    try:
        # Parse the stringified JSON into Python objects
        contact_data = json.loads(contact_str)
        # If the parsed data is a list, iterate through it
        if isinstance(contact_data, list):
            for item in contact_data:
                # Check if 'organisation' and 'en' keys exist
                if 'organisation' in item and 'en' in item['organisation']:
                    return item['organisation']['en']
        elif isinstance(contact_data, dict):
            # If the data is a dictionary, extract 'organisation' in 'en' directly
            return contact_data.get('organisation', {}).get('en', None)
    except json.JSONDecodeError:
        # Handle cases where the contact string is not valid JSON
        return None
    except Exception as e:
        # Catch-all for any other unexpected errors
        return f"Error: {str(e)}"

# Text preprocess
def preprocess_records_into_text(df):
    selected_columns = ['features_properties_title_en','features_properties_description_en','features_properties_keywords_en']
    df = df[selected_columns]
    return df.apply(lambda x: f"{x['features_properties_title_en']}\n{x['features_properties_description_en']}\nkeywords:{x['features_properties_keywords_en']}",axis=1 )

In [8]:
#1) Step1: Load the data 
df_parquet = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-stage', 'records.parquet')
df_sentinel1 = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-stage', 'sentinel1.parquet')
df = pd.concat([df_parquet, df_sentinel1], ignore_index=True)
df.head()
#df.columns

Unnamed: 0,features_type,features_geometry_type,features_geometry_coordinates,features_properties_id,features_properties_title_en,features_properties_title_fr,features_properties_description_en,features_properties_description_fr,features_properties_keywords_en,features_properties_keywords_fr,...,features_properties_distributor,features_properties_options,features_properties_temporalExtent_end_@indeterminatePosition,features_properties_temporalExtent_end_#text,features_properties_eoCollection,features_properties_eoFilters,features_properties_sourceSystemName,features_properties_plugins,features_popularity,features_similarity
0,Feature,Polygon,"[[[-143, 39.05], [-47, 39.05], [-47, 85], [-14...",d3881c4c-650d-4070-bf9b-1e00aabf0a1d,Canadian Hydrographic Service Non-Navigational...,Données bathymétriques non navigationnelles (N...,"**CHS NONNA data has been updated: April 21, 2...",**Les données NONNA du Service hydrographique ...,"Bathymetry, Depth, Hydrography","Bathymétrie, les profondeurs des fonds marins,...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://data.chs-shc.ca/"", ""protocol...",,,,[],cgp,[],3929,"[{""sim"": ""sim1"", ""features_properties_id"": ""ff..."
1,Feature,Polygon,"[[[-141.0027151, 41.7], [-52.6, 41.7], [-52.6,...",3d282116-e556-400c-9306-ca1a3cada77f,National Road Network - NRN - GeoBase Series,Réseau routier national - RRN - Série GéoBase,Notice - Format decommissioning\n\nGML (Geogra...,Avis - Changement aux formats offerts\n\nLes f...,"Canada, Geographic Infrastructure, NRN, Nation...","Canada, Infrastructure géographique, RRN, Rése...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://geo.statcan.gc.ca/geo_wa/ser...",,,,[],cgp,[],1991,"[{""sim"": ""sim1"", ""features_properties_id"": ""24..."
2,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",b6567c5c-8339-4055-99fa-63f92114d9e4,First Nations Location,Localisation des Premières Nations,The First Nations geographic location dataset ...,Le jeu de données des Premières Nations contie...,"First Nation, Band, Aboriginal, Indian and Nor...","Première Nation, bande, autochtone, Affaires i...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://data.aadnc-aandc.gc.ca/geoma...",,,,[],cgp,[],1881,"[{""sim"": ""sim1"", ""features_properties_id"": ""f4..."
3,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",522b07b9-78e2-4819-b736-ad9208eb1067,Aboriginal Lands of Canada Legislative Boundaries,Limites législatives des terres autochtones du...,The Aboriginal Lands of Canada Legislative Bou...,Le service web des limites législatives des te...,"Canada Lands, Indian reserves, Land management...","Terres du Canada, Réserves indiennes, Gestion ...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://proxyinternet.nrcan-rncan.gc...",,,,[],cgp,[],1832,"[{""sim"": ""sim1"", ""features_properties_id"": ""65..."
4,Feature,Polygon,"[[[-139.88850063, 42.16046612], [-50.94318813,...",db177a8c-5d7d-49eb-8290-31e6a45d786c,Critical Habitat of Species at Risk,L'habitat essentiel désigné en vertu de la Loi...,The Species at Risk (SAR) Program is responsib...,Le Programme des espèces en péril consiste à r...,"Species at risk, Canada, Diadromous fish, Sea ...","Espèces en péril, Canada, Les poissons diadrom...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://pacgis01.dfo-mpo.gc.ca/FGPPu...",,,,[],cgp,[],1711,"[{""sim"": ""sim1"", ""features_properties_id"": ""e0..."


Subset to columns that are required in the app.geo.ca [api response](https://geocore.api.geo.ca/geo?north=81.77364370720657&east=360&south=-8.407168163601076&west=-359.6484375&keyword=&lang=en&min=1&max=10&sort=popularity-desc). 
##### Note, we are focus on the english search at the moment. 

In [9]:
#2) Step2: Clean the data  
col_names_list = [
    'features_properties_id','features_geometry_coordinates','features_properties_title_en',
    'features_properties_description_en','features_properties_date_published_date',
    'features_properties_keywords_en','features_properties_options','features_properties_contact',
    'features_properties_topicCategory','features_properties_date_created_date',
    'features_properties_spatialRepresentation','features_properties_type',
    'features_properties_temporalExtent_begin','features_properties_temporalExtent_end',
    'features_properties_graphicOverview','features_properties_language','features_popularity',
    'features_properties_sourceSystemName','features_properties_eoCollection',
    'features_properties_eoFilters'
]
df_en = df[col_names_list]
df_en['organisation_en'] = df_en['features_properties_contact'].apply(extract_organisation_en)

# Create a new column 'temporalExtent' as a dictionary of {'begin': ..., 'end': ...}
values_to_replace = {'Present': None, 'Not Available; Indisponible': None}
columns_to_replace = ['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

df_en['temporalExtent'] = df_en.apply(lambda row: {'begin': row['features_properties_temporalExtent_begin'], 'end': row['features_properties_temporalExtent_end']}, axis=1)
df_en = df_en.drop(columns =['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end'])

values_to_replace = {'Not Available; Indisponible': None} # modifies dates to acceptable values
columns_to_replace = ['features_properties_date_published_date', 'features_properties_date_created_date']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en['organisation_en'] = df_en['features_properties_contact'].apply(extract_organisation_en)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en['temporalExtent'] = df_

In [10]:
#3) Step 3: Preprocess text 
df_en['text'] = preprocess_records_into_text(df_en)

df_en.describe()
print(type(df_en['text'].head(6)[1]))
print(df_en['text'].head(6)[1])

<class 'str'>
National Road Network - NRN - GeoBase Series
Notice - Format decommissioning\n\nGML (Geography Markup Language) and KML (Keyhole Markup Language) distribution formats will no longer be included in NRN releases produced after April 1, 2022. After this date, NRN versions will still be available in GeoPackage and ESRI Shapefile formats. This change does not impact the currently available NRN releases on this portal.\n\nThe NRN product is distributed in the form of thirteen provincial or territorial datasets and consists of two linear entities (Road Segment and Ferry Connection Segment) and three punctual entities (Junction, Blocked Passage, Toll Point) with which is associated a series of descriptive attributes such as, among others: First House Number, Last House Number, Street Name Body, Place Name, Functional Road Class, Pavement Status, Number Of Lanes, Structure Type, Route Number, Route Name, Exit Number. The development of the NRN was realized by means of individual m

### (optional) Text Preprocess using NLTK  

Create a new column that concadenate the selected columns: features_properties_title_en, features_properties_description_en,features_properties_keywords_en, and apply the following preprocesing before tokenization:
- convert to lower case 
- remove stopwords and punctuation
- remove apostrophe
- stemming


In [None]:
# import nltk
# from nltk.corpus import stopwords          # module for stop words that come with NLTK
# from nltk.stem import PorterStemmer        # module for stemming
# from nltk.tokenize import word_tokenize   # module for tokenizing strings 
# import string
# # Download necessary NLTK resources
# nltk.download('punkt')
# nltk.download('stopwords')


# df_en['metadata_en'] = df_en['features_properties_title_en'] + ' ' + df_en['features_properties_description_en'] + ' ' + df_en['features_properties_keywords_en'] 
# if df_en['metadata_en'].isnull().any():
#     df_en['metadata_en'] = df_en['metadata_en'].fillna('')

# # Function to clean text
# def clean_text(text):
#     """
#     text: raw tex input, a string of text, or a list of string
#     output: preprocess text in string format 
#     """
#     # Set of stopwords
#     stop_words = set(stopwords.words('english'))
#     # Initialize the Porter Stemmer
#     stemmer = PorterStemmer()
    
#     # Convert text to lowercase
#     text = text.lower()
#     # Remove punctuation
#     text = text.translate(str.maketrans('', '', string.punctuation.replace("'", "")))  # Keep apostrophe
#     # Remove apostrophes
#     text = text.replace("'", "")
#     # Tokenize text
#     word_tokens = word_tokenize(text)
#     # Remove stopwords and stem
#     filtered_text = [stemmer.stem(word) for word in word_tokens if word not in stop_words]
#     return " ".join(filtered_text)

# df_en['processed_metadata_en'] = df_en['metadata_en'].apply(clean_text)
# # Show the processed column
# print(df_en.loc[1:3, ['metadata_en', 'processed_metadata_en']])
# df_en.head(4)

# tqdm.pandas()
# df_en['vector'] = df_en["processed_metadata_en"].progress_apply(sentence_to_vector)

In [11]:
# Step 4: Embedding text 
tqdm.pandas()
df_en['vector'] = df_en["text"].progress_apply(sentence_to_vector)


100%|██████████| 61302/61302 [49:15<00:00, 20.74it/s]  


In [13]:
print(f'The dimension of the vector is {len(df_en.loc[1, "vector"])}')
df_en.head(4)

The dimension of the vector is 384


Unnamed: 0,features_properties_id,features_geometry_coordinates,features_properties_title_en,features_properties_description_en,features_properties_date_published_date,features_properties_keywords_en,features_properties_options,features_properties_contact,features_properties_topicCategory,features_properties_date_created_date,...,features_properties_graphicOverview,features_properties_language,features_popularity,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,organisation_en,temporalExtent,text,vector
0,d3881c4c-650d-4070-bf9b-1e00aabf0a1d,"[[[-143, 39.05], [-47, 39.05], [-47, 85], [-14...",Canadian Hydrographic Service Non-Navigational...,"**CHS NONNA data has been updated: April 21, 2...",2018-10-11,"Bathymetry, Depth, Hydrography","[{""url"": ""https://data.chs-shc.ca/"", ""protocol...","[{""individual"": ""null"", ""position"": {""en"": ""nu...","oceans, inlandWaters",2018-10-01,...,"[{""overviewFileName"": ""https://pacgis01.dfo-mp...",eng; CAN,3929,cgp,,[],Government of Canada; Fisheries and Oceans Can...,"{'begin': <NA>, 'end': <NA>}",Canadian Hydrographic Service Non-Navigational...,"[-0.03204740211367607, 0.007585172541439533, 0..."
1,3d282116-e556-400c-9306-ca1a3cada77f,"[[[-141.0027151, 41.7], [-52.6, 41.7], [-52.6,...",National Road Network - NRN - GeoBase Series,Notice - Format decommissioning\n\nGML (Geogra...,2015,"Canada, Geographic Infrastructure, NRN, Nation...","[{""url"": ""https://geo.statcan.gc.ca/geo_wa/ser...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",transportation,2010,...,"[{""overviewFileName"": ""http://ftp.geogratis.gc...",eng; CAN,1991,cgp,,[],Government of Canada; Statistics Canada,"{'begin': '1979-07', 'end': '2020-05'}",National Road Network - NRN - GeoBase Series\n...,"[-0.03426723554730415, -0.06323028355836868, 0..."
2,b6567c5c-8339-4055-99fa-63f92114d9e4,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",First Nations Location,The First Nations geographic location dataset ...,2015-05-01,"First Nation, Band, Aboriginal, Indian and Nor...","[{""url"": ""https://data.aadnc-aandc.gc.ca/geoma...","[{""individual"": ""null"", ""position"": {""en"": ""Me...","location, society",2007-06-01,...,"[{""overviewFileName"": ""https://data.aadnc-aand...",eng; CAN,1881,cgp,,[],Government of Canada;Indigenous Services Canad...,"{'begin': '2007-06-01', 'end': <NA>}",First Nations Location\nThe First Nations geog...,"[0.029996875673532486, -0.042383823543787, -0...."
3,522b07b9-78e2-4819-b736-ad9208eb1067,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",Aboriginal Lands of Canada Legislative Boundaries,The Aboriginal Lands of Canada Legislative Bou...,2017-07-28,"Canada Lands, Indian reserves, Land management...","[{""url"": ""https://proxyinternet.nrcan-rncan.gc...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",boundaries,2016-02-03,...,[],eng; CAN,1832,cgp,,[],Government of Canada; Natural Resources Canada...,"{'begin': '2004-04-02', 'end': <NA>}",Aboriginal Lands of Canada Legislative Boundar...,"[0.037921637296676636, -0.005010603461414576, ..."


In [14]:
# Step 5 Upload the embeddings as a parquet file to S3 bucket 
upload_df_to_s3_as_parquet(df=df_en, bucket_name='webpresence-nlp-data-preprocessing-stage', file_key='semantic_search_embeddings-pretrain.parquet') 

Uploading semantic_search_embeddings-pretrain.parquet to webpresence-nlp-data-preprocessing-stage as parquet file


True

### 6. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.


In [15]:
from opensearch import get_awsauth_from_secret, create_opensearch_connection, delete_aos_index_if_exists, load_data_to_opensearch_index

In [9]:
import boto3
def read_parquet_from_s3_as_df(region, s3_bucket, s3_key):
    """
    Load a Parquet file from an S3 bucket into a pandas DataFrame.

    Parameters:
    - region: AWS region where the S3 bucket is located.
    - s3_bucket: Name of the S3 bucket.
    - s3_key: Key (path) to the Parquet file within the S3 bucket.

    Returns:
    - df: pandas DataFrame containing the data from the Parquet file.
    """

    # Setup AWS session and clients
    session = boto3.Session(region_name=region)
    s3 = session.resource('s3')

    # Load the Parquet file as a pandas DataFrame
    object = s3.Object(s3_bucket, s3_key)
    body = object.get()['Body'].read()
    df = pd.read_parquet(io.BytesIO(body))
    return df

#Optional: read the embedding data from the S3 bucket 
# import pandas as pd 
# import io
# df_en = read_parquet_from_s3_as_df('ca-central-1', 'webpresence-nlp-data-preprocessing-stage', 'semantic_search_embeddings-mpnet-mpf.parquet')

In [12]:
df_test = df_en.head(10)

In [29]:
vector = df_en['vector']
print(type(vector))
# Check for null values
has_null = vector.isnull().any()
print(f"Series has null values: {has_null}")

json_en = df_en.to_dict("records")
print(type(json_en))
#print(json_en[0])

# Extract the 'vector' values
vectors = [item['vector'] for item in json_en]
print(type(vectors))
#print(vectors[0])

#heck if there is null values 
import numpy as np 
array = np.array(vectors, dtype=object)
has_null = np.any(array == None)
print(f"List of lists has null values: {has_null}")

<class 'pandas.core.series.Series'>
Series has null values: False
<class 'list'>
<class 'list'>
List of lists has null values: False


Under the cloudformation template 'geocore-semantic-search-with-opensearch-stage; Output tab, find the values for region, aos_host, and os_secret_id

In [16]:
region = "ca-central-1"
aos_host = "search-semantic-search-arieibeskhrn6vn2qd7gf5br7q.ca-central-1.es.amazonaws.com"
os_secret_id = "OpenSearchSecret-geocore-semantic-search-with-opensearch-stage"

awsauth = get_awsauth_from_secret(region, secret_id=os_secret_id)
aos_client =create_opensearch_connection(aos_host, awsauth)

Connection to OpenSearch established: <OpenSearch([{'host': 'search-semantic-search-arieibeskhrn6vn2qd7gf5br7q.ca-central-1.es.amazonaws.com', 'port': 443}])>


### 7. Create a index in Amazon Opensearch Service 
The following index setting is configuring an OpenSearch index to support k-nearest neighbor (KNN) searches with specific characteristics. KNN is a feature that allows for similarity searches, finding the "nearest" documents in a high-dimensional space. This setting is crucial for enabling vector-based searches, where vectors represent document features in a multidimensional space.

How k-NN search works:K-NN search works by calculating the distance between a query vector and the vectors in the dataset to find the closest matches. OpenSearch stores these vectors in an index and uses specialized algorithms (like HNSW, Hierarchical Navigable Small World graphs) to perform efficient similarity search at scale.



In [17]:
index_name = "minilm-knn"
knn_index = {
    "settings": {
        "index.knn": True, #This enables the k-nearest neighbor (KNN) search capability on the index.
        "index.knn.space_type": "cosinesimil", #cosine similarity 
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "vector": {
                "type": "knn_vector",
                "dimension": 384,
                "store": True
            },
            "coordinates":{
              "type": "geo_shape", 
              "store": True 
            }  
        }
    }
}


In [18]:
#Delete index if it exists 
delete_aos_index_if_exists(aos_client, index_to_delete=index_name)

Current indexes: ['minilm-knn', 'minilm-pretrain-knn', '.opensearch-observability', '.plugins-ml-config', '.ql-datasources', '.kibana_1', '.opendistro_security', 'mpnet-mpf-knn']
Deleted index: minilm-knn
Response: {'acknowledged': True}
Indexes after deletion attempt: ['minilm-pretrain-knn', '.opensearch-observability', '.plugins-ml-config', '.ql-datasources', '.kibana_1', '.opendistro_security', 'mpnet-mpf-knn']


In [19]:
#Create a index 
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'minilm-knn'}

In [None]:
#Load data to OpenSearch Index 
load_data_to_opensearch_index(df_en, aos_client, index_name)

In [22]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print(f"Records loaded into the index {index_name} is {res['hits']['total']['value']}.")

Records loaded into the index minilm-knn is 10000.


### 8. Test "Semantic Search" 

Now that we have vector in OpenSearch and a vector for our query question, let's perform a KNN search in OpenSearch.

In [23]:
INPUT = "Riverice events in ottawa"
search_vector = sentence_to_vector(INPUT)
print(index_name)

query={
    "size": 20,
    "query": {
        "knn": {
            "vector":{
                "vector":search_vector,
                "k":20
            }
        }
    }
}

res = aos_client.search(index=index_name, size=20,body=query,request_timeout=55)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)
query_result_df = pd.DataFrame(data=query_result,columns=["_id","relevancy_score","title",'uuid'])
display(query_result_df)

minilm-knn


Unnamed: 0,_id,relevancy_score,title,uuid
0,sZ6rWpABNvGO3XAAW7u7,0.653354,Mapping major floods April-May 2017,34085f6d-106a-41af-a29b-53ed6947c249
1,Z56qWpABNvGO3XAA-rPd,0.643385,Ontario Hydro Network - Watercourses,17bfc1bb-4849-4615-83a4-f7bd1dd21fd2
2,lp6qWpABNvGO3XAA_LO4,0.642681,Ontario Integrated Hydrology (OIH) data,ef0c4387-38ce-4adc-b761-f0506b82564e
3,eZ6rWpABNvGO3XAAZLyp,0.639701,2022 Events,c5bb7cfa-1eb1-45fb-8d1a-1ca9d0eafec4
4,b56rWpABNvGO3XAAK7cL,0.638624,Road Point Events,01164c08-450c-4d66-8291-1ba018f2fc1c
5,3p6qWpABNvGO3XAA_7PK,0.636887,Major river systems in the Far North,e10f6d04-8eff-4ebc-bd47-37d640c443c5
6,Ep6qWpABNvGO3XAAva5R,0.635798,Lakes and Rivers Database (LCE),a4a5575d-e8e8-4410-bfc6-18e9361ffd3f
7,9p6qWpABNvGO3XAAjanz,0.635618,Waterfront course,14be6566-5e91-4553-9049-857bcfc0f7ca
8,bZ6rWpABNvGO3XAABrRN,0.635004,Rivers and ditches,c128aff5-325c-4599-ab66-1c9d0b3abc94
9,Gp6qWpABNvGO3XAA97OP,0.634775,River-Mountain Promenade,71433534-694b-4538-8cd7-86530790ab0c


### 9. Test "Semantic Search" using the model endpoints

Now that we have vector in OpenSearch and a vector for our query question, let's perform a KNN search in OpenSearch.

In [25]:
# Initialize a boto3 client for SageMaker
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface.model import HuggingFaceModel

# Initialize a boto3 client for SageMaker
sagemaker_client = boto3.client('sagemaker', region_name='ca-central-1')  # Specify the AWS region

def list_sagemaker_endpoints():
    """List all SageMaker endpoints"""
    try:
        # Get the list of all SageMaker endpoints
        response = sagemaker_client.list_endpoints(SortBy='Name')
        print("Listing SageMaker Endpoints:")
        for endpoint in response['Endpoints']:
            print(f"Endpoint Name: {endpoint['EndpointName']}, Status: {endpoint['EndpointStatus']}")
    except Exception as e:
        print(f"Error listing SageMaker endpoints: {e}")

def invoke_sagemaker_endpoint_ft(endpoint_name, payload):
    """Invoke a SageMaker endpoint to get predictions with ContentType='application/json'."""
    # Initialize the runtime SageMaker client
    runtime_client = boto3.client('runtime.sagemaker', region_name='ca-central-1')  
    try:
        """
        if not isinstance(payload, str):
            payload = str(payload)
        """
        # Invoke the SageMaker endpoint
        response = runtime_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType='application/json',
            Body=json.dumps(payload)
        )
        # Decode the response
        result = json.loads(response['Body'].read().decode())
        return (result)
        #print(f"Prediction from {endpoint_name}: {result}")
    except Exception as e:
        print(f"Error invoking SageMaker endpoint {endpoint_name}: {e}")

def invoke_sagemaker_endpoint_pretrain(endpoint_name, payload):
    """Invoke a SageMaker endpoint to get predictions with ContentType='text/plain'."""
    # Initialize the runtime SageMaker client
    runtime_client = boto3.client('runtime.sagemaker', region_name='ca-central-1')  

    try:
        # Ensure payload is a string, since ContentType is 'text/plain'
        if not isinstance(payload, str):
            payload = str(payload)
        
        # Invoke the SageMaker endpoint
        response = runtime_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType='text/plain',
            Body=payload
        )
        
        # Decode the response
        result = json.loads(response['Body'].read().decode())
        return (result)
        #print(f"Prediction from {endpoint_name}: {result}")
    except Exception as e:
        print(f"Error invoking SageMaker endpoint {endpoint_name}: {e}")
    


In [26]:
list_sagemaker_endpoints()

Listing SageMaker Endpoints:
Endpoint Name: semantic-search-pretrain-all-MiniLM-L6-v2-1719456491, Status: InService
Endpoint Name: all-mpnet-base-v2-mpf-huggingface-test, Status: InService


In [140]:
endpoint_name = 'semantic-search-pretrain-all-MiniLM-L6-v2-1719456491'
payload = "Riverice events in ottawa"
vector = invoke_sagemaker_endpoint_pretrain(endpoint_name, payload)

In [141]:
query={
    "size": 20,
    "query": {
        "knn": {
            "vector":{
                "vector":vector,
                "k":20
            }
        }
    }
}

res = aos_client.search(index=index_name, size=20,body=query,request_timeout=55)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)
query_result_df = pd.DataFrame(data=query_result,columns=["_id","relevancy_score","title",'uuid'])
display(query_result_df)

Unnamed: 0,_id,relevancy_score,title,uuid
0,ZtwtVpAB8FP_I1hR7YnC,0.653354,Mapping major floods April-May 2017,34085f6d-106a-41af-a29b-53ed6947c249
1,HNwtVpAB8FP_I1hRZoE9,0.643385,Ontario Hydro Network - Watercourses,17bfc1bb-4849-4615-83a4-f7bd1dd21fd2
2,S9wtVpAB8FP_I1hRaYEp,0.642681,Ontario Integrated Hydrology (OIH) data,ef0c4387-38ce-4adc-b761-f0506b82564e
3,LtwtVpAB8FP_I1hR-4qU,0.639701,2022 Events,c5bb7cfa-1eb1-45fb-8d1a-1ca9d0eafec4
4,JNwtVpAB8FP_I1hRqYXe,0.638624,Road Point Events,01164c08-450c-4d66-8291-1ba018f2fc1c
5,k9wtVpAB8FP_I1hRbYGa,0.636887,Major river systems in the Far North,e10f6d04-8eff-4ebc-bd47-37d640c443c5
6,x9wtVpAB8FP_I1hRDXuL,0.635798,Lakes and Rivers Database (LCE),a4a5575d-e8e8-4410-bfc6-18e9361ffd3f
7,q9wsVpAB8FP_I1hRx3fH,0.635618,Waterfront course,14be6566-5e91-4553-9049-857bcfc0f7ca
8,ItwtVpAB8FP_I1hRdoJN,0.635004,Rivers and ditches,c128aff5-325c-4599-ab66-1c9d0b3abc94
9,z9wtVpAB8FP_I1hRYYCu,0.634775,River-Mountain Promenade,71433534-694b-4538-8cd7-86530790ab0c
