# Semantic Search with Amazon OpenSearch Service 
To create semantic search, we will add a vector representation of the metadata to our data set in OpenSearch, then do the same with our sample query "Wildfires in Canada". In OpenSearch, we'll use a KNN search to find matches based on a cosine similarity rating on the vector.
We will:
1. Use a HuggingFace sentence-transformer BERT model to generate sentence embedding for the geo.ca metadata dataset
2. Upload the dataset to OpenSearch, with the original metadata schema text combined with the vector representation of the questions.
3. Translate the query question to a vector.
4. Perform a KNN search in OpenSearch to perform semantic search

### 1. Check PyTorch Version


As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch. The version should already be 2.0.1 or higher. If not, please run the lab in order to get everything set up.

In [2]:
import torch 
print(torch.__version__)

2.1.0


### 2. Retrieve notebook variables

The line below will retrieve your shared variables from the previous notebook.

In [4]:
%store -r

### 3. import library 

In [13]:
# #installed in the previous notebook
# !pip install -q boto3
# !pip install -q requests
# !pip install -q requests-aws4auth
# !pip install -q opensearch-py
# !pip install -q tqdm
# !pip install -q boto3
# !pip install -q install transformers[torch]
# !pip install -q transformers
# !pip install -q sentence-transformers rank_bm25
# !pip install -q nltk

In [3]:
import boto3
import re
import time
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### 4. Prepare BERT Model 
#### Option 1: DistilBert model
For this module, we will be using the HuggingFace BERT model to generate vectorization data, where every sentence is 768 dimension data. Let's create some helper functions we'll use later on.
![BERT](image/nlp_bert.png)

We are creating 2 functions:
1. mean_pooling
2. sentence_to_vector - this is the key function we'll use to generate our vector embedding for the metadata dataset.

A reason for not using DistilBert:
 Transformer models like DistilBert have a fixed maximum input length (512), and any input longer than this limit can cause errors during processing.Our input sequence length (1086 tokens) exceeds the model's maximum sequence length (512 tokens).

In [5]:
# import torch
# from transformers import AutoTokenizer, AutoModel
# from transformers import DistilBertTokenizer, DistilBertModel

# #model_name = "distilbert-base-uncased"
# #model_name = "sentence-transformers/msmarco-distilbert-base-dot-prod-v3"
# model_name = "sentence-transformers/distilbert-base-nli-stsb-mean-tokens" #https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens


# #Mean Pooling - Take attention mask into account for correct averaging
# def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
#     sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
#     return sum_embeddings / sum_mask


# def sentence_to_vector(raw_inputs):
#     tokenizer = DistilBertTokenizer.from_pretrained(model_name)
#     model = DistilBertModel.from_pretrained(model_name)
#     inputs_tokens = tokenizer(raw_inputs, padding=True, return_tensors="pt")
    
#     with torch.no_grad():
#         outputs = model(**inputs_tokens)

#     sentence_embeddings = mean_pooling(outputs, inputs_tokens['attention_mask'])
#     return sentence_embeddings




#### Option 2: all-MiniLM-L6-v2
We can also use sentence-transformer models ['all-MiniLM-L6-v2'](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which simplifies the process of obtaining sentence embeddings. It has 384 dimensions.It designed for generating sentence embeddings directly, which means we can use the Sentence Transformers library's functionality to handle both tokenization and embedding in a more streamlined manner compared to manually handling with DistilBertModel and DistilBertTokenizer.

The SentenceTransformer class's encode method directly handles the text input, tokenization, and conversion to sentence embeddings, eliminating the need for manual mean pooling. The encode method returns a tensor of sentence embeddings, where each embedding corresponds to the input sentences provided to the function.



In [4]:
#!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import numpy as np 

# Load the Sentence Transformer model
model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

def sentence_to_vector(raw_inputs):
    # Encode sentences to get sentence embeddings
    sentence_embeddings = model.encode(raw_inputs, convert_to_tensor=True)
    """
    When you work with vectors (such as embeddings) in Elasticsearch or OpenSearch, you need to convert the PyTorch tensor to a list of floats before indexing the document. 
    """
    encod_np_array = np.array(sentence_embeddings)
    encod_list = encod_np_array.tolist()
        
    return encod_list



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

### 5. Prepare the metadata 

In [5]:
import pandas as pd 
import boto3
import json
from requests_aws4auth import AWS4Auth
import io

In [6]:
def load_parquet_from_s3_to_df(region, s3_bucket, s3_key):
    """
    Load a Parquet file from an S3 bucket into a pandas DataFrame.

    Parameters:
    - region: AWS region where the S3 bucket is located.
    - s3_bucket: Name of the S3 bucket.
    - s3_key: Key (path) to the Parquet file within the S3 bucket.

    Returns:
    - df: pandas DataFrame containing the data from the Parquet file.
    """
    
    # Setup AWS session and clients
    session = boto3.Session(region_name=region)
    s3 = session.resource('s3')

    # Load the Parquet file as a pandas DataFrame
    object = s3.Object(s3_bucket, s3_key)
    body = object.get()['Body'].read()
    df = pd.read_parquet(io.BytesIO(body))
    return df

df = load_parquet_from_s3_to_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-dev', 'records.parquet')

In [7]:
df.head()
#df.columns

Unnamed: 0,features_type,features_geometry_type,features_geometry_coordinates,features_properties_id,features_properties_title_en,features_properties_title_fr,features_properties_description_en,features_properties_description_fr,features_properties_keywords_en,features_properties_keywords_fr,...,features_properties_distributor,features_properties_options,features_properties_temporalExtent_end_@indeterminatePosition,features_properties_temporalExtent_end_#text,features_properties_plugins,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,features_popularity,features_similarity
0,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...","Principales régions minières, principales mine...",This dataset is produced and published annuall...,Ce jeu de données est produit et publié annuel...,"mineralization, mineral occurrences, mines, hy...","minéralisation, indices minéralisés, mines, hy...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://maps-cartes.services.geo.ca/...",,,[],,,[],1250806,"[{""sim"": ""sim1"", ""features_properties_id"": ""b6..."
1,Feature,Polygon,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...",7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011","Modèle numérique d'élévation du Canada, 1945-2011",This collection is a legacy product that is no...,Ce produit fait maintenant partie du patrimoin...,"Canada, Earth Sciences, elevation, relief, geo...","Canada, Sciences de la Terre, élévation, relie...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...",,,[],,,[],210798,"[{""sim"": ""sim1"", ""features_properties_id"": ""76..."
2,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,Catalogue national de scénarios de tremblement...,"The National Earthquake Scenario Catalogue, pr...",Le dépôt est utilisé pour l’élaboration du cat...,"Emergency preparedness, Earth sciences, Earthq...","Protection civile, Sciences de la terre, Tremb...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://github.com/OpenDRR/earthquak...",,,[],,,[],140088,"[{""sim"": ""sim1"", ""features_properties_id"": ""4c..."
3,Feature,Polygon,"[[[-104.75571511, 50.42392886], [-104.56356008...",03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,Série temporelle de la photothèque nationale d...,"Note: To visualize the data in the viewer, zoo...",Note: Pour visualiser les données dans l’outil...,"Mosaic, Aerial photography, Access to informat...","Mosaïque, Photographie aérienne, Accès à l'inf...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://datacube-prod-data-public.s3...",,,[],,,[],120162,"[{""sim"": ""sim1"", ""features_properties_id"": ""23..."
4,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,Ententes minières autochtones,The Indigenous Mining Agreements dataset provi...,Les données des ententes minières autochtones ...,"Indigenous, First Nations, Métis, Indigenous a...","Autochtones, Premières nations, Métis, Affaire...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://atlas.gc.ca/imaema/en/"", ""pr...",,,[],,,[],111036,"[{""sim"": ""sim1"", ""features_properties_id"": ""CG..."


In [8]:
df.describe()

Unnamed: 0,features_type,features_geometry_type,features_geometry_coordinates,features_properties_id,features_properties_title_en,features_properties_title_fr,features_properties_description_en,features_properties_description_fr,features_properties_keywords_en,features_properties_keywords_fr,...,features_properties_distributor,features_properties_options,features_properties_temporalExtent_end_@indeterminatePosition,features_properties_temporalExtent_end_#text,features_properties_plugins,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,features_popularity,features_similarity
count,10957,10957,10957,10956,10956,10956,10957,10957,7534,7534,...,10957,10957,25,22,10957,4364,51,10957,10957,9987
unique,1,1,5002,10955,9534,9492,6955,6919,4092,4073,...,1074,10836,2,7,2,3,1,6,547,8574
top,Feature,Polygon,"[[[-139.5, 48], [-113.5, 48], [-113.5, 60], [-...",0c25772d-da22-4ac1-b130-c1d97b935f6f,WMS,WMS,This record has been generated from a geospati...,Cet enregistrement a été généré à partir d'un ...,"SpatioTemporal Asset Catalog, stac, digital el...","SpatioTemporal Asset Catalog, stac, modèle num...",...,"""Not Available; Indisponible""",[],after,2016-03-30,[],Canadian Geospatial Data Infrastructure Web Ha...,sentinel-1,[],0,"[{""sim"": ""sim1"", ""features_properties_id"": ""CG..."
freq,10957,10957,514,2,144,144,787,787,698,698,...,3423,52,22,8,10955,3423,51,10016,5337,279


Subset to columns that are required in the app.geo.ca [api response](https://geocore.api.geo.ca/geo?north=81.77364370720657&east=360&south=-8.407168163601076&west=-359.6484375&keyword=&lang=en&min=1&max=10&sort=popularity-desc). 
##### Note, we are focus on the english search at the moment. 

In [9]:
#Extract organization from contact json, English only 
def extract_organisation_en(contact_str):
    try:
        # Parse the stringified JSON into Python objects
        contact_data = json.loads(contact_str)
        # If the parsed data is a list, iterate through it
        if isinstance(contact_data, list):
            for item in contact_data:
                # Check if 'organisation' and 'en' keys exist
                if 'organisation' in item and 'en' in item['organisation']:
                    return item['organisation']['en']
        elif isinstance(contact_data, dict):
            # If the data is a dictionary, extract 'organisation' in 'en' directly
            return contact_data.get('organisation', {}).get('en', None)
    except json.JSONDecodeError:
        # Handle cases where the contact string is not valid JSON
        return None
    except Exception as e:
        # Catch-all for any other unexpected errors
        return f"Error: {str(e)}"

# Subset to selected columns 
col_names_list = ['features_properties_id','features_geometry_coordinates','features_properties_title_en','features_properties_description_en','features_properties_date_published_date','features_properties_keywords_en','features_properties_options','features_properties_contact','features_properties_topicCategory','features_properties_date_created_date','features_properties_spatialRepresentation','features_properties_type','features_properties_temporalExtent_begin','features_properties_temporalExtent_end','features_properties_graphicOverview','features_properties_language','features_popularity','features_properties_sourceSystemName','features_properties_eoCollection','features_properties_eoFilters']
df_en = df[col_names_list]
#df_en = df_en[:100]
    
# Create new column 'organization_en'
df_en['organisation_en'] = df_en['features_properties_contact'].apply(extract_organisation_en)

# Create a new column 'temporalExtent' as a dictionary of {'begin': ..., 'end': ...}
values_to_replace = {'Present': None, 'Not Available; Indisponible': None}
columns_to_replace = ['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

df_en['temporalExtent'] = df_en.apply(lambda row: {'begin': row['features_properties_temporalExtent_begin'], 'end': row['features_properties_temporalExtent_end']}, axis=1)
df_en = df_en.drop(columns =['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end'])

#modifies dates to acceptable values
values_to_replace = {'Not Available; Indisponible': None}
columns_to_replace = ['features_properties_date_published_date', 'features_properties_date_created_date']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en['organisation_en'] = df_en['features_properties_contact'].apply(extract_organisation_en)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en['temporalExtent'] = df_

In [10]:
log_level=""
if log_level == "DEBUG":
    print(df_en.describe())
    print(f'Total rows are {df_en.shape[0]}, total columns are {df_en.shape[1]}')
    print(f'Types of each columns are \n {df_en.dtypes}')

print(df_en.shape)
df_en.head(4)

(10957, 20)


Unnamed: 0,features_properties_id,features_geometry_coordinates,features_properties_title_en,features_properties_description_en,features_properties_date_published_date,features_properties_keywords_en,features_properties_options,features_properties_contact,features_properties_topicCategory,features_properties_date_created_date,features_properties_spatialRepresentation,features_properties_type,features_properties_graphicOverview,features_properties_language,features_popularity,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,organisation_en,temporalExtent
0,000183ed-8864-42f0-ae43-c4313a860720,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...","Principal Mineral Areas, Producing Mines, and ...",This dataset is produced and published annuall...,2020-02-27,"mineralization, mineral occurrences, mines, hy...","[{""url"": ""https://maps-cartes.services.geo.ca/...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",economy,2019-04-12,vector; vecteur,series; série,"[{""overviewFileName"": ""http://ftp.maps.canada....",eng; CAN,1250806,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '2020-01', 'end': '2020-12'}"
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...","Canadian Digital Elevation Model, 1945-2011",This collection is a legacy product that is no...,2015,"Canada, Earth Sciences, elevation, relief, geo...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",elevation,2012-11-06,grid; grille,dataset; jeuDonnées,"[{""overviewFileName"": ""https://ftp.maps.canada...",eng; CAN,210798,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '1945', 'end': '2011'}"
2,085024ac-5a48-427a-a2ea-d62af73f2142,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",Canada's National Earthquake Scenario Catalogue,"The National Earthquake Scenario Catalogue, pr...",2021-07-06,"Emergency preparedness, Earth sciences, Earthq...","[{""url"": ""https://github.com/OpenDRR/earthquak...","[{""individual"": ""Dr. Tiegan Hobbs"", ""position""...",geoscientificInformation,2021-07-06,vector; vecteur,series; série,[],eng; CAN,140088,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '2021-07-06', 'end': <NA>}"
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,"[[[-104.75571511, 50.42392886], [-104.56356008...",Temporal Series of the National Air Photo Libr...,"Note: To visualize the data in the viewer, zoo...",2021-03-31,"Mosaic, Aerial photography, Access to informat...","[{""url"": ""https://datacube-prod-data-public.s3...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",imageryBaseMapsEarthCover,2020-08-01,grid; grille,dataset; jeuDonnées,"[{""overviewFileName"": ""http://datacube-prod-da...",eng; CAN,120162,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '1947', 'end': '1967'}"


### 6. Text Preprocess using NLTK 

Create a new column that concadenate the selected columns: features_properties_title_en, features_properties_description_en,features_properties_keywords_en, and apply the following preprocesing before tokenization:
- convert to lower case 
- remove stopwords and punctuation
- remove apostrophe
- stemming


In [11]:
import pandas as pd
import nltk
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import word_tokenize   # module for tokenizing strings 
import string
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Creata a new column for metadata text 

In [12]:
df_en['metadata_en'] = df_en['features_properties_title_en'] + ' ' + df_en['features_properties_description_en'] + ' ' + df_en['features_properties_keywords_en'] 
if df_en['metadata_en'].isnull().any():
    df_en['metadata_en'] = df_en['metadata_en'].fillna('')

In [13]:
# Function to clean text
def clean_text(text):
    """
    text: raw tex input, a string of text, or a list of string
    output: preprocess text in string format 
    """
    # Set of stopwords
    stop_words = set(stopwords.words('english'))
    # Initialize the Porter Stemmer
    stemmer = PorterStemmer()
    
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation.replace("'", "")))  # Keep apostrophe
    # Remove apostrophes
    text = text.replace("'", "")
    # Tokenize text
    word_tokens = word_tokenize(text)
    # Remove stopwords and stem
    filtered_text = [stemmer.stem(word) for word in word_tokens if word not in stop_words]
    return " ".join(filtered_text)

df_en['processed_metadata_en'] = df_en['metadata_en'].apply(clean_text)
# Show the processed column
print(df_en.loc[1:3, ['metadata_en', 'processed_metadata_en']])
df_en.head(4)

                                         metadata_en  \
1  Canadian Digital Elevation Model, 1945-2011 Th...   
2  Canada's National Earthquake Scenario Catalogu...   
3  Temporal Series of the National Air Photo Libr...   

                               processed_metadata_en  
1  canadian digit elev model 19452011 collect leg...  
2  canada nation earthquak scenario catalogu nati...  
3  tempor seri nation air photo librari napl regi...  


Unnamed: 0,features_properties_id,features_geometry_coordinates,features_properties_title_en,features_properties_description_en,features_properties_date_published_date,features_properties_keywords_en,features_properties_options,features_properties_contact,features_properties_topicCategory,features_properties_date_created_date,...,features_properties_graphicOverview,features_properties_language,features_popularity,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,organisation_en,temporalExtent,metadata_en,processed_metadata_en
0,000183ed-8864-42f0-ae43-c4313a860720,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...","Principal Mineral Areas, Producing Mines, and ...",This dataset is produced and published annuall...,2020-02-27,"mineralization, mineral occurrences, mines, hy...","[{""url"": ""https://maps-cartes.services.geo.ca/...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",economy,2019-04-12,...,"[{""overviewFileName"": ""http://ftp.maps.canada....",eng; CAN,1250806,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '2020-01', 'end': '2020-12'}","Principal Mineral Areas, Producing Mines, and ...",princip miner area produc mine oil ga field 90...
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...","Canadian Digital Elevation Model, 1945-2011",This collection is a legacy product that is no...,2015,"Canada, Earth Sciences, elevation, relief, geo...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",elevation,2012-11-06,...,"[{""overviewFileName"": ""https://ftp.maps.canada...",eng; CAN,210798,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '1945', 'end': '2011'}","Canadian Digital Elevation Model, 1945-2011 Th...",canadian digit elev model 19452011 collect leg...
2,085024ac-5a48-427a-a2ea-d62af73f2142,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",Canada's National Earthquake Scenario Catalogue,"The National Earthquake Scenario Catalogue, pr...",2021-07-06,"Emergency preparedness, Earth sciences, Earthq...","[{""url"": ""https://github.com/OpenDRR/earthquak...","[{""individual"": ""Dr. Tiegan Hobbs"", ""position""...",geoscientificInformation,2021-07-06,...,[],eng; CAN,140088,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '2021-07-06', 'end': <NA>}",Canada's National Earthquake Scenario Catalogu...,canada nation earthquak scenario catalogu nati...
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,"[[[-104.75571511, 50.42392886], [-104.56356008...",Temporal Series of the National Air Photo Libr...,"Note: To visualize the data in the viewer, zoo...",2021-03-31,"Mosaic, Aerial photography, Access to informat...","[{""url"": ""https://datacube-prod-data-public.s3...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",imageryBaseMapsEarthCover,2020-08-01,...,"[{""overviewFileName"": ""http://datacube-prod-da...",eng; CAN,120162,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '1947', 'end': '1967'}",Temporal Series of the National Air Photo Libr...,tempor seri nation air photo librari napl regi...


### 7. Create tokens - convert the text data to vector 
Using the helper function we created earlier, let's convert the processed_metadata_en from into vectors. 
Note, we test with the first 100 records only. 

In [14]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

# Make sure to call this after importing tqdm
tqdm.pandas()

#df_en = df_en[:100]
df_en['vector'] = df_en["processed_metadata_en"].progress_apply(sentence_to_vector)


  0%|          | 0/10957 [00:00<?, ?it/s]

In [15]:
print(f'The dimension of the vector is len(df_en.loc[1, "vector"]')
df_en.head(4)

The dimension of the vector is len(df_en.loc[1, "vector"]


Unnamed: 0,features_properties_id,features_geometry_coordinates,features_properties_title_en,features_properties_description_en,features_properties_date_published_date,features_properties_keywords_en,features_properties_options,features_properties_contact,features_properties_topicCategory,features_properties_date_created_date,...,features_properties_language,features_popularity,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,organisation_en,temporalExtent,metadata_en,processed_metadata_en,vector
0,000183ed-8864-42f0-ae43-c4313a860720,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...","Principal Mineral Areas, Producing Mines, and ...",This dataset is produced and published annuall...,2020-02-27,"mineralization, mineral occurrences, mines, hy...","[{""url"": ""https://maps-cartes.services.geo.ca/...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",economy,2019-04-12,...,eng; CAN,1250806,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '2020-01', 'end': '2020-12'}","Principal Mineral Areas, Producing Mines, and ...",princip miner area produc mine oil ga field 90...,"[-0.054844897240400314, -0.02167576551437378, ..."
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...","Canadian Digital Elevation Model, 1945-2011",This collection is a legacy product that is no...,2015,"Canada, Earth Sciences, elevation, relief, geo...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",elevation,2012-11-06,...,eng; CAN,210798,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '1945', 'end': '2011'}","Canadian Digital Elevation Model, 1945-2011 Th...",canadian digit elev model 19452011 collect leg...,"[0.0068742078728973866, -0.042781200259923935,..."
2,085024ac-5a48-427a-a2ea-d62af73f2142,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",Canada's National Earthquake Scenario Catalogue,"The National Earthquake Scenario Catalogue, pr...",2021-07-06,"Emergency preparedness, Earth sciences, Earthq...","[{""url"": ""https://github.com/OpenDRR/earthquak...","[{""individual"": ""Dr. Tiegan Hobbs"", ""position""...",geoscientificInformation,2021-07-06,...,eng; CAN,140088,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '2021-07-06', 'end': <NA>}",Canada's National Earthquake Scenario Catalogu...,canada nation earthquak scenario catalogu nati...,"[0.07048127800226212, 0.027553539723157883, 0...."
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,"[[[-104.75571511, 50.42392886], [-104.56356008...",Temporal Series of the National Air Photo Libr...,"Note: To visualize the data in the viewer, zoo...",2021-03-31,"Mosaic, Aerial photography, Access to informat...","[{""url"": ""https://datacube-prod-data-public.s3...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",imageryBaseMapsEarthCover,2020-08-01,...,eng; CAN,120162,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '1947', 'end': '1967'}",Temporal Series of the National Air Photo Libr...,tempor seri nation air photo librari napl regi...,"[0.07804589718580246, -0.019797218963503838, 0..."


### 8. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.


In [16]:
import boto3
import json

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')


def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search-with-opensearch"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

outputs
print(aos_host)

search-semantic-search-dfcizxxxuj62dusl5skmeu3czu.ca-central-1.es.amazonaws.com


In [17]:
from opensearchpy import OpenSearch,RequestsHttpConnection
import boto3

#update the region if you're working other than us-east-1
region = 'ca-central-1' 

# #Alternatively, auth can be get using the AWS4SignerAuth library
# from opensearchpy import AWSV4SignerAuth
# credentials = boto3.Session().get_credentials()
# auth = AWSV4SignerAuth(credentials, region)

auth = (aos_credentials['username'], aos_credentials['password'])
aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    timeout = 60,
    connection_class = RequestsHttpConnection
)
print(aos_client)

<OpenSearch([{'host': 'search-semantic-search-dfcizxxxuj62dusl5skmeu3czu.ca-central-1.es.amazonaws.com', 'port': 443}])>


### 9. Create a index in Amazon Opensearch Service 
The following index setting is configuring an OpenSearch index to support k-nearest neighbor (KNN) searches with specific characteristics. KNN is a feature that allows for similarity searches, finding the "nearest" documents in a high-dimensional space. This setting is crucial for enabling vector-based searches, where vectors represent document features in a multidimensional space.

How k-NN search works:K-NN search works by calculating the distance between a query vector and the vectors in the dataset to find the closest matches. OpenSearch stores these vectors in an index and uses specialized algorithms (like HNSW, Hierarchical Navigable Small World graphs) to perform efficient similarity search at scale.



In [18]:
index_name = "nlp_knn"
knn_index = {
    "settings": {
        "index.knn": True, #This enables the k-nearest neighbor (KNN) search capability on the index.
        "index.knn.space_type": "cosinesimil", #cosine similarity 
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "vector": {
                "type": "knn_vector",
                "dimension": 384,
                "store": True
            },
            "coordinates":{
              "type": "geo_shape", 
              "store": True 
            }  
        }
    }
}



If the index is already exist, we will delete the index and recreate 

In [19]:
from opensearchpy import OpenSearch

def delete_index_if_exists(aos_client, index_to_delete):
    """
    Deletes the specified index if it exists.

    :param aos_client: An instance of OpenSearch client.
    :param index_to_delete: The name of the index to delete.
    """
    # List all indexes and check if the specified index exists
    all_indices = aos_client.cat.indices(format='json')
    existing_indices = [index['index'] for index in all_indices]
    print("Current indexes:", existing_indices)

    if index_to_delete in existing_indices:
        # Delete the specified index
        try:
            response = aos_client.indices.delete(index=index_to_delete)
            print(f"Deleted index: {index_to_delete}")
            print("Response:", response)
        except Exception as e:
            print(f"Error deleting index {index_to_delete}:", e)
    else:
        print(f"Index {index_to_delete} does not exist.")

    # List all indexes again to confirm deletion
    all_indices_after_deletion = aos_client.cat.indices(format='json')
    existing_indices_after_deletion = [index['index'] for index in all_indices_after_deletion]
    print("Indexes after deletion attempt:", existing_indices_after_deletion)

delete_index_if_exists(aos_client, index_to_delete=index_name)

Current indexes: ['nlp_knn', '.opensearch-observability', '.plugins-ml-config', 'search', 'keyword_search', '.ql-datasources', '.opendistro_security', '.kibana_1']
Deleted index: nlp_knn
Response: {'acknowledged': True}
Indexes after deletion attempt: ['.opensearch-observability', '.plugins-ml-config', 'search', 'keyword_search', '.ql-datasources', '.opendistro_security', '.kibana_1']


In [20]:
#Create a index 
aos_client.indices.create(index=index_name,body=knn_index,ignore=400)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'nlp_knn'}

In [21]:
aos_client.indices.get(index=index_name)

{'nlp_knn': {'aliases': {},
  'mappings': {'properties': {'coordinates': {'type': 'geo_shape',
     'store': True},
    'vector': {'type': 'knn_vector', 'store': True, 'dimension': 384}}},
  'settings': {'index': {'replication': {'type': 'DOCUMENT'},
    'number_of_shards': '5',
    'provided_name': 'nlp_knn',
    'knn.space_type': 'cosinesimil',
    'knn': 'true',
    'creation_date': '1713375761340',
    'analysis': {'analyzer': {'default': {'type': 'standard',
       'stopwords': '_english_'}}},
    'number_of_replicas': '1',
    'uuid': 'sWaN69J2TuiqUdihASo1qA',
    'version': {'created': '136327827'}}}}}

### 10. Load the raw data into the Index
To mimic the API response required by app.geo.ca, we will index the required properties 

In [22]:
import json
from tqdm import tqdm
import time

In [23]:
def index_data_to_opensearch(df_en, aos_client, index_name, log_level="INFO"):
    """
    Index data from a pandas DataFrame to an OpenSearch index.

    Parameters:
    - df_en: DataFrame containing the data to index.
    - aos_client: OpenSearch client.
    - index_name: Name of the OpenSearch index to which the data will be indexed.
    - log_level: Logging level, defaults to "INFO". Set to "DEBUG" for detailed logs.
    """
    start_time = time.time()

    # Convert DataFrame to a list of dictionaries (JSON)
    json_en = df_en.to_dict("records")

    # Index the data
    for x in tqdm(json_en, desc="Indexing Records"):
        try:
            bounding_box = json.loads(x.get('features_geometry_coordinates', '[]'))
            coordinates = {
                "type": "polygon",
                "coordinates": bounding_box
            }
            
            document = {
                'id': x.get('features_properties_id', ''),
                'coordinates': coordinates,
                'title': x.get('features_properties_title_en', ''),
                'description': x.get('features_properties_description_en', ''),
                'published': x.get('features_properties_date_published_date', ''),
                'keywords': x.get('features_properties_keywords_en', ''),
                'options': json.loads(x.get('features_properties_options', '[]')),
                'contact': json.loads(x.get('features_properties_contact', '[]')),
                'topicCategory': x.get('features_properties_topicCategory', ''),
                'created': x.get('features_properties_date_created_date', ''),
                'spatialRepresentation': x.get('features_properties_spatialRepresentation', ''),
                'type': x.get('features_properties_type', ''),
                'temporalExtent': x.get('temporalExtent', ''),
                'graphicOverview': json.loads(x.get('features_properties_graphicOverview', '[]')),
                'language': x.get('features_properties_language', ''),
                'organisation': x.get('organisation_en', ''),
                'popularity': int(x.get('features_popularity', '0')),
                'systemName': x.get('features_properties_sourceSystemName', ''),
                'eoCollection': x.get('features_properties_eoCollection', ''),
                'eoFilters': json.loads(x.get('features_properties_eoFilters', '[]')),
                "vector":x.get("vector", "")
            }

            if log_level == "DEBUG":
                print((json.dumps(document, indent=4)))

            aos_client.index(index=index_name, body=document)

        except Exception as e:
            print(e)

    # Final record count check (Optional, can slow down the script if the index is large)
    res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
    total_time = time.time() - start_time
    print(f"Completed indexing. Records loaded into the index {index_name}: {res['hits']['total']['value']}. Total time taken: {total_time:.2f} seconds.")

# Example usage
index_data_to_opensearch(df_en, aos_client, index_name)

Indexing Records:  54%|█████▍    | 5958/10957 [01:35<01:14, 67.30it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  57%|█████▋    | 6202/10957 [01:39<01:11, 66.87it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  57%|█████▋    | 6239/10957 [01:40<01:09, 67.98it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  58%|█████▊    | 6339/10957 [01:41<01:04, 71.29it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  60%|█████▉    | 6559/10957 [01:44<01:04, 67.77it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  60%|██████    | 6622/10957 [01:45<01:05, 66.19it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  61%|██████    | 6694/10957 [01:46<00:57, 74.34it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  63%|██████▎   | 6857/10957 [01:49<00:56, 72.24it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  63%|██████▎   | 6937/10957 [01:50<01:01, 65.10it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  64%|██████▎   | 6974/10957 [01:51<00:58, 68.16it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  64%|██████▍   | 7013/10957 [01:51<00:53, 73.21it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  65%|██████▍   | 7091/10957 [01:52<01:03, 60.91it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  65%|██████▌   | 7142/10957 [01:53<01:03, 60.30it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  72%|███████▏  | 7919/10957 [02:06<00:43, 69.91it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  74%|███████▍  | 8160/10957 [02:12<00:55, 50.83it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  77%|███████▋  | 8383/10957 [02:15<00:40, 64.35it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  77%|███████▋  | 8397/10957 [02:16<00:43, 58.90it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  77%|███████▋  | 8409/10957 [02:16<00:45, 55.63it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  77%|███████▋  | 8428/10957 [02:16<00:49, 51.02it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  77%|███████▋  | 8465/10957 [02:17<00:48, 51.67it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  79%|███████▉  | 8643/10957 [02:20<00:37, 60.96it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  80%|███████▉  | 8722/10957 [02:21<00:37, 60.25it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  83%|████████▎ | 9072/10957 [02:27<00:28, 66.40it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  84%|████████▍ | 9209/10957 [02:29<00:25, 69.17it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  84%|████████▍ | 9247/10957 [02:30<00:24, 70.31it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')
RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  85%|████████▍ | 9287/10957 [02:30<00:23, 70.76it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records:  86%|████████▋ | 9468/10957 [02:33<00:24, 60.42it/s]

RequestError(400, 'mapper_parsing_exception', 'failed to parse field [coordinates] of type [geo_shape]')


Indexing Records: 100%|██████████| 10957/10957 [02:57<00:00, 61.68it/s]


Completed indexing. Records loaded into the index nlp_knn: 10000. Total time taken: 191.09 seconds.


In [24]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print(f"Records loaded into the index {index_name} is {res['hits']['total']['value']}.")

Records loaded into the index nlp_knn is 10000.


### 11. Test "Semantic Search" 

Now that we have vector in OpenSearch and a vector for our query question, let's perform a KNN search in OpenSearch.

In [29]:
INPUT = "Ontario Road Network"
search_vector = sentence_to_vector(INPUT)

query={
    "size": 20,
    "query": {
        "knn": {
            "vector":{
                "vector":search_vector,
                "k":20
            }
        }
    }
}

res = aos_client.search(index=index_name, size=20,body=query,request_timeout=55)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)
query_result_df = pd.DataFrame(data=query_result,columns=["_id","relevancy_score","title",'uuid'])
display(query_result_df)

Unnamed: 0,_id,relevancy_score,title,uuid
0,15Yn7Y4BdooaaHUeh40e,0.770623,Canada’s National Highway System,c5c249c4-dea6-40a6-8fae-188a42030908
1,tpYo7Y4BdooaaHUeVZpl,0.741267,Ontario Road Network: segment with address,0a38cc68-c82d-4125-bbe8-2ad73c2cb7f7
2,rZYo7Y4BdooaaHUehJ3C,0.699612,Commercial Vehicle Survey Data (Commercial veh...,35eddd83-a0ca-4932-a901-2bd90540df84
3,uJYn7Y4BdooaaHUehY0d,0.694755,Transport Networks in Canada - CanVec Series -...,2dac78ba-8543-48a6-8f07-faeef56f9895
4,7JYo7Y4BdooaaHUeeZyg,0.688326,Border crossings,39b3cc87-d369-4465-995f-34ebd0660432
5,nZYn7Y4BdooaaHUeg41I,0.687971,GO Train stations,023ef1fd-878e-425d-a369-15c3e27a67f9
6,lZYo7Y4BdooaaHUep58K,0.685451,Interchanges,6556b771-9564-4beb-a1ed-b663601571bd
7,8JYn7Y4BdooaaHUe7JME,0.683694,Forest Tenure Road Segment Lines,9e5bfa62-2339-445e-bf67-81657180c682
8,p5Yo7Y4BdooaaHUeVJqJ,0.683301,Ontario Railway Network (ORWN),5ff2a676-d1a4-4aa1-88ce-0a7d7ac28d75
9,UJYp7Y4BdooaaHUeTKo-,0.683038,JUNCTION OFFICIAL,62d78217-8413-3266-d1db-bb61805a495b


### Construct the API response 

In [54]:
def add_to_top_of_dict(original_dict, key, value):
    """
    Adds a new key-value pair to the top of an existing dictionary.
    """
    # Check if the key or value is empty
    if key is None or value is None:
        print("Key and value must both be non-empty.")
        return original_dict  # Optionally handle this case differently

    # Create a new dictionary with the new key-value pair
    new_dict = {key: value}
    
    # Update the new dictionary with the original dictionary
    new_dict.update(original_dict)
    
    # Return the updated dictionary
    return new_dict

def create_api_response(search_results):
    """
    Creates an API response from the search results.
    
    :param search_results: The search results returned by Elasticsearch/OpenSearch.
    :return: A list of items with added metadata (total, relevancy, and row number).
    """
    items = []
    total_hits = len(search_results['hits']['hits'])
    
    for count, hit in enumerate(search_results['hits']['hits'], start=1):
        try:
            # Extract the source data
            source_data = hit['_source']
            
            # Check and delete 'vector' key if it exists
            source_data.pop('vector', None)  # Remove 'vector' key without raising an error if it's not present
        
            # Add custom metadata to the source data
            source_data = add_to_top_of_dict(source_data, 'total', total_hits)
            source_data = add_to_top_of_dict(source_data, 'relevancy', hit.get('_score', ''))
            source_data = add_to_top_of_dict(source_data, 'row_num', count)
            
            items.append(source_data)
        except Exception as e:
            print(f"Error processing hit: {e}")
    
    return items


In [55]:
api_response = create_api_response(res)
print(api_response)
print(json.dumps(api_response[0], indent=4))

[{'row_num': 1, 'relevancy': 0.65576303, 'total': 30, 'id': 'ef0c4387-38ce-4adc-b761-f0506b82564e', 'title': 'Ontario Integrated Hydrology (OIH) data', 'description': 'Ontario Integrated Hydrology (OIH) data is used to generate watersheds and support provincial-scale hydrology applications including: * watershed generation * hydrologic modelling * watercourse network analysis Four key datasets are represented in each data package: * stream network (Enhanced Watercourse) * hydrology-enforced digital elevation model [DEM ] (Enforced DEM) * flow direction grid (Enhanced Flow Direction - EFDIR) * raster representation of the stream network (StreamGrid) __Technical information__ For the first time, OIH data is complete for the entire province making it possible to create a watershed for any location in Ontario. This includes areas flowing in from neighbouring provinces and Minnesota with the following exceptions: * points on the international border that drain to Lake Superior, south of Pig