### Semantic search vs  Keywords Search 
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

Its advantages include: 
- Content understanding (attention mechanism) 
- Handling of Natural language queries and complex queries 
- Dealing with synonyms and variations 
- Improved accuracy



### Fine-tuning vs Pretrain
Fine-tuning is the process of adjusting a pre-trained model (a model trained on a large dataset to learn general features) for a specific task by continuing its training on a new, often smaller dataset.

This approach leverages the model's learned knowledge, reducing the need for extensive data and computational power for new tasks. Pre-trained models serve as adaptable starting points for various machine learning tasks, enabling efficient model development without starting from scratch.


### Amazon OpenSearch 

Amazon OpenSearch Service (formerly known as Amazon Elasticsearch Service) is a managed service offered by AWS that makes it easy to deploy, operate, and scale OpenSearch, an open-source search and analytics suite.

Its advantages include: 
- Managed services, secure, and scalable 
- Supports semantic search, keywords search, hybrid search, etc. 
- Provides search relevance sorting 
- Supports geospatial search and indexing 
- Provide features such as paginate, sort, autocomplete, did-you-mean 


### Amazon SageMaker 
Amazon SageMaker is a fully managed service provided by AWS to quickly and easily build, train, and deploy machine learning (ML) models at scale. 


### Keyords Search with Amazon OpenSearch

In [None]:
# !pip install -q boto3
# !pip install -q requests
# !pip install -q requests-aws4auth
# !pip install -q opensearch-py
# !pip install -q tqdm
# !pip install -q boto3
# !pip install -q install transformers[torch]
# !pip install -q transformers
# !pip install -q sentence-transformers rank_bm25
# !pip install -q pandas 

### 1. Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [1]:
import boto3
import json

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')


def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search-with-opensearch"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

outputs
#print(aos_credentials)

{'OpenSearchDomainEndpoint': 'search-semantic-search-dfcizxxxuj62dusl5skmeu3czu.ca-central-1.es.amazonaws.com',
 'Region': 'ca-central-1',
 'OpenSearchDomainName': 'semantic-search',
 'OpenSearchSecret': 'arn:aws:secretsmanager:ca-central-1:006288227511:secret:OpenSearchSecret-semantic-search-with-opensearch-dFOH1b'}

### 2. Create an OpenSearch cluster connection.
Next, we'll use OpenSearch Python SDK to set up connection with Amazon Opensearch Service domain.

In [12]:
from opensearchpy import OpenSearch,RequestsHttpConnection
import boto3

#update the region if you're working other than us-east-1
region = 'ca-central-1' 

# #Alternatively, auth can be get using the AWS4SignerAuth library
# from opensearchpy import AWSV4SignerAuth
# credentials = boto3.Session().get_credentials()
# auth = AWSV4SignerAuth(credentials, region)

auth = (aos_credentials['username'], aos_credentials['password'])
aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)
print(aos_client)

<OpenSearch([{'host': 'search-semantic-search-dfcizxxxuj62dusl5skmeu3czu.ca-central-1.es.amazonaws.com', 'port': 443}])>


### 3. Create a index in Opensearch Service 
We will use the aos_client connection we initiated ealier to create an index in Amazon OpenSearch Service. 

An OpenSearch index includes components such as Documents (json format), Aliases, Analyzers (part of settings), Templates, shards and replicas (part of settings) . 
In the index, an English analyzer is used to strip the common stopwords like `the`, `is`, `a`, `an`, etc.
- OpenSearch [Index Setting](https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/index-settings/) 
- OpenSearch [Index Analyzer](https://opensearch.org/docs/latest/analyzers/index-analyzers/):Index analyzers are specified at indexing time and are used to analyze text fields when indexing a document.Analyzers define how text data is processed before indexing, including tokenization and the application of token filters



### Create a index with dynamic mapping

In [16]:
#Static index-level index settings: https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/index-settings/#static-index-level-index-settings)
index_name = 'search'
metadata_default_index = {
    "settings": {
        "number_of_replicas": 1,
        "number_of_shards": 1,
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    }
   
}

# Retrieve cluster settings including default settings
response = aos_client.cluster.get_settings(include_defaults=True)
print(response)


If the index is already exist, we will delete the index and recreate 

In [20]:
from opensearchpy import OpenSearch

def delete_index_if_exists(aos_client, index_to_delete):
    """
    Deletes the specified index if it exists.

    :param aos_client: An instance of OpenSearch client.
    :param index_to_delete: The name of the index to delete.
    """
    # List all indexes and check if the specified index exists
    all_indices = aos_client.cat.indices(format='json')
    existing_indices = [index['index'] for index in all_indices]
    print("Current indexes:", existing_indices)

    if index_to_delete in existing_indices:
        # Delete the specified index
        try:
            response = aos_client.indices.delete(index=index_to_delete)
            print(f"Deleted index: {index_to_delete}")
            print("Response:", response)
        except Exception as e:
            print(f"Error deleting index {index_to_delete}:", e)
    else:
        print(f"Index {index_to_delete} does not exist.")

    # List all indexes again to confirm deletion
    all_indices_after_deletion = aos_client.cat.indices(format='json')
    existing_indices_after_deletion = [index['index'] for index in all_indices_after_deletion]
    print("Indexes after deletion attempt:", existing_indices_after_deletion)

delete_index_if_exists(aos_client, index_to_delete=index_name)


Current indexes: ['nlp_knn', '.opensearch-observability', '.plugins-ml-config', 'keyword_search', 'search', '.ql-datasources', '.opendistro_security', '.kibana_1']
Deleted index: search
Response: {'acknowledged': True}
Indexes after deletion attempt: ['nlp_knn', '.opensearch-observability', '.plugins-ml-config', 'keyword_search', '.ql-datasources', '.opendistro_security', '.kibana_1']


In [21]:
#Create a index 
aos_client.indices.create(index=index_name,body=metadata_default_index,ignore=400)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'search'}

Let's verify the created index information

In [22]:
aos_client.indices.get(index=index_name)

{'search': {'aliases': {},
  'mappings': {},
  'settings': {'index': {'replication': {'type': 'DOCUMENT'},
    'number_of_shards': '1',
    'provided_name': 'search',
    'creation_date': '1710198966505',
    'analysis': {'analyzer': {'default': {'type': 'standard',
       'stopwords': '_english_'}}},
    'number_of_replicas': '1',
    'uuid': 'k-7zDllvRDepbKqEfTsxUg',
    'version': {'created': '136327827'}}}}}

### Alternatively, create a index with explicit mapping 
Amazon OpenSearch is default for dynamic mapping. For certain OpenSearch features, like aggregations, sorting, or specific queries (e.g., geo_shape queries), require fields to be indexed in a particular way. Pre-defining mappings (e.g., explicit mapping) ensures these features will work as expected. 

In our case, we would like to explicit mapping the "features_geometry_coordinates" as a geo_shape
- [OpenSearch Mapping](https://opensearch.org/docs/latest/field-types/)
- [Supported field types](https://opensearch.org/docs/latest/field-types/supported-field-types/index/)

In [96]:
index_name = 'search'
metadata_default_index = {
    "settings": {
        "number_of_replicas": 1,
        "number_of_shards": 1,
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        },
        "index": {#default relevancy score is BM25: https://opensearch.org/docs/latest/search-plugins/keyword-search/
            "similarity": {
                "custom_similarity": {
                    "type": "BM25",
                    "k1": 1.2,
                    "b": 0.75,
                    "discount_overlaps": "true"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "coordinates":{
              "type": "geo_shape", 
              "store": True 
            }
        }
    }
}


In [97]:
#delete_index_if_exists(aos_client, index_to_delete=index_name)
#aos_client.indices.create(index=index_name,body=metadata_default_index,ignore=400)

Current indexes: ['nlp_knn', '.opensearch-observability', '.plugins-ml-config', 'keyword_search', 'search', '.ql-datasources', '.opendistro_security', '.kibana_1']
Deleted index: search
Response: {'acknowledged': True}
Indexes after deletion attempt: ['nlp_knn', '.opensearch-observability', '.plugins-ml-config', 'keyword_search', '.ql-datasources', '.opendistro_security', '.kibana_1']


List all the indexes in the OpenSearch domain 


In [201]:
all_indices = aos_client.cat.indices(format='json')
print([index['index'] for index in all_indices]) 

['nlp_knn', '.opensearch-observability', '.plugins-ml-config', 'search', 'keyword_search', '.ql-datasources', '.opendistro_security', '.kibana_1']


Check the data in the index 

In [202]:
index_name = 'search'
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print(f"Records loaded into the index {index_name} is {res['hits']['total']['value']}.")

Records loaded into the index search is 10000.


### 4. Load the raw data - records.parquet from S3 buckets 
Next, let's load the metadata (records.parquet) from the S3 bucket into the index we've just created.
Note, raw data needs to be in JSON format 

In [28]:
import pandas as pd 
#import boto3
#import json
from requests_aws4auth import AWS4Auth
import io

First, we will read the parquet file from S3. 

In [31]:
def load_parquet_from_s3_to_df(region, s3_bucket, s3_key):
    """
    Load a Parquet file from an S3 bucket into a pandas DataFrame.

    Parameters:
    - region: AWS region where the S3 bucket is located.
    - s3_bucket: Name of the S3 bucket.
    - s3_key: Key (path) to the Parquet file within the S3 bucket.

    Returns:
    - df: pandas DataFrame containing the data from the Parquet file.
    """
    
    # Setup AWS session and clients
    session = boto3.Session(region_name=region)
    s3 = session.resource('s3')

    # Load the Parquet file as a pandas DataFrame
    object = s3.Object(s3_bucket, s3_key)
    body = object.get()['Body'].read()
    df = pd.read_parquet(io.BytesIO(body))
    return df

df = load_parquet_from_s3_to_df('ca-central-1', 'webpresence-geocore-geojson-to-parquet-dev', 'records.parquet')

In [32]:
df.head()
#df.columns

Unnamed: 0,features_type,features_geometry_type,features_geometry_coordinates,features_properties_id,features_properties_title_en,features_properties_title_fr,features_properties_description_en,features_properties_description_fr,features_properties_keywords_en,features_properties_keywords_fr,...,features_properties_distributor,features_properties_options,features_properties_temporalExtent_end_@indeterminatePosition,features_properties_temporalExtent_end_#text,features_properties_plugins,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,features_popularity,features_similarity
0,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",000183ed-8864-42f0-ae43-c4313a860720,"Principal Mineral Areas, Producing Mines, and ...","Principales régions minières, principales mine...",This dataset is produced and published annuall...,Ce jeu de données est produit et publié annuel...,"mineralization, mineral occurrences, mines, hy...","minéralisation, indices minéralisés, mines, hy...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://maps-cartes.services.geo.ca/...",,,[],,,[],1250806,"[{""sim"": ""sim1"", ""features_properties_id"": ""b6..."
1,Feature,Polygon,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...",7f245e4d-76c2-4caa-951a-45d1d2051333,"Canadian Digital Elevation Model, 1945-2011","Modèle numérique d'élévation du Canada, 1945-2011",This collection is a legacy product that is no...,Ce produit fait maintenant partie du patrimoin...,"Canada, Earth Sciences, elevation, relief, geo...","Canada, Sciences de la Terre, élévation, relie...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...",,,[],,,[],210798,"[{""sim"": ""sim1"", ""features_properties_id"": ""76..."
2,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",085024ac-5a48-427a-a2ea-d62af73f2142,Canada's National Earthquake Scenario Catalogue,Catalogue national de scénarios de tremblement...,"The National Earthquake Scenario Catalogue, pr...",Le dépôt est utilisé pour l’élaboration du cat...,"Emergency preparedness, Earth sciences, Earthq...","Protection civile, Sciences de la terre, Tremb...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://github.com/OpenDRR/earthquak...",,,[],,,[],140088,"[{""sim"": ""sim1"", ""features_properties_id"": ""4c..."
3,Feature,Polygon,"[[[-104.75571511, 50.42392886], [-104.56356008...",03ccfb5c-a06e-43e3-80fd-09d4f8f69703,Temporal Series of the National Air Photo Libr...,Série temporelle de la photothèque nationale d...,"Note: To visualize the data in the viewer, zoo...",Note: Pour visualiser les données dans l’outil...,"Mosaic, Aerial photography, Access to informat...","Mosaïque, Photographie aérienne, Accès à l'inf...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://datacube-prod-data-public.s3...",,,[],,,[],120162,"[{""sim"": ""sim1"", ""features_properties_id"": ""23..."
4,Feature,Polygon,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",488faf70-b50b-4749-ac1c-a1fd44e06f11,Indigenous Mining Agreements,Ententes minières autochtones,The Indigenous Mining Agreements dataset provi...,Les données des ententes minières autochtones ...,"Indigenous, First Nations, Métis, Indigenous a...","Autochtones, Premières nations, Métis, Affaire...",...,"[{""individual"": ""null"", ""position"": {""en"": ""nu...","[{""url"": ""https://atlas.gc.ca/imaema/en/"", ""pr...",,,[],,,[],111036,"[{""sim"": ""sim1"", ""features_properties_id"": ""CG..."


Subset to columns that are required in the app.geo.ca [api response](https://geocore.api.geo.ca/geo?north=81.77364370720657&east=360&south=-8.407168163601076&west=-359.6484375&keyword=&lang=en&min=1&max=10&sort=popularity-desc). 
##### Note, we are focus on the english search at the moment. 

In [None]:
#Extract organization from contact json, English only 
def extract_organisation_en(contact_str):
    try:
        # Parse the stringified JSON into Python objects
        contact_data = json.loads(contact_str)
        # If the parsed data is a list, iterate through it
        if isinstance(contact_data, list):
            for item in contact_data:
                # Check if 'organisation' and 'en' keys exist
                if 'organisation' in item and 'en' in item['organisation']:
                    return item['organisation']['en']
        elif isinstance(contact_data, dict):
            # If the data is a dictionary, extract 'organisation' in 'en' directly
            return contact_data.get('organisation', {}).get('en', None)
    except json.JSONDecodeError:
        # Handle cases where the contact string is not valid JSON
        return None
    except Exception as e:
        # Catch-all for any other unexpected errors
        return f"Error: {str(e)}"

# Subset to selected columns 
col_names_list = ['features_properties_id','features_geometry_coordinates','features_properties_title_en','features_properties_description_en','features_properties_date_published_date','features_properties_keywords_en','features_properties_options','features_properties_contact','features_properties_topicCategory','features_properties_date_created_date','features_properties_spatialRepresentation','features_properties_type','features_properties_temporalExtent_begin','features_properties_temporalExtent_end','features_properties_graphicOverview','features_properties_language','features_popularity','features_properties_sourceSystemName','features_properties_eoCollection','features_properties_eoFilters']
df_en = df[col_names_list]
#df_en = df_en[:100]
    
# Create new column 'organization_en'
df_en['organisation_en'] = df_en['features_properties_contact'].apply(extract_organisation_en)

# Create a new column 'temporalExtent' as a dictionary of {'begin': ..., 'end': ...}
values_to_replace = {'Present': None, 'Not Available; Indisponible': None}
columns_to_replace = ['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

df_en['temporalExtent'] = df_en.apply(lambda row: {'begin': row['features_properties_temporalExtent_begin'], 'end': row['features_properties_temporalExtent_end']}, axis=1)
df_en = df_en.drop(columns =['features_properties_temporalExtent_begin', 'features_properties_temporalExtent_end'])

#modifies dates to acceptable values
values_to_replace = {'Not Available; Indisponible': None}
columns_to_replace = ['features_properties_date_published_date', 'features_properties_date_created_date']
df_en[columns_to_replace] = df_en[columns_to_replace].replace(values_to_replace)

In [160]:
print(df_en.shape)
df_en.head(4)

(10993, 20)


Unnamed: 0,features_properties_id,features_geometry_coordinates,features_properties_title_en,features_properties_description_en,features_properties_date_published_date,features_properties_keywords_en,features_properties_options,features_properties_contact,features_properties_topicCategory,features_properties_date_created_date,features_properties_spatialRepresentation,features_properties_type,features_properties_graphicOverview,features_properties_language,features_popularity,features_properties_sourceSystemName,features_properties_eoCollection,features_properties_eoFilters,organisation_en,temporalExtent
0,000183ed-8864-42f0-ae43-c4313a860720,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...","Principal Mineral Areas, Producing Mines, and ...",This dataset is produced and published annuall...,2020-02-27,"mineralization, mineral occurrences, mines, hy...","[{""url"": ""https://maps-cartes.services.geo.ca/...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",economy,2019-04-12,vector; vecteur,series; série,"[{""overviewFileName"": ""http://ftp.maps.canada....",eng; CAN,1250806,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '2020-01', 'end': '2020-12'}"
1,7f245e4d-76c2-4caa-951a-45d1d2051333,"[[[-142, 41], [-52, 41], [-52, 84], [-142, 84]...","Canadian Digital Elevation Model, 1945-2011",This collection is a legacy product that is no...,2015,"Canada, Earth Sciences, elevation, relief, geo...","[{""url"": ""https://maps.geogratis.gc.ca/wms/ele...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",elevation,2012-11-06,grid; grille,dataset; jeuDonnées,"[{""overviewFileName"": ""https://ftp.maps.canada...",eng; CAN,210798,,,[],Government of Canada; Natural Resources Canada...,"{'begin': '1945', 'end': '2011'}"
2,085024ac-5a48-427a-a2ea-d62af73f2142,"[[[-141.003, 41.6755], [-52.6174, 41.6755], [-...",Canada's National Earthquake Scenario Catalogue,"The National Earthquake Scenario Catalogue, pr...",2021-07-06,"Emergency preparedness, Earth sciences, Earthq...","[{""url"": ""https://github.com/OpenDRR/earthquak...","[{""individual"": ""Dr. Tiegan Hobbs"", ""position""...",geoscientificInformation,2021-07-06,vector; vecteur,series; série,[],eng; CAN,140088,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '2021-07-06', 'end': <NA>}"
3,03ccfb5c-a06e-43e3-80fd-09d4f8f69703,"[[[-104.75571511, 50.42392886], [-104.56356008...",Temporal Series of the National Air Photo Libr...,"Note: To visualize the data in the viewer, zoo...",2021-03-31,"Mosaic, Aerial photography, Access to informat...","[{""url"": ""https://datacube-prod-data-public.s3...","[{""individual"": ""null"", ""position"": {""en"": ""nu...",imageryBaseMapsEarthCover,2020-08-01,grid; grille,dataset; jeuDonnées,"[{""overviewFileName"": ""http://datacube-prod-da...",eng; CAN,120162,,,[],Government of Canada;Natural Resources Canada;...,"{'begin': '1947', 'end': '1967'}"


### 5 Import the raw data to the index 
To import the raw data, we first need to convert the parquet data frame to JSON format.

The following attributes are in nested json format, and it needs to be a Python list object after conversion, and a mapping field type [object](https://opensearch.org/docs/latest/field-types/supported-field-types/object-fields/)
- features_properties_eoFilters
- features_similarity
- features_properties_options
- features_properties_distributor
- features_properties_cited
- features_properties_credits
- features_properties_contact
- features_properties_graphicOverview
- geometry_coordinate 

In [None]:
import json
from tqdm import tqdm
import time

In [None]:
for x in test_json:
    try: 
        _={
            'id':x.get('features_properties_id',''), #str
            'title':x.get('features_properties_title_en',''), #str
            'description':x.get('features_properties_description_en',''), #str
            'keywords':x.get('features_properties_keywords_en',''), #str
            'topicCategory':x.get('features_properties_topicCategory',''), #str
            'organisation':x.get('organisation_en', ''), #str
            'systemName':x.get('features_properties_sourceSystemName',''),
            "vector":x.get("vector", ""), #str
        }
        aos_client.index(index = index_name,body = _)
    except Exception as e:
        print(e)
        pass

In [None]:
def index_data_to_opensearch(df_en, aos_client, index_name, log_level="INFO"):
    """
    Index data from a pandas DataFrame to an OpenSearch index.

    Parameters:
    - df_en: DataFrame containing the data to index.
    - aos_client: OpenSearch client.
    - index_name: Name of the OpenSearch index to which the data will be indexed.
    - log_level: Logging level, defaults to "INFO". Set to "DEBUG" for detailed logs.
    """
    start_time = time.time()

    # Convert DataFrame to a list of dictionaries (JSON)
    json_en = df_en.to_dict("records")

    # Index the data
    for x in tqdm(json_en, desc="Indexing Records"):
        try:
            bounding_box = json.loads(x.get('features_geometry_coordinates', '[]'))
            coordinates = {
                "type": "polygon",
                "coordinates": bounding_box
            }
            
            document = {
                'id': x.get('features_properties_id', ''),
                'coordinates': coordinates,
                'title': x.get('features_properties_title_en', ''),
                'description': x.get('features_properties_description_en', ''),
                'published': x.get('features_properties_date_published_date', ''),
                'keywords': x.get('features_properties_keywords_en', ''),
                'options': json.loads(x.get('features_properties_options', '[]')),
                'contact': json.loads(x.get('features_properties_contact', '[]')),
                'topicCategory': x.get('features_properties_topicCategory', ''),
                'created': x.get('features_properties_date_created_date', ''),
                'spatialRepresentation': x.get('features_properties_spatialRepresentation', ''),
                'type': x.get('features_properties_type', ''),
                'temporalExtent': x.get('temporalExtent', ''),
                'graphicOverview': json.loads(x.get('features_properties_graphicOverview', '[]')),
                'language': x.get('features_properties_language', ''),
                'organisation': x.get('organisation_en', ''),
                'popularity': int(x.get('features_popularity', '0')),
                'systemName': x.get('features_properties_sourceSystemName', ''),
                'eoCollection': x.get('features_properties_eoCollection', ''),
                'eoFilters': json.loads(x.get('features_properties_eoFilters', '[]'))
            }

            if log_level == "DEBUG":
                print((json.dumps(document, indent=4)))

            aos_client.index(index=index_name, body=document)

        except Exception as e:
            print(e)

    # Final record count check (Optional, can slow down the script if the index is large)
    res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
    total_time = time.time() - start_time
    print(f"Completed indexing. Records loaded into the index {index_name}: {res['hits']['total']['value']}. Total time taken: {total_time:.2f} seconds.")

# Example usage
index_data_to_opensearch(df_en, aos_client, index_name)

In [143]:
res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print(f"Records loaded into the index {index_name} is {res['hits']['total']['value']}.")

Records loaded into the index search is 10000.


### 6. Run a "simple text search" 

In [203]:
import pandas as pd
import json
search_query = "wildfire"
query={
  "size": 20,
  "query": {
    "match": {
      "title": search_query
    }
  }
}
res = aos_client.search(index=index_name, body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["id","relevancy_score","title",'uuid'])
display(query_result_df)


Unnamed: 0,id,relevancy_score,title,uuid
0,K5VKOI4BdooaaHUeJKHO,11.244217,Wildfire Information,c4a1037c-cf9e-491a-2e3f-70cf13ee29a2
1,BpVIOI4BdooaaHUe9I74,8.911254,Wildfire hotspots Cumulative Effects products,574c32db-aba7-4919-9c9f-c58398754173
2,fJVJOI4BdooaaHUeiJdU,8.334815,Wildfire Year/dNBR/Mask 1985-2015,9eb51c7b-5336-4b50-a9f8-a6791bad15a9


### 7. Keyword search across multiple fields 
The current geo.ca search is a multi-column keywords search based on the following properties: features_properties_topicCategory, features_properties_keywords_en, features_properties_description_en, features_properties_title_en, features_properties_contact (organization), features_properties_sourceSystemName, features_properties_geometry 

In [210]:
import pandas as pd
search_query = "River Ice events in Canada in 2020"
query={
  "size": 20,
  "query": {
    "multi_match": {
      "query": search_query,
      "fields": ["topicCategory","keywords", "description", "title*", "organisation", "systemName"]
    }
  }
}
res = aos_client.search(index=index_name, size=20,body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["id","relevancy_score","title",'uuid'])
display(query_result_df)


Unnamed: 0,id,relevancy_score,title,uuid
0,fZVIOI4BdooaaHUe_Y4G,17.015722,River Ice in Canada - Current Year,8ca6f047-ddef-43d7-81c2-47654f4c69bd
1,k5VIOI4BdooaaHUe_o5l,17.015722,Active Monitoring of River Ice in Canada,7b210c58-2fc7-47c5-8b8a-2605c77d725c
2,x5VJOI4BdooaaHUeAY7B,17.015722,River Ice in Canada - Archive,5e6b40bf-299f-4e05-87c8-d10b9c8210f9
3,QZVIOI4BdooaaHUe-I7m,15.963156,River Ice State in Canada - Cartographic Produ...,d1fcb44f-5f86-4957-bdb4-e6fd1aa69283
4,M5VKOI4BdooaaHUeUqTs,14.826182,River Ice in Canada - Current Year,CGDIWH-150516
5,T5VKOI4BdooaaHUeVKS6,14.826182,Active Monitoring of River Ice in Canada - Pro...,CGDIWH-150521
6,xJVKOI4BdooaaHUe6K0V,14.434877,River Ice in Canada - Archive - Footprints,CGDIWH-150519
7,xZVKOI4BdooaaHUe6K0k,14.253883,Active monitoring of river ice in Canada - Foo...,CGDIWH-150522
8,MZVKOI4BdooaaHUeUqSs,13.886137,River ice in Canada - Current year,CGDIWH-150514
9,w5VKOI4BdooaaHUe6K0H,12.656118,River ice in Canada - Archive - Footprints on ...,CGDIWH-150518


### Keywords Search Relevance  
Search relevance evaluates the accuracy of the search results returned by a query. The higher the relevance, the better the search engine.
- [Search features in OpenSearch](https://opensearch.org/docs/latest/search-plugins/)
- [keyword (BM25) search](https://opensearch.org/docs/latest/search-plugins/keyword-search/):By default, OpenSearch calculates document scores using the Okapi BM25 algorithm. BM25 is a keyword-based algorithm that performs lexical search for words that appear in the query.
- [k-NN search](https://opensearch.org/docs/latest/search-plugins/knn/index/):Searches for k-nearest neighbors to a search term across an index of vectors
- [Hybrid search](https://opensearch.org/docs/latest/search-plugins/hybrid-search/)

[Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)(Best Matching 25) is a keyword-based algorithm that performs lexical search for words that appear in the query.When determining a document’s relevance, BM25 considers term frequency/inverse document frequency (TF/IDF). It improves upon the TF-IDF model by incorporating document length normalization and a saturation function, which prevents the term frequency part from growing indefinitely with the frequency of the term in the document.

The K-NN search feature in OpenSearch enables users to perform similarity search by finding the "nearest" vectors in high-dimensional space. This approach is particularly useful for scenarios like product recommendations, image search, and finding similar documents, where items can be represented as vectors in a multidimensional space.

### 8. Search with Field preference or boosting

When searching across fields, all fields given the same priority by default. But you can control the preference by giving static boost score to each field

In [None]:
import pandas as pd
search_query = "Urban planning and development trends"
query={
  "size": 20,
  "query": {
    "multi_match": {
      "query": search_query,
      "fields": ["topicCategory","keywords^2", "description^1.5", "title^1.5", "organisation", "systemName"]
    }
  }
}
res = aos_client.search(index=index_name, size=20,body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["id","relevancy_score","title",'uuid'])
display(query_result_df)

### 9. Keywords Search geo_shape search

Ottawa, being the capital city of Canada, is located in the province of Ontario. The geographical coordinates for Ottawa roughly center around latitude 45.4215° N and longitude -75.6972° W.

In [140]:
import pandas as pd
query = 'flood risk mapping'
# Example coordinates that cover a rough area around Ottawa
coordinates = [[[-75.9, 45.2], [-75.9, 45.5], [-75.6, 45.5], [-75.6, 45.2], [-75.9, 45.2]]]

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": query,
                        "fields": ["topicCategory","keywords", "description", "title*", "organisation", "systemName"]
                    }
                },
                {
                    "geo_shape": {
                        "coordinates": {  # Geo_shape field
                            "shape": {
                                "type": "polygon",
                                "coordinates": coordinates
                            },
                            "relation": "intersects"  # Change as needed (e.g., 'within
                        }
                    }
                }
            ]
        }
    }
}

res = aos_client.search(index=index_name, size=20,body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['title'],hit['_source']['id']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["id","relevancy_score","title",'uuid'])
display(query_result_df)


Unnamed: 0,id,relevancy_score,title,uuid
0,y5VKOI4BdooaaHUeuqqR,13.991396,Geo-Floods: Map showing the presence of mapped...,5690277b-cef2-45bc-a50f-b07efc521fe6
1,a5VJOI4BdooaaHUeh5dj,13.101778,Schedule for the presence of a flood zone iden...,0d9de0d6-9873-4a8c-adc7-0e94d51b3fa0
2,9JVJOI4BdooaaHUef5b-,12.810937,Flood Risk Areas Database (BDZI),3ac8ddff-fe0a-4a7a-8393-d5938e8f35e5
3,OpVIOI4BdooaaHUe-I5-,11.411324,Active Floods in Canada,9cad712a-5ac5-4248-b7d7-2db1a3892509
4,RpVIOI4BdooaaHUe-Y5B,11.411324,Floods in Canada - Current Year,b1afd8d2-6e14-4ec4-9a09-652221a6cb71
5,V5VKOI4BdooaaHUeRqMf,11.17375,Large and low-current areas of the Outaouais a...,b1e7c9a9-a34a-457c-8671-bae4529749fb
6,TJVIOI4BdooaaHUe-Y7L,10.672247,Floods in Canada - Archive,74144824-206e-4cea-9fb9-72925a128189
7,5ZVJOI4BdooaaHUef5Yh,10.318212,2017-2019 Special Intervention Zone (ZIS annex...,fef0a7a8-07ee-43d5-93d7-979d12dfcff3
8,yJVKOI4BdooaaHUe6K1P,9.487976,Floods in Canada - Archive,CGDIWH-150527
9,6pVKOI4BdooaaHUeP6Ju,9.432468,Limnimetric scales (public),cbc3a496-cc0f-4930-bc91-b351373110f6


### Construct the geolambda API response directly from the search results 

In [None]:
def add_to_top_of_dict(original_dict, key, value):
    """
    Adds a new key-value pair to the top of an existing dictionary.
    """
    # Check if the key or value is empty
    if key is None or value is None:
        print("Key and value must both be non-empty.")
        return original_dict  # Optionally handle this case differently

    # Create a new dictionary with the new key-value pair
    new_dict = {key: value}
    
    # Update the new dictionary with the original dictionary
    new_dict.update(original_dict)
    
    # Return the updated dictionary
    return new_dict

def create_api_response(search_results):
    """
    Creates an API response from the search results.
    
    :param search_results: The search results returned by Elasticsearch/OpenSearch.
    :return: A list of items with added metadata (total, relevancy, and row number).
    """
    items = []
    total_hits = len(search_results['hits']['hits'])
    
    for count, hit in enumerate(search_results['hits']['hits'], start=1):
        try:
            # Extract the source data
            source_data = hit['_source']
            
            # Check and delete 'vector' key if it exists
            source_data.pop('vector', None)  # Remove 'vector' key without raising an error if it's not present
        
            # Add custom metadata to the source data
            source_data = add_to_top_of_dict(source_data, 'total', total_hits)
            source_data = add_to_top_of_dict(source_data, 'relevancy', hit.get('_score', ''))
            source_data = add_to_top_of_dict(source_data, 'row_num', count)
            
            items.append(source_data)
        except Exception as e:
            print(f"Error processing hit: {e}")
    
    return items

api_response = create_api_response(res)
#print(json.dumps(api_response[0], indent=4))