# Building AI-Powered semantic product catalog search 
### Using a pretrained LLM and Amazon Aurora PostgreSQL extension `pgvector` 

---

---

## Contents


1. [Background](#Background)
1. [Architecture](#Architecture)
1. [Setup](#Setup)
1. [Amazon Bedrock Model Hosting](#Amazon-Bedrock-Model-Hosting)
1. [Load data into PostgreSQL](#Open-source-extension-pgvector-in-PostgreSQL)
1. [Evaluate Search Results](#Evaluate-PostgreSQL-vector-Search-Results)

## Background


Semantic search is a type of search technique that aims to understand the intent and context of a user's query, rather than simply matching keywords or phrases. It goes beyond traditional keyword-based search by considering the meaning of words, the relationships between them, and the overall context of the query to deliver more relevant search results. Semantic search is important because it can help users to find the information they are looking for more quickly and easily.

Here are some examples of how semantic search is used today:
- Amazon uses semantic search to help customers find the products they are looking for. For example, if you search for "blue running shoes," Amazon will return results for shoes that are both blue and designed for running, even if you didn't use both of those keywords in your query.
- Netflix uses semantic search to recommend movies and TV shows to its users. For example, if you watch a lot of documentaries, Netflix will recommend other documentaries that you may be interested in.
- Google uses semantic search to improve the relevance of its search results. For example, if you search for "capital of France," Google will return results for Paris, even though you didn't explicitly mention Paris in your query.


In this notebook, we'll build the core components of a textually similar Products. Often people don't know what exactly they are looking for and in that case they just type an item description and hope it will retrieve similar items. Other times, they have a photo of a product and looking for similar products matching those items.

One of the core components of searching textually similar items is a fixed length sentence/word embedding i.e. a  “feature vector” that corresponds to that text. The reference word/sentence embedding typically are generated offline and must be stored so they can be efficiently searched. In this use case we are using a pretrained SentenceTransformer model `amazon.titan-embed-g1-text-02` from [Amazon Titan](https://aws.amazon.com/bedrock/titan/)
 
To enable efficient searches for textually similar items, we'll use [Amazon Bedrock](https://aws.amazon.com/bedrock/) to generate fixed length sentence embeddings i.e “feature vectors” and use the Nearest Neighbor search in Amazon Aurora for PostgreSQL using the extension [pgvector](https://github.com/pgvector/pgvector). The PostgreSQL `pgvector` extension lets you store and search for points in vector space and find the "nearest neighbors" for those points. Use cases include recommendations (for example, an "other songs you might like" feature in a music application), image recognition, and fraud detection.

Here are the steps we'll follow to build textually similar items: 
- Generate feature vectors for the products description from kaggle dataset using Amazon Titan Embedding model. 
- Store the generated vectors in Amazon Aurora for PostgreSQL as vector datatype along with the metadata 
- Explore some sample text queries, and visualize the results.

## Architecture

![](./static/arch_product_recommendation.png)

**Step 1.** We will download the Kaggle dataset and generate embeddings using `amazon.titan-embed-g1-text-02`  from Amazon Titan and store it in the Amazon Aurora PostgreSQL instance with pgvector extension
 
**Step 2.** Search for a product with a keyword, which will be converted to embeddings and searched in the Amazon Aurora PostgreSQL database with Approximate Nearest Search and provide the results

## Setup
Install required python libraries for the workshop.


In [None]:
# Install all the required prerequiste libraries - approx 3 min to complete
%pip install -U pgvector pandarallel boto3 psycopg numexpr

## Download Amazon Product Catalog from Kaggle

The [dataset](https://www.kaggle.com/datasets/promptcloud/amazon-product-dataset-2020) consists of 9000+ Amazon products along with the different descriptions of the product. The data was already downloaded as a csv file and we will load it into pandas dataframe to further process it. We will be combining the data in the colums `About Product`, `Product Specification` and `Technical Details` as `all_descriptions` to be used to convert to vector embeddings.


In [None]:
import pandas as pd

# Load the data of csv
df = pd.read_csv('data/amazon.csv')
df = df[['Uniq Id','Product Name','Category','About Product','Product Specification','Technical Details','Image']]

df = df.dropna(subset=['About Product'])
df = df.fillna('')
df.rename(columns={'Uniq Id': 'id', 
                   'Product Name': 'product_name',
                   'Category':'category',
                   'About Product':'product_description',
                   'Product Specification':'product_specification',
                   'Technical Details':'product_details',
                   'Image':'image_url'}, inplace=True)

df['all_descriptions'] = df['product_description'] + df['product_specification'] + df['product_details']

print("Total number of records : {}".format(len(df.index)))

display(df.head(2))

##  Amazon Bedrock Model Hosting

In this section we will deploy the pretrained `amazon.titan-embed-g1-text-02` from Amazon Titan SentenceTransformer model into Amazon Bedrock and generates 1536 dimensional vector embeddings for our product catalog descriptions.


In [None]:
import boto3
import json

bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

Function to convert the text into vector embeddings. This function will be called for all the individual product descriptions.

In [None]:
def generate_embeddings(query):
    
    payLoad = json.dumps({'inputText': query })
    
    response = bedrock_runtime.invoke_model(
        body=payLoad, 
        modelId='amazon.titan-embed-g1-text-02',
        accept="application/json", 
        contentType="application/json" )
    response_body = json.loads(response.get("body").read())
    return(response_body.get("embedding"))
    
description_embeddings = generate_embeddings(df.iloc[1].get('all_descriptions'))

print ("Number of dimensions : {}".format(len(description_embeddings)))

In this code block, we will scan through all the data in the dataframe for the text stored in the `all_descriptions` column and convert it as embeddings using Amazon Titan Embeddings model and store it as `description_embeddings` column in the same dataframe. 

In [None]:
# Generate embeddings for all the products descriptions - approx 3 min to complete

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=8)

# Generate Embeddings for all the products 
df['description_embeddings'] = df['all_descriptions'].parallel_apply(generate_embeddings)

df.head()

print("Completed generation of embeddings for all the products descriptions")

## Open-source extension pgvector in PostgreSQL

`pgvector` is an open-source extension for PostgreSQL that allows you to store and search vector embeddings for exact and approximate nearest neighbors. It is designed to work seamlessly with other PostgreSQL features, including indexing and querying.

One of the key benefits of using pgvector is that it allows you to perform similarity searches on large datasets quickly and efficiently. This is particularly useful in industries like e-commerce, where businesses need to be able to quickly search through large product catalogs to find the items that best match a customer's preferences. It supports exact and approximate nearest neighbor search, L2 distance, inner product, and cosine distance.

To further optimize your searches, you can also use pgvector's indexing features. By creating indexes on your vector data, you can speed up your searches and reduce the amount of time it takes to find the nearest neighbors to a given vector.

In this step we'll get all the product descriptions of *Amazon Products* dataset and store those embeddings into Amazon Aurora PostgreSQL vector type.

In [None]:
import psycopg
from pgvector.psycopg import register_vector
import boto3 
import json 
import numpy as np

client = boto3.client('secretsmanager')

response = client.get_secret_value(SecretId='apgpg-pgvector-secret')
database_secrets = json.loads(response['SecretString'])

dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10, autocommit=True)

dbconn.execute("CREATE EXTENSION IF NOT EXISTS vector;")
register_vector(dbconn)

dbconn.execute("DROP TABLE IF EXISTS products;")

dbconn.execute("""CREATE TABLE IF NOT EXISTS products(
                   id text primary key, 
                   product_name text, 
                   category text, 
                   product_description text, 
                   product_specification text,
                   product_details text,   
                   image_url text,
                   description_embeddings vector(1536));""")

for _, x in df.iterrows():
    dbconn.execute("""INSERT INTO products
                  (id, product_name, category, product_description, product_specification, product_details, image_url, description_embeddings) 
                   VALUES(%s, %s, %s, %s, %s, %s, %s, %s);""", 
                   (x.get('id'), x.get('product_name'), x.get('category'), x.get('product_description'), x.get('product_specification'), x.get('product_details'), x.get('image_url'), x.get('description_embeddings')))

dbconn.execute("""CREATE INDEX ON products 
                   USING hnsw (description_embeddings vector_cosine_ops) 
                   WITH  (m = 16, ef_construction = 64);""")

dbconn.execute("VACUUM ANALYZE products;")

dbconn.close()
print ("Vector embeddings has been successfully loaded into Aurora PostgreSQL tables ")


## Evaluate PostgreSQL vector search results

In this step we will use pretrained `amazon.titan-embed-g1-text-02` model from Amazon Titan to generate embeddings for the query and use the embeddings to search the Amazon Aurora PostgreSQL to retrive the nearest neighbours and retrive the relevent product images along with its descriptions.


In [None]:
import numpy
from IPython.display import display, Markdown, Latex, HTML


def similarity_search(search_text):
    
    embedding = numpy.array(generate_embeddings(search_text))
    dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10)
    register_vector(dbconn)
    
    r= dbconn.execute("""SELECT id, image_url, product_name, product_description, product_details
                         FROM products 
                         ORDER BY description_embeddings <=> %s limit 3;""",(embedding,)).fetchall()
   
    img_td = ""
    for x in r:
        url = x[1].split("|")[0]
        img_td = img_td + """<tr><td><img src={} width="1000"></td>""".format(url)
        img_td = img_td + """<td style="text-align: left; vertical-align: top;"> <h3>{}</h3> <p>{}</p></td></tr>""".format(str(x[2]),str(x[4]))
       
    display(HTML("""<table>{}</table>""".format(img_td)))
    dbconn.close()

print("Created similarity_search function successfully")


Using the above `similarity_search` function, let's try some more search queries on the product catalog

In [None]:
similarity_search("suggest something for 5 year old")

In [None]:
similarity_search("suggest something for halloween")

In [None]:
similarity_search("suggest something for home office")

In [None]:
similarity_search("suggest something for december")

In [None]:
similarity_search("suggest something for thanksgiving")

## Conclusion
In this workshop you have learnt how semantic search works in searching through a product catalog for an e-commerce application. 

### Take aways
- Adapt this notebook to experiment with different models available through HuggingFace or Amazon Bedrock such as Anthropic Claude and AI21 Labs Jurassic models.
- Change the input dataset and experiment with your organizational data.