# Building AI-powered image search in PostgreSQL using Amazon Bedrock and pgvector
_**Using a pretrained LLM and PostgreSQL extension `pgvector` for similarity image search on product catalog**_

---

---

## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [Amazon Bedrock Model Hosting](#Amazon-Bedrock-Model-Hosting)
1. [Load data into PostgreSQL](#Open-source-extension-pgvector-in-PostgreSQL)
1. [Evaluate Search Results](#Evaluate-PostgreSQL-vector-Search-Results)

## Background

Image search refers to the process of using an image as a query to find related or similar images. Users provide an image as a query, and the search engine uses visual features such as color, shape, and texture to find visually similar images. This type of search is more advanced and relies on image recognition and computer vision technologies. Image search is useful for various purposes, such as finding the source of an image, identifying objects or landmarks in a picture, or discovering visually similar content

In this notebook, we'll build the core components of a visually similar Products. Often people don't know what exactly they are looking for and in that case they just type an item description or upload a photo of a product and looking for similar products matching those items.

One of the core components of searching visually similar items is a fixed length sentence/word embedding i.e. a  “feature vector” that corresponds to that image. For image search, we will convert the product images into "feature vector" that corresponds to that image. The reference image embedding typically are generated offline and must be stored so they can be efficiently searched. In this use case we are using a pretrained multimodal `amazon.titan-embed-image-v1` embeddings from [Amazon Titan](https://aws.amazon.com/bedrock/titan/). 

To enable efficient searches for visually similar items, we'll use  `amazon.titan-embed-image-v1` to generate fixed length sentence embeddings i.e “feature vectors” and use the Nearest Neighbor search in Amazon Aurora for PostgreSQL using the extension `pgvector`. The PostgreSQL `pgvector` extension lets you store and search for points in vector space and find the "nearest neighbors" for those points. Use cases include recommendations (for example, an "other songs you might like" feature in a music application), image recognition, and fraud detection.

Here are the steps we'll follow to build textually similar items:

- Generate feature vectors for the products images from [Kaggle dataset]((https://www.kaggle.com/datasets/vikashrajluhaniwal/fashion-images/) using Amazon Titan Multimodal Embeddings.
- Store the generated vectors in Amazon Aurora PostgreSQL with the pgvector extension along with the metadata
- Explore some sample text queries, and visualize the results.


## Setup
Install required python libraries for the workshop


In [None]:
!pip install -U pgvector boto3 pandarallel psycopg ipywidgets

### Downloading Fashion image dataset from Kaggle

The dataset itself consists of 2900+  product images under Apparel and Footwear category. Two gender types Boys and Girls under Apparel, similarly Men and Women under Footwear.


In [2]:
import json
import boto3

# Loading configuration information
fp = open('data/config.json')
data = json.load(fp)

# Generating list of embedding models available 
bedrock = boto3.client(service_name='bedrock', region_name='us-west-2')
listModels = bedrock.list_foundation_models()

model_list = []
for model in listModels['modelSummaries']:
    if 'EMBEDDING' in model['outputModalities'] and 'ON_DEMAND' in model['inferenceTypesSupported']:
        model_list.append(model['modelId'])


In [3]:
# In this section we will choose the dataset and the embedding model for this lab

import IPython
import ipywidgets as widgets

dataset = widgets.Dropdown(
    options=data['dataset'],
    description="Dataset:",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)

model = widgets.Dropdown(
    options=model_list,
    description="Embeddings Models: ",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(IPython.display.Markdown("## Select Dataset for this lab"))
display(dataset)

display(IPython.display.Markdown("## Select an Embedding Model"))
display(model)

## Select Dataset for this lab

Dropdown(description='Dataset:', layout=Layout(width='max-content'), options={'Fashion Dataset': 'dress.csv', …

## Select an Embedding Model

Dropdown(description='Embeddings Models: ', layout=Layout(width='max-content'), options=('amazon.titan-embed-g…

In [None]:
import pandas as pd

# Load the data of csv
df = pd.read_csv('data/{}'.format(dataset.value))
df = df[['descriptions','display','image_url']]
print("Total number of records : {}".format(len(df.index)))

display(df.head(2))

# Amzon Bedrock Model Hosting

In this section will deploy the pretrained `amazon.titan-embed-image-v1` model into Amazon Bedrock and generate 1024 dimensional vector embeddings for our product catalog images.

**Note**
Please make sure you have completed Amazon Bedrock Setup before continuing this section.


In [None]:
import boto3
import json

bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

Function to convert the image into vector embeddings. This function will be called for all the individual product images.

In [None]:
def generate_embeddings(query):
    
    if model.value.startswith('amazon'):
        payLoad = json.dumps({'inputText': query })
    else:
        payLoad = json.dumps({'texts': [query],'input_type': 'search_document'})
    
    response = bedrock_runtime.invoke_model(
        body=payLoad, 
        modelId=model.value, 
        accept="application/json", 
        contentType="application/json" )
       
    response_body = json.loads(response.get("body").read())
    if model.value.startswith('amazon'):
        output = response_body.get("embedding")
    else:
        output = response_body.get("embeddings")[0]
    return output

description_embeddings = generate_embeddings(df.iloc[1].get('descriptions'))
no_of_embeddings = len(description_embeddings)

print("Embedding model : {}".format(model.value))
print ("Number of dimensions : {}".format(len(description_embeddings)))



In this code block, we will scan through all the data in the dataframe for the image stored in the ImageURL column and convert it as embeddings using Amazon Tital Multimodal Embeddings and store it as image_embeddings column in the same dataframe. We will use [pandarallel](https://pypi.org/project/pandarallel/) to parallize the generation of vector embeddings. pandarallel is a simple and efficient tool to parallelize Pandas operations on all available CPUs.

In [None]:
# Generate embeddings for all the products descriptions - approx 3 min to complete

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=8)

df['embeddings'] = df['descriptions'].parallel_apply(generate_embeddings)
df.head()

print("Completed generation of embeddings for all the products images")

## Open-source extension pgvector in PostgreSQL

pgvector is an open-source extension for PostgreSQL offers a powerful and versatile way to store, manipulate, and search for vector data within PostgreSQL. Its features and ease of use make it a valuable tool for various applications like recommendation systems, image and text retrieval, and data clustering.

Here are some of the key features of the pgvector extension for PostgreSQL:

**Vector Storage and Operations:**

- **Dedicated vector type**: Stores vectors directly in tables, providing efficient storage and retrieval.   
- **Multiple data types:** Supports various data types for vector elements, including floats, integers, and strings.   
- **Basic vector operations:** Allows basic mathematical operations like addition, subtraction, scaling, and dot product.   

**Similarity Search:**

- **Exact and approximate nearest neighbor search:** Find the closest data points based on similarity metrics like L2 distance, inner product, and cosine distance.    
- **Trade-off between accuracy and performance:** Exact search offers perfect recall but slower speed, while approximate search sacrifices some accuracy for higher speed.    
- **Multiple indexing options:** Supports HNSW and IVFFlat indexes for optimizing similarity search queries.   

**Integration and Ease of Use:**    

- **Seamless SQL integration:** Use pgvector functions and operators directly within your SQL queries for a familiar workflow.   
- **Multiple programming language support:** Works with any client library that connects to PostgreSQL.    
- **Active community and documentation:** Benefit from community contributions, tutorials, and extensive documentation.    


In this step we'll get all the image embeddings of *__kaggle__* dataset and store those embeddings into PostgreSQL vector type.


In [None]:
import psycopg
from pgvector.psycopg import register_vector
import boto3 
import json 
import numpy as np

client = boto3.client('secretsmanager')

response = client.get_secret_value(SecretId='apgpg-pgvector-secret')
database_secrets = json.loads(response['SecretString'])

dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10, autocommit=True)

dbconn.execute("CREATE EXTENSION IF NOT EXISTS vector;")
register_vector(dbconn)

dbconn.execute("DROP TABLE IF EXISTS similiarity_search;")

dbconn.execute("""CREATE TABLE IF NOT EXISTS similiarity_search(
                   id bigserial primary key, 
                   descriptions text, 
                   image_url text,
                   display text,
                   description_embeddings vector({}));""".format(no_of_embeddings))


for _, x in df.iterrows():
    dbconn.execute("""INSERT INTO similiarity_search
                  (descriptions, image_url, description_embeddings, display) 
                   VALUES(%s, %s, %s, %s);""", 
                   (x.get('descriptions'), x.get('image_url'), x.get('embeddings'),x.get('display')))

dbconn.execute("""CREATE INDEX ON similiarity_search 
                   USING hnsw (description_embeddings vector_cosine_ops) 
                   WITH  (m = 16, ef_construction = 64);""")

dbconn.execute("VACUUM ANALYZE similiarity_search;")

dbconn.close()
print ("Vector embeddings has been successfully loaded into Aurora PostgreSQL tables ")

## Evaluate PostgreSQL vector Search Results

In this step we will use Amazon Bedrock to generate embeddings for the query and use the embeddings to search the PostgreSQL to retrive the nearest neighbours and retrive the relevent product images.


In [None]:
import numpy 
from IPython.display import display, Markdown, Latex, HTML

def similarity_search(search_text):
    
    embedding = numpy.array(generate_embeddings(search_text))
    dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10)
    register_vector(dbconn)
    
    r= dbconn.execute("""SELECT id, image_url, display
                         FROM similiarity_search 
                         ORDER BY description_embeddings <=> %s limit 3;""",(embedding,)).fetchall()
   
    img_td = ""
    for x in r:
        img_td = img_td + """<tr><td><img src={} width="1000"></td>""".format(x[1])
        img_td = img_td + """<td style="text-align: left; vertical-align: top;"> <p>{}</p></td></tr>""".format(str(x[2]))
       
    display(HTML("""<table>{}</table>""".format(img_td)))
    dbconn.close()

print("Created similarity_search function successfully")

similarity_search("Gift")

Let's do some image search using the above function `similarity_image_search`. You can do image search by specifiying the item number or the product types.
For example, you can pass 4 as item number or  "Sport Shoes" as product type as the parameter to the function.   
You can try product types such as Tops, Dresses, Shorts, Tshirts, Jeans, Casual Shoes, Flip Flops, Formal Shoes , Sports Shoes etc. 

In [None]:
similarity_search("red sleeveless summer wear")

In [None]:
similarity_search("suggest something for december")

Let's search for 3rd image in the dataset. 

Let's search for product type as "Jeans"

In this workshop you have successfully implemented Image Search functionality in PostgreSQL using Amazon Bedrock and pgvector