# Building AI-powered image search in PostgreSQL using Amazon Bedrock and pgvector
_**Using a pretrained LLM and PostgreSQL extension `pgvector` for similarity image search on product catalog**_

---

---

## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [Amazon SageMaker Model Hosting](#Amazon-SageMaker-Model-Hosting)
1. [Load data into PostgreSQL](#Open-source-extension-pgvector-in-PostgreSQL)
1. [Evaluate Search Results](#Evaluate-PostgreSQL-vector-Search-Results)

## Background

Image search refers to the process of using an image as a query to find related or similar images. Image search is useful for various purposes, such as finding the source of an image, identifying objects or landmarks in a picture, or discovering visually similar content

In this notebook, we'll build the core components of a visually similar Products. Often people don't know what exactly they are looking for and in that case they just type an item description or upload a photo of a product and looking for similar products matching those items.

One of the core components of searching visually similar items is a fixed length sentence/word embedding i.e. a  “feature vector” that corresponds to that image. For image search, we will convert the product images into "feature vector" that corresponds to that image. The reference image embedding typically are generated offline and must be stored so they can be efficiently searched. In this use case we are using a pretrained Image model `tensorflow-icembedding-imagenet-inception-v2-featurevector-4` from [Tensorflow](https://tfhub.dev/google/imagenet/inception_v2/feature_vector/5). 

To enable efficient searches for visually similar items, we'll use Amazon SageMaker to generate fixed length sentence embeddings i.e “feature vectors” and use the Nearest Neighbor search in Amazon Aurora for PostgreSQL using the extension `pgvector`. The PostgreSQL `pgvector` extension lets you store and search for points in vector space and find the "nearest neighbors" for those points. Use cases include recommendations (for example, an "other songs you might like" feature in a music application), image recognition, and fraud detection.

Here are the steps we'll follow to build textually similar items:

- Generate feature vectors for the products images from [Kaggle dataset]((https://www.kaggle.com/datasets/vikashrajluhaniwal/fashion-images/) using using Tensorflow Transformers.
- Store the generated vectors in Amazon Aurora PostgreSQL with the pgvector extension along with the metadata
- Explore some sample text queries, and visualize the results.


## Setup
Install required python libraries for the workshop


In [2]:
!pip install -U pgvector tqdm boto3 requests scikit-image pillow pandarallel psycopg pillow



### Downloading Fashion image dataset from Kaggle

The dataset itself consists of 2900+  product images under Apparel and Footwear category. Two gender types Boys and Girls under Apparel, similarly Men and Women under Footwear.


In [7]:
import pandas as pd

# Load the data of csv
df = pd.read_csv('data/fashion.csv')
df = df[['ProductId','Gender','Category','SubCategory','ProductType','Colour','Usage','ProductTitle','Image','ImageURL']]
print("Total number of records : {}".format(len(df.index)))

display(df.head(2))

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Total number of records : 2906


Unnamed: 0,ProductId,Gender,Category,SubCategory,ProductType,Colour,Usage,ProductTitle,Image,ImageURL
0,42419,Girls,Apparel,Topwear,Tops,White,Casual,Gini and Jony Girls Knit White Top,42419.jpg,http://assets.myntassets.com/v1/images/style/p...
1,34009,Girls,Apparel,Topwear,Tops,Black,Casual,Gini and Jony Girls Black Top,34009.jpg,http://assets.myntassets.com/v1/images/style/p...


# Amazon SageMaker Model Hosting

In this section will deploy the pretrained `tensorflow-icembedding-imagenet-inception-v2-featurevector-4` model into SageMaker and generate 1024 dimensional vector embeddings for our product catalog images.

In [4]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::192355327736:role/genai-pgvector-lab-ExecutionRole-7bvQUcFhTZga
sagemaker bucket: sagemaker-us-east-1-192355327736
sagemaker session region: us-east-1


In [5]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base


model_id = "tensorflow-icembedding-imagenet-inception-v2-featurevector-4"
model_version = "2.0.0"
endpoint_name = "apg-image-vector"
inference_instance_type = "ml.m5.xlarge"

# Retrieve the inference docker container uri. This is the base Tensorflow container image for the default model above.
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)


# Retrieve the model uri. This includes the model and model parameters.
model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"
)


# Create the SageMaker model instance
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=model_uri,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    role=role,
    predictor_cls=Predictor,
    name=endpoint_name,
)

# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the sagemaker API.
image_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)

print(f"Image Model has been deployed successfully to SageMaker")


----!Image Model has been deployed successfully to SageMaker


Function to convert the image into vector embeddings. This function will be called for all the individual product images.

In [6]:
from io import BytesIO
import requests
import json

def generate_embeddings(url):
    image_req = requests.get(url)
    image_bytes = BytesIO(image_req.content)
    image_embeddings_byte = image_model_predictor.predict(image_bytes,{
            "ContentType": "application/x-image",
            "Accept": "application/json",
        },)
    image_embeddings = json.loads(image_embeddings_byte.decode("utf-8"))['embedding']
    return image_embeddings

image_embeddings = generate_embeddings(df.iloc[0].get('ImageURL'))

print ("Number of image Dimensions : {}".format(len(image_embeddings)))


Number of image Dimensions : 1024


In this code block, we will scan through all the data in the dataframe for the image stored in the ImageURL column and convert it as embeddings using TensorFlow Transformer and store it as image_embeddings column in the same dataframe.

In [8]:
# Generate embeddings for all the products descriptions - approx 3 min to complete

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=8)

df['image_embeddings'] = df['ImageURL'].parallel_apply(generate_embeddings)
df.head()

print("Completed generation of embeddings for all the products images")

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=364), Label(value='0 / 364'))), HB…

Completed generation of embeddings for all the products images


In [2]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting torch>=1.6.0 (from sentence_transformers)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting torchvision (from sentence_transformers)
  Downloading torchvision-0.16.2-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Downloading huggingface_hub-0.20.2-py3-none-any.whl (330 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.3/330.3 kB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m969.5 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.2/89.2 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[

In [4]:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm
torch.set_num_threads(4)



#First, we load the respective CLIP model
model = SentenceTransformer('clip-ViT-B-32')

In [17]:
import requests
from io import BytesIO
response = requests.get(df.iloc[0].get('ImageURL'))
query = Image.open(BytesIO(response.content))

#query = Image.open(os.path.join(img_folder, 'lyStEjlKNSw.jpg'))

query_emb = model.encode([query], convert_to_tensor=True, show_progress_bar=False).tolist()[0]
print(len(query_emb[0]))
#print(query_emb)
#print(type(query_emb))
#print(query_emb.tolist())

512


In [18]:
from io import BytesIO
import requests
import json

def generate_embeddings(url):
    image_req = requests.get(url)
    image_bytes = Image.open(BytesIO(image_req.content))
    image_embeddings = model.encode([image_bytes], convert_to_tensor=True, show_progress_bar=False).tolist()[0]
    return image_embeddings

image_embeddings = generate_embeddings(df.iloc[0].get('ImageURL'))

print ("Number of image Dimensions : {}".format(len(image_embeddings)))


Number of image Dimensions : 512


In [20]:
# Generate embeddings for all the products descriptions - approx 3 min to complete

from tqdm.notebook import tqdm
tqdm.pandas()

#from pandarallel import pandarallel

#pandarallel.initialize(progress_bar=True, nb_workers=8)

#df['image_embeddings'] = df['ImageURL'].parallel_apply(generate_embeddings)
df['image_embeddings'] = df['ImageURL'].progress_apply(generate_embeddings)

df.head()

print("Completed generation of embeddings for all the products images")

  0%|          | 0/2906 [00:00<?, ?it/s]

Completed generation of embeddings for all the products images


## Open-source extension pgvector in PostgreSQL

pgvector is an open-source extension for PostgreSQL that allows you to store and search vector embeddings for exact and approximate nearest neighbors. It is designed to work seamlessly with other PostgreSQL features, including indexing and querying.

One of the key benefits of using pgvector is that it allows you to perform similarity searches on large datasets quickly and efficiently. This is particularly useful in industries like e-commerce, where businesses need to be able to quickly search through large product catalogs to find the items that best match a customer's preferences. It supports exact and approximate nearest neighbor search, L2 distance, inner product, and cosine distance.

To further optimize your searches, you can also use pgvector's indexing features. By creating indexes on your vector data, you can speed up your searches and reduce the amount of time it takes to find the nearest neighbors to a given vector.

In this step we'll get all the image embeddings of *__kaggle__* dataset and store those embeddings into PostgreSQL vector type.

In [22]:
import psycopg
from pgvector.psycopg import register_vector
import boto3 
import json 
import numpy as np

client = boto3.client('secretsmanager')

response = client.get_secret_value(SecretId='apgpg-pgvector-secret')
database_secrets = json.loads(response['SecretString'])

dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10, autocommit=True)

dbconn.execute("CREATE EXTENSION IF NOT EXISTS vector;")
register_vector(dbconn)

dbconn.execute("DROP TABLE IF EXISTS fashion;")

dbconn.execute("""CREATE TABLE IF NOT EXISTS fashion(
                   id bigserial primary key, 
                   product_id text, 
                   category text, 
                   product_type text, 
                   product_title text,
                   image_url text,
                   image_embeddings vector(512));""")


for _, x in df.iterrows():
    dbconn.execute("""INSERT INTO fashion
                  (product_id, category, product_type, product_title, image_url, image_embeddings) 
                   VALUES(%s, %s, %s, %s, %s, %s);""", 
                   (x.get('ProductId'), x.get('Category'), x.get('ProductType'), x.get('ProductTitle'), x.get('ImageURL'), x.get('image_embeddings')))

dbconn.execute("""CREATE INDEX ON fashion 
                   USING hnsw (image_embeddings vector_cosine_ops) 
                   WITH  (m = 16, ef_construction = 64);""")

dbconn.execute("VACUUM ANALYZE fashion;")

dbconn.close()
print ("Vector embeddings has been successfully loaded into Aurora PostgreSQL tables ")

Vector embeddings has been successfully loaded into Aurora PostgreSQL tables 


## Evaluate PostgreSQL vector Search Results

In this step we will use SageMaker realtime inference to generate embeddings for the query and use the embeddings to search the PostgreSQL to retrive the nearest neighbours and retrive the relevent product images.


In [23]:
import numpy as np
from skimage import io
import matplotlib.pyplot as plt
import requests
from IPython.display import display, Markdown, Latex, HTML
import ipywidgets as widgets


def similarity_image_search(image_url):
    res1 = generate_embeddings(image_url)
    
    client = boto3.client('secretsmanager')
    response = client.get_secret_value(SecretId='apgpg-pgvector-secret')
    database_secrets = json.loads(response['SecretString'])

    dbhost = database_secrets['host']
    dbport = database_secrets['port']
    dbuser = database_secrets['username']
    dbpass = database_secrets['password']

    dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10)
    register_vector(dbconn)
        
    r = dbconn.execute("""SELECT product_id, product_title, image_url, image_embeddings 
                            FROM fashion where image_url != %s
                            ORDER BY image_embeddings <-> %s limit 3;""", (image_url, np.array(res1),))
 
    urls = []
    plt.rcParams["figure.figsize"] = [7.50, 3.50]
    plt.rcParams["figure.autolayout"] = True
    
    display(Markdown("## Reference product"))
    display(HTML("""<table><tr><td><img src={} width="250"></td></tr></table>""".format(image_url)))
    
    display(Markdown(("## Similar products")))
        
    item_td = ""
    img_td = ""
    for x in r:
        url = x[2]
        item_td = item_td + """<td style="text-align: center; vertical-align: middle;"> <h4> ProductId: {}</h4></td>""".format(str(x[0]))
        img_td = img_td + """<td><img src={} width="250"></td>""".format(url)

    display(HTML("""<table><tr>{}</tr><tr>{}</tr></table>""".format(item_td,img_td)))
    dbconn.close()
    

print("Search function created successfully")

Search function created successfully


Using the  function `similarity_image_search` , lets do some image search. 
For example, the parameter passed to the function is the reference image URL for the 4th item in the list out of 8000+ items.

In [24]:
similarity_image_search(df.iloc[3].get('ImageURL'))

## Reference product

## Similar products

0,1,2
ProductId: 52123,ProductId: 50721,ProductId: 38286
,,


Let's search for Jeans similar to the image in the 1004th row of the dataset. 

In [25]:
similarity_image_search(df.iloc[1003].get('ImageURL'))

## Reference product

## Similar products

0,1,2
ProductId: 38993,ProductId: 40928,ProductId: 40925
,,


Let's search for Shoes similar to the image in the 2001th row of the dataset. 

In [26]:
similarity_image_search(df.iloc[2000].get('ImageURL'))

## Reference product

## Similar products

0,1,2
ProductId: 3301,ProductId: 33822,ProductId: 13213
,,


In this workshop you have successfully implemented Image Search functionality in PostgreSQL using Amazon Bedrock and pgvector