# Using GPT-4o mini to tag & caption images

This notebook explores how to leverage the vision capabilities of the GPT-4* models (for example `gpt-4o`, `gpt-4o-mini` or `gpt-4-turbo`) to tag & caption images.

We can leverage the multimodal capabilities of these models to provide input images along with additional context on what they represent, and prompt the model to output tags or image descriptions. The image descriptions can then be further refined with a language model (in this notebook, we'll use `gpt-4o-mini`) to generate captions.

Generating text content from images can be useful for multiple use cases, especially use cases involving search.  
We will illustrate a search use case in this notebook by using generated keywords and product captions to search for products - both from a text input and an image input.

As an example, we will use a dataset of Amazon furniture items, tag them with relevant keywords and generate short, descriptive captions.

## Setup

In [1]:
# Install dependencies if needed
%pip install --upgrade openai
%pip install scikit-learn

Collecting openai
  Downloading openai-1.39.0-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.39.0-py3-none-any.whl (336 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.7/336.7 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-a

In [2]:
from IPython.display import Image, display
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from openai import OpenAI
import os

# Set the environment variable
os.environ["OPENAI_API_KEY"] = "OPEN_AI_KEY_HERE"

# Initializing OpenAI client - see https://platform.openai.com/docs/quickstart?context=python
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

In [3]:
# Loading dataset
dataset_path =  "/content/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,asin,url,title,brand,price,availability,categories,primary_image,images,upc,...,color,material,style,important_information,product_overview,about_item,description,specifications,uniq_id,scraped_at
0,B0CG1N9QRC,https://www.amazon.com/dp/B0CG1N9QRC,JOIN IRON Foldable TV Trays for Eating Set of ...,JOIN IRON Store,$89.99,Usually ships within 5 to 6 weeks,"['Home & Kitchen', 'Furniture', 'Game & Recrea...",https://m.media-amazon.com/images/I/41p4d4VJnN...,['https://m.media-amazon.com/images/I/41p4d4VJ...,,...,Grey Set of 4,Iron,X Classic Style,[],,['Includes 4 Folding Tv Tray Tables And one Co...,Set of Four Folding Trays With Matching Storag...,"['Brand: JOIN IRON', 'Shape: Rectangular', 'In...",bdc9aa30-9439-50dc-8e89-213ea211d66a,2/2/2024 18:53
1,B0C9WYYFLB,https://www.amazon.com/dp/B0C9WYYFLB,"LOVMOR 30'' Bathroom Vanity Sink Base Cabine, ...",LOVMOR,,Only 5 left in stock - order soon.,"['Home & Kitchen', 'Furniture', 'Bathroom Furn...",https://m.media-amazon.com/images/I/41zMuj2wvv...,['https://m.media-amazon.com/images/I/41zMuj2w...,,...,Cameo Scotch,Wood,"Soft-closing Switch, Soft-closing Switch",[],,['Durable & Lightweight Construction: Our bath...,Our versatile bathroom sink base cabinet is pe...,"['Brand: LOVMOR', 'Color: Cameo Scotch', 'Reco...",20da3703-26f1-53e5-aa0b-a8104527d1bb,2/2/2024 18:53
2,B09NZY3R1T,https://www.amazon.com/dp/B09NZY3R1T,Folews Bathroom Organizer Over The Toilet Stor...,Folews Store,$63.99,In Stock,"['Home & Kitchen', 'Furniture', 'Bathroom Furn...",https://m.media-amazon.com/images/I/41ixgM73Dg...,['https://m.media-amazon.com/images/I/41ixgM73...,,...,,,Classic,[],,['4 Tier Large Capacity: The 4-tier design of ...,,"['Room Type: Laundry Room, Bathroom, Bedroom, ...",aba4138e-6401-52ca-a099-02e30b638db4,2/2/2024 18:53
3,B09PTXGFZD,https://www.amazon.com/dp/B09PTXGFZD,"Lerliuo Nightstand, Side Table, Industrial Bed...",Lerliuo Store,$39.99,In Stock,"['Home & Kitchen', 'Furniture', 'Bedroom Furni...",https://m.media-amazon.com/images/I/41IzLmM91F...,['https://m.media-amazon.com/images/I/41IzLmM9...,,...,Grey,Stone,Classic,[],,['Elegant Grey: This industrial modern style b...,,"['Brand: Lerliuo', 'Shape: Rectangular', 'Room...",fa87da9a-a8cf-51d7-895f-d64b75ee02a3,2/2/2024 18:53
4,B002FL3LL2,https://www.amazon.com/dp/B002FL3LL2,Boss Office Products Any Task Mid-Back Task Ch...,Boss Office Products Store,,,"['Home & Kitchen', 'Furniture', 'Home Office F...",https://m.media-amazon.com/images/I/41rMElFrXB...,['https://m.media-amazon.com/images/I/41rMElFr...,,...,Grey,Foam,Loop Arms,[],,['Mid-back styling with firm lumbar support; E...,Mid-back styling with firm lumbar support. Ele...,"['Brand: Boss Office Products', 'Color: Grey',...",a0a69530-a944-589d-a036-90358cb9e485,2/2/2024 18:53


## Tag images

In this section, we'll use GPT-4o mini to generate relevant tags for our products.

We'll use a simple zero-shot approach to extract keywords, and deduplicate those keywords using embeddings to avoid having multiple keywords that are too similar.

We will use a combination of an image and the product title to avoid extracting keywords for other items that are depicted in the image - sometimes there are multiple items used in the scene and we want to focus on just the one we want to tag.

### Extract keywords

In [4]:
system_prompt = '''
    You are an agent specialized in tagging images of furniture items, decorative items, or furnishings with relevant keywords that could be used to search for these items on a marketplace.

    You will be provided with an image and the title of the item that is depicted in the image, and your goal is to extract keywords for only the item specified.

    Keywords should be concise and in lower case.

    Keywords can describe things like:
    - Item type e.g. 'sofa bed', 'chair', 'desk', 'plant'
    - Item material e.g. 'wood', 'metal', 'fabric'
    - Item style e.g. 'scandinavian', 'vintage', 'industrial'
    - Item color e.g. 'red', 'blue', 'white'

    Only deduce material, style or color keywords when it is obvious that they make the item depicted in the image stand out.

    Return keywords in the format of an array of strings, like this:
    ['desk', 'industrial', 'metal']

'''

def analyze_image(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": img_url,
                    }
                },
            ],
        },
        {
            "role": "user",
            "content": title
        }
    ],
        max_tokens=300,
        top_p=0.1
    )

    return response.choices[0].message.content

#### Testing with a few examples

In [5]:
examples = df.iloc[:7]

In [6]:
for index, ex in examples.iterrows():
    url = ex['primary_image']
    img = Image(url=url)
    display(img)
    result = analyze_image(url, ex['title'])
    print(result)
    print("\n\n")

['tv tray', 'foldable', 'metal', 'grey']





['vanity cabinet', 'storage cabinet', 'wood', 'brown', 'bathroom', 'kitchen', 'laundry']





['bathroom organizer', 'over toilet shelf', 'storage rack', 'freestanding', 'black', 'metal']





['nightstand', 'side table', 'industrial', 'grey', 'black', 'metal', '2 drawers', 'open shelf']





['task chair', 'mid-back', 'grey', 'fabric', 'adjustable', 'rolling']





['towel bar', 'brass', 'vintage', 'gold']





['wall mount', 'swing-arm', 'black', 'metal']





### Looking up existing keywords

Using embeddings to avoid duplicates (synonyms) and/or match pre-defined keywords

In [7]:
# Feel free to change the embedding model here
def get_embedding(value, model="text-embedding-3-large"):
    embeddings = client.embeddings.create(
      model=model,
      input=value,
      encoding_format="float"
    )
    return embeddings.data[0].embedding

#### Testing with example keywords

In [8]:
# Existing keywords
keywords_list = ['industrial', 'metal', 'wood', 'vintage', 'bed']

In [9]:
df_keywords = pd.DataFrame(keywords_list, columns=['keyword'])
df_keywords['embedding'] = df_keywords['keyword'].apply(lambda x: get_embedding(x))
df_keywords

Unnamed: 0,keyword,embedding
0,industrial,"[-0.026137426, 0.021297162, -0.007273361, -0.0..."
1,metal,"[-0.020472562, 0.0045137997, -0.011044847, -0...."
2,wood,"[0.013877833, 0.02955235, 0.0006239023, -0.035..."
3,vintage,"[-0.052324098, 0.008192246, -0.015525414, 0.00..."
4,bed,"[-0.011677503, 0.023275835, 0.0026937425, -0.0..."


In [10]:
def compare_keyword(keyword):
    embedded_value = get_embedding(keyword)
    df_keywords['similarity'] = df_keywords['embedding'].apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    most_similar = df_keywords.sort_values('similarity', ascending=False).iloc[0]
    return most_similar

def replace_keyword(keyword, threshold = 0.6):
    most_similar = compare_keyword(keyword)
    if most_similar['similarity'] > threshold:
        print(f"Replacing '{keyword}' with existing keyword: '{most_similar['keyword']}'")
        return most_similar['keyword']
    return keyword

In [11]:
# Example keywords to compare to our list of existing keywords
example_keywords = ['bed frame', 'wooden', 'vintage', 'old school', 'desk', 'table', 'old', 'metal', 'metallic', 'woody']
final_keywords = []

for k in example_keywords:
    final_keywords.append(replace_keyword(k))

final_keywords = set(final_keywords)
print(f"Final keywords: {final_keywords}")

Replacing 'bed frame' with existing keyword: 'bed'
Replacing 'wooden' with existing keyword: 'wood'
Replacing 'vintage' with existing keyword: 'vintage'
Replacing 'metal' with existing keyword: 'metal'
Replacing 'metallic' with existing keyword: 'metal'
Replacing 'woody' with existing keyword: 'wood'
Final keywords: {'bed', 'metal', 'desk', 'wood', 'vintage', 'old school', 'table', 'old'}


## Generate captions

In this section, we'll use GPT-4o mini to generate an image description and then use a few-shot examples approach with GPT-4-turbo to generate captions from the images.

If few-shot examples are not enough for your use case, consider fine-tuning a model to get the generated captions to match the style & tone you are targeting.

In [12]:
# Cleaning up dataset columns
selected_columns = ['title', 'primary_image', 'style', 'material', 'color', 'url']
df = df[selected_columns].copy()
df.head()

Unnamed: 0,title,primary_image,style,material,color,url
0,JOIN IRON Foldable TV Trays for Eating Set of ...,https://m.media-amazon.com/images/I/41p4d4VJnN...,X Classic Style,Iron,Grey Set of 4,https://www.amazon.com/dp/B0CG1N9QRC
1,"LOVMOR 30'' Bathroom Vanity Sink Base Cabine, ...",https://m.media-amazon.com/images/I/41zMuj2wvv...,"Soft-closing Switch, Soft-closing Switch",Wood,Cameo Scotch,https://www.amazon.com/dp/B0C9WYYFLB
2,Folews Bathroom Organizer Over The Toilet Stor...,https://m.media-amazon.com/images/I/41ixgM73Dg...,Classic,,,https://www.amazon.com/dp/B09NZY3R1T
3,"Lerliuo Nightstand, Side Table, Industrial Bed...",https://m.media-amazon.com/images/I/41IzLmM91F...,Classic,Stone,Grey,https://www.amazon.com/dp/B09PTXGFZD
4,Boss Office Products Any Task Mid-Back Task Ch...,https://m.media-amazon.com/images/I/41rMElFrXB...,Loop Arms,Foam,Grey,https://www.amazon.com/dp/B002FL3LL2


### Describing images with GPT-4o mini

In [13]:
describe_system_prompt = '''
    You are a system generating descriptions for furniture items, decorative items, or furnishings on an e-commerce website.
    Provided with an image and a title, you will describe the main item that you see in the image, giving details but staying concise.
    You can describe unambiguously what the item is and its material, color, and style if clearly identifiable.
    If there are multiple items depicted, refer to the title to understand which item you should describe.
    '''

def describe_image(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.2,
    messages=[
        {
            "role": "system",
            "content": describe_system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": img_url,
                    }
                },
            ],
        },
        {
            "role": "user",
            "content": title
        }
    ],
    max_tokens=300,
    )

    return response.choices[0].message.content

#### Testing on a few examples

In [14]:
for index, row in examples.iterrows():
    print(f"{row['title'][:50]}{'...' if len(row['title']) > 50 else ''} - {row['url']} :\n")
    img_description = describe_image(row['primary_image'], row['title'])
    print(f"{img_description}\n--------------------------\n")

JOIN IRON Foldable TV Trays for Eating Set of 4 wi... - https://www.amazon.com/dp/B0CG1N9QRC :

The JOIN IRON Foldable TV Trays set includes four sleek, grey folding tables designed for convenience and space-saving. Each tray features a sturdy top with a smooth finish, supported by a durable black iron frame that allows for easy folding and storage. The design is perfect for small spaces, making it ideal for dining, snacks, or as a side table. The set also comes with a stand for organized storage when not in use.
--------------------------

LOVMOR 30'' Bathroom Vanity Sink Base Cabine, Stor... - https://www.amazon.com/dp/B0C9WYYFLB :

The LOVMOR 30'' Bathroom Vanity Sink Base Cabinet features a classic design with a rich brown finish. It includes three drawers on the left side for ample storage, along with a spacious cabinet door on the right. The cabinet is constructed with detailed paneling, adding a touch of elegance, making it suitable for bathrooms, kitchens, laundry rooms, and ot

### Turning descriptions into captions
Using a few-shot examples approach to turn a long description into a short image caption

In [15]:
caption_system_prompt = '''
Your goal is to generate short, descriptive captions for images of furniture items, decorative items, or furnishings based on an image description.
You will be provided with a description of an item image and you will output a caption that captures the most important information about the item.
Your generated caption should be short (1 sentence), and include the most relevant information about the item.
The most important information could be: the type of the item, the style (if mentioned), the material if especially relevant and any distinctive features.
'''

few_shot_examples = [
    {
        "description": "This is a multi-layer metal shoe rack featuring a free-standing design. It has a clean, white finish that gives it a modern and versatile look, suitable for various home decors. The rack includes several horizontal shelves dedicated to organizing shoes, providing ample space for multiple pairs. Above the shoe storage area, there are 8 double hooks arranged in two rows, offering additional functionality for hanging items such as hats, scarves, or bags. The overall structure is sleek and space-saving, making it an ideal choice for placement in living rooms, bathrooms, hallways, or entryways where efficient use of space is essential.",
        "caption": "White metal free-standing shoe rack"
    },
    {
        "description": "The image shows a set of two dining chairs in black. These chairs are upholstered in a leather-like material, giving them a sleek and sophisticated appearance. The design features straight lines with a slight curve at the top of the high backrest, which adds a touch of elegance. The chairs have a simple, vertical stitching detail on the backrest, providing a subtle decorative element. The legs are also black, creating a uniform look that would complement a contemporary dining room setting. The chairs appear to be designed for comfort and style, suitable for both casual and formal dining environments.",
        "caption": "Set of 2 modern black leather dining chairs"
    },
    {
        "description": "This is a square plant repotting mat designed for indoor gardening tasks such as transplanting and changing soil for plants. It measures 26.8 inches by 26.8 inches and is made from a waterproof material, which appears to be a durable, easy-to-clean fabric in a vibrant green color. The edges of the mat are raised with integrated corner loops, likely to keep soil and water contained during gardening activities. The mat is foldable, enhancing its portability, and can be used as a protective surface for various gardening projects, including working with succulents. It's a practical accessory for garden enthusiasts and makes for a thoughtful gift for those who enjoy indoor plant care.",
        "caption": "Waterproof square plant repotting mat"
    }
]

formatted_examples = [[{
    "role": "user",
    "content": ex['description']
},
{
    "role": "assistant",
    "content": ex['caption']
}]
    for ex in few_shot_examples
]

formatted_examples = [i for ex in formatted_examples for i in ex]

In [16]:
def caption_image(description, model="gpt-4o-mini"):
    messages = formatted_examples
    messages.insert(0,
        {
            "role": "system",
            "content": caption_system_prompt
        })
    messages.append(
        {
            "role": "user",
            "content": description
        })
    response = client.chat.completions.create(
    model=model,
    temperature=0.2,
    messages=messages
    )

    return response.choices[0].message.content

#### Testing on a few examples

In [17]:
examples = df.iloc[5:8]

In [18]:
for index, row in examples.iterrows():
    print(f"{row['title'][:50]}{'...' if len(row['title']) > 50 else ''} - {row['url']} :\n")
    img_description = describe_image(row['primary_image'], row['title'])
    print(f"{img_description}\n--------------------------\n")
    img_caption = caption_image(img_description)
    print(f"{img_caption}\n--------------------------\n")

Kingston Brass BA1752BB Heritage 18-Inch Towel-Bar... - https://www.amazon.com/dp/B0B9PLM9P8 :

The Kingston Brass BA1752BB Heritage 18-Inch Towel Bar features a sleek and elegant design in brushed brass. It is 18 inches long, making it ideal for hanging towels in bathrooms or kitchens. The towel bar is supported by two decorative wall mounts, adding a classic touch to your decor. Its durable construction ensures longevity while maintaining a stylish appearance.
--------------------------

18-inch brushed brass towel bar with decorative wall mounts
--------------------------

Chief Mfg.Swing-Arm Wall Mount Hardware Mount Blac... - https://www.amazon.com/dp/B007E40Z5K :

The Chief Mfg Swing-Arm Wall Mount (TS218SU) is a versatile and sturdy hardware mount designed for flat-screen televisions. It features a sleek black finish and a swing-arm design that allows for adjustable positioning. The mount is constructed from durable materials, ensuring stability and support for your TV. Its desi

## Image search

In this section, we will use generated keywords and captions to search items that match a given input, either text or image.

We will leverage our embeddings model to generate embeddings for the keywords and captions and compare them to either input text or the generated caption from an input image.

In [19]:
# Df we'll use to compare keywords
df_keywords = pd.DataFrame(columns=['keyword', 'embedding'])
df['keywords'] = ''
df['img_description'] = ''
df['caption'] = ''

In [20]:
# Function to replace a keyword with an existing keyword if it's too similar
def get_keyword(keyword, df_keywords, threshold = 0.6):
    embedded_value = get_embedding(keyword)
    df_keywords['similarity'] = df_keywords['embedding'].apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    sorted_keywords = df_keywords.copy().sort_values('similarity', ascending=False)
    if len(sorted_keywords) > 0 :
        most_similar = sorted_keywords.iloc[0]
        if most_similar['similarity'] > threshold:
            print(f"Replacing '{keyword}' with existing keyword: '{most_similar['keyword']}'")
            return most_similar['keyword']
    new_keyword = {
        'keyword': keyword,
        'embedding': embedded_value
    }
    df_keywords = pd.concat([df_keywords, pd.DataFrame([new_keyword])], ignore_index=True)
    return keyword

### Preparing the dataset

In [21]:
import ast

def tag_and_caption(row):
    keywords = analyze_image(row['primary_image'], row['title'])
    try:
        keywords = ast.literal_eval(keywords)
        mapped_keywords = [get_keyword(k, df_keywords) for k in keywords]
    except Exception as e:
        print(f"Error parsing keywords: {keywords}")
        mapped_keywords = []
    img_description = describe_image(row['primary_image'], row['title'])
    caption = caption_image(img_description)
    return {
        'keywords': mapped_keywords,
        'img_description': img_description,
        'caption': caption
    }


In [22]:
df.shape

(301, 9)

Processing all 312 lines of the dataset will take a while.
To test out the idea, we will only run it on the first 50 lines: this takes ~20 mins.
Feel free to skip this step and load the already processed dataset (see below).

In [23]:
# Running on first 50 lines
for index, row in df[:50].iterrows():
    print(f"{index} - {row['title'][:50]}{'...' if len(row['title']) > 50 else ''}")
    updates = tag_and_caption(row)
    df.loc[index, updates.keys()] = updates.values()

0 - JOIN IRON Foldable TV Trays for Eating Set of 4 wi...
1 - LOVMOR 30'' Bathroom Vanity Sink Base Cabine, Stor...
2 - Folews Bathroom Organizer Over The Toilet Storage,...
3 - Lerliuo Nightstand, Side Table, Industrial Bedside...
4 - Boss Office Products Any Task Mid-Back Task Chair ...
5 - Kingston Brass BA1752BB Heritage 18-Inch Towel-Bar...
6 - Chief Mfg.Swing-Arm Wall Mount Hardware Mount Blac...
7 - DOMYDEVM Black End Table, Nightstand with Charging...
8 - LASCO 35-5019 Hallmack Style 24-Inch Towel Bar Acc...
9 - Table-Mate II PRO TV Tray Table - Folding Table wi...
10 - EGFheal White Dress Up Storage
11 - Caroline's Treasures PPD3013JMAT Enchanted Garden ...
12 - Leick Home 70007-WTGD Mixed Metal and Wood Stepped...
13 - Caroline's Treasures CK3435MAT Bichon Frise Doorma...
14 - Wildkin Kids Canvas Sling Bookshelf with Storage f...
15 - Gbuzozie 38L Round Laundry Hamper Cute Mermaid Gir...
16 - Tiita Comfy Saucer Chair, Soft Faux Fur Oversized ...
17 - Summer Desk Decor,Welcome

In [24]:
df.head()

Unnamed: 0,title,primary_image,style,material,color,url,keywords,img_description,caption
0,JOIN IRON Foldable TV Trays for Eating Set of ...,https://m.media-amazon.com/images/I/41p4d4VJnN...,X Classic Style,Iron,Grey Set of 4,https://www.amazon.com/dp/B0CG1N9QRC,"[tv tray, foldable, metal, grey]",The JOIN IRON Foldable TV Trays set includes f...,Set of 4 foldable grey TV trays with metal fra...
1,"LOVMOR 30'' Bathroom Vanity Sink Base Cabine, ...",https://m.media-amazon.com/images/I/41zMuj2wvv...,"Soft-closing Switch, Soft-closing Switch",Wood,Cameo Scotch,https://www.amazon.com/dp/B0C9WYYFLB,"[vanity cabinet, storage cabinet, wood, brown,...",The LOVMOR 30'' Bathroom Vanity Sink Base Cabi...,Classic 30'' brown bathroom vanity sink base c...
2,Folews Bathroom Organizer Over The Toilet Stor...,https://m.media-amazon.com/images/I/41ixgM73Dg...,Classic,,,https://www.amazon.com/dp/B09NZY3R1T,"[bathroom organizer, over toilet shelf, storag...",The Folews Bathroom Organizer is a freestandin...,Freestanding black metal four-tier bathroom or...
3,"Lerliuo Nightstand, Side Table, Industrial Bed...",https://m.media-amazon.com/images/I/41IzLmM91F...,Classic,Stone,Grey,https://www.amazon.com/dp/B09PTXGFZD,"[nightstand, side table, industrial, grey, bla...",The Lerliuo Nightstand is a stylish and functi...,Sleek grey nightstand with two drawers and an ...
4,Boss Office Products Any Task Mid-Back Task Ch...,https://m.media-amazon.com/images/I/41rMElFrXB...,Loop Arms,Foam,Grey,https://www.amazon.com/dp/B002FL3LL2,"[task chair, mid-back, grey, fabric, adjustabl...",The Boss Office Products Any Task Mid-Back Tas...,Sleek grey mid-back task chair with adjustable...


In [25]:
data_path = "/content/items_tagged_and_captioned.csv"

In [26]:
# Saving locally for later - optional: do not execute if you prefer to use the provided file
df.to_csv(data_path, index=False)

In [27]:
# Optional: load data from saved file if you haven't processed the whole dataset
df = pd.read_csv(data_path)

### Embedding captions and keywords
We can now use the generated captions and keywords to match relevant content to an input text query or caption.
To do this, we will embed a combination of keywords + captions.
Note: creating the embeddings will take ~3 mins to run. Feel free to load the pre-processed dataset (see below).

In [28]:
df_search = df.copy()

In [29]:
def embed_tags_caption(x):
    if x['caption'] != '':
        try:
            keywords_string = ",".join(k for k in x['keywords']) + '\n'
            content = keywords_string + x['caption']
            embedding = get_embedding(content)
            return embedding
        except Exception as e:
            print(f"Error creating embedding for {x}: {e}")

In [30]:
df_search['embedding'] = df_search.apply(lambda x: embed_tags_caption(x), axis=1)

Error creating embedding for title              TIMCORR CD Case DVD Holder Storage: 144 Capaci...
primary_image      https://m.media-amazon.com/images/I/411Q2ETwel...
style                                                       Portable
material                           EVA + PVC + PP + Non-woven fabric
color                                                          Black
url                             https://www.amazon.com/dp/B0B19ZGGXC
keywords                                                         NaN
img_description                                                  NaN
caption                                                          NaN
Name: 50, dtype: object: 'float' object is not iterable
Error creating embedding for title              Ginger Cayden Closed Towel Ring - 4905/SN - Sa...
primary_image      https://m.media-amazon.com/images/I/31LNv7QILd...
style                                                            NaN
material                                                  

In [31]:
df_search.head()

Unnamed: 0,title,primary_image,style,material,color,url,keywords,img_description,caption,embedding
0,JOIN IRON Foldable TV Trays for Eating Set of ...,https://m.media-amazon.com/images/I/41p4d4VJnN...,X Classic Style,Iron,Grey Set of 4,https://www.amazon.com/dp/B0CG1N9QRC,"['tv tray', 'foldable', 'metal', 'grey']",The JOIN IRON Foldable TV Trays set includes f...,Set of 4 foldable grey TV trays with metal fra...,"[-0.03314441, 0.0034025249, -0.019191332, -0.0..."
1,"LOVMOR 30'' Bathroom Vanity Sink Base Cabine, ...",https://m.media-amazon.com/images/I/41zMuj2wvv...,"Soft-closing Switch, Soft-closing Switch",Wood,Cameo Scotch,https://www.amazon.com/dp/B0C9WYYFLB,"['vanity cabinet', 'storage cabinet', 'wood', ...",The LOVMOR 30'' Bathroom Vanity Sink Base Cabi...,Classic 30'' brown bathroom vanity sink base c...,"[-0.032412935, 0.02662491, -0.008775895, -0.01..."
2,Folews Bathroom Organizer Over The Toilet Stor...,https://m.media-amazon.com/images/I/41ixgM73Dg...,Classic,,,https://www.amazon.com/dp/B09NZY3R1T,"['bathroom organizer', 'over toilet shelf', 's...",The Folews Bathroom Organizer is a freestandin...,Freestanding black metal four-tier bathroom or...,"[-0.041910503, -0.016502438, -0.0042892112, -0..."
3,"Lerliuo Nightstand, Side Table, Industrial Bed...",https://m.media-amazon.com/images/I/41IzLmM91F...,Classic,Stone,Grey,https://www.amazon.com/dp/B09PTXGFZD,"['nightstand', 'side table', 'industrial', 'gr...",The Lerliuo Nightstand is a stylish and functi...,Sleek grey nightstand with two drawers and an ...,"[-0.002086421, -8.6050735e-05, -0.014964935, -..."
4,Boss Office Products Any Task Mid-Back Task Ch...,https://m.media-amazon.com/images/I/41rMElFrXB...,Loop Arms,Foam,Grey,https://www.amazon.com/dp/B002FL3LL2,"['task chair', 'mid-back', 'grey', 'fabric', '...",The Boss Office Products Any Task Mid-Back Tas...,Sleek grey mid-back task chair with adjustable...,"[0.004858285, -0.025756875, -0.016096056, -0.0..."


In [32]:
# Keep only the lines where we have embeddings
df_search = df_search.dropna(subset=['embedding'])
print(df_search.shape)

(50, 10)


In [35]:
data_embeddings_path = "/content/items_tagged_and_captioned_embeddings.csv"

In [36]:
# Saving locally for later - optional: do not execute if you prefer to use the provided file
df_search.to_csv(data_embeddings_path, index=False)

In [None]:
# Optional: load data from saved file if you haven't processed the whole dataset
from ast import literal_eval
df_search = pd.read_csv(data_embeddings_path)
df_search["embedding"] = df_search.embedding.apply(literal_eval).apply(np.array)

FileNotFoundError: [Errno 2] No such file or directory: 'data/items_tagged_and_captioned_embeddings.csv'

### Search from input text    

We can compare the input text from a user directly to the embeddings we just created.

In [37]:
# Searching for N most similar results
def search_from_input_text(query, n = 2):
    embedded_value = get_embedding(query)
    df_search['similarity'] = df_search['embedding'].apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    most_similar = df_search.sort_values('similarity', ascending=False).iloc[:n]
    return most_similar

In [38]:
user_inputs = ['shoe storage', 'black metal side table', 'doormat', 'step bookshelf', 'ottoman']

In [39]:
for i in user_inputs:
    print(f"Input: {i}\n")
    res = search_from_input_text(i)
    for index, row in res.iterrows():
        similarity_score = row['similarity']
        if isinstance(similarity_score, np.ndarray):
            similarity_score = similarity_score[0][0]
        print(f"{row['title'][:50]}{'...' if len(row['title']) > 50 else ''} ({row['url']}) - Similarity: {similarity_score:.2f}")
        img = Image(url=row['primary_image'])
        display(img)
        print("\n\n")

Input: shoe storage

Suptsifira Shoe storage box, 24 Packs Shoe Boxes C... (https://www.amazon.com/dp/B0BZ85JVBN) - Similarity: 0.59





MAEPA RV Shoe Storage for Bedside - 8 Extra Large ... (https://www.amazon.com/dp/B0C4PL1R3F) - Similarity: 0.53





Input: black metal side table

FLYJOE Narrow Side Table with PU Leather Magazine ... (https://www.amazon.com/dp/B0CHYDTQKN) - Similarity: 0.59





HomePop Metal Accent Table Triangle Base Round Mir... (https://www.amazon.com/dp/B08N5H868H) - Similarity: 0.56





Input: doormat

AnyDesign Christmas Welcome Doormat Decorative Xma... (https://www.amazon.com/dp/B0BC85H7Y7) - Similarity: 0.51





Let the Adventure Begin Door Mat 17"x30" Decorativ... (https://www.amazon.com/dp/B0C8SJSZYS) - Similarity: 0.49





Input: step bookshelf

Leick Home 70007-WTGD Mixed Metal and Wood Stepped... (https://www.amazon.com/dp/B098KNRNLQ) - Similarity: 0.56





Wildkin Kids Canvas Sling Bookshelf with Storage f... (https://www.amazon.com/dp/B07GBVFZ1Y) - Similarity: 0.45





Input: ottoman

Furnistar 15.9 inch Modern Round Velvet Storage Ot... (https://www.amazon.com/dp/B0C4NT8N8C) - Similarity: 0.47





HomePop Home Decor | K2380-YDQY-2 | Luxury Large F... (https://www.amazon.com/dp/B0B94T1TZ1) - Similarity: 0.47







### Search from image

If the input is an image, we can find similar images by first turning images into captions, and embedding those captions to compare them to the already created embeddings.

In [40]:
# We'll take a mix of images: some we haven't seen and some that are already in the dataset
example_images = df.iloc[306:]['primary_image'].to_list() + df.iloc[5:10]['primary_image'].to_list()

In [41]:
for i in example_images:
    img_description = describe_image(i, '')
    caption = caption_image(img_description)
    img = Image(url=i)
    print('Input: \n')
    display(img)
    res = search_from_input_text(caption, 1).iloc[0]
    similarity_score = res['similarity']
    if isinstance(similarity_score, np.ndarray):
        similarity_score = similarity_score[0][0]
    print(f"{res['title'][:50]}{'...' if len(res['title']) > 50 else ''} ({res['url']}) - Similarity: {similarity_score:.2f}")
    img_res = Image(url=res['primary_image'])
    display(img_res)
    print("\n\n")


Input: 



Kingston Brass BA1752BB Heritage 18-Inch Towel-Bar... (https://www.amazon.com/dp/B0B9PLM9P8) - Similarity: 0.63





Input: 



Chief Mfg.Swing-Arm Wall Mount Hardware Mount Blac... (https://www.amazon.com/dp/B007E40Z5K) - Similarity: 0.66





Input: 



DOMYDEVM Black End Table, Nightstand with Charging... (https://www.amazon.com/dp/B0BFJGDHVF) - Similarity: 0.67





Input: 



LASCO 35-5019 Hallmack Style 24-Inch Towel Bar Acc... (https://www.amazon.com/dp/B00N2OZU42) - Similarity: 0.50





Input: 



Table-Mate II PRO TV Tray Table - Folding Table wi... (https://www.amazon.com/dp/B093KMM9D3) - Similarity: 0.59







## Wrapping up


In this notebook, we explored how to leverage the multimodal capabilities of `gpt-4o-mini` to tag and caption images. By providing images along with contextual information to the model, we were able to generate tags and descriptions that can be further refined to create captions. This process has practical applications in various scenarios, particularly in enhancing search functionalities.

The search use case illustrated can be directly applied to applications such as recommendation systems, but the techniques covered in this notebook can be extended beyond items search and used in multiple use cases, for example RAG applications leveraging unstructured image data.

As a next step, you could explore using a combination of rule-based filtering with keywords and embeddings search with captions to retrieve more relevant results.