<div align="center" dir="auto">
<p dir="auto"><a href="https://colab.research.google.com/github/encord-team/encord-notebooks/blob/main/colab-notebooks/Encord_Notebooks_Building_Semantic_Search_for_Visual_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<div align="center" dir="auto">
  <div style="flex: 1; padding: 10px;">
    <a href="https://join.slack.com/t/encordactive/shared_invite/zt-1hc2vqur9-Fzj1EEAHoqu91sZ0CX0A7Q" target="_blank" style="text-decoration:none">
      <img alt="Join us on Slack" src="https://img.shields.io/badge/Join_Our_Community-4A154B?label=&logo=slack&logoColor=white">
    </a>
    <a href="https://docs.encord.com/docs/active-overview" target="_blank" style="text-decoration:none">
      <img alt="Documentation" src="https://img.shields.io/badge/docs-Online-blue">
    </a>
    <a href="https://twitter.com/encord_team" target="_blank" style="text-decoration:none">
      <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social">
    </a>
    <img alt="Python versions" src="https://img.shields.io/pypi/pyversions/encord-active">
    <a href="https://pypi.org/project/encord-active/" target="_blank" style="text-decoration:none">
      <img alt="PyPi project" src="https://img.shields.io/pypi/v/encord-active">
    </a>
    <a href="https://docs.encord.com/docs/active-contributing" target="_blank" style="text-decoration:none">
      <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue">
    </a>
    <img alt="Licence" src="https://img.shields.io/github/license/encord-team/encord-active">
  </div>
</div>

<div align="center">
  <p>
    <a align="center" href="" target="_blank">
      <img
        width="7232"
        src="https://storage.googleapis.com/encord-notebooks/encord_active_notebook_banner.png">
    </a>
  </p>
</div>

# 🟣 Encord Notebooks | 🔎 Building Semantic Search for Visual Data





## 🏁 Overview

👋 Hi there! In this notebook, we will build a semantic search engine using CLIP and ChatGPT.

We will use an 🟣 Encord-Active sandbox project to the search over.
The dataset is to COCO Validation dataset.

## 📥 Install 🟣 Encord-Active



👟 Run the following script to install 🟣[Encord Active](https://docs.encord.com/active/docs/).

<br>

📌  `python3.9` and `python3.10` are the version requirements to run 🟣Encord Active.

<br>


👉 Depending on your internet speed this might take 1-3 minutes.

In [None]:
# Assert that python is 3.9 or 3.10 instead
import sys
assert sys.version_info.minor in [9, 10], "Encord Active only supported for python 3.9 and 3.10."

# Install Encord Active
!python -m pip install -qq encord-active==0.1.60

> # Please _RESTART_ your runtime before going any further.
We've noticed some complications with the latest version of Google Colab and Numpy, which is fixed by restarting the runtime.

Later, we'll also need the `openai` and `langchain` modules, so let's install them as well.

In [None]:
!python -m pip install -qq langchain openai

## 📩 Download an 🟣 Encord Active sandbox project

🌆 We will use the [COCO Validation set](https://paperswithcode.com/dataset/coco) project for this notebook 📙.

In [None]:
project_name = "[open-source][validation]-coco-2017-dataset"
!encord-active download --project-name $project_name

# 📨 Import all the necessary libraries

In this section, you will import the key libraries that will be used for building the semantic search engine. These libraries play a crucial role in executing the code examples and demonstrating the concepts covered in the walkthrough.

In [None]:
import os
import random
import sys
from functools import reduce
from getpass import getpass
from pathlib import Path
from pprint import pprint
from time import perf_counter
from typing import List

import clip
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import umap
from encord_active.lib.common.image_utils import show_image_and_draw_polygons
from encord_active.lib.common.iterator import DatasetIterator
from encord_active.lib.db.connection import DBConnection
from encord_active.lib.db.merged_metrics import (
    MergedMetrics,
    ensure_initialised_merged_metrics,
)
from encord_active.lib.project.project import Project
from faiss import IndexFlatIP
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser, RetryWithErrorOutputParser
from langchain.prompts import (
    AIMessagePromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from PIL import Image
from pydantic import BaseModel, Field, root_validator, validator
from sklearn.preprocessing import normalize
from tqdm.auto import tqdm

# Another patch to make Colab work
sys.stdout.fileno = lambda: 0
sys.stderr.fileno = lambda: 1
# End patch


First, load the Encord Project

In [None]:
project = Project(Path(project_name)).load()

In [None]:
class DatasetImage(BaseModel):
    image: Path
    data_hash: str


iterator = DatasetIterator(project.file_structure.project_dir)

# 🗒️ List all images in the project
project_images: list[DatasetImage] = [
    DatasetImage(
        image=data_unit[1],
        data_hash=iterator.du_hash,
    )
    for data_unit in iterator.iterate()
]
project_img_df = pd.DataFrame(project_images)


You've loaded the image paths and associated data hashes to be able to match them to other queries later.

In [None]:
project_img_df = pd.DataFrame([i.dict() for i in project_images])
project_img_df.head()

# 📎Embedding Images with CLIP

In the following cells, you will learn how to embed images with CLIP. You will load in a bunch of images from the COCO Validation project and compute the CLIP embeddings.

Next, you will see how to search these embeddings based on both new Images and on Text.

Encord have made OpenAI's [CLIP model](https://github.com/openai/CLIP) available via PIP for ease of use.
The dependency is already installed with `encord-active` so nothing needs to be done.

However, if you want the dependency in isolation, you can install it with the following command:

In [None]:
#!python -m pip install clip-ea

With the installation, it's easy to instantiate a pretrained model to use for embedding images:

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, preprocess = clip.load("ViT-B/32", device=device)
print(f"Model loaded on the {'CPU' if device == 'cpu' else 'GPU'}")

Now embed some images. For starters, grab 1000 images and embed them in batches of 100 images.

In [None]:
BATCH_SIZE = 100
DB_SIZE = 1000
image_list = project_img_df.image.to_list()
db_images, unindexed_images = image_list[:DB_SIZE], image_list[DB_SIZE:]

@torch.inference_mode()
def embed_images(model, images: list[Path], device):
    out: list[np.ndarray] = []
    for batch_start in tqdm(range(0, len(images), BATCH_SIZE)):
        batch = images[batch_start : batch_start + BATCH_SIZE]
        if not batch:
            continue

        batch_images = [preprocess(Image.open(i).convert("RGB")) for i in batch]
        if len(batch_images) == 1:
            tensors = batch_images[0].to(device)[None]
        else:
            tensors = torch.stack(batch_images).to(device)
        out.append(clip_model.encode_image(tensors).detach().cpu().numpy())

    # create one np array with all images
    if len(out) == 1:
        return out[0]
    return np.concatenate(out, axis=0)


In [None]:
t0 = perf_counter()
embeddings = embed_images(clip_model, db_images, device=device)
t1 = perf_counter()
print(f"Embedding {embeddings.shape[0]} images took {t1 - t0:.3f} seconds ({embeddings.shape[0] / (t1-t0):.3f} img/sec)")

In [None]:
!nvidia-smi

# 📊 See how it looks with Umap

Umap is one of multiple ways of embedding high dimensional data into 2D, so we can plot it.
Similar high-dimensional vectors should end up close to each other in the low-dimensional space.

In [None]:
reducer = umap.UMAP(random_state=0)
embeddings_2d = reducer.fit_transform(embeddings)

fig, ax = plt.subplots()
ax.scatter(*embeddings_2d.T)

## ✂️ Indexing and searching CLIP Embeddings

To be able to search embeddings efficiently, it makes sense to build an index over the embeddings for efficient searching.

In this example, you'll keep it simple and build the index using `faiss`, as it's already available on Colab.

In [None]:
index = IndexFlatIP(embeddings.shape[1])
index.add(normalize(embeddings))
# ☝️ That's it really. Normalizing the vectors to unit norm makes the search equivalent to cosine similarity.

With the index, you can now query the embeddings 🔍

In [None]:
random.seed(0)

num_neighbors = 3
num_tries = 5

# Sample random image outside the ones in the index
query_indices = random.sample(list(range(len(unindexed_images))), k=num_tries)
query_images = [unindexed_images[i] for i in query_indices]

# Do search and embedding
query_embeddings = embed_images(clip_model, query_images, device=device)
similarities, indices = index.search(normalize(query_embeddings), k=num_neighbors)
query_2d = reducer.transform(query_embeddings)

# Plotting
fig, axs = plt.subplots(num_tries, num_neighbors+2, figsize=(15, 15))
for try_, (img, emb_2d, nn_similarities, nn_indices) in enumerate(zip(query_images, query_2d, similarities, indices)):
    # Plot 2D embeddings
    axs[try_, 0].scatter(*embeddings_2d.T)
    axs[try_, 0].axis("off")
    axs[try_, 0].scatter(*emb_2d.T, c="red")
    axs[try_, 0].scatter(*embeddings_2d[nn_indices].T, c="orange")

    # Plot images
    axs[try_, 1].set_title("Query Image")
    axs[try_, 1].imshow(Image.open(img))
    axs[try_, 1].axis("off")
    for sim, neighbor, ax in zip(nn_similarities, nn_indices, axs[try_, 2:]):
        ax.set_title(f"Similarity: {sim:.3f}")
        ax.imshow(Image.open(db_images[neighbor]))
        ax.axis("off")
fig.tight_layout()


✨ It gets even more powerful when you search via text embeddings!

In [None]:
text_queries = [
    "surfing",
    "motorbikes",
    "transportation",
    "red flowers in a vase"
]
num_neighbors = 3
num_tries = len(text_queries)

# Do search and embeddings
text_tensors = torch.concatenate([clip.tokenize(t) for t in text_queries], dim=0).to(device)
query_embeddings = clip_model.encode_text(
    text_tensors
).detach().cpu().numpy()
similarities, indices = index.search(normalize(query_embeddings), k=num_neighbors)
query_2d = reducer.transform(query_embeddings)

# Plot
fig, axs = plt.subplots(num_tries, num_neighbors+1, figsize=(15, 12))
for try_, (query, emb_2d, nn_similarities, nn_indices) in enumerate(zip(text_queries, query_2d, similarities, indices)):
    # Plot 2D embeddings
    axs[try_, 0].scatter(*embeddings_2d.T)
    axs[try_, 0].axis("off")
    axs[try_, 0].scatter(*emb_2d.T, c="red")
    axs[try_, 0].scatter(*embeddings_2d[nn_indices].T, c="orange")
    axs[try_, 0].set_title(f'Query: "{query}"')

    # Plot images
    for sim, neighbor, ax in zip(nn_similarities, nn_indices, axs[try_, 1:]):
        ax.set_title(f"Similarity: {sim:.3f}")
        ax.imshow(Image.open(db_images[neighbor]))
        ax.axis("off")

fig.tight_layout()

#🔎 Indirect Search with ChatGPT



For this you'll use `langchain` to get started. So let's do that.

Steps:
1. Load Quality Metrics from the Encord Project
2. Setup prompt
3. Ask ChatGPT for help

Get the complete data frame from the project

In [None]:
ensure_initialised_merged_metrics(project.file_structure)
with DBConnection(project.file_structure) as conn:
    df = MergedMetrics(conn).all()

df["data_hash"] = df.index.str.split("_", expand=False).str[1]

A few insights from the table

In [None]:
pd.set_option("display.precision", 3)
print(df.describe().to_string())

In [None]:
#@title 🗝️ Set api key and instantiate model
OPENAI_API_KEY = getpass("What's your OpenAI API key? ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

model_name = 'text-davinci-003'
temperature = 0.0
model = OpenAI(model_name=model_name, temperature=temperature)

Prepare the prompts:

In [None]:
# Define the prompt that we'll be giving ChatGPT
def form_prompt(dataframe, parser):
    system_message_prompt = SystemMessagePromptTemplate.from_template(
        "You are a helpful assistant that translates human queries to filters that apply to a data frame."
    )
    columns_str = "\n".join(dataframe.columns)
    instructions_prompt = HumanMessagePromptTemplate.from_template(
        f"Columns in the dataframe are: \n{columns_str}\n\n"
        f"Data frame description: \n{dataframe.describe()}\n\n"
        "Here are some rules:\n"
        "1. Top, highest, or largest means the highest quartile.\n"
        "2. Bottom, least, and lowest means the lowest quartile.\n"
        "3. `min_value` and `max_value` should be floats or ints related to the data frame description above.\n"
        "4. `min_value` cannot be larger than the `max_value`\n"
        'If you are not able to answer, please respond with [{{filters: [{{"column": "unknown", "min_value": -1, "max_value": -1}}]}}\n\n'
    )
    query_prompt = HumanMessagePromptTemplate(
        prompt=PromptTemplate(
            template="Answer the user query.\n{format_instructions}\n{query}\n",
            input_variables=["query"],
            partial_variables={"format_instructions": parser.get_format_instructions()}
        )
    )
    return ChatPromptTemplate.from_messages([system_message_prompt, instructions_prompt, query_prompt])


In [None]:
# Define pydantic model for filter outputs
class Filter(BaseModel):
    column: str = Field(description="The column of the provided dataframe to filter")
    min_value: float = Field(description="The minimum value to include")
    max_value: float = Field(description="The maximum value to include")

    @validator("column")
    def column_exists(cls, field):
        if field == "unknown":
            return field

        if field not in df.columns:
            raise ValueError("The specified column does not exist in the provided dataframe")
        return field

    @root_validator()
    def check_min_smaller_than_max(cls, values):
        min_value = values.get("min_value")
        max_value = values.get("max_value")

        if not isinstance(min_value, (float, int)):
            raise ValueError(f"`min_value` should be a number")

        if not isinstance(max_value, (float, int)):
            raise ValueError(f"`max_value` should be a number")

        if min_value > max_value:
            raise ValueError(f"`min_value` ({min_value}) cannot be larger than `max_value` ({max_value})")
        return values

class Filters(BaseModel):
    filters: list[Filter] = Field(description="A list of filters needed to be applied in given order")


See an example of what you would pass to ChatGPT:

In [None]:
parser = PydanticOutputParser(pydantic_object=Filters)
example_query = "What are all the images with both high contrast and many objects?"
input_prompt = form_prompt(df, parser).format_prompt(query=example_query, format_instructions=parser.get_format_instructions())
pprint(input_prompt.to_string())

And now the final bit, which is stitching it all together.

In [None]:
def do_indirect_query(model, query:str, dataframe: pd.DataFrame):
    # Generate the prompt
    parser = PydanticOutputParser(pydantic_object=Filters)
    input_prompt = form_prompt(dataframe, parser).format_prompt(query=query, format_instructions=parser.get_format_instructions())

    # Ask ChatGPT for help given the prompt
    response = model(input_prompt.to_string())

    # Parse the output with a retry
    output: Filters | None = None
    try:
         output = parser.parse(response)
    except:
        print(f"Trying to fix error after receiving {response}")
        retry_parser = RetryWithErrorOutputParser.from_llm(parser=parser, llm=OpenAI(temperature=0))
        try:
            output = retry_parser.parse_with_prompt(response, input_prompt)
        except:
            pass

    if not output or not output.filters or output.filters[0].column == "unknown":
        print(f"This query couldn't be processed properly. The response gotten from ChatGPT was: {response}")
        return None

    # Do the actual filtering
    subset_df = df.copy()
    for filter in output.filters:
        subset_df = subset_df[subset_df[filter.column].between(filter.min_value, filter.max_value, inclusive="both")]
    return subset_df.sort_values([f.column for f in output.filters], ascending=False).reset_index(), output



Try it out for an example query on the entire dataset.

In [None]:
example_query = "What are all the images with both high contrast and many objects?"
subset_df, filters = do_indirect_query(model, example_query, df)
print(f"Number of results: {subset_df.shape[0]}")
print("Filters")
pprint(filters)

Plot the results to see the actual images found based on the query.

In [None]:
def plot_top_k_data_units(df, filters: Filters, k=12, cols=3):
    rows = k // cols if k % cols == 0 else k // cols + 1
    fig, axs = plt.subplots(rows, cols, figsize=(cols*3, rows*3))
    axs = axs.reshape(-1)
    fig.suptitle("; ".join(map(lambda f: f.column, filters.filters)))

    for (idx, row), ax in zip(df.iterrows(), axs):
        img = show_image_and_draw_polygons(row, project.file_structure)
        ax.imshow(img)
        ax.set_title("; ".join([f"{row[f.column]:.3f}" for f in filters.filters]))
        ax.axis("off")
    fig.tight_layout()
    return fig

_ = plot_top_k_data_units(subset_df, filters)

# 🪢 Putting it all together



Now that you know how to do direct semantic queries with CLIP and indirect semantic queries with ChatGPT, combine them.

The steps are:

1. Compute embeddings for the entire dataset
2. Define some direct and indirect query pairs
3. Use an index to find the nearest neighbors based on CLIP Embeddings
4. Use ChatGPT to refine the search by indirect queries

In [None]:
# Set some thresholds for the CLIP search
num_neighbors = 1000
similarity_threshold = 0.265  # 👈 The minimum similarity required to be considered relevant

In [None]:
# Embed the entire dataset
t0 = perf_counter()
project_embeddings = embed_images(clip_model, image_list, device=device)
t1 = perf_counter()
print(f"Embedding {project_embeddings.shape[0]} images took {t1 - t0:.3f} seconds ({project_embeddings.shape[0] / (t1-t0):.3f} img/sec)")

# Create an index
project_index = IndexFlatIP(project_embeddings.shape[1])
project_index.add(normalize(project_embeddings))

In [None]:
# Define queries
direct_queries = [
    "outdoor sports",
    "transportation"
]
indirect_queries = [
    "All the images with high brightness and many objects",
    "All the objects with high annotation quality"
]

# Embed direct queries
text_tensors = torch.concatenate([clip.tokenize(t) for t in direct_queries], dim=0).to(device)
query_embeddings = clip_model.encode_text(
    text_tensors
).detach().cpu().numpy()

In [None]:
# Do the direct semantic querying
similarities, indices = index.search(normalize(query_embeddings), k=num_neighbors)

In [None]:
# Filter dataframe based on search result.
for (in_query, di_query, sim, idx) in zip(indirect_queries, direct_queries, similarities, indices):
    idx = idx[sim>similarity_threshold]
    data_hashes = set(project_img_df.iloc[idx].data_hash.to_list())

    filtered_df = df.copy()

    # Filter project data
    clip_filtered_df = filtered_df[filtered_df.data_hash.isin(data_hashes)]
    gpt_result = do_indirect_query(model, in_query, clip_filtered_df)

    if gpt_result is None:
        print(f"Chat GPT failed to produce valid filters for the indirect query {in_query}")
        continue

    gpt_filtered_df, filters = gpt_result

    print(f"Results for direct query: '{di_query}' and indirect query: '{in_query}'")
    print(f"Found {gpt_filtered_df.shape[0]} results matching the query based of {clip_filtered_df.shape[0]} semantically similar images.")
    print(f"Based on filters: {filters}")
    print("- " * 10)
    fig = plot_top_k_data_units(gpt_filtered_df, filters)
    fig.suptitle(f"IQ: '{in_query}', DQ: '{di_query}'", fontsize=16)

# ✅ Wrap up


📓This Colab notebook showed you how to build a semantic search engine for visual search data using CLIP and ChatGPT.

---

🟣 Encord Active is an open-source framework for computer vision model testing, evaluation, and validation. Check out the project on [GitHub](https://github.com/encord-team/encord-active), leave a star 🌟 if you like it, and leave an issue if you find something is missing.

---

👉 Check out our 📖[blog](https://encord.com/blog/webinar-semantic-visual-search-chatgpt-clip/) and 📺[YouTube](https://www.youtube.com/@encord) channel to stay up-to-date with the latest in computer vision, foundation models, active learning, and data-centric AI.



### ✨ Want more walthroughs like this? Check out the 🟣 [Encord Notebooks repository](https://github.com/encord-team/encord-notebooks/).