
# 🧠 Exploring Embeddings and Similarity Search with FiftyOne + Vector Search
This notebook demonstrates how to build a complete visual search workflow using **FiftyOne** and **Vector Search**.


Welcome to Notebook 2 of the workshop! In this notebook, we’ll explore how to compute and leverage **image embeddings** to perform **similarity search** and build intuitive tools for dataset exploration using **FiftyOne** and the **CLIP model**.

## 🚦 What You'll Learn

In this notebook, you will:

- Compute **CLIP embeddings** for all samples in the dataset
- Use FiftyOne’s **similarity plugin** to build a fast vector search index
- Perform **nearest-neighbor search** using:
  - A reference image (sample ID)
  - A natural language prompt (e.g., "foggy day")
- Visualize embeddings in a 2D semantic space using **UMAP**
- Tag and curate samples based on similarity queries

## 🔍 Why This Matters

Embedding-based search enables a more **semantic and intuitive way** to explore visual datasets. Instead of relying only on structured metadata, you can:
- Discover similar images based on content
- Perform prompt-driven exploration
- Detect clusters, outliers, or annotation inconsistencies

👉 As an example of vector search, see this official documentation [FiftyOne + Mosaic AI docs](https://docs.voxel51.com/integrations/mosaic.html)


<img src="assets/mosaic_fiftyone_recipe.png" alt="Image2" width="600"/>


In [None]:
# Install necessary packages
#!pip install fiftyone torch torchvision python-dotenv mlflow umap-learn

Wait until this endpoint is ready, any action before that can create a 500 or 400 HTTP Error.

## 📁 Load the BDD100K Dataset and Launch FiftyOne
We will use the `BDD100K` dataset from HuggingFace Hub.

In [1]:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

import fiftyone.utils.huggingface as fouh # Hugging Face integration

import os

# Increase both connection and read timeout values (in seconds)
# os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "60"  # default is 10
# os.environ["HF_HUB_ETAG_TIMEOUT"] = "30"      # metadata fetch timeout
# dataset = fouh.load_from_hub("dgural/bdd100k", persistent=True, name= "bdd10k") #, overwrite=True)
#fo.delete_dataset("dgural/bdd100k")

# # Define the new dataset name
# dataset_name = "bdd10k"
dataset_name = "bdd10k_imported"

# Check if the dataset exists
if dataset_name in fo.list_datasets():
    print(f"Dataset '{dataset_name}' exists. Loading...")
    dataset = fo.load_dataset(dataset_name)
else:
    print(f"Dataset '{dataset_name}' does not exist. Creating a new one...")
    # Clone the dataset with a new name and make it persistent
    dataset = dataset.clone(dataset_name, persistent=True)



Dataset 'bdd10k_imported' exists. Loading...


### 📋 List Existing Datasets
This cell lists the currently available datasets in your FiftyOne environment.

In [2]:
print(fo.list_datasets())
print(dataset)

['bdd100k_100_unique', 'bdd100k_test', 'bdd10k_imported']
Name:        bdd10k_imported
Media type:  image
Num samples: 10000
Persistent:  True
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    detections:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    polylines:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Polylines)
    weather:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    timeofday:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classif

### 🚀 Launch FiftyOne App
This cell launches the FiftyOne App to enable interactive dataset exploration.

In [3]:
proxy_host = "https://"+os.getenv("VIRTUAL_HOST")+"/fiftyone/"
fo.app_config.proxy_url = proxy_host
session = fo.launch_app(dataset, auto=False )

Connected to FiftyOne on port 5151 at 0.0.0.0.
If you are not connecting to a remote session, you may need to start a new session and specify a port
Session launched. Run `session.show()` to open the App in a cell output.


In [4]:
print(session.url)

https://ml-az-05.oit.duke.edu:40003/fiftyone/?proxy=/fiftyone/&polling=true


## Using the SKLearn backend (By default)
By default, calling ```compute_similarity()``` or ```sort_by_similarity()``` will use an sklearn backend.
To use the Mosaic backend, simply set the optional backend parameter of ```compute_similarity()``` to ```mosaic```:

## 🧠 Compute Embeddings, Similarity, and Index with SKLearn
Now we compute a similarity index using the Mosaic backend. This will:
- Use a CLIP model to generate embeddings
- Compute visualization
- Compute Similarity
- Text promt the dataset, create a view, find mistakes.

In [None]:
#mosaic_index = fob.compute_similarity(
#    dataset,
#    model="clip-vit-base32-torch",
#    backend="sklearn",
#    brain_key="sklearn_key",
#    index_name="fiftyone_index",
#)

### 📥 Extracting Embeddings from the Similarity Index

Now that we've computed embeddings using the CLIP model and built a similarity index with FiftyOne's MosaicSimilarity plugin, we can retrieve those embeddings for further analysis or search.

There are two common retrieval strategies:

1. **Random Subset Retrieval:**  
   We can take a random subset of samples from the dataset (e.g., 10 samples) and extract their embeddings. This is useful for quick inspection, debugging, or creating interactive demos.

2. **Full Embedding Retrieval:**  
   We can extract **all embeddings** from the similarity index to enable bulk operations like clustering, dimensionality reduction, or similarity-based visualization.

In both cases, the function `mosaic_index.get_embeddings()` returns:
- A 2D numpy array of embeddings with shape `(N, D)`, where:
  - `N` is the number of samples queried.
  - `D` is the embedding dimensionality (typically 512 for CLIP).
- A list of corresponding sample IDs.

Understanding how to access these embeddings is key for downstream tasks like nearest-neighbor search, semantic grouping, and embedding visualization.


In [5]:
## Retrieve embeddings for a view
#ids = dataset.take(10).values("id")
#embeddings, sample_ids, _ = mosaic_index.get_embeddings(sample_ids=ids)
#print(embeddings.shape)  # (10, 512)
#print(sample_ids.shape)  # (10,)

(10, 512)
(10,)


In [7]:
## Get all embeddings from the MosaicSimilarityIndex
#embeddings, sample_ids, _ = mosaic_index.get_embeddings()

## Confirm shape
#print("Embeddings shape:", embeddings.shape)  # (N, D) => N samples, D dimensions
#print("Sample IDs shape:", sample_ids.shape)

Embeddings shape: (10000, 512)
Sample IDs shape: (10000,)


In [None]:
#from fiftyone.brain.internal.core.utils import add_embeddings

## Your variables
## - dataset: your FiftyOne dataset
## - embeddings: a NumPy array of shape (N, D)
## - sample_ids: list of sample IDs corresponding to the embeddings
## - embedding_field: the name of the field to store the embeddings, e.g., "sklearn_embedding"

## Optional: if you're dealing with patches instead of samples
## patches_field = "detections"  # or whatever your field is called

#embedding_field = "sklearn_embedding"

## Add embeddings to samples
#add_embeddings(
#    samples=dataset,
#    embeddings=embeddings,
#    sample_ids=sample_ids,
#    label_ids=None,  # only needed for patches; otherwise None
#    embeddings_field= embedding_field,
#    patches_field=None  # or your patches field
#)

In [5]:
print(dataset)

Name:        bdd10k_imported
Media type:  image
Num samples: 10000
Persistent:  True
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    detections:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    polylines:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Polylines)
    weather:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    timeofday:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    scene:              fiftyone.core.fields.Embe

### 🧭 Visualizing Embeddings in FiftyOne

With the embeddings now extracted, we can generate a 2D or 3D projection to **visualize the semantic space** of our dataset using FiftyOne's `compute_visualization()` function.

This step performs dimensionality reduction (e.g., via t-SNE or UMAP) to map high-dimensional embeddings (like 512D CLIP vectors) into a space humans can interpret visually.

Key parameters:
- `dataset`: The FiftyOne dataset we are working with.
- `embeddings`: The matrix of embeddings to visualize.
- `brain_key`: A unique identifier to label this visualization in the FiftyOne App.
- `sample_ids`: The list of sample IDs corresponding to the embeddings.

Once the visualization is computed, we launch the FiftyOne App again to explore the embedding space, inspect clusters, and interactively analyze how similar images group together.

This is a powerful way to uncover patterns, detect outliers, and validate the semantic structure of your data.


In [None]:
## Compute the visualization
#fob.compute_visualization(
#    dataset,                      # your FiftyOne dataset
#    embeddings=embeddings,        # the N x D matrix
#    brain_key="sklearn_viz",       # identifier for visualization (name it!)
#    sample_ids=sample_ids         # make sure this matches the dataset
#)
#session = fo.launch_app(dataset, auto=False)

Generating visualization...




UMAP( verbose=True)
Tue May 20 10:36:03 2025 Construct fuzzy simplicial set
Tue May 20 10:36:03 2025 Finding Nearest Neighbors
Tue May 20 10:36:03 2025 Building RP forest with 10 trees
Tue May 20 10:36:05 2025 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	 4  /  13
	 5  /  13
	Stopping threshold met -- exiting after 5 iterations
Tue May 20 10:36:08 2025 Finished Nearest Neighbor Search
Tue May 20 10:36:09 2025 Construct embedding


Epochs completed:  10%| ▉          48/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs


Epochs completed:  36%| ███▌       180/500 [00:00]

	completed  100  /  500 epochs
	completed  150  /  500 epochs


Epochs completed:  54%| █████▍     269/500 [00:01]

	completed  200  /  500 epochs
	completed  250  /  500 epochs


Epochs completed:  72%| ███████▏   359/500 [00:01]

	completed  300  /  500 epochs
	completed  350  /  500 epochs


Epochs completed:  90%| ████████▉  449/500 [00:01]

	completed  400  /  500 epochs
	completed  450  /  500 epochs


Epochs completed: 100%| ██████████ 500/500 [00:01]


Tue May 20 10:36:10 2025 Finished embedding
Session launched. Run `session.show()` to open the App in a cell output.


In [12]:
# from fiftyone.brain.internal.core.utils import get_embeddings

# # Extract the embeddings
# embeddings, sample_ids, patches_field = get_embeddings(
#     dataset,
#     embeddings_field="sklearn_embedding"
# )

# # Delete previous brain_key related to visualization
# dataset.delete_brain_run("sklearn_embedding_viz")

# # Compute the visualization
# fob.compute_visualization(
#     samples=dataset,
#     embeddings=embeddings,
#     brain_key="sklearn_embedding_viz",
#     sample_ids=sample_ids
# )

Generating visualization...




UMAP( verbose=True)
Tue Jun  3 18:21:49 2025 Construct fuzzy simplicial set
Tue Jun  3 18:21:49 2025 Finding Nearest Neighbors
Tue Jun  3 18:21:49 2025 Building RP forest with 10 trees
Tue Jun  3 18:21:53 2025 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	 4  /  13
	 5  /  13
	Stopping threshold met -- exiting after 5 iterations
Tue Jun  3 18:22:33 2025 Finished Nearest Neighbor Search
Tue Jun  3 18:22:38 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Tue Jun  3 18:32:47 2025 Finished embedding


<fiftyone.brain.visualization.VisualizationResults at 0x7f4b913d9bb0>

### 🔄 Reloading the Dataset

After computing embeddings or visualizations, it's often useful to **reload the dataset** to ensure all computed metadata (like new fields or brain keys) is properly synchronized with the dataset's in-memory representation.

- `dataset.reload()` refreshes the dataset, pulling in any updates.
- `print(dataset)` displays the overall dataset summary.
- `print(dataset.first())` shows the metadata of the first sample, which helps verify that fields such as embeddings or visualization keys have been correctly added.

This step is especially helpful when chaining multiple operations and ensuring consistency before launching or updating the FiftyOne App.


In [6]:
dataset.reload()

print(dataset)
print(dataset.first())

Name:        bdd10k_imported
Media type:  image
Num samples: 10000
Persistent:  True
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    detections:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    polylines:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Polylines)
    weather:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    timeofday:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    scene:              fiftyone.core.fields.Embe

### 🧠 Similarity Search by Sample ID

In this step, we demonstrate how to retrieve the top-`k` most similar images in the dataset based on the **embedding proximity** to a specific sample.

- We select the **first sample** in the dataset using `dataset.first().id` to obtain its sample ID.
- The `sort_by_similarity()` method retrieves the `k=10` nearest samples in the embedding space relative to this sample ID.
- `session.view = view` updates the FiftyOne App to display only these top-10 similar samples.

Note: This is not a semantic search by image content or custom prompt — it's a **proximity search using the embedding of a specific image sample**, identified by its sample ID. This method is useful for finding visually or semantically related images based on a known reference point in the dataset.


In [7]:
# Query by first image sample
query = dataset.first().id
view = dataset.sort_by_similarity(query, k=10)
session.view = view



### 🔎 Semantic Search by Text Prompt

This cell demonstrates how to perform **semantic search** over the dataset using a **natural language prompt**, powered by CLIP embeddings.

- The query `"foggy day"` is passed to `sort_by_similarity()`, which uses the CLIP model to compare the prompt with precomputed image embeddings.
- The top-50 most semantically similar images are returned and visualized in the FiftyOne App.
- `view_txt.tag_samples("foggy_day")` tags these samples for easy reference and downstream analysis.

The following categories are available to guide your experimentation:

- **DETECTIONS:** `bike`, `bus`, `car`, `motor`, `person`, `rider`, `traffic light`, `traffic sign`, `train`, `truck`
- **WEATHER:** `overcast`, `foggy`, `rainy`, `snowy`, `undefined`, `partly cloudy`, `clear`
- **SCENE:** `city street`, `gas stations`, `highway`, `parking lot`, `residential`, `tunnel`
- **TIME OF DAY:** `daytime`, `night`, `dawn/dusk`

Feel free to try different prompts like:
- `"busy city street at night"`
- `"snowy highway with trucks"`
- `"parking lot with motorcycles"`

This interaction shows the power of CLIP embeddings for enabling **language-driven dataset exploration**, making it intuitive to search by meaning rather than metadata.


In [8]:
# Query by text prompt
# DETECTIONS: bike  bus  car  motor  person  rider  traffic light  traffic sign  train  truck
# WEATHER: overcast  foggy  rainy  snowy  undefined  partly cloudy  clear
# SCENE: city street  gas stations  highway  parking lot  residential  tunnel 
# TIME OF DAY: daytime  night  dawn/dusk

query_txt = "foggy day" 
view_txt = dataset.sort_by_similarity(query_txt, k=50)
session.view = view_txt

# Tag all samples in the semantic search result
view_txt.tag_samples("foggy_day")



In [16]:
# Query by text prompt
# DETECTIONS: bike  bus  car  motor  person  rider  traffic light  traffic sign  train  truck
# WEATHER: overcast  foggy  rainy  snowy  undefined  partly cloudy  clear
# SCENE: city street  gas stations  highway  parking lot  residential  tunnel 
# TIME OF DAY: daytime  night  dawn/dusk

query_txt = "foggy night" 
view_txt = dataset.sort_by_similarity(query_txt, k=50)
session.view = view_txt

# Tag all samples in the semantic search result
view_txt.tag_samples("foggy_night")



In [34]:
# Query by text prompt
# DETECTIONS: bike  bus  car  motor  person  rider  traffic light  traffic sign  train  truck
# WEATHER: overcast  foggy  rainy  snowy  undefined  partly cloudy  clear
# SCENE: city street  gas stations  highway  parking lot  residential  tunnel 
# TIME OF DAY: daytime  night  dawn/dusk

query_txt = "foggy" 
view_txt = dataset.sort_by_similarity(query_txt, k=50)
session.view = view_txt

# Tag all samples in the semantic search result
view_txt.tag_samples("foggy")