# Multimodal Search

In this final exercise, we will learn how to use vector databases to search through images using natural language. 

We will be searching through an open source image dataset using an open source model called CLIP.
This model is able to encode both images and text into the same embedding space, allowing us to retrieve images that are similar to a user question.

In [None]:
# Christoph:
# does not run!!!
# large differneces between the given example code and the solution video clip.
# didn't solve it


In [None]:
# pip install --quiet datasets gradio lancedb pandas transformers [This has been preinstalled for you]

## Setup CLIP model

First, let's prepare the [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) model to encode the images.
We want to setup two things:
1. a model to encode the image
2. a processor to prepare the image to be encoded

Fill in the code below to initialize a pre-trained model and processor.

In [1]:
from transformers import CLIPModel, CLIPProcessor

MODEL_ID = "openai/clip-vit-base-patch32"

device = "cpu"

model = CLIPModel.from_pretrained(MODEL_ID).to(device)
processor = CLIPProcessor.from_pretrained(MODEL_ID)


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

## Setup data model

The dataset itself has an image field and an integer label.
We'll also need an embedding vector (CLIP produces 512D vectors) field.

For this problem, please a field named "vector" to the Image class below
that is a 512D vector.

The image that comes out of the raw dataset is a PIL image. So we'll add
some conversion code between PIL and bytes to make it easier for serde.

In [2]:
import io

from lancedb.pydantic import LanceModel, vector
import PIL

class Image(LanceModel):
    image: bytes
    label: int
    vector: vector(512)
        
    def to_pil(self):
        return PIL.Image.open(io.BytesIO(self.image))
    
    @classmethod
    def pil_to_bytes(cls, img) -> bytes:
        buf = io.BytesIO()
        img.save(buf, format="PNG")
        return buf.getvalue()

## Image processing function

Next we will implement a function to process batches of data from the dataset.
We will be using the `zh-plus/tiny-imagenet` dataset from huggingface datasets.
This dataset has an `image` and a `label` column.

For this problem, please fill in the code to extract the image embeddings from
the image using the CLIP model.

In [None]:
def process_image(row):
    # Extract the actual image bytes from the dictionary
    image_bytes = row["image"]["bytes"]  # Access the 'bytes' key inside the dictionary

    # Convert bytes to PIL image
    pil_image = PIL.Image.open(io.BytesIO(image_bytes))
    
    # Process the image using CLIPProcessor
    image_tensor = processor(text=None, images=pil_image, return_tensors="pt")["pixel_values"].to(device)
    # create the image embedding from the processed image and the model
    img_emb = <fill me in>
    
    # Flatten the vector and ensure it's a list
    row["vector"] = img_emb.flatten().tolist()  # Flatten and convert to a list
    row["image"] = Image.pil_to_bytes(pil_image)  # Convert back to bytes
    return row

## Table creation

Please create a LanceDB table called `image_search` to store the image, label, and vector.

In [None]:
import lancedb
TABLE_NAME = "image_search"

<fill me in>

## Adding data

Now we're ready to process the images and generate embeddings.
Please write a function called `datagen` that calls `process_image` on each image in the validation set (10K images) and return a list of Image instances.

**HINT**
1. You may find it faster to use the [dataset.map](https://huggingface.co/docs/datasets/process#map) function.
2. You'll want to store the `image_bytes` field that is returned by `process_image`.

In [None]:
import pandas as pd
from tqdm import tqdm

# Define a wrapper for tqdm progress tracking
# A progress bar is required because the datagen() function processes 10,000 images, which takes approximately 50–60 minutes.
def process_image_with_progress(row):
    global pbar  # Use a global progress bar
    result = process_image(row)
    pbar.update(1)  # Update progress after processing each row
    return result

# Load and Process Data from Parquet
def datagen() -> list[Image]:
    dataset = pd.read_parquet("../zh-plus-tiny-imagenet_valid_split.parquet")

    global pbar
    
    # Process rows using Pandas apply
    with tqdm(total=len(dataset), desc="Processing images") as pbar:
        processed_df = dataset.apply(process_image_with_progress, axis=1)

    # Convert rows into Image objects
    return [
        Image(image=row["image"], label=row["label"], vector=row["vector"])
        for _, row in processed_df.iterrows()
    ]


Now call the function you just wrote and add the generated instances to the LanceDB table.  The following process can take up to 60 minutes to complete.

In [None]:
data = datagen()


In [None]:
table.add(data)

## Encoding user queries

We have image embeddings, but how do we generate the embeddings for the user query?
Furthermore, how can we possibly have the same features between the image embeddings
and text embeddings. This is where the power of CLIP comes in.

Please write a function to turn user query text into an embedding
in the same latent space as the images. 

**HINT** 
You can refer to the [CLIPModel documention](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel)

In [None]:
from transformers import CLIPTokenizerFast

MODEL_ID = "openai/clip-vit-base-patch32"
model = <fill me in>
tokenizer = <fill me in>

def embed_func(query):
    inputs = tokenizer([query], padding=True, return_tensors="pt")
    
    # generate the text embeddings
    text_features = <fill me in>
    
    return text_features.detach().numpy()[0]

## Core search function

Now let's write the core search function `find_images`, that takes a text query as input, and returns a list of PIL images that's most similar to the query.

In [None]:
def find_images(query):
    
    # Generate the embedding for the query
    emb = <fill me in>    
    
    # Search for the closest 9 images
    rs = <fill me in>
    
    # Return PIL instances for visualization
    return <fill me in>

In [None]:
find_images("fish")[0]

## Create an App

Let's use gradio to create a small app to search through the images.
The code below has been completed for you:
1. Created a [text input](https://www.gradio.app/docs/textbox) where the user can type in a query
2. Created a "Submit" [button](https://www.gradio.app/docs/button) that finds similar images to the input query and display the resulting images
3. A [Gallery component](https://www.gradio.app/docs/gallery) that displays the images

In [None]:
import gradio as gr


with gr.Blocks() as demo:
    with gr.Row():
        vector_query = gr.Textbox(value="fish", show_label=False)
        b1 = gr.Button("Submit")
    with gr.Row():
        gallery = gr.Gallery(
                label="Found images", show_label=False, elem_id="gallery"
            ).style(columns=[3], rows=[3], object_fit="contain", height="auto")   
        
    b1.click(find_images, inputs=vector_query, outputs=gallery)
    
demo.launch(server_name="0.0.0.0", inline=False)

To view the interface, click on the **Links** button at the bottom of the workspace window.  Then click on **gradio**.  This will open a new browser window with the interface.

Now try a bunch of different queries and see the results.
By default CLIP search results leave a lot of room for improvement. More advanced applications in this space can improve these results in a number ways like retraining the model with your own dataset, your own labels, and using image and text vectors to train the index. The details are however beyond the scope of this lesson.

## Summary

Congrats! 

Through this exercise, you learned how to use CLIP to generate image and text embeddings. You've mastered how to use vector databases to enable searching through images using natural language. And you even created a simple app to show off your work. 

Great job!