# OWL - Vit based zero shot object detection on images

Here we are using a pretrained Vision transformer based model by google, [OWL-Vit](https://huggingface.co/docs/transformers/main/model_doc/owlvit) to explain zero shot image detection capability.

In [None]:
import warnings

warnings.filterwarnings("ignore")

import transformers

# To suppress some unwanted warnings that arises
from transformers.utils import logging

logging.set_verbosity_error()

In [None]:
# Additional imports for image visualisation
import requests
from PIL import Image, ImageDraw, ImageFont
import io

In [None]:
def visualize_bboxes(image_url: str, predictions: list[dict]) -> None:
    """Utility function to visualise images and predictions.

    Args:
        image_url (str): Input image url
        predictions (list[dict]): List of dictionaries conraining the predictions for the image
                                   eg:[{'score': 0.29,
                                      'label': 'car',
                                      'box': {'xmin': 69, 'ymin': 188, 'xmax': 156, 'ymax': 291}},
                                     {'score': 0.78,
                                      'label': 'car',
                                      'box': {'xmin': 496, 'ymin': 160, 'xmax': 525, 'ymax': 182}},
                                     {'score': 0.18434,
                                      'label': 'person',
                                      'box': {'xmin': 399, 'ymin': 106, 'xmax': 769, 'ymax': 516}}]

    Returns:
        None
    """
    # Load image from URL
    response = requests.get(image_url)
    image = Image.open(io.BytesIO(response.content))

    # Create a drawing context
    draw = ImageDraw.Draw(image)

    # Load a font for text annotations
    font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
    font_size = 20
    font = ImageFont.truetype(font_path, size=font_size)

    # iteratively overlay all the detections, labels, and confidence scores on the image
    for pred in predictions:
        label = pred["label"]
        score = pred["score"]
        box = pred["box"]
        xmin, ymin, xmax, ymax = box["xmin"], box["ymin"], box["xmax"], box["ymax"]

        # Draw rectangle
        draw.rectangle([xmin, ymin, xmax, ymax], outline="red", width=2)

        # Add label and confidence score
        text = f"{label}: {score:.2f}"
        draw.text((xmin, ymin), text, fill="blue", font=font)

    # Display the image
    image.show()

In [None]:
# Provide the id of the free GPU that you are intented to use
# "auto" is not yet supported by this pipeline as of transformers v4.33.0
# the model is small enough to load to a single GPU - consumes less than 6GB
gpu_id = 0

# model = "google/owlvit-base-patch32"
model = "google/owlvit-large-patch14"

In [None]:
pipeline = transformers.pipeline(
    task="zero-shot-object-detection",
    model=model,
    device=gpu_id,
)

In [None]:
url1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
items_to_detect1 = ["cat", "couch", "remote", "ear", "tail"]

result1 = pipeline(
    url1,
    candidate_labels=items_to_detect1,
)

In [None]:
visualize_bboxes(url1, result1)

This model is able to detect custom `classes` or categories without being explicitely trained to do so.

This is generally called [zero shot detection](https://huggingface.co/docs/transformers/main/tasks/zero_shot_object_detection).

In [None]:
url2 = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png"
items_to_detect2 = ["bird", "eye", "beak"]

result2 = pipeline(
    url2,
    candidate_labels=items_to_detect2,
)

In [None]:
# print results - just to understand what is being returned from the pipeline
result2

In [None]:
visualize_bboxes(url2, result2)

In case if you notice anything weird with the detections?

The same objects getting detected multiple times but with slight difference in pixel locations can be easily eliminated with a post processing  technique called [Non Maximal Suppression](https://learnopencv.com/non-maximum-suppression-theory-and-implementation-in-pytorch/). 