[![Roboflow Notebooks](https://ik.imagekit.io/roboflow/notebooks/template/bannertest2-2.png?ik-sdk-version=javascript-1.4.3&updatedAt=1672932710194)](https://github.com/roboflow/notebooks)

# Zero-Shot Object Detection with Grounding DINO

---

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/IDEA-Research/GroundingDINO) [![arXiv](https://img.shields.io/badge/arXiv-2303.05499-b31b1b.svg)](https://arxiv.org/abs/2303.05499)

Grounding DINO can detect **arbitrary objects** with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector DINO. for open-set concept generalization. If you want to learn more visit official GitHub [repository](https://github.com/IDEA-Research/GroundingDINO) and read the [paper](https://arxiv.org/abs/2303.05499).

![grounding dino figure](https://media.roboflow.com/notebooks/examples/grounding-dino-figure.png)

## Complementary Materials

---

[![Roboflow](https://raw.githubusercontent.com/roboflow-ai/notebooks/main/assets/badges/roboflow-blogpost.svg)](https://blog.roboflow.com/grounding-dino-zero-shot-object-detection) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk)

We recommend that you follow along in this notebook while reading the blog post on Grounding DINO. We will talk about the advantages of Grounding DINO, analyze the model architecture, and provide real prompt examples.

![grounding dino](https://media.roboflow.com/notebooks/examples/grounding-dino.png)

## ⚠️ Disclaimer

Grounding DINO codebase is still under development. If you experience any problems with launching the notebook, please let us know and create [issues](https://github.com/roboflow/notebooks/issues) on our GitHub.

## Pro Tip: Use GPU Acceleration

If you are running this notebook in Google Colab, navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `GPU`, and then click `Save`. This will ensure your notebook uses a GPU, which will significantly speed up model training times.

## Steps in this Tutorial

In this tutorial, we are going to cover:

- Before you start
- Install Grounding DINO 🦕
- Download Grounding DINO Weights 🏋️
- Download Example Data
- Load Grounding DINO Model
- Grounding DINO Demo
- Grounding DINO with Roboflow Dataset
- 🏆 Congratulations

**Let's begin!**

## Before you start

Let's make sure that we have access to GPU. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `GPU`, and then click `Save`.

In [None]:
!nvidia-smi

In [None]:
import os
HOME = os.getcwd()
print(HOME)

## Install Grounding DINO 🦕 

In [None]:
# %cd {HOME}
# !git clone https://github.com/IDEA-Research/GroundingDINO.git
# %cd {HOME}/GroundingDINO
# !pip install -q -e .
# !pip install -q roboflow

In [None]:
import os

CONFIG_PATH = os.path.join(HOME, "groundingdino/config/GroundingDINO_SwinT_OGC.py")
print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH))

## Download Grounding DINO Weights 🏋️

In [None]:
# %cd {HOME}
# !mkdir {HOME}/weights
# %cd {HOME}/weights

# !wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

In [None]:
import os

WEIGHTS_NAME = "groundingdino_swint_ogc.pth"
WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME)
print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

## Download Example Data

In [None]:
%cd {HOME}
!mkdir {HOME}/data
%cd {HOME}/data

!wget -q https://media.roboflow.com/notebooks/examples/dog.jpeg
!wget -q https://media.roboflow.com/notebooks/examples/dog-2.jpeg
!wget -q https://media.roboflow.com/notebooks/examples/dog-3.jpeg
!wget -q https://media.roboflow.com/notebooks/examples/dog-4.jpeg

## Load Grounding DINO Model

In [None]:
# !pip install torchsummary
!pip install torchviz

In [None]:
# %cd {HOME}/GroundingDINO

import torch
from groundingdino.util.inference import load_model, load_image, predict, annotate
# from torchsummary import summary
from torch.utils.tensorboard import SummaryWriter

log_dir = "./logs"
writer = SummaryWriter(log_dir)

model = load_model(CONFIG_PATH, WEIGHTS_PATH)
# torch.save(model,'/content/weights/groundingdino_swint_ogc_with_DAG.pth')

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

## Grounding DINO Demo

In [None]:
from transformers import AutoTokenizer
import torch

text_encoder_type = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(text_encoder_type)

# 对单词"chair"进行编码
encoded = tokenizer("chair")
tensor_encoded = torch.tensor(encoded)
print(tensor_encoded)

In [None]:
# 企图实现Grounding DINO的模型计算图可视化，尝试未果，分别试了tensorboard torchviz networkx netron 
# from torchsummary import summary
import os
from torchviz import make_dot
from IPython.display import display
import pydot

import networkx as nx
import matplotlib.pyplot as plt

def preprocess_caption(caption: str) -> str:
    result = caption.lower().strip()
    if result.endswith("."):
        return result
    return result + "."

CONFIG_PATH = os.path.join(HOME, "groundingdino/config/GroundingDINO_SwinT_OGC.py")
print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH))

WEIGHTS_NAME = "groundingdino_swint_ogc.pth"
WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME)
print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

IMAGE_NAME = "dog-3.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "chair"
TEXT_PROMPT = preprocess_caption(TEXT_PROMPT)
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

model = load_model(CONFIG_PATH, WEIGHTS_PATH)

image_source, image = load_image(IMAGE_PATH)
with torch.no_grad():
    outputs = model(image[None], captions=[TEXT_PROMPT])

# torch.onnx.export(model, (image[None], captions=[TEXT_PROMPT]), './groundingdino_swint_ogc_with_DAG_0606_1.onnx')
torch.onnx.export(model, (image[None],), './groundingdino_swint_ogc_with_DAG_0606_1.on',  captions=(TEXT_PROMPT,))
torch.save(model, './groundingdino_swint_ogc_with_DAG_0606_1.pth')

# torch.save(model,'./groundingdino_swint_ogc_with_DAG_0606.pth')

# summary(model, (image[None], captions=[TEXT_PROMPT]))
# dot = make_dot(outputs["pred_logits"])
# G = nx.DiGraph(dot)
# pos = nx.nx_pydot.graphviz_layout(G, prog='dot')
# nx.draw(G, pos, with_labels=True, arrows=False)
# plt.show()

# dot = make_dot(outputs)
# pydot_graph = pydot.graph_from_dot_data(dot.source)[0]
# pydot_graph.write_png('graph.png')

# outputs = {k: str(v.data_ptr()) if isinstance(v, torch.Tensor) else v for k, v in outputs.items()}
# outputs = {k: v.id if isinstance(v, torch.Tensor) else v for k, v in outputs.items()}
# dot = make_dot(outputs, params=None)
# dot = make_dot(outputs, params=dict(model.named_parameters()))
# dot = make_dot(outputs["pred_logits"])
# display(dot)
# log_dir = "./logs"
# writer = SummaryWriter(log_dir)
# writer.add_graph(model)
# writer.add_graph(model, input_to_model={"samples": image[None], "captions": TEXT_PROMPT})
# writer.add_graph(model,(image[None], [TEXT_PROMPT]))

# # 将模型计算图写入日志文件
# inputs = (image,TEXT_PROMPT)
# dot = make_dot(model(*inputs), params=dict(model.named_parameters()))
# writer.add_graph(dot)

# # 关闭 SummaryWriter 对象
# writer.close()

In [None]:
import os

def preprocess_caption(caption: str) -> str:
    result = caption.lower().strip()
    if result.endswith("."):
        return result
    return result + "."

CONFIG_PATH = os.path.join(HOME, "groundingdino/config/GroundingDINO_SwinT_OGC.py")
print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH))

WEIGHTS_NAME = "groundingdino_swint_ogc.pth"
WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME)
print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

IMAGE_NAME = "dog-3.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "chair"
TEXT_PROMPT = preprocess_caption(TEXT_PROMPT)
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

model = load_model(CONFIG_PATH, WEIGHTS_PATH)

image_source, image = load_image(IMAGE_PATH)
with torch.no_grad():
    outputs = model(image[None], captions=[TEXT_PROMPT])

log_dir = "./logs"
writer = SummaryWriter(log_dir)
# writer.add_graph(model, input_to_model={"samples": image[None], "captions": TEXT_PROMPT})
writer.add_graph(model,(image[None], [TEXT_PROMPT]))
writer.close()


# outputs = {k: str(v.data_ptr()) if isinstance(v, torch.Tensor) else v for k, v in outputs.items()}
# outputs = {k: v.id if isinstance(v, torch.Tensor) else v for k, v in outputs.items()}
# dot = make_dot(outputs, params=None)
# dot = make_dot(outputs, params=dict(model.named_parameters()))
# dot = make_dot(outputs["pred_logits"])
# display(dot)

# writer.add_graph(model)



In [None]:
import os
import supervision as sv

# from torch.utils.tensorboard import SummaryWriter
# from torchviz import make_dot

IMAGE_NAME = "dog-3.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "chair"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

text_exaple = torch.randn(1)

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)



# print(model)
%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))
# sv.plot_image(torch.randn(20,20),size=(16, 16))

In [None]:
import torch

# 打印模型每一层的信息及输出
for i, layer in enumerate(model.modules()):
    print(f"Layer {i}: {layer.__class__.__name__}")
    # print(f"Input shape: {layer.in_features}")
    # print(f"Output shape: {layer.out_features}")
    print()

# 打印模型每一层的特征输出
# with torch.no_grad():
#     for layer in model.modules():
#         intermediate_layer_model = torch.nn.Sequential(*list(model.children())[:list(model.modules()).index(layer)+1])
#         intermediate_output = intermediate_layer_model(input_data)
#         print(f"Layer {layer.__class__.__name__} output shape: {intermediate_output.shape}")
#         print(f"Layer {layer.__class__.__name__} output: {intermediate_output}")
#         print()

In [None]:
# 该模块打印了程序运行过程中加载模型推理的层的输出信息，目前来看比较有用的一个地方
import torch

# 定义一个钩子函数
def print_output_shape(module, input, module_output):
    # print(module)
    print(f"The module of GDINO: {module.__class__.__name__}")
    if isinstance(module_output,tuple):
        print(f"The module output: {module_output.__class__.__name__}")
        for i, output in enumerate(module_output):
            if isinstance(output, torch.Tensor):
                print(f"{output} shape: {output.shape}")
            else:
                print(f"Unknown output type: {type(output)}")
    else:
        if isinstance(module_output, torch.Tensor):
            print(f"{module_output} shape: {module_output.shape}")
        else:
            print(f"Unknown output type: {type(module_output)}")
            # print(type(module_output))
            print(len(module_output))
        # print(module_output.shape)
        # pass


# 注册钩子函数，对所有模块都生效
handles = []
for module in model.modules():
    handle = module.register_forward_hook(print_output_shape)
    handles.append(handle)
print(len(handles))
# 前向推理
boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

# 移除所有钩子函数
for handle in handles:
    handle.remove()

In [None]:
# from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# token_ids = tokenizer.convert_tokens_to_ids(["[CLS]", "[SEP]", ".", "?"])
# print(token_ids)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "This is a sentence."
# tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
print(token_ids)

In [None]:
import os
import supervision as sv

# IMAGE_NAME = "dog-3.jpeg"
IMAGE_NAME = "/data/ML_document/datasets/custom_6dpose_dataset/train/5.png"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "chair with man sitting on it"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

In [None]:
import os
import supervision as sv

IMAGE_NAME = "dog-3.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "chair, dog, table, shoe, light bulb, coffee, hat, glasses, car, tail, umbrella"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

In [None]:
import os
import supervision as sv

IMAGE_NAME = "dog-2.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "glass"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

In [None]:
import os
import supervision as sv

IMAGE_NAME = "dog-2.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "glass most to the right"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

In [None]:
import os
import supervision as sv

IMAGE_NAME = "dog-2.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "straw"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

In [None]:
import os
import supervision as sv

IMAGE_NAME = "dog-4.jpeg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "mens shadow"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

## Grounding DINO with Roboflow Dataset

In [None]:
%cd {HOME}

In [None]:
import roboflow

roboflow.login()

In [None]:
from random import randrange
from roboflow.core.dataset import Dataset


def pick_random_image(dataset: Dataset, subdirrectory: str = "valid") -> str:
    image_directory_path = f"{dataset.location}/{subdirrectory}"
    image_names = os.listdir(image_directory_path)
    image_index = randrange(len(image_names))
    image_name = image_names[image_index]
    image_path = os.path.join(image_directory_path, image_name)
    return image_path

In [None]:
from roboflow import Roboflow
rf = Roboflow()
project = rf.workspace("work-safe-project").project("safety-vest---v4")
dataset = project.version(3).download("coco")

In [None]:
TEXT_PROMPT = ", ".join(project.classes.keys())
TEXT_PROMPT

In [None]:
image_path = pick_random_image(dataset=dataset)

In [None]:
import os
import supervision as sv

BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(image_path)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

**NOTE:** The design of the prompt is very important. Try to be as accurate as possible. Avoid abbreviations.

In [None]:
TEXT_PROMPT = "reflective safety vest, helmet, head, nonreflective safety vest"

In [None]:
import os
import supervision as sv

BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(image_path)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_TRESHOLD, 
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline  
sv.plot_image(annotated_frame, (16, 16))

## 🏆 Congratulations

### Learning Resources

Roboflow has produced many resources that you may find interesting as you advance your knowledge of computer vision:

- [Roboflow Notebooks](https://github.com/roboflow/notebooks): A repository of over 20 notebooks that walk through how to train custom models with a range of model types, from YOLOv7 to SegFormer.
- [Roboflow YouTube](https://www.youtube.com/c/Roboflow): Our library of videos featuring deep dives into the latest in computer vision, detailed tutorials that accompany our notebooks, and more.
- [Roboflow Discuss](https://discuss.roboflow.com/): Have a question about how to do something on Roboflow? Ask your question on our discussion forum.
- [Roboflow Models](https://roboflow.com): Learn about state-of-the-art models and their performance. Find links and tutorials to guide your learning.

### Convert data formats

Roboflow provides free utilities to convert data between dozens of popular computer vision formats. Check out [Roboflow Formats](https://roboflow.com/formats) to find tutorials on how to convert data between formats in a few clicks.

### Connect computer vision to your project logic

[Roboflow Templates](https://roboflow.com/templates) is a public gallery of code snippets that you can use to connect computer vision to your project logic. Code snippets range from sending emails after inference to measuring object distance between detections.