<div align="center" dir="auto">
<p dir="auto"><a href="https://colab.research.google.com/github/encord-team/encord-notebooks/blob/main/colab-notebooks/Encord_Notebooks_Team_gDINO_SAM_vs_maskrcnn_webinar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<div align="center" dir="auto">
  <div style="flex: 1; padding: 10px;">
    <a href="https://join.slack.com/t/encordactive/shared_invite/zt-1hc2vqur9-Fzj1EEAHoqu91sZ0CX0A7Q" target="_blank" style="text-decoration:none">
      <img alt="Join us on Slack" src="https://img.shields.io/badge/Join_Our_Community-4A154B?label=&logo=slack&logoColor=white">
    </a>
    <a href="https://docs.encord.com/docs/active-overview" target="_blank" style="text-decoration:none">
      <img alt="Documentation" src="https://img.shields.io/badge/docs-Online-blue">
    </a>
    <a href="https://twitter.com/encord_team" target="_blank" style="text-decoration:none">
      <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social">
    </a>
    <img alt="Python versions" src="https://img.shields.io/pypi/pyversions/encord-active">
    <a href="https://pypi.org/project/encord-active/" target="_blank" style="text-decoration:none">
      <img alt="PyPi project" src="https://img.shields.io/pypi/v/encord-active">
    </a>
    <a href="https://docs.encord.com/docs/active-contributing" target="_blank" style="text-decoration:none">
      <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue">
    </a>
    <img alt="License" src="https://img.shields.io/github/license/encord-team/encord-active">
  </div>
</div>

<div align="center">
  <p>
    <a align="center" href="" target="_blank">
      <img
        width="7232"
        src="https://storage.googleapis.com/encord-notebooks/encord_active_notebook_banner.png">
    </a>
  </p>
</div>

# üü£ Encord Notebooks | üÜö Grounding-DINO+SAM vs. Mask-RCNN

## üèÅ Overview

üëã Hi there!

In this notebook file, you will get the segmentation predictions of images using Grounding-DINO and Segment Anything Model (SAM).

<br>

---

üí°If you want to read more about üü£ Encord Active checkout our [GitHub](https://github.com/encord-team/encord-active) and [documentation](https://encord-active-docs.web.app/).


 ## üìΩÔ∏è Complementary Webinar

[![Are VFMs on par with SOTA - GroundingDIN+SAM vs MaskRCNN](https://storage.googleapis.com/encord-notebooks/ground_dino_sam/Encord_Webinar_Are_VFMs_on_par_with_SOTA.jpeg)](https://encord.com/learning-hub/are-vfms-on-par-with-sota/)

With Foundational Models increasing in prominence, Encord's President and Co-Founder sits down with Lead ML Engineer, Frederik to dissect Meta's new Visual Foundation Model, Segment Anything Model (SAM).

After combining the model with Grounding-DINO to allow for zero-shot segmentation, the team will compare it to a SOTA Mask-RCNN model to see whether the development of SAM really is revolutionary for segmentation. They discussed:

- The rise of VFMs and how they differ from standard models
- What Meta's release of DINOv2 means for Grounding-DINO + SAM
- How SAM and Grounding-DINO compare to previous segmentation models for performance and predictions
- Evaluating model performance using Encord Active

Check out [the webinar](https://encord.com/learning-hub/are-vfms-on-par-with-sota/).

## üì• Installation and Set Up: Grounding-DINO and Segment Anything Model (SAM)

In [None]:
%cd /content #ENTER YOUR WORKING DIRECTORY
%git clone https://github.com/IDEA-Research/Grounded-Segment-Anything
%cd /content/Grounded-Segment-Anything  # CHANGE TO YOUR WORKING DIRECTORY
%pip install -q -r requirements.txt

%cd /content/Grounded-Segment-Anything/GroundingDINO # CHANGE TO YOUR WORKING DIRECTORY
%pip install -q .
%cd /content/Grounded-Segment-Anything/segment_anything # CHANGE TO YOUR WORKING DIRECTORY
%pip install -q .

# üì® Import all the necessary libraries

In this section, you will import the key libraries that will be used for running the code sample. These libraries play a crucial role in executing the code examples and demonstrating the concepts covered in the walkthrough.

In [None]:
import sys

# ‚ö†Ô∏è REMEMBER TO CHANGE TO THE CORRECT DIRECTORY üëá

module_paths = [
    "/content/Grounded-Segment-Anything/GroundingDINO",
    "/content/Grounded-Segment-Anything/segment_anything",
]
for path in module_paths:
    if not path in sys.path:
        sys.path.append(path)

import json
import os
import pickle
from pathlib import Path

import groundingdino.datasets.transforms as T
import matplotlib.pyplot as plt
import numpy as np
import torch

from groundingdino.models import build_model
from groundingdino.util import box_ops
from groundingdino.util.inference import load_image, predict
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
from huggingface_hub import hf_hub_download
from matplotlib.patches import Rectangle
from PIL import Image
from segment_anything import SamAutomaticMaskGenerator, SamPredictor, sam_model_registry

device = "cuda" if torch.cuda.is_available() else "cpu"

# ‚öì Try GroundingDINO


Let's start by trying out GroundingDINO.
What we will have to do is

1. Fetch a test image
2. Define what we're searchnig for (the prompt)
3. Download the model weights and load the model
4. Run model inference on our example image and search query
5. Displaying the results

In [None]:
#@title ### üíª Fetch an image for testing
# Here, we will use one of the demo images from the GroundedSAM repo
dino_image_transform =  transform = T.Compose(
    [
        T.RandomResize([800], max_size=1333),
        T.ToTensor(),
        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]
)

def load_image(image_path):
    # load image
    image_pil = Image.open(image_path).convert("RGB")
    image, _ = dino_image_transform(image_pil, None)
    return image_pil, image


# ‚ö†Ô∏è REMEMBER TO CHANGE TO THE CORRECT DIRECTORY üëá

image_path = "/content/Grounded-Segment-Anything/assets/demo6.jpg"
img, img_tensor = load_image(image_path)

plt.imshow(img)
_ = plt.axis("off")

In [None]:
#@title ### üí¨ Define query string
# Grounding dino suggests to encode classes by joining them with dots.

class_descriptions = ".".join(["cat", "dog", "horse"])

In [None]:
#@title üì• Download model weights for GroundingDINO and load model
def load_dino_from_hf(device, repo_id="ShilongLiu/GroundingDINO", filename="groundingdino_swinb_cogcoor.pth", ckpt_config_filename="GroundingDINO_SwinB.cfg.py"):
    cache_config_file = hf_hub_download(repo_id=repo_id, filename=ckpt_config_filename)

    args = SLConfig.fromfile(cache_config_file)
    cache_file = hf_hub_download(repo_id=repo_id, filename=filename)
    args.device = device
    model = build_model(args)
    checkpoint = torch.load(cache_file, map_location="cpu")
    load_res = model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
    _ = model.eval()
    return model

dino_model = load_dino_from_hf(device)

In [None]:
#@title ### ü™Ñ Run GroundingDINO
def get_grounding_output(dino_model, image, class_description, device, box_threshold=0.3, text_threshold=0.25):
    class_description = class_description.lower()
    class_description = class_description.strip()
    if not class_description.endswith("."):
        class_description = class_description + "."

    dino_model = dino_model.to(device)
    image = image.to(device)
    with torch.no_grad():
        outputs = dino_model(image[None], captions=[class_description])
    logits = outputs["pred_logits"].cpu().sigmoid()[0]  # (nq, 256)
    boxes = outputs["pred_boxes"].cpu()[0]  # (nq, 4)
    logits.shape[0]

    # filter output
    logits_filt = logits.clone()
    boxes_filt = boxes.clone()
    filt_mask = logits_filt.max(dim=1)[0] > box_threshold
    logits_filt = logits_filt[filt_mask]  # num_filt, 256
    boxes_filt = boxes_filt[filt_mask]  # num_filt, 4
    logits_filt.shape[0]

    # get phrase
    tokenlizer = dino_model.tokenizer
    tokenized = tokenlizer(class_description)

    # build pred descriptions
    pred_phrases = []
    for logit, box in zip(logits_filt, boxes_filt):
        pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
        pred_phrases.append(pred_phrase + f" ({str(logit.max().item())[:4]})")

    return boxes_filt, pred_phrases

boxes, phrases = get_grounding_output(dino_model, img_tensor, class_descriptions, device)

In [None]:
#@title ### üñºÔ∏è Display results
def plot_boxes(image_pil, boxes, labels):
    W, H = image_pil.size
    assert len(boxes) == len(labels), "boxes and labels must have same length"

    fig, ax = plt.subplots()
    ax.imshow(image_pil)
    ax.axis("off")

    for box, label in zip(boxes, labels):
        box = box * torch.Tensor([W, H, W, H])
        # from cxcyywh to xywh
        box[:2] -= box[2:] / 2
        x, y, w, h = box

        color = tuple(np.random.random(size=3).tolist())

        ax.add_patch(Rectangle((x, y), w, h, color=color, fill=None))
        ax.add_patch(Rectangle((x, y), w, h, color=color, fill=color, alpha=0.3))
        ax.text(x, y, str(label), color="white", ha="left", va="top")

    return ax

image_with_box = plot_boxes(img, boxes, phrases)

# ü™£ Try SAM


Let's also try out SAM. We will follow similar steps as above - but with the SAM weights and code:

1. ~Load image~ üëà we already did this
2. Define what we're searching for (the prompt)
3. Download and load the model
4. Running model inference
5. Displaying results

In [None]:
#@title ### üí¨ Define what you're searching for
#@markdown We will try out a couple of bounding boxes although SAM offers more than that.
#@markdown Let's have a look at the image again - with axis.
fig, ax = plt.subplots()
ax.imshow(img)

#@markdown it seems like some good points would be:
boxes = np.array([
    (50, 0, 1000, 900),
    (1550, 550, 1800, 1150),
])

def show_box(box, ax, color="red"):
    x1, y1, x2, y2 = box
    w, h = x2 - x1, y2 - y1
    ax.add_patch(Rectangle((x1, y1), w, h, color=color, fill=None))

for box in boxes:
    show_box(box, ax)


In [None]:
#@title ### üì• Download and load SAM
%cd /content
def load_sam_model(device,  model_file: Path = Path("sam_vit_h_4b8939.pth"), model_type = "default"):
# def load_sam_model(device,  model_file: Path = Path("sam_vit_b_01ec64.pth"), model_type = "default"):
    print(model_file)
    if not model_file.exists():
        # Hack for UTF-8 input encoding
        import locale
        def getpreferredencoding(do_setlocale = True):
            return "UTF-8"
        locale.getpreferredencoding = getpreferredencoding
        # Hack end
        # !wget https://dl.fbaipublicfiles.com/segment_anything/
        model_name = model_file.name
        import subprocess
        subprocess.run(f"wget https://dl.fbaipublicfiles.com/segment_anything/{model_name}", shell=True)

    sam = sam_model_registry[model_type](checkpoint=model_file.name)
    sam.to(device=device)
    sam_model = SamPredictor(sam)
    return sam_model

sam_model = load_sam_model(device)

In [None]:
#@title ### ü™Ñ Run inference
def get_sam_output(sam_model, img, boxes, device):
    img_np = np.asarray(img)
    sam_model.set_image(img_np)

    boxes_tensor = torch.tensor(boxes, device=device)
    boxes_sam = sam_model.transform.apply_boxes_torch(boxes_tensor, img_np.shape[:2])

    masks, *_ = sam_model.predict_torch(
        point_coords=None,
        point_labels=None,
        boxes=boxes_sam,
        multimask_output=False,
    )
    return masks.detach().cpu().numpy().squeeze()

masks = get_sam_output(sam_model, img, boxes, device)

In [None]:
#@title ### üñºÔ∏è Display SAM results
def show_mask(mask, ax):
    color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
    return color

def plot_masks(img, masks):
    fig, ax = plt.subplots()
    ax.imshow(img)
    ax.axis("off")
    for mask in masks:
        show_mask(mask, ax)
    return fig

print(masks.shape)
_ = plot_masks(img, masks)

# ü™¢ Combine GroundingDINO and Segment Anything Model (SAM)

Combining the two models is straight forward. We take the boxes from GroundingDINO and feed them to SAM and voila, you have your predictions.

In [None]:
def transform_boxes_to_xyxy(boxes, img_w, img_h):
    boxes = boxes * torch.tensor([img_w, img_h, img_w, img_h], device=boxes.device)
    boxes[:,:2] -= boxes[:,2:] / 2
    boxes[:,2:] = boxes[:,:2] + boxes[:,2:]
    return boxes.numpy()

def plot_masks_with_labels(img, masks, boxes, labels):
    fig, ax = plt.subplots()
    ax.imshow(img)
    ax.axis("off")

    for mask, box, label in zip(masks, boxes, labels):
        color = show_mask(mask, ax)
        show_box(box, ax, color=color)
        x, y, *_ = box
        ax.text(x, y, str(label), color="white", ha="left", va="top")

@torch.inference_mode()
def predict(img_path, class_descriptions):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    pil_img, tensor_img = load_image(img_path)
    boxes, phrases = get_grounding_output(dino_model, tensor_img, class_descriptions, device)

    if not boxes.size:
        print("Couldn't find any boxes")
        return

    boxes = transform_boxes_to_xyxy(boxes, pil_img.size[0], pil_img.size[1])
    masks = get_sam_output(sam_model, pil_img, boxes, device)

    if masks.sum():
        if masks.ndim == 2:
            masks = masks[None]
        plot_masks_with_labels(pil_img, masks, boxes, phrases)
    else:
        print("No masks detected")
        plot_boxes(pil_img, boxes, phrases)

predict(image_path, "cat.dog.horse")

# ‚úÖ Wrap up

That's it folks!
If you aren't done, please find our [notebook](./Encord_Notebooks_Zero_shot_image_segmentation_with_grounding_dino_and_sam.ipynb) the does the end-2-end experiments include model evaluation on an actual segmentation task.

# üñºÔ∏è Other examples

In [None]:
# ‚ö†Ô∏è REMEMBER TO CHANGE TO THE CORRECT DIRECTORY üëá

files = [
    ("/content/Grounded-Segment-Anything/assets/demo1.jpg", "bear.dog.chair.person."),
    ("/content/Grounded-Segment-Anything/assets/demo2.jpg", "bear.dog.chair.person."),
    ("/content/Grounded-Segment-Anything/assets/demo3.jpg", "bear.dog.chair.person."),
    ("/content/Grounded-Segment-Anything/assets/demo4.jpg", "bear.dog.chair.person."),
    ("/content/Grounded-Segment-Anything/assets/demo5.jpg", "bear.dog.chair.person."),
]
for f, q in files:
    predict(f, q)

üììThis Colab notebook showed you how to build combine the power of GroundingDINO and SAM to produce accurate segmentation masks and bounding boxes that compare with state-of-the-art techniques like [MaskRCNN](https://paperswithcode.com/paper/mask-r-cnn).

---

üü£ Encord Active is an open-source framework for computer vision model testing, evaluation, and validation. Check out the project on [GitHub](https://github.com/encord-team/encord-active), leave a star üåü if you like it, and leave an issue if you find something is missing.

---

üëâ Check out our üìñ[blog](https://encord.com/blog/webinar-semantic-visual-search-chatgpt-clip/) and üì∫[YouTube](https://www.youtube.com/@encord) channel to stay up-to-date with the latest in computer vision, foundation models, active learning, and data-centric AI.

### ‚ú® Want more walthroughs like this? Check out the üü£ [Encord Notebooks repository](https://github.com/encord-team/encord-notebooks/).