<div align="center" dir="auto">
<p dir="auto"><a href="https://colab.research.google.com/github/encord-team/encord-notebooks/blob/main/colab-notebooks/Encord_Notebooks_Zero_shot_image_segmentation_with_grounding_dino_and_sam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<div align="center" dir="auto">
  <div style="flex: 1; padding: 10px;">
    <a href="https://join.slack.com/t/encordactive/shared_invite/zt-1hc2vqur9-Fzj1EEAHoqu91sZ0CX0A7Q" target="_blank" style="text-decoration:none">
      <img alt="Join us on Slack" src="https://img.shields.io/badge/Join_Our_Community-4A154B?label=&logo=slack&logoColor=white">
    </a>
    <a href="https://docs.encord.com/docs/active-overview" target="_blank" style="text-decoration:none">
      <img alt="Documentation" src="https://img.shields.io/badge/docs-Online-blue">
    </a>
    <a href="https://twitter.com/encord_team" target="_blank" style="text-decoration:none">
      <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social">
    </a>
    <img alt="Python versions" src="https://img.shields.io/pypi/pyversions/encord-active">
    <a href="https://pypi.org/project/encord-active/" target="_blank" style="text-decoration:none">
      <img alt="PyPi project" src="https://img.shields.io/pypi/v/encord-active">
    </a>
    <a href="https://docs.encord.com/docs/active-contributing" target="_blank" style="text-decoration:none">
      <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue">
    </a>
    <img alt="Licence" src="https://img.shields.io/github/license/encord-team/encord-active">
  </div>
</div>

<div align="center">
  <p>
    <a align="center" href="" target="_blank">
      <img
        width="7232"
        src="https://storage.googleapis.com/encord-notebooks/encord_active_notebook_banner.png">
    </a>
  </p>
</div>

# 🟣 Encord Notebooks | 🔧 Zero-Shot Image Segmentation with Grounding-DINO + Segment Anything Model (SAM)

## 🏁 Overview

👋 Hi there!

In this notebook file, you will get and evaluate the segmentation predictions of images using [Grounding-DINO](https://encord.com/blog/grounding-dino-sam-vs-mask-rcnn-comparison/) and Segment Anything Model (SAM).

You will use an 🟣 Encord Active sandbox project to run the segmentation pipeline and visualize the prediction performance (mAP/mAR) on 🟣 Encord Active as well.

<br>

---

💡If you want to read more about 🟣 Encord Active checkout our [GitHub](https://github.com/encord-team/encord-active) and [documentation](https://encord-active-docs.web.app/).

 ## 📰 Complementary Blog Post

![Encord_Notebooks_Grounding-DINO_Segment_Anything_Model_Header_Image](https://images.prismic.io/encord/63abbb6a-9fe0-4bd0-a76f-5cb4e0b1955a_Grounding-DINO%20%2B%20Segment%20Anything%20model%20Header%20image.png?ixlib=gatsbyFP&auto=compress%2Cformat&fit=max)

This is the notebook which implements the steps discussed in this blog post: https://encord.com/blog/grounding-dino-sam-vs-mask-rcnn-comparison/

Check it out for a complementary guide to this notebook.

## 📥 Installation and Set Up: Grounding-DINO and Segment Anything Model (SAM)

To ensure a smooth experience with this walkthrough notebook, you need to install the necessary libraries, dependencies, and model family. This step is essential for running the code and executing the examples effectively.

By installing these libraries upfront, you'll have everything you need to follow along and explore the notebook without any interruptions.

In [None]:
%cd /content

!git clone https://github.com/IDEA-Research/Grounded-Segment-Anything

%cd /content/Grounded-Segment-Anything
!pip install -q -r requirements.txt
%cd /content/Grounded-Segment-Anything/GroundingDINO
!pip install -q .
%cd /content/Grounded-Segment-Anything/segment_anything
!pip install -q .

## Install 🟣 Encord Active

In [None]:
# Assert that python is 3.9 or 3.10 instead
import sys
assert sys.version_info.minor in [9, 10], "Encord Active only supported for python 3.9 and 3.10."

from IPython.display import display, Markdown

!python -m pip install -qq encord-active

display(Markdown('## ‼ Please restart your runtime before running the next cell.'))

## 📩 Download an Encord Active sandbox project


You will use the 🟣 Encord Active quickstart project (200-images subset of COCO Val set) in this notebook

In [None]:
%cd /content
!encord-active download --project-name quickstart
%cd /content/Grounded-Segment-Anything

In [None]:
! wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

In [None]:
#@title 👇🏽 Run this utility code
import sys
sys.stdout.fileno = lambda: 1
sys.stderr.fileno = lambda: 2

## Load Grounding DINO and SAM models


You'll need to set up the necessary libraries, load the Grounding DINO model and the SAM model, and prepare the required data structures and objects for further processing.

You need to load the Grounding DINO model using the function `load_model_hf` which takes repository ID, filenames, and device type as inputs. This function will download the model files from the Hugging Face model hub, build the model using the provided configuration, and load the model's state dictionary. It then sets the model to evaluation mode and returns the loaded model.

The code will also read the ontology from a file and extracts the names of the objects and their corresponding feature node hashes. You'll set a text prompt using the names of the ontology objects and also the threshold values for box and text predictions.



In [None]:
import os, sys

sys.path.append(os.path.join(os.getcwd(), "GroundingDINO"))
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import os
import pickle

import numpy as np
import torch

# Grounding DINO
from GroundingDINO.groundingdino.models import build_model
from GroundingDINO.groundingdino.util import box_ops
from GroundingDINO.groundingdino.util.slconfig import SLConfig
from GroundingDINO.groundingdino.util.utils import clean_state_dict
from GroundingDINO.groundingdino.util.inference import  load_image, predict


# segment anything
sys.path.append("..")
from segment_anything import sam_model_registry, SamPredictor
import cv2

from huggingface_hub import hf_hub_download

from tqdm import tqdm
from encord_active.lib.project.project_file_structure import ProjectFileStructure
from encord_active.lib.common.iterator import DatasetIterator
from encord_active.lib.db.predictions import Format, ObjectDetection, Prediction
import json
from pathlib import Path

# Function to load Grounding DINO model from Hugging Face Hub
def load_model_hf(repo_id, filename, ckpt_config_filename, device='cpu'):
    # Download and load model configuration
    cache_config_file = hf_hub_download(repo_id=repo_id, filename=ckpt_config_filename)
    args = SLConfig.fromfile(cache_config_file)

    # Build Grounding DINO model
    model = build_model(args)
    args.device = device

    # Download and load model checkpoint
    cache_file = hf_hub_download(repo_id=repo_id, filename=filename)
    checkpoint = torch.load(cache_file, map_location='cpu')
    log = model.load_state_dict(clean_state_dict(checkpoint['model']), strict=False)
    print("Model loaded from {} \n => {}".format(cache_file, log))

    # Set model to evaluation mode
    _ = model.eval()
    return model

device = "cuda"
ea_project_path = Path('/content/quickstart') # Path to the project directory

ckpt_repo_id = "ShilongLiu/GroundingDINO" # Repository ID of Grounding DINO model
ckpt_filenmae = "groundingdino_swinb_cogcoor.pth" # Model checkpoint filename
ckpt_config_filename = "GroundingDINO_SwinB.cfg.py" # Model configuration filename

# Load Grounding DINO model
groundingdino_model = load_model_hf(ckpt_repo_id, ckpt_filenmae, ckpt_config_filename, device=device)

# Create ProjectFileStructure object for the project directory
project_fs: ProjectFileStructure = ProjectFileStructure(ea_project_path)

# Initialize DatasetIterator object with the project directory
iterator = DatasetIterator(project_fs.project_dir)

# Read ontology from file and extract object names and feature node hashes
ontology = json.loads(project_fs.ontology.read_text(encoding="utf-8"))
ontology_names = [obj["name"] for obj in ontology.get("objects")]
ontology_name_to_featurehash = {obj["name"]: obj['featureNodeHash'] for obj in ontology.get("objects")}

TEXT_PROMPT = " . ".join(ontology_names) # Set text prompt using ontology names
BOX_TRESHOLD = 0.3 # Threshold for box predictions
TEXT_TRESHOLD = 0.25 # Threshold for text predictions



Initialize a SAM model using the specified checkpoint and model type you downloaded earlier, and move it to the the CUDA device.

Next, you'll create a `SamPredictor` object with the initialized SAM model.

In [None]:
sam_checkpoint = 'sam_vit_b_01ec64.pth'
model_type = "vit_b"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint) # Initialize SAM model
sam.to(device=device) # Move SAM model to cuda device
sam_predictor = SamPredictor(sam) # Create SamPredictor object with SAM model

## Import predictions to Encord Active

Now, you'll import the model predictions to 🟣 Encord Active so you can see how the model's performance based on metric, evaluate its quality, identify failure modes, detect labelling errors and other valuable insights, enhancing the overall performance of the system.

Essentially, you will use the Grounding-DINO model and the SAM model to perform object detection and segmentation, respectively, and incorporate the predictions to 🟣Encord Active.

> 💡 Learn more in [the documentation](https://encord-active-docs.web.app/import/import-predictions).

In [None]:
predictions_to_store = [] # List to store predictions

# Iterate over the dataset using the DatasetIterator
for data_unit, img_path in tqdm(iterator.iterate()):
    try:
        image_source, image = load_image(img_path.as_posix()) # Load the image

        # Get bounding boxes from Grounding-DINO
        boxes, logits, phrases = predict(
            model=groundingdino_model,
            image=image,
            caption=TEXT_PROMPT,
            box_threshold=BOX_TRESHOLD,
            text_threshold=TEXT_TRESHOLD
        )

        if boxes.shape[0] > 0:

            sam_predictor.set_image(image_source)

            H, W, _ = image_source.shape
            boxes_xyxy = box_ops.box_cxcywh_to_xyxy(boxes) * torch.Tensor([W, H, W, H]) # Convert box coordinates

            transformed_boxes = sam_predictor.transform.apply_boxes_torch(boxes_xyxy, image_source.shape[:2]).to(device)

            # Get masks for bounding boxes using SAM predictor
            masks, _, _ = sam_predictor.predict_torch(
                point_coords=None,
                point_labels=None,
                boxes=transformed_boxes,
                multimask_output=False,
            )

            for id, mask in enumerate(masks):
                mask = mask[0].detach().cpu().numpy()  # Convert the mask to a numpy array
                contours, hierarchy = cv2.findContours(mask.astype(np.uint8), cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE) # Find contours in the mask

                for contour in contours:
                    contour = contour.reshape(contour.shape[0], 2) / np.array([[W, H]])

                    if  phrases[id] not in ontology_name_to_featurehash:
                        if phrases[id].split(" ")[0] in ontology_name_to_featurehash:
                            class_name = phrases[id].split(" ")[0]
                        else:
                            class_name = ' '.join(phrases[id].split(' ')[:2]) # Extract class name from the phrase
                    else:
                        class_name = phrases[id]

                    # Create a Prediction object with the predicted object detection information
                    prediction = Prediction(
                        data_hash=data_unit["data_hash"],
                        confidence=logits[id].item(),
                        object=ObjectDetection(
                            format=Format.POLYGON,
                            data=contour,
                            feature_hash=ontology_name_to_featurehash[class_name],
                        ),
                    )
                    predictions_to_store.append(prediction) # Add the prediction to the list

    except Exception as e:
        print('Error')
        print(e)

# Save the predictions to a pickle file
with open(os.path.join(project_fs.project_dir.as_posix(), f"predictions_sam.pkl"), "wb") as f:
    pickle.dump(predictions_to_store, f)

## 📥 Importing the `predictions_sam.pkl` file to Encord Active project



- Download the predictions_sam.pkl file
- Run `encord-active import predictions /path/to/predictions_sam.pkl -t /path/to/target/project/folder`
- When the importing process is finished, you can open 🟣 Encord Active to see the model quality results.

Here are some screenshots from the model performance page of 🟣 Encord Active:

**Metric correlation**

![Encord Notebooks - Metric Correlation Viz](https://storage.googleapis.com/encord-notebooks/ground_dino_sam/encord_notebooks_metric_correlation.png)


**Metrics per class**

![Encord Notebooks - Metrics per class](https://storage.googleapis.com/encord-notebooks/ground_dino_sam/encord_notebooks_metrics_per_class.png)

**Performance by Metric**


![Encord Notebooks - Performance by Metric](https://storage.googleapis.com/encord-notebooks/ground_dino_sam/encord_notebooks_performance_by_metric.png)

# ✅ Wrap up


📓This Colab notebook showed you how to run zero-shot image segmentation with [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) and [Segment Anything Model](https://encord.com/blog/segment-anything-model-explained/) (SAM).

Most importantly, you learnt how to import the model's predictions to Encord Active to analyse class errors and visualize the model's performance.

If you would like to learn more, check out the [complementary blog post](https://encord.com/blog/grounding-dino-sam-vs-mask-rcnn-comparison/).

---

🟣 Encord Active is an open-source framework for computer vision model testing, evaluation, and validation. Check out the project on [GitHub](https://github.com/encord-team/encord-active), leave a star 🌟 if you like it, and leave an issue if you find something is missing.

---

👉 Check out our 📖[blog](https://encord.com/blog/webinar-semantic-visual-search-chatgpt-clip/) and 📺[YouTube](https://www.youtube.com/@encord) channel to stay up-to-date with the latest in computer vision, foundation models, active learning, and data-centric AI.



### ✨ Want more walthroughs like this? Check out the 🟣 [Encord Notebooks repository](https://github.com/encord-team/encord-notebooks/tree/9617d8bc6cea52563ecb18bf173c2043195403e8).