## Object Tracking and Image Segmentation Pipeline Using SAM2 Model

### Introduction

This project presents an advanced image segmentation and object tracking pipeline using the SAM2 model, a solution for tasks requiring precise object detection and segmentation in video frames. The pipeline is designed to automate the process of tracking objects across multiple frames, predicting segmentation masks, and evaluating the accuracy of these predictions using COCO format metrics. It offers a reusable framework adaptable to different categories of objects.

### Implementation Details

The core of this project is encapsulated in the ImageSegmentationPipeline class. This class integrates several components:

1. **Initialization and Setup:**

- The class initializes the SAM2 model for both image and video predictions.
- A temporary directory is created for storing intermediate files during processing.

2. **Utility Functions:**

- Functions for directory management, clearing temporary folders, and image display are provided.
- The ```show_mask``` and ```show_box``` methods allow visualization of segmentation masks and bounding boxes on images.

3. **Tracking and Visualization:**

- The ```track_item_boxes``` method tracks objects across frames based on the bounding boxes provided.
- The ```visualize_tracking``` method helps visualize the object tracking results.

4. **Image Processing:**

- The ```process_img_png_mask``` method processes the mask of an image to extract bounding box coordinates.
- The ```process_category method``` handles the entire workflow for a given category, including mask prediction, bounding box extraction, and saving the results.

5. **Evaluation:**

- The ```evaluate_category``` method evaluates the predicted bounding boxes against the ground truth using COCO's evaluation tools, providing performance metrics for each category.

6. **Pipeline Execution:**

- The ```run_pipeline``` method is the entry point for running the entire process on specified categories, from image processing to evaluation.

### Importing necessary libraries

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
import shutil
import json
from PIL import Image
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
from torchvision import transforms

from sam2.build_sam import build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2_video_predictor

  OLD_GPU, USE_FLASH_ATTN, MATH_KERNEL_ON = get_sdpa_settings()


### Defining the Image Segmentation Pipeline Class

In [2]:
class ImageSegmentationPipeline:
    def __init__(self, model_cfg, checkpoint, device='cuda'):
        self.predictor_prompt = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
        self.sam2 = build_sam2(model_cfg, checkpoint, device=device, apply_postprocessing=False)
        self.mask_generator = SAM2AutomaticMaskGenerator(self.sam2)
        self.predictor_vid = build_sam2_video_predictor(model_cfg, checkpoint, device=device)
        
        self.tempfolder = "./tempdir"
        self.create_if_not_exists(self.tempfolder)

    @staticmethod
    def create_if_not_exists(dirname):
        if not os.path.exists(dirname):
            os.makedirs(dirname)

    def cleardir(self, folder):
        filepaths = glob.glob(folder + "/*")
        for filepath in filepaths:
            os.unlink(filepath)

    def show_mask(self, mask, ax, obj_id=None, random_color=False):
        if random_color:
            color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
        else:
            cmap = plt.get_cmap("tab10")
            cmap_idx = 0 if obj_id is None else obj_id
            color = np.array([*cmap(cmap_idx)[:3], 0.6])
        h, w = mask.shape[-2:]
        mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
        ax.imshow(mask_image)

    def show_box(self, box, ax):
        x0, y0 = box[0], box[1]
        w, h = box[2] - box[0], box[3] - box[1]
        ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0, 0, 0, 0), linewidth=1))

    def track_item_boxes(self, imgpath1, imgpath2, img1boxclasslist, visualize=False):
        self.cleardir(self.tempfolder)
        shutil.copy(imgpath1, self.tempfolder + "/00000.jpg")
        shutil.copy(imgpath2, self.tempfolder + "/00001.jpg")

        inference_state = self.predictor_vid.init_state(video_path=self.tempfolder)
        self.predictor_vid.reset_state(inference_state)
        ann_frame_idx = 0

        for img1boxclass in img1boxclasslist:
            ([xmin, xmax, ymin, ymax], objectnumint) = img1boxclass
            box = np.array([xmin, ymin, xmax, ymax], dtype=np.float32)
            _, out_obj_ids, out_mask_logits = self.predictor_vid.add_new_points_or_box(
                inference_state=inference_state,
                frame_idx=ann_frame_idx,
                obj_id=objectnumint,
                box=box,
            )

        video_segments = {}
        for out_frame_idx, out_obj_ids, out_mask_logits in self.predictor_vid.propagate_in_video(inference_state):
            video_segments[out_frame_idx] = {
                out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
                for i, out_obj_id in enumerate(out_obj_ids)
            }

        if visualize:
            self.visualize_tracking(imgpath1, imgpath2, img1boxclasslist[0][0], video_segments)

        return video_segments

    def visualize_tracking(self, imgpath1, imgpath2, box, video_segments):
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
        
        # Original image with box
        ax1.imshow(Image.open(imgpath1))
        ax1.set_title("Original image object")
        xmin, xmax, ymin, ymax = box
        self.show_box([xmin, ymin, xmax, ymax], ax1)
        
        # Test image with detected mask
        ax2.imshow(Image.open(imgpath2))
        ax2.set_title("Detected object in test image")
        for out_obj_id, out_mask in video_segments[1].items():
            self.show_mask(out_mask, ax2, obj_id=out_obj_id)
        
        plt.show()

    @staticmethod
    def process_img_png_mask(img_path, mask_path, visualize=False):
        mask = Image.open(mask_path).convert("L")
        mask_np = np.array(mask)

        y_indices, x_indices = np.where(mask_np > 0)
        
        if y_indices.size > 0 and x_indices.size > 0:
            xmin, xmax = x_indices.min(), x_indices.max()
            ymin, ymax = y_indices.min(), y_indices.max()
        else:
            xmin, xmax, ymin, ymax = 0, 0, 0, 0

        if visualize:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
            ax1.imshow(Image.open(img_path))
            ax1.set_title("Original Image")
            ax2.imshow(mask_np, cmap='gray')
            ax2.add_patch(plt.Rectangle((xmin, ymin), xmax-xmin, ymax-ymin, 
                                        edgecolor='red', facecolor='none', linewidth=2))
            ax2.set_title("Mask with Bounding Box")
            plt.show()

        return xmin, xmax, ymin, ymax

    def process_category(self, category, data_dir, output_dir):
        self.create_if_not_exists(output_dir)
        
        # Directories for saving outputs
        pred_masks_dir = os.path.join(output_dir, f'{category}_Predicted_Masks')
        pred_bbox_dir = os.path.join(output_dir, f'{category}_Bounding_Boxes_Predicted')
        gt_bbox_dir = os.path.join(output_dir, f'{category}_Bounding_Boxes_Ground_Truth')
        
        for dir_path in [pred_masks_dir, pred_bbox_dir, gt_bbox_dir]:
            self.create_if_not_exists(dir_path)

        # Get the first image and its mask for training
        first_img_path = sorted(glob.glob(os.path.join(data_dir, f'{category}_*.jpg')))[0]
        first_img_mask_path = first_img_path.replace('.jpg', '_1_gt.png')
        
        # Extract the bounding box from the first image's mask
        xmin, xmax, ymin, ymax = self.process_img_png_mask(first_img_path, first_img_mask_path)

        # Process remaining images
        image_files = sorted(glob.glob(os.path.join(data_dir, f'{category}_*.jpg')))[1:]
        for img_path in image_files:
            # Predict mask
            op = self.track_item_boxes(first_img_path, img_path, [([xmin, xmax, ymin, ymax], 1)])
            relevant_mask = op[1][1]
            
            if relevant_mask.ndim > 2:
                relevant_mask = np.squeeze(relevant_mask)
            
            mask_to_save = (relevant_mask.astype(np.uint8) * 255)
            
            # Save predicted mask
            mask_name = os.path.basename(img_path).replace('.jpg', '_pred_mask.png')
            save_path = os.path.join(pred_masks_dir, mask_name)
            Image.fromarray(mask_to_save).save(save_path)
            
            # Calculate and save predicted bounding box
            y_indices, x_indices = np.where(mask_to_save > 0)
            if y_indices.size > 0 and x_indices.size > 0:
                xmin, xmax = x_indices.min(), x_indices.max()
                ymin, ymax = y_indices.min(), y_indices.max()
                bbox = [xmin, ymin, xmax-xmin, ymax-ymin]
            else:
                bbox = [0, 0, 0, 0]
            
            self.save_bbox_coco_format(os.path.basename(img_path), bbox, 1, 
                                       os.path.join(pred_bbox_dir, f'{os.path.basename(img_path).replace(".jpg", "_pred_bbox.json")}'))
            
            # Calculate and save ground truth bounding box
            gt_mask_path = img_path.replace('.jpg', '_1_gt.png')
            if os.path.exists(gt_mask_path):
                xmin, xmax, ymin, ymax = self.process_img_png_mask(img_path, gt_mask_path)
                bbox = [xmin, ymin, xmax-xmin, ymax-ymin]
                self.save_bbox_coco_format(os.path.basename(img_path), bbox, 1, 
                                           os.path.join(gt_bbox_dir, f'{os.path.basename(img_path).replace(".jpg", "_gt_bbox.json")}'))
        
        # Evaluate predictions
        self.evaluate_category(category, gt_bbox_dir, pred_bbox_dir)

    @staticmethod
    def save_bbox_coco_format(img_name, bbox, category_id, output_path):
        bbox = [int(coord) for coord in bbox]
        annotation = {
            "image_id": img_name,
            "category_id": category_id,
            "bbox": bbox,
            "area": bbox[2] * bbox[3],
            "iscrowd": 0,
        }
        with open(output_path, 'w') as f:
            json.dump(annotation, f)

    @staticmethod
    def load_annotations_from_directory(directory):
        annotations = []
        for file_path in glob.glob(f'{directory}/*.json'):
            with open(file_path, 'r') as file:
                annotations.append(json.load(file))
        return annotations

    @staticmethod
    def convert_to_coco_format(annotations, image_ids, category_id, is_pred=False):
        coco_annotations = []
        for i, ann in enumerate(annotations):
            coco_annotation = {
                "id": i + 1,
                "image_id": image_ids.index(ann['image_id']),
                "category_id": category_id,
                "bbox": ann['bbox'],
                "area": ann['area'],
                "iscrowd": ann['iscrowd']
            }
            if is_pred:
                coco_annotation["score"] = 1.0  # Assuming high confidence for all predictions
            coco_annotations.append(coco_annotation)
        return coco_annotations

    #Evaluation
    def evaluate_category(self, category, gt_bbox_dir, pred_bbox_dir):
        gt_annotations = self.load_annotations_from_directory(gt_bbox_dir)
        pred_annotations = self.load_annotations_from_directory(pred_bbox_dir)
        
        image_ids = list({ann['image_id'] for ann in gt_annotations + pred_annotations})
        
        coco_gt = COCO()
        coco_pred = COCO()

        coco_gt.dataset = {
            'images': [{'id': idx, 'file_name': img_id} for idx, img_id in enumerate(image_ids)],
            'annotations': self.convert_to_coco_format(gt_annotations, image_ids, 1),
            'categories': [{'id': 1, 'name': category}]
        }
        
        coco_pred.dataset = {
            'images': [{'id': idx, 'file_name': img_id} for idx, img_id in enumerate(image_ids)],
            'annotations': self.convert_to_coco_format(pred_annotations, image_ids, 1, is_pred=True),
            'categories': [{'id': 1, 'name': category}]
        }

        coco_gt.createIndex()
        coco_pred.createIndex()

        coco_eval = COCOeval(coco_gt, coco_pred, 'bbox')
        coco_eval.evaluate()
        coco_eval.accumulate()
        coco_eval.summarize()

        print(f"Evaluation results for category: {category}")
        

    def run_pipeline(self, categories, data_dir, output_dir):
        for category in categories:
            print(f"Processing category: {category}")
            self.process_category(category, data_dir, output_dir)
            print(f"Finished processing category: {category}\n")



### Initializing and Running the Pipeline

In [3]:
# Initialize the pipeline
model_cfg = "./sam2_hiera_t.yaml"
checkpoint = "./model/sam2_hiera_tiny.pt"
data_dir = "./data/data_2D"
output_dir = "./outputs"

pipeline = ImageSegmentationPipeline(model_cfg, checkpoint)

#### Category: can_chowder

In [4]:
categories = ["can_chowder"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)


Processing category: can_chowder


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  5.63it/s]
  x = F.scaled_dot_product_attention(

Skipping the post-processing step due to the error above. You can still use SAM 2 and it's OK to ignore the error above, although some post-processing functionality may be limited (which doesn't affect the results in most cases; see https://github.com/facebookresearch/segment-anything-2/blob/main/INSTALL.md).
  pred_masks_gpu = fill_holes_in_mask_scores(
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.05it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.86it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.00it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.43it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.48it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 14.75it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.42it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.51it/s]


creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.10s).
Accumulating evaluation results...
DONE (t=0.04s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.033
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.003
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.185
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.041
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.041
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.041
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: can_soymilk

In [5]:
categories = ["can_soymilk"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: can_soymilk


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.90it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.27it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.95it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.45it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.93it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.45it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.59it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.11it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 11.25it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.19it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 11.50it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.93it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.12it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.29it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.26

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.01s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.003
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.011
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.020
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.020
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.020
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: can_tomatosoup

In [6]:
categories = ["can_tomatosoup"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: can_tomatosoup


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.17it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.97it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.46it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.18it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.68it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.84it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 11.22it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.07it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.29it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.28it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.68it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.17it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  7.10it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.23it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.29

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.02s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.013
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.005
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.035
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.035
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.035
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: carton_oj

In [7]:
categories = ["carton_oj"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: carton_oj


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.95it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.44it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.92it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.12it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 12.07it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.44it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.23it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.29it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.92it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.86it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.00it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.86it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 12.08it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.13it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.20

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.00s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.003
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.006
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000


#### Category: carton_soymilk

In [8]:
categories = ["carton_soymilk"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: carton_soymilk


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.52it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.61it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.96it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.30it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 13.47it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.51it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 13.42it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.32it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.68it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.66it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 11.04it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.45it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.94it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.63it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  7.72

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.00s).
Accumulating evaluation results...
DONE (t=0.02s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.014
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.013
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: diet_coke

In [9]:
categories = ["diet_coke"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: diet_coke


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.16it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.46it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  7.39it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.39it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.03it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.28it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.38it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.13it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.30it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.28it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 11.33it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.02it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.00it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.27it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.34

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.00s).
Accumulating evaluation results...
DONE (t=0.02s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.009
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.012
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.065
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.065
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.065
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: hc_potroastsoup

In [10]:
categories = ["hc_potroastsoup"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: hc_potroastsoup


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.64it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.80it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.78it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.89it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.05it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.42it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.06it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.62it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 12.25it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.29it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.68it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.76it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.85it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.17it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 12.06

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.00s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.008
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.033
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.033
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.033
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: juicebox

In [11]:
categories = ["juicebox"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: juicebox


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.36it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.63it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 12.12it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.44it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.06it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.14it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.39it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.32it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.95it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.13it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.99it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.15it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 11.96it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.29it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  7.79

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.02s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.013
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.011
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.068
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.029
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.059
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.059
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.059
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 

#### Category: rice_tuscan

In [12]:
categories = ["rice_tuscan"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: rice_tuscan


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  5.02it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.79it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.72it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.29it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.50it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.21it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.00it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.18it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.59it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.87it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.34it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.44it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.07it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.21it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.98

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.00s).
Accumulating evaluation results...
DONE (t=0.01s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.027
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.031
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.031
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.031
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.031
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.031
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.031
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000


#### Category: ricepilaf

In [13]:
categories = ["ricepilaf"]

# Run the pipeline
pipeline.run_pipeline(categories, data_dir, output_dir)

Processing category: ricepilaf


frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.88it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.75it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.32it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.43it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 10.89it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.14it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.61it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.62it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.56it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  3.99it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  8.57it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.14it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00, 12.03it/s]
propagate in video: 100%|██████████| 2/2 [00:00<00:00,  4.47it/s]
frame loading (JPEG): 100%|██████████| 2/2 [00:00<00:00,  9.13

creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.02s).
Accumulating evaluation results...
DONE (t=0.02s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000


### Conclusion

This project provides a solution for object tracking and image segmentation using the SAM2 model. The pipeline's modular design allows for easy adaptation to different categories.