# Step 2: Object Detection and Image Segmentation 

Using the images downloaded from the Library of Congress API in [Step 1 (Metadata Collection and Image Download)]('https://github.com/beefoo/lclabs-jfp24/blob/main/workflow/step_1_metadata_and_image_download.ipynb), the second step in our workflow will focus on the computer vision aspects of the collage tool. 

This step utilizes PyTorch's Faster R-CNN object detection model and weights to generate information regarding predicted objects in an image (prediction confidence, class label, and bounding box). Subsequently, the bounding box information is used to supply a box-prompt to the segmentation model, EfficientSAM, which generates a mask (outline) of the object for extraction. In addition, this notebook also generates thumbnails to connect to the website's UI.

Overall, outputs from both of these computer vision models are used to generate masks, extract objects from images, and generate data that is stored as a JSON file ('model_results.json').

### I. Imports

Importing all necessary libraries and modules. During your first run, it may take some time to import the models.

In [1]:
# General utility libraries
import os
import matplotlib.pyplot as plt
import numpy as np
import regex as re
import numpy as np
import json

# Importing Pytorch ML Libraries
import torch
import torchvision
from torchvision.transforms import ToTensor

# Importing the Models and their respective weights
from torchvision.models.detection import (
    # Faster R-CNN
    fasterrcnn_resnet50_fpn_v2,
    FasterRCNN_ResNet50_FPN_V2_Weights,
)

# Utility functions that help visualize the models and describe the model outputs.
from torchvision.io.image import read_image
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms.functional import to_pil_image
from PIL import ImageFont, ImageDraw, Image
from IPython.display import display
from torchvision.utils import make_grid

# Libraries Mask manipulation and generation
import cv2
from scipy.ndimage import binary_dilation, binary_erosion, binary_closing
from scipy.ndimage import binary_fill_holes
from workflow_helpers import *

### II. Create Directories and Model Results Dictionary

Outputs from the from the computer vision models will be stored as JSON. This part focuses on the creation of the directories that will store the data and the dictionaries which will eventually be turned into the final JSON.

In [2]:
# How many Items do you want to output? Refer to the Notebook 1 value to output the same amount.
number_of_instances = 100

In [3]:
# Data Directories for reference
root_directory = os.getcwd()
data_directory = "workflow_data"
output_directory = os.path.join(data_directory, "image-collection-output")


In [4]:
model_dictionary = {}
model_dictionary['items'] = []

for picture in os.listdir('image-collection-output/')[:number_of_instances]:
    if picture != '.DS_Store':
        item_dictionary = {}
        resource_id = extract_number(picture)
        item_dictionary['resource_id'] = resource_id
        model_dictionary['items'].append(item_dictionary)
        print(picture,resource_id)

image_2001699137.jpg 2001699137
image_2001702332.jpg 2001702332
image_2001703618.jpg 2001703618
image_2001703638.jpg 2001703638
image_2002705861.jpg 2002705861
image_2002716781.jpg 2002716781
image_2003666591.jpg 2003666591
image_2003680531.jpg 2003680531
image_2010630036.jpg 2010630036
image_2010630192.jpg 2010630192
image_2010630446.jpg 2010630446
image_2010630700.jpg 2010630700
image_2010641712.jpg 2010641712
image_2010641826.jpg 2010641826
image_2010648441.jpg 2010648441
image_2010719313.jpg 2010719313
image_2011630135.jpg 2011630135
image_2011630582.jpg 2011630582
image_2011630694.jpg 2011630694
image_2011630889.jpg 2011630889
image_2011631396.jpg 2011631396
image_2011631448.jpg 2011631448
image_2011631890.jpg 2011631890
image_2011632545.jpg 2011632545
image_2011632658.jpg 2011632658
image_2011633142.jpg 2011633142
image_2011633149.jpg 2011633149
image_2011633233.jpg 2011633233
image_2011634248.jpg 2011634248
image_2011635657.jpg 2011635657
image_2013634071.jpg 2013634071
image_20

### III. Create Item Thumbnail

Image thumbnails are created using the 'items_metadata.json' generated in Step 1. The thumbnails are output to the tool's UI, and the file paths are stored in the model_results dictionary.

In [5]:
def create_main_thumbnail(image_path, output_path, item):
    # thumbnail_name
    # resource = os.path.basename(image_path)
    base_name = os.path.basename(image_path).split('.')[0]

    # Create Resource Thumbname
    thumbnail_image = Image.open(image_path)
    original_size = thumbnail_image.size
    max_size = (480,480)
    thumbnail_image.thumbnail(max_size)

    # Create Output directory if it doesn't exist
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    thumbnail_name = f'{base_name}_thumbnail' + '.jpg'
    output_filename =  os.path.join(output_path,thumbnail_name)  
    thumbnail_image.save(output_filename)
    print(f'Saved {thumbnail_name}')

    # Saves original format and thumbnail filename to dictionary.
    item['original_format'] = original_size
    item['thumbnail'] = thumbnail_name

In [6]:
for item in model_dictionary['items']:
    id = item['resource_id']

    image = f'../workflow/image-collection-output/image_{id}.jpg'
    create_main_thumbnail(image,'../ui/data', item)


Saved image_2001699137_thumbnail.jpg
Saved image_2001702332_thumbnail.jpg
Saved image_2001703618_thumbnail.jpg
Saved image_2001703638_thumbnail.jpg
Saved image_2002705861_thumbnail.jpg
Saved image_2002716781_thumbnail.jpg
Saved image_2003666591_thumbnail.jpg
Saved image_2003680531_thumbnail.jpg
Saved image_2010630036_thumbnail.jpg
Saved image_2010630192_thumbnail.jpg
Saved image_2010630446_thumbnail.jpg
Saved image_2010630700_thumbnail.jpg
Saved image_2010641712_thumbnail.jpg
Saved image_2010641826_thumbnail.jpg
Saved image_2010648441_thumbnail.jpg
Saved image_2010719313_thumbnail.jpg
Saved image_2011630135_thumbnail.jpg
Saved image_2011630582_thumbnail.jpg
Saved image_2011630694_thumbnail.jpg
Saved image_2011630889_thumbnail.jpg
Saved image_2011631396_thumbnail.jpg
Saved image_2011631448_thumbnail.jpg
Saved image_2011631890_thumbnail.jpg
Saved image_2011632545_thumbnail.jpg
Saved image_2011632658_thumbnail.jpg
Saved image_2011633142_thumbnail.jpg
Saved image_2011633149_thumbnail.jpg
S

### IV. Load the Faster-RCNN Model and Weights

In [7]:
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT

# Loading the 
model = fasterrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.9)
model.eval()
preprocess = weights.transforms()

### V. Load the EfficientSAM Model

Unlike the PyTorch model, the EfficientSAM model and weights must be downloaded locally. The cell block below pulls the original repository to generate the model. Expect it to take some time on the first run.

In [15]:
parent_dir = os.getcwd()

path_dit = os.path.join(parent_dir,'EfficientSAM')

if not os.path.exists(path_dit):
    !git clone https://github.com/yformer/EfficientSAM.git
    
os.chdir("EfficientSAM")

# Importing the EfficientSAM Model and setting the correct directoy
from efficient_sam.build_efficient_sam import build_efficient_sam_vitt, build_efficient_sam_vits
import zipfile

efficient_sam_vitt_model = build_efficient_sam_vitt()
efficient_sam_vitt_model.eval()

# Since EfficientSAM-S checkpoint file is >100MB, we store the zip file.
with zipfile.ZipFile("weights/efficient_sam_vits.pt.zip", 'r') as zip_ref:
    zip_ref.extractall("weights")
efficient_sam_vits_model = build_efficient_sam_vits()
efficient_sam_vits_model.eval()

os.chdir(parent_dir)


Cloning into 'EfficientSAM'...
Updating files:  92% (35/38)
Updating files:  94% (36/38)
Updating files:  97% (37/38)
Updating files: 100% (38/38)
Updating files: 100% (38/38), done.


In [16]:
def process_image(image_path, output_path, item, structuring_value=25,threshold =0.9):
    # Read the image
    img = read_image(image_path)

    batch = [preprocess(img)]
    # Get prediction from the model
    prediction = model(batch)[0]
    
    if len(prediction['labels']) == 0:
        print(f'No Object Detection predictions within the Scope of MS COCO dataset: {os.path.basename(image_path)}')

    else:

        # Extracting the len of Index of the scores that meet the threshold value:
        score_len = (prediction["scores"] >= threshold).sum().item()
        # Limits the scores at the threshold to just the top 5
        if score_len >= 3:
            score_len = 3
        else:
            pass


        resource = os.path.basename(image_path)
        base_name = os.path.basename(image_path).split('.')[0]
        resource_id = item['resource_id']
        item['segments'] = []

        for i in range(score_len):                
            segment = {}
            bbox =  prediction['boxes'].tolist()[i]
            # Extract bounding box coordinates

            x1 = bbox[0]
            y1 = bbox[1]
            x2 = bbox[2]
            y2 = bbox[3]
            w = x2 - x1
            h = y2 - y1

            if (h*w) <= 30000: 
                continue
            else:
                class_index = prediction['labels'][i].item()
                class_label = weights.meta["categories"][class_index]
                # print(class_label)

                
                # fig, ax = plt.subplots(1, 3, figsize=(30, 30))
                input_point = np.array([[x1, y1], [x2, y2]])
                input_label = np.array([2, 3])
                

                mask_efficient_sam_vitt = run_ours_box_or_points(image_path, input_point, input_label, efficient_sam_vitt_model)
                # show_anns_ours(mask_efficient_sam_vitt, ax[1])
                binary_mask = mask_efficient_sam_vitt
                structuring_element = np.ones((structuring_value,structuring_value), dtype=bool)
                dilated_mask = binary_dilation(binary_mask, structure=structuring_element)
                eroded_mask = binary_erosion(dilated_mask, structure=structuring_element)

                closed_mask_uint8 = (eroded_mask * 255).astype(np.uint8)

                mask_name = f'mask_{resource_id}_{class_label}_{i}' + '.png'
                mask_path = os.path.join(output_path, f'masks/{mask_name}')
                cv2.imwrite(mask_path, closed_mask_uint8)
                img_val = cv2.imread(image_path) 
                mask = cv2.imread(mask_path)

                img_foreground = np.array((mask/255)*(img_val/255)) * img_val
                na = img_foreground
                

                '''
                Import to note that part of the following code is from substack
                '''
                # Make a True/False mask of pixels whose BGR values sum to more than zero
                alpha = np.sum(na, axis=-1) > 0

                # Convert True/False to 0/255 and change type to "uint8" to match "na"
                alpha = np.uint8(alpha * 255)

                # Stack new alpha layer with existing image to go from BGR to BGRA, i.e. 3 channels to 4 channels
                res = np.dstack((na, alpha))
                img = Image.fromarray(res, mode='RGBa')

                # Save result
                cutout_name =  f'cutout_{resource_id}_{class_label}_{i}' + '.png'
                cutout_path = os.path.join(output_path, f'cutouts/{cutout_name}')
                cv2.imwrite(cutout_path, res)
                
                crop_image(cutout_path, x1, y1, x2, y2)

                resize_to_thumbnail(cutout_path)
                resize_to_thumbnail(mask_path)

                segment['confidence'] = prediction["scores"][i].item()
                segment['label'] =  class_label
                segment['cutout'] = cutout_name
                segment['mask'] =  mask_name
                item['segments'].append(segment)
                segment['bounding_box'] = bbox
                segment['instance'] =  i



In [18]:
os.chdir(root_directory)

for item in model_dictionary['items']:
   id = item['resource_id']
   image = f'image-collection-output/image_{id}.jpg'
   process_image(image,'../ui/data', item)

No Object Detection predictions within the Scope of MS COCO dataset: image_2010630036.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2010630192.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2010630700.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2010641826.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2011630135.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2011630582.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2011630694.jpg
No Object Detection predictions within the Scope of MS COCO dataset: image_2011630889.jpg


In [19]:
with open(os.path.join(data_directory,f"model_results.json"), 'w') as f:
        json.dump(model_dictionary, f, indent=4)

print('All done! Model outputs stored as JSON and masks/extractions saved to UI')

All done! Model outputs stored as JSON and masks/extractions saved to UI
