# Orgaquant dataset creation

This notebook gives a schema and the total steps to generate the ground-truth masks for the organoids in Orgaquant paper. This dataset contains instestinal organoid pictures, but is designed for object detection tasks. Here we generate a dataset for instance segmentation.

## Initialization

Import relevant libraries

In [1]:
import os
import numpy as np
import torch
import pandas as pd
import cv2


# Grounding DINO
from groundingdino.util.inference import load_image
from groundingdino.util import box_ops



  from .autonotebook import tqdm as notebook_tqdm


Ignore possible warnings.

In [2]:
import warnings

# Suppress SettingWithCopyWarning
warnings.filterwarnings("ignore", category=pd.core.generic.SettingWithCopyWarning)

Set up the main directory and the data directory.

In [12]:
# Set working directory as the main directory
os.chdir("/home/ubuntu/")
# Data directory
data_dir = "/home/ubuntu/data/intestinal_organoid_dataset"

Use CUDA.

In [4]:
# Use CUDA if possible
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


Initialize the lists containing all the information:
* Directory list: contains the paths to all images.
* Boxes list: contains the boxes corresponding to organoids for all images.
* Masks list: contains the path to the masks generated by SAM.

In [5]:
img_source_list = []
boxes_list = []
masks_list = []
split_list = []

In [6]:
from transformers import SamModel, SamProcessor
from utils.inference_sam import sam_inference_from_dino

# Use large encoder here
model = SamModel.from_pretrained("facebook/sam-vit-large").to(device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-large")

## Train dataset creation

In this part the train dataset is generated with the boxes given in `train_labels.csv`.

### 1. Get the information from the .csv.

In [7]:
df_boxes_train = pd.read_csv(os.path.join(data_dir, "train_labels.csv"),
                             header = None,
                             names= ['image_path', 'x1', 'y1', 'x2', 'y2', 'class_name'])
# The name of each image is saved in a new column of the dataframe.
df_boxes_train[['image_name']] = df_boxes_train['image_path'].apply(lambda x: pd.Series(os.path.basename(x)))
print(df_boxes_train.shape)


(13004, 7)


### 2. Lists for train dataset

Create the lists that contain train dataset information.

In [8]:
train_img_source_list = []
train_boxes_list = []
train_masks_list = []

### 3. Generate the masks for every image

Given the boxes contained in `train_labels.csv` generate the masks with SAM.

We create a list containing all the image names. For each image name we get all the boxes in the dataframe. Then, we run SAM model with the boxes as prompts and save each mask separately.

In [9]:
unique_img_names = np.unique(np.array(df_boxes_train['image_name'].tolist()))
for i, name in enumerate(unique_img_names):
    # Save image path
    img_path = os.path.join(data_dir, "train", "images", name)
    train_img_source_list.append(img_path)
    img_source_list.append(img_path)
    # Save split in dataset
    split_list.append("train")
    # Save image name root
    root, _ = os.path.splitext(name)

    # Load the image
    image_source, _ = load_image(img_path)
    H, W, _ = image_source.shape

    # Get all boxes corresponding to the image
    df_image = df_boxes_train.query("image_name == @name")
    df_image['box'] = df_image.apply(lambda row: np.array([row['x1'], row['y1'], row['x2'], row['y2']]), axis=1)
    image_boxes = np.array(df_image['box'].tolist())
    train_boxes_list.append(image_boxes)    
    boxes_list.append(image_boxes)
    # Convert boxes to SAM format
    sam_boxes = box_ops.box_xyxy_to_cxcywh(torch.Tensor(image_boxes) / np.array([W, H, W, H]))
    
    # Get masks with SAM
    masks, _, _ = sam_inference_from_dino(image_source, sam_boxes, 
                                          model, processor, device)
    # Transform masks to numpy array of shape [n_masks, width, height]
    np_masks = (masks[0].numpy())[:,0,:,:]*1
    # Save the masks individually and save the path of each mask
    count = 0
    image_masks = []
    for index, mask in enumerate(np_masks):
        # Save the mask as an image
        if count < 10:
            cv2.imwrite(os.path.join(data_dir, "train", "masks", root + "_" + str(0) + str(count) + ".png"), mask * 255)
        elif count >= 10:
            cv2.imwrite(os.path.join(data_dir, "train", "masks", root + "_" + str(count) + ".png"), mask * 255)
        # Save the mask location 
        if count < 10:
            image_masks.append(os.path.join(data_dir, "train", "masks", root + "_" + str(0) + str(count) + ".png")) 
        elif count >= 10:
            image_masks.append(os.path.join(data_dir, "train", "masks", root + "_" + str(count) + ".png")) 
        # Update count
        count += 1
    masks_list.append(image_masks)    
    train_masks_list.append(image_masks)

### 4. Get dataset dimensions

In [10]:
train_images = 0
train_masks = 0
for i, masks in enumerate(train_masks_list):
    if len(masks) > 0:
        train_images += 1
        train_masks += len(masks)

print("TRAIN DATASET")
print("Total number of images:", train_images)
print("Total number of masks:", train_masks)

TRAIN DATASET
Total number of images: 1630
Total number of masks: 13004


## Test dataset creation

In this part the test dataset is generated with the boxes given in `test_labels.csv`.

### 1. Get the information from the .csv.

In [11]:
df_boxes_test = pd.read_csv(os.path.join(data_dir, "test_labels.csv"))
# The name of each image is saved in a new column of the dataframe.
df_boxes_test[['image_name']] = df_boxes_test['image_path'].apply(lambda x: pd.Series(os.path.basename(x)))
print(df_boxes_test.shape)


(1135, 7)


### 2. Lists for test dataset

Create the lists that contain test dataset information.

In [12]:
test_img_source_list = []
test_boxes_list = []
test_masks_list = []
test_split = []

### 3. Generate the masks for every image

Given the boxes contained in `test_labels.csv` generate the masks with SAM.

We create a list containing all the image names. For each image name we get all the boxes in the dataframe. Then, we run SAM model with the boxes as prompts and save each mask separately.

In [13]:
unique_img_names = np.unique(np.array(df_boxes_test['image_name'].tolist()))
for i, name in enumerate(unique_img_names):
    # Save image path
    img_path = os.path.join(data_dir, "test", "images", name)
    test_img_source_list.append(img_path)
    img_source_list.append(img_path)
    # Save split in dataset
    split_list.append("test")
    test_split.append("test")
    # Save image name root
    root, _ = os.path.splitext(name)

    # Load the image
    image_source, _ = load_image(img_path)
    H, W, _ = image_source.shape

    # Get all boxes corresponding to the image
    df_image = df_boxes_test.query("image_name == @name")
    df_image['box'] = df_image.apply(lambda row: np.array([row['x1'], row['y1'], row['x2'], row['y2']]), axis=1)
    image_boxes = np.array(df_image['box'].tolist())
    test_boxes_list.append(image_boxes)    
    boxes_list.append(image_boxes)
    # Convert boxes to SAM format
    sam_boxes = box_ops.box_xyxy_to_cxcywh(torch.Tensor(image_boxes) / np.array([W, H, W, H]))
    
    # Get masks with SAM
    masks, _, _ = sam_inference_from_dino(image_source, sam_boxes, 
                                          model, processor, device)
    # Transform masks to numpy array of shape [n_masks, width, height]
    np_masks = (masks[0].numpy())[:,0,:,:]*1
    # Save the masks individually and save the path of each mask
    count = 0
    image_masks = []
    for index, mask in enumerate(np_masks):
        # Save the mask as an image
        if count < 10:
            cv2.imwrite(os.path.join(data_dir, "test", "masks", root + "_" + str(0) + str(count) + ".png"), mask * 255)
        elif count >= 10:
            cv2.imwrite(os.path.join(data_dir, "test", "masks", root + "_" + str(count) + ".png"), mask * 255)
        # Save the mask location 
        if count < 10:
            image_masks.append(os.path.join(data_dir, "test", "masks", root + "_" + str(0) + str(count) + ".png")) 
        elif count >= 10:
            image_masks.append(os.path.join(data_dir, "test", "masks", root + "_" + str(count) + ".png")) 
        # Update count
        count += 1
    masks_list.append(image_masks)    
    test_masks_list.append(image_masks)

### 4. Get dataset dimensions

In [14]:
test_images = 0
test_masks = 0
for i, masks in enumerate(test_masks_list):
    if len(masks) > 0:
        test_images += 1
        test_masks += len(masks)

print("TEST DATASET")
print("Total number of images:", test_images)
print("Total number of masks:", test_masks)

TEST DATASET
Total number of images: 112
Total number of masks: 1135


## FINAL: Dataset creation

Now we can create the file needed to later load the information as a dataset. To do it, we create a pandas dataframe that we save later as .json format. 

In [15]:
df = pd.DataFrame(list(zip(img_source_list, boxes_list, masks_list, split_list)),
               columns =['img', 'boxes', 'masks', 'split'])

df.to_json(data_dir + "/metadata.json", orient = "records", lines = True)

Get the size of the dataset.

In [17]:
print("------------------")
print("TOTAL DATASET")
print("Total number of images:", train_images + test_images)
print("Total number of masks:", train_masks + test_masks)
print("------------------")
print("TRAIN DATASET")
print("Total number of images:", train_images)
print("Total number of masks:", train_masks)
print("------------------")
print("TEST DATASET")
print("Total number of images:", test_images)
print("Total number of masks:", test_masks)
print("------------------")

------------------
TOTAL DATASET
Total number of images: 1742
Total number of masks: 14139
------------------
TRAIN DATASET
Total number of images: 1630
Total number of masks: 13004
------------------
TEST DATASET
Total number of images: 112
Total number of masks: 1135
------------------
