# Organoseg dataset creation

This notebook gives a schema and the total steps to generate patches for the dataset used in Organoseg paper. This dataset contains colon organoid pictures, but is designed for semantic segmentation tasks. There is a total of 64 images, but here we augment it by creating patches.

## Initialization

Import relevant libraries

In [54]:
import os
import pandas as pd
import cv2
from patchify import patchify
import tifffile
import numpy as np

Set up the main directory and the data directory.

In [2]:
# Set working directory as the main directory
os.chdir("/home/ubuntu/")
# Data directory
data_dir = "/home/ubuntu/data/colon_dataset"

Initialize the lists containing all the information:
* Directory list: contains the paths to all images.
* Masks list: contains the path to the masks.

In [3]:
img_source_list = []
masks_list = []

## Original semantic segmentation dataset creation
In this part we read the images that are in .tif format and get the path and their masks path.
### Directories

In [4]:
original_images_dir = os.path.join(data_dir, "original", "colon_images")
original_masks_dir = os.path.join(data_dir, "original", "colon_masks")

### Image conversion and saving information
Get all images in the directory in `.tif` format and convert them to `.png`. Save the paths to the images and their corresponding masks.

In [5]:
# Get all image directories.
list_original_images_name = os.listdir(original_images_dir)

# Save all images
for i in range(len(list_original_images_name)):
    # Get image as array
    tif_array = tifffile.imread(os.path.join(original_images_dir, list_original_images_name[i]))
    # Get image source name
    img_name, _ = os.path.splitext(list_original_images_name[i])

    # Save image in .png format
    cv2.imwrite(os.path.join(original_images_dir, img_name  + ".png"), tif_array)
    
    # Save image path
    img_source_list.append(os.path.join(original_images_dir, img_name  + ".png"))
    # Save image mask
    masks_list.append(os.path.join(original_masks_dir, "slice_" + img_name  + "_bw.png"))


### Dimensions of original dataset

In [6]:
H, W, _ = tif_array.shape
print("This dataset contains", len(img_source_list), "images of size", H, "x", W, ".")

This dataset contains 64 images of size 648 x 864 .


## Creation of semantic segmentation augmented dataset
In this part we get 4 patches per image and get an augmented dataset from the previous original one.
### Directories

In [7]:
augmented_images_dir = os.path.join(data_dir, "augmented", "colon_images")
augmented_masks_dir = os.path.join(data_dir, "augmented", "colon_masks")

In [66]:
augmented_image_path_list = []
augmented_mask_path_list = []

### Patches creation and saving information

Get 4 patches per image and adjust the masks respectively. 

In [46]:
# Calculate the dimensions of patches in each dimension
h_patches = H // 2
w_patches = W // 2

# Desired patch size
patch_size = (h_patches, w_patches, 3)

# Adjust the step size to ensure non-overlapping patches
step_size_h = H // 2
step_size_w = W // 2

# Create patches
patches = patchify(tif_array, patch_size, step=(step_size_h, step_size_w, 1))

print(patches.shape[0]*patches.shape[1], "patches for image of size", 
      patches.shape[3], "x", patches.shape[4])

4 patches for image of size 324 x 432


Save the paths to the new patched images and masks in the lists.

In [67]:
images_name = os.listdir(original_images_dir)

for i in range(len(images_name)):
    # Get image as array
    image_array = cv2.imread(os.path.join(original_images_dir, images_name[i]))
    # Get image source name
    img_name, _ = os.path.splitext(images_name[i])
    # Get mask as array
    mask_array = cv2.imread(os.path.join(original_masks_dir, "slice_" + img_name + "_bw.png"))

    # Get patches for the image
    image_patches = patchify(image_array, patch_size, step=(step_size_h, step_size_w, 1))
    # Get patches for the mask
    mask_patches = patchify(mask_array, patch_size, step=(step_size_h, step_size_w, 1))

    # Run through patches
    for r in range(image_patches.shape[0]):
        for c in range(image_patches.shape[1]):
            # Save image in .png format
            cv2.imwrite(os.path.join(augmented_images_dir, img_name + "_" + str(r) + str(c) + ".png"), image_patches[r,c,0,:,:])
            # Save the path of image
            img_source_list.append(os.path.join(augmented_images_dir, img_name + "_" + str(r) + str(c) + ".png"))
            augmented_image_path_list.append(os.path.join(augmented_images_dir, img_name + "_" + str(r) + str(c) + ".png"))
            # Save mask in .png format
            cv2.imwrite(os.path.join(augmented_masks_dir, "slice_" + img_name + "_" + str(r) + str(c) + "_bw.png"), mask_patches[r,c,0,:,:])
            # Save the path of mask
            masks_list.append(os.path.join(augmented_masks_dir, "slice_" + img_name + "_" + str(r) + str(c) + "_bw.png"))
            augmented_mask_path_list.append(os.path.join(augmented_masks_dir, "slice_" + img_name + "_" + str(r) + str(c) + "_bw.png"))

### Dimensions of patch dataset

In [71]:
print("This dataset contains", len(augmented_image_path_list), "images of size", patch_size[0], "x", patch_size[1], ".")

This dataset contains 256 images of size 324 x 432 .


## FINAL: Semantic segmentation dataset creation

Now we can create the file needed to later load the information as a dataset. To do it, we create a pandas dataframe that we save later as .json format. 

In [72]:
df = pd.DataFrame(list(zip(img_source_list, masks_list)),
               columns =['img', 'masks'])

df.to_json(data_dir + "/metadata_semantic_segmentation.json", orient = "records", lines = True)

Get the size of the dataset.

In [73]:
print("------------------")
print("TOTAL DATASET")
print("Total number of images:", len(img_source_list))
print("Total number of masks:", len(masks_list))
print("------------------")
print("ORIGINAL DATASET")
print("Total number of images:", 64)
print("Total number of masks:", 64)
print("------------------")
print("AUGMENTED DATASET")
print("Total number of images:", len(augmented_image_path_list))
print("Total number of masks:", len(augmented_mask_path_list))
print("------------------")

------------------
TOTAL DATASET
Total number of images: 320
Total number of masks: 320
------------------
ORIGINAL DATASET
Total number of images: 64
Total number of masks: 64
------------------
AUGMENTED DATASET
Total number of images: 256
Total number of masks: 256
------------------


## Instance segmentation augmented dataset

Here we create a dataset containing one mask for every organoid in the original images. The masks are extracted one by one from the original masks using connected components analysis.

### Image loading

First we get the paths to the images and masks, and also a list containing all relative paths to the images.

In [102]:
augmented_seg_images_dir = os.path.join(data_dir, "augmented", "colon_images")
augmented_seg_masks_dir = os.path.join(data_dir, "augmented", "colon_masks")
augmented_inst_masks_dir = os.path.join(data_dir, "augmented", "colon_instance_masks")

# Relative path to images
list_augmented_seg_images_name = os.listdir(augmented_seg_images_dir)

Create empty lists that will contain all `metadata.json` information.

In [103]:
image_path_list = []
mask_path_list = []
box_list = []

We create now the instance segmentation masks for all the images.

In [None]:
for i in range(len(list_augmented_seg_images_name)):
    # Load the image and semantic segmentation mask
    img_name, _ = os.path.splitext(list_augmented_seg_images_name[i])
    img_path = augmented_seg_images_dir + "/" + list_augmented_seg_images_name[i]
    mask_path = augmented_seg_masks_dir + "/" + "slice_" + img_name + "_bw.png"
    # Assuming 'mask' is your binary mask, load it
    mask = cv2.imread(mask_path, cv2.IMREAD_GRAYSCALE)

    # Perform connected components analysis
    output = cv2.connectedComponentsWithStats(mask, connectivity=8, ltype=cv2.CV_32S)
    # Here we save the relevant information about the connected components
    (numLabels, labels, stats, centroids) = output

    # Save the masks in the corresponding file and all relevant information: framing box and paths; to be saved later in metadata.json file.
    for label in range(1,stats.shape[0]):
        if label < 10:
            num = "0" + str(label)
        else:
            num = str(label)
        # Get mask according to the label
        mask_for_label = (labels == label)*255
        # Save it
        mask_path = augmented_inst_masks_dir + "/" + "slice_" + img_name + "_" + num + "_bw.png"
        cv2.imwrite(mask_path, mask_for_label)

        # Get box corresponding to mask
        box_mask = [stats[label,0], stats[label,1], stats[label,0] + stats[label,2], stats[label,1] + stats[label,3]]

        # Save information in the lists
        image_path_list.append(img_path)
        mask_path_list.append(mask_path)
        box_list.append(box_mask)

Dimensions of the dataset.

In [105]:
print("There are", len(list_augmented_seg_images_name), "images with a total number of", len(mask_path_list), "organoids.")

There are 256 images with a total number of 5706 organoids.


### Save metadata

Here we save all the information in a `metadata.json` file.

In [106]:
df = pd.DataFrame(list(zip(image_path_list, box_list, mask_path_list)),
               columns =['img', 'box', 'mask'])

df.to_json(data_dir + "/metadata_instance_segmentation.json", orient = "records", lines = True)