# Data aggregation

This notebook helps to aggregate all data from all datasets in a unique .json file to train SAM on all the data we have collected. We will also generate a smaller dataset for testing implementations.

## Initialization

Import relevant libraries

In [1]:
import os
import numpy as np
import torch
import pandas as pd
import json
import random

Set up the main directory and the data directory.

In [2]:
# Set working directory as the main directory
os.chdir("/home/ubuntu/")
# Data directory
data_dir = "/home/ubuntu/data/"

Use CUDA.

In [3]:
# Use CUDA if possible
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


### Private dataset loading

Here we save all the metadata of the private dataset on lists.

In [5]:
with open(data_dir + "private/metadata.json", 'r') as file:
    data = [json.loads(line) for line in file]

df = pd.concat([pd.json_normalize(entry) for entry in data], ignore_index=True)

private_img_source_list = list(df['img'])
private_boxes_list = list(df['box'])
private_masks_list = list(df['mask'])
private_split_list = list(df["split"])

In [6]:
print(len(private_img_source_list), len(private_boxes_list), len(private_masks_list), len(private_split_list))

25962 25962 25962 25962


### Intestinal organoid dataset loading

Here we save all the metadata of the private dataset on lists.

In [7]:
with open(data_dir + "intestinal_organoid_dataset/metadata.json", 'r') as file:
    data = [json.loads(line) for line in file]

df = pd.concat([pd.json_normalize(entry) for entry in data], ignore_index=True)

intestinalorg_img_source_list = list(df['img'])
intestinalorg_boxes_list = list(df['box'])
intestinalorg_masks_list = list(df['mask'])
intestinalorg_split_list = list(df["split"])

In [8]:
print(len(intestinalorg_img_source_list), len(intestinalorg_boxes_list), len(intestinalorg_masks_list), len(intestinalorg_split_list))

14139 14139 14139 14139


## Modify paths of every dataset

Since the aggregated resulting `metadata.json` file will be located in the `/data/` folder, we need to complement the paths appearing in the elements `img` and `mask` adding the prefix of the dataset to which the element belongs.

In [9]:
for i in range(len(private_img_source_list)):
    private_img_source_list[i] = "private/" + private_img_source_list[i]
    private_masks_list[i] = "private/" + private_masks_list[i]

for i in range(len(intestinalorg_img_source_list)):
    intestinalorg_img_source_list[i] = "intestinal_organoid_dataset/" + intestinalorg_img_source_list[i]
    intestinalorg_masks_list[i] = "intestinal_organoid_dataset/" +  intestinalorg_masks_list[i]

## Get smaller dataset

Here we will get a smaller dataset containing 10% of the real data for experiments.

### Private dataset

We filter the number of samples to 10% randomly of `train` and `val` and `test` splits for this dataset.

In [16]:
random.seed(98)

# Filter elements containing "train"
private_train_indices = [index for index, value in enumerate(private_split_list) if "train" in value]

# Choose 10% of the filtered "train" elements randomly
num_samples = int(len(private_train_indices) * 0.1)
private_train_indices = random.sample(private_train_indices, num_samples)

# Filter elements containing "val"
private_val_indices = [index for index, value in enumerate(private_split_list) if "val" in value]

# Choose 10% of the filtered "val" elements randomly
num_samples = int(len(private_val_indices) * 0.1)
private_val_indices = random.sample(private_val_indices, num_samples)

# Filter elements containing "test"
private_test_indices = [index for index, value in enumerate(private_split_list) if "test" in value]

# Choose 10% of the filtered "test" elements randomly
num_samples = int(len(private_test_indices) * 0.1)
private_test_indices = random.sample(private_test_indices, num_samples)


In [17]:
print(f'Length of small private train split:', len(private_train_indices))
print(f'Length of small private validation split:', len(private_val_indices))
print(f'Length of small private test split:', len(private_test_indices))

Length of small private train split: 2079
Length of small private validation split: 251
Length of small private test split: 266


### Intestinal organoid dataset

We filter the number of samples to 10% randomly of `train` and `test` splits for this dataset.

In [18]:
random.seed(98)

# Filter elements containing "train"
intestinal_train_indices = [index for index, value in enumerate(intestinalorg_split_list) if "train" in value]

# Choose 10% of the filtered "train" elements randomly
num_samples = int(len(intestinal_train_indices) * 0.1)
intestinal_train_indices = random.sample(intestinal_train_indices, num_samples)

# Filter elements containing "test"
intestinal_test_indices = [index for index, value in enumerate(intestinalorg_split_list) if "test" in value]

# Choose 10% of the filtered "test" elements randomly
num_samples = int(len(intestinal_test_indices) * 0.1)
intestinal_test_indices = random.sample(intestinal_test_indices, num_samples)

In [19]:
print(f'Length of small private train split:', len(intestinal_train_indices))
print(f'Length of small private test split:', len(intestinal_test_indices))

Length of small private train split: 1300
Length of small private test split: 113


## Dataset aggregation

Finally, we aggregate the datasets creating a big dataset that contains all data from the two original sources; and another smaller dataset that will be used for experiments.

### Small dataset aggregation

First we add all the data as lists.

In [28]:
small_dataset_img_source_list = [private_img_source_list[i] for i in private_train_indices] + [private_img_source_list[i] for i in private_val_indices] + [private_img_source_list[i] for i in private_test_indices] + [intestinalorg_img_source_list[i] for i in intestinal_train_indices] + [intestinalorg_img_source_list[i] for i in intestinal_test_indices]
small_dataset_box_list = [private_boxes_list[i] for i in private_train_indices] + [private_boxes_list[i] for i in private_val_indices] + [private_boxes_list[i] for i in private_test_indices] + [intestinalorg_boxes_list[i] for i in intestinal_train_indices] + [intestinalorg_boxes_list[i] for i in intestinal_test_indices]
small_dataset_mask_list = [private_masks_list[i] for i in private_train_indices] + [private_masks_list[i] for i in private_val_indices] + [private_masks_list[i] for i in private_test_indices] + [intestinalorg_masks_list[i] for i in intestinal_train_indices] + [intestinalorg_masks_list[i] for i in intestinal_test_indices]
small_dataset_split_list = [private_split_list[i] for i in private_train_indices] + [private_split_list[i] for i in private_val_indices] + [private_split_list[i] for i in private_test_indices] + [intestinalorg_split_list[i] for i in intestinal_train_indices] + [intestinalorg_split_list[i] for i in intestinal_test_indices]

Save everything as a `.json` file.

In [29]:
df = pd.DataFrame(list(zip(small_dataset_img_source_list, small_dataset_box_list, small_dataset_mask_list, small_dataset_split_list)),
               columns =['img', 'box', 'mask', 'split'])

df.to_json(data_dir + "small_metadata.json", orient = "records", lines = True)

del(df)

In [33]:
print(f'Total number of instances in the small dataset:', len(small_dataset_box_list))

Total number of instances in the small dataset: 4009


### Complete dataset aggregation

First we save all the data as lists.

In [30]:
big_dataset_img_source_list = private_img_source_list + intestinalorg_img_source_list
big_dataset_box_list = private_boxes_list + intestinalorg_boxes_list
big_dataset_mask_list = private_masks_list + intestinalorg_masks_list
big_dataset_split_list = private_split_list + intestinalorg_split_list

Save everything as `.json` file.

In [32]:
df = pd.DataFrame(list(zip(big_dataset_img_source_list, big_dataset_box_list, big_dataset_mask_list, big_dataset_split_list)),
               columns =['img', 'box', 'mask', 'split'])

df.to_json(data_dir + "metadata.json", orient = "records", lines = True)

del(df)

In [34]:
print(f'Total number of instances in the complete dataset:', len(big_dataset_box_list))

Total number of instances in the complete dataset: 40101
