# DATASET MERGING

In [1]:
import os
import shutil

## A. Classification-Set

This file consists of function to combine all datasets into one Classification-Set, as explained in the paper. There are 3 classification datasets used in this research: BACH, BRACS, and RSDS.
1. *BACH (BreAst Cancer Histology)* is an open source dataset with a total of 400 histopathology images of breast cancer. It consists of 100 normal class images, 100 benign class images, 100 in situ class images, and 100 invasive class images. The data that was used from this dataset are a total of 300 images, consisting 100 images of Benign, 100 images of In Situ, and 100 images of Invasive. (Source: https://iciar2018-challenge.grand-challenge.org/)
2. *BRACS (BReAst Cancer )* is an open source dataset with a total of 4539 histopathology images of breast cancer. It consists of 484 normal (N) images, 836 pathological benign (PB), 517 usual ductal hyperplasia (UDH), 756 flat epithelial atypia (FEA), 507 atypical ductal hyperplasia (ADH), 790 ductal carcinoma in situ (DCIS), and 649 invasive carcinoma (IC). The data that was used from this dataset are a total of 2792 images, consisting 1353 images of PB and UDH as Benign, 790 DCIS images as In Situ, and 649 IC images as Invasive. (Source: https://www.bracs.icar.cnr.it/)
3. *RSDS (RSUD Dr. Soetomo)* is a private dataset with a total of 61 histopathology images fo breast cancer from patients of Dr. Soetomo General Hospital, Surabaya. It consists of 10 Benign images, 29 In Situ images, and 22 Invasive images. All data from this dataset was used. Only samples of images will be shown for publication purposes.

BACH and BRACS are used in training, while RSDS is used in testing with some images addition from BACH and BRACS to balance the splitting ratio of 8:1:1.

In [4]:
source_root = r"E:\SKRIPSI ANGGUN\All-Data\Classification-Set"

target_dir = r"E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set"

datasets = ["BACH", "BRACS", "RSDS"]

class_folders = [
    "Benign",
    os.path.join("Malignant", "In Situ"),
    os.path.join("Malignant", "Invasive")
]

valid_ext = ('.png', '.jpg', '.jpeg', '.tif', '.tiff')

In [5]:
print("[INFO] Menghapus isi folder target...")
for class_folder in class_folders:
    class_path = os.path.join(target_dir, class_folder)
    if os.path.exists(class_path):
        for f in os.listdir(class_path):
            f_path = os.path.join(class_path, f)
            if os.path.isfile(f_path) or os.path.islink(f_path):
                os.remove(f_path)
    else:
        os.makedirs(class_path, exist_ok=True)

[INFO] Menghapus isi folder target...


In [6]:
print("\n[INFO] Mulai menggabungkan file...")
for dataset_name in datasets:
    dataset_path = os.path.join(source_root, dataset_name)
    
    if not os.path.exists(dataset_path):
        print(f"[SKIP] Folder tidak ditemukan: {dataset_path}")
        continue

    for class_folder in class_folders:
        src_folder = os.path.join(dataset_path, class_folder)
        dst_folder = os.path.join(target_dir, class_folder)
        os.makedirs(dst_folder, exist_ok=True)

        if not os.path.exists(src_folder):
            print(f"[SKIP] Kelas tidak ditemukan: {src_folder}")
            continue

        for file in os.listdir(src_folder):
            if not file.lower().endswith(valid_ext):
                continue

            src_path = os.path.join(src_folder, file)

            # Tambahkan prefix nama dataset jika belum ada
            if not file.startswith(f"{dataset_name}_"):
                new_name = f"{dataset_name}_{file}"
            else:
                new_name = file

            dst_path = os.path.join(dst_folder, new_name)

            # Hindari overwrite jika nama file sudah ada
            count = 1
            base, ext = os.path.splitext(new_name)
            while os.path.exists(dst_path):
                dst_path = os.path.join(dst_folder, f"{base}_{count}{ext}")
                count += 1

            shutil.copy2(src_path, dst_path)
            print(f"✓ Copied: {src_path} → {dst_path}")


[INFO] Mulai menggabungkan file...
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b001.tif → E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set\Benign\BACH_b001.tif
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b002.tif → E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set\Benign\BACH_b002.tif
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b003.tif → E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set\Benign\BACH_b003.tif
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b004.tif → E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set\Benign\BACH_b004.tif
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b005.tif → E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set\Benign\BACH_b005.tif
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b006.tif → E:\SKRIPSI ANGGUN\Dataset-All\Classification-Set\Benign\BACH_b006.tif
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Classification-Set\BACH\Benign\b0

## 2. Segmentation-Set

The dataset used for Segmentation-Set is Monuseg-2018 (Multi-Organ NUclei SEGmentation - 2018), which consists of 82 image-mask pairs of histology image from multiple organ diagnosed with cancer. This dataset is a combination of 32 images of the original MonuSeg MICCAI 2018 and 50 images from Triple Negative Breast Cancer (TNBC). (Source: http://www.koreascience.or.kr/article/JAKO202125761193585.pdf)

In [2]:
source_root = r"E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set"
target_root = r"E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set"

datasets = ["BreCaHad", "Monuseg-2018"]

segment_folders = [
    "images",
    "masks"
]

valid_ext = ('.png', '.jpg', '.jpeg', '.tif', '.tiff')

In [3]:
print("[INFO] Menghapus isi folder target...")
for segment_folder in segment_folders:
    segment_path = os.path.join(target_root, segment_folder)
    if os.path.exists(segment_path):
        for f in os.listdir(segment_path):
            f_path = os.path.join(segment_path, f)
            if os.path.isfile(f_path) or os.path.islink(f_path):
                os.remove(f_path)
    else:
        os.makedirs(segment_path, exist_ok=True)

[INFO] Menghapus isi folder target...


In [4]:
for dataset_name in datasets:
    dataset_path = os.path.join(source_root, dataset_name)

    if not os.path.exists(dataset_path):
        print(f"[SKIP] Folder tidak ditemukan: {dataset_path}")
        continue
    for segment_folder in segment_folders:
        src_folder = os.path.join(dataset_path, segment_folder)
        dst_folder = os.path.join(target_root, segment_folder)
        os.makedirs(dst_folder, exist_ok=True)

        if not os.path.exists(src_folder):
            print(f"[SKIP] Folder tidak ditemukan: {src_folder}")
            continue

        for file in os.listdir(src_folder):
            if not file.lower().endswith(valid_ext):
                continue

            src_path = os.path.join(src_folder, file)

            # Tambahkan prefix nama dataset jika belum ada
            if not file.startswith(f"{dataset_name}_"):
                new_name = f"{dataset_name}_{file}"
            else:
                new_name = file

            dst_path = os.path.join(dst_folder, new_name)

            # Hindari overwrite jika nama file sudah ada
            count = 1
            base, ext = os.path.splitext(new_name)
            while os.path.exists(dst_path):
                dst_path = os.path.join(dst_folder, f"{base}_{count}{ext}")
                count += 1

            shutil.copy2(src_path, dst_path)
            print(f"✓ Copied: {src_path} → {dst_path}")

✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set\BreCaHad\images\Case_1-01.png → E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set\images\BreCaHad_Case_1-01.png
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set\BreCaHad\images\Case_1-02.png → E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set\images\BreCaHad_Case_1-02.png
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set\BreCaHad\images\Case_1-03.png → E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set\images\BreCaHad_Case_1-03.png
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set\BreCaHad\images\Case_1-04.png → E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set\images\BreCaHad_Case_1-04.png
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set\BreCaHad\images\Case_1-05.png → E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set\images\BreCaHad_Case_1-05.png
✓ Copied: E:\SKRIPSI ANGGUN\All-Data\Segmentation-Set\BreCaHad\images\Case_1-06.png → E:\SKRIPSI ANGGUN\Dataset-All\Segmentation-Set\images\BreCaHad_Case_1-06.png
✓ Copied: E:\SKRIPSI A