# Data Processing

This notebook walks through the steps we used to process the PraNet dataset for polyp segmentation. The processed dataset can be directly downloaded [here](https://github.com/chrisyeh96/conformal-risk-training/releases/download/v1.0.0/polyps_data.zip) and unzipped into `polyps/data`.

This notebook is meant to be run from the `polyps` directory.

1. Create a subfolder `polyps/data_raw/`
2. Download the PraNet test set from [Google Drive](https://drive.google.com/file/d/1Y2z7FD5p5y31vkZwQQomXFRB0HutHyao/). Extract the zip file to `polyps/data_raw/TestDataset/`.
3. Optionally, download the PraNet training set from [Google Drive](https://drive.google.com/file/d/1YiGHLw4iTvKdvbT6MgwO9zcCv8zJ_Bnb/). Extract the file to `polyps/data_raw/TrainDataset`.

This notebook ensures the following:
1. training set is saved in `polyps/data/train/{DATASET}/`
2. test set is saved in `polyps/data/test/{DATASET}/`
3. input images are saved as 24-bit PNGs within `{DATASET}/images/`
4. ground-truth masks are saved as 1-bit PNGs within `{DATASET}/masks/`
5. input images and ground-truth masks have the same (height, width) shape

## Imports / Setup

In [None]:
from collections.abc import Iterable
import os

import numpy as np

import imageio
import PIL.Image as Image
from tqdm.auto import tqdm

# create folder structure
os.makedirs('data', exist_ok=True)
os.makedirs('data/train', exist_ok=True)
os.makedirs('data/test', exist_ok=True)

## CVC-ClinicDB dataset

> Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., Gil, D., Rodríguez, C., & Vilariño, F. (2015). WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics, 43, 99-111 .

Website: https://polyp.grand-challenge.org/CVCClinicDB/

Download the CVC-ClinicDB dataset from this [Dropbox link](https://www.dropbox.com/s/p5qe9eotetjnbmq/CVC-ClinicDB.rar?dl=0) and save it to the `polyps/data_raw` directory. If `unrar` is not installed, install it with

```bash
sudo apt update
sudo apt install unrar
```

The following bash command unzips the `.rar` file and creates the following directory structure:

```plain
data_raw/
    CVC-ClinicDB/
        Original/
            1.tif
            ...
            612.tif
        Ground Truth/
            1.tif
            ...
            612.tif
```

The CVC-ClinicDB dataset includes 612 images with segmentation masks. Of the 612 images, we use a 550 train / 62 test split. The split matches what [PraNet](https://github.com/DengPingFan/PraNet) used.

We do not use this dataset as-is from the PraNet repo because the PraNet masks are slightly pixel-shifted and contain some dithering noise compared to the original ground-truth masks. Therefore, we process the original ground-truth masks ourselves.

In [None]:
%%bash
unrar x -r "data_raw/CVC-ClinicDB.rar" "data_raw"

In [None]:
# train / test split from PraNet
train_inds = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 26, 27, 28, 29, 30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 49, 51, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 67, 68, 69, 70, 71, 72, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 88, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 149, 150, 151, 152, 153, 155, 156, 157, 158, 159, 160, 161, 162, 164, 165, 167, 168, 169, 170, 172, 173, 174, 175, 176, 177, 178, 180, 182, 183, 184, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 280, 281, 282, 283, 284, 285, 286, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 301, 302, 303, 304, 305, 306, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 350, 351, 352, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 375, 376, 377, 378, 379, 380, 382, 383, 384, 385, 386, 387, 389, 390, 391, 392, 393, 394, 395, 396, 398, 399, 401, 402, 403, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 426, 427, 428, 430, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 454, 455, 456, 457, 458, 460, 461, 462, 463, 465, 466, 467, 468, 469, 470, 471, 472, 473, 475, 476, 477, 478, 479, 480, 482, 484, 485, 486, 487, 488, 489, 490, 491, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 527, 528, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 546, 547, 548, 549, 550, 551, 552, 553, 554, 556, 557, 558, 560, 562, 563, 564, 565, 566, 567, 568, 570, 572, 573, 574, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612]
test_inds = [14, 21, 25, 31, 42, 50, 52, 61, 65, 66, 73, 80, 89, 100, 106, 119, 134, 148, 154, 163, 166, 171, 179, 181, 185, 191, 205, 240, 251, 266, 279, 287, 300, 307, 349, 353, 374, 381, 388, 397, 400, 404, 425, 429, 431, 442, 453, 459, 464, 474, 481, 483, 492, 526, 529, 545, 555, 559, 561, 569, 571, 575]
assert set(train_inds).isdisjoint(test_inds)
assert set(train_inds).union(test_inds) == set(range(1, 613))

print('# train:', len(train_inds), '# test:', len(test_inds))

raw_img_dir = 'data_raw/CVC-ClinicDB/Original'
raw_mask_dir = 'data_raw/CVC-ClinicDB/Ground Truth'

In [None]:
def process_cvc_clinicdb(img_dir: str, mask_dir: str, img_inds: Iterable[int]):
    os.makedirs(img_dir, exist_ok=True)
    os.makedirs(mask_dir, exist_ok=True)

    for i in tqdm(img_inds):
        raw_img_path = os.path.join(raw_img_dir, f'{i}.tif')
        raw_mask_path = os.path.join(raw_mask_dir, f'{i}.tif')

        # for whatever reason, PIL.Image.open() doesn't work with these TIFFs,
        # so we use imageio instead
        with open(raw_img_path, 'rb') as f_img, open(raw_mask_path, 'rb') as f_mask:
            raw_img = imageio.imread(f_img, format='TIFF')
            raw_mask = imageio.imread(f_mask, format='TIFF')

        # check that height and width match
        assert raw_img.shape[0] == raw_mask.shape[0]
        assert raw_img.shape[1] == raw_mask.shape[1]

        # save image
        img = Image.fromarray(raw_img)
        assert img.mode == 'RGB'
        img.save(os.path.join(img_dir, f'{i}.png'))
        img.close()

        # some images have a mask with values other than 0 and 255
        # most of the values are closer to 0, but a small number are larger than 127
        # if not np.array_equal(np.unique(raw_mask), [0, 255]):
        #     if raw_mask[raw_mask != 255].max() > 127:
        #         print(raw_mask_path)
        #         print('Unique mask values:', np.unique(raw_mask))
        #         break

        # manually round the mask to 0 and 255
        # - if we instead called PIL.Image.convert('1') to convert the mask to 1-bit, it will
        #   apply unwanted dithering (e.g., if 255 pixels all have value 1, then one of them
        #   will be set to white, and the others will be set to black)
        mask = Image.fromarray(raw_mask >= 128)
        assert mask.mode == '1'

        # save mask
        mask.save(os.path.join(mask_dir, f'{i}.png'))
        mask.close()

In [None]:
process_cvc_clinicdb(
    img_dir='data/train/CVC-ClinicDB/images',
    mask_dir='data/train/CVC-ClinicDB/masks',
    img_inds=train_inds)

process_cvc_clinicdb(
    img_dir='data/test/CVC-ClinicDB/images',
    mask_dir='data/test/CVC-ClinicDB/masks',
    img_inds=test_inds)

## CVC-ColonDB dataset

As of 2025-04-29, the form on the original [CVC-ColonDB website](http://vi.cvc.uab.es/colon-qa/cvccolondb/) for requesting dataset access is no longer functional. Instead, we use CVC-ColonDB dataset hosted on Kaggle by user `longvil`: https://www.kaggle.com/datasets/longvil/cvc-colondb

We verify that the Kaggle CVC-ColonDB dataset and the PraNet CVC-ColonDB dataset are identical. (The Kaggle version stores the masks with `uint8`, whereas the PraNet version stores the masks as `bool`.)

Because the datasets are identical (and because the PraNet version uses the smaller `bool` dtype for masks), we copy over the PraNet CVC-ColonDB dataset to our `data` directory.

In [None]:
%%bash
curl -L -o data_raw/cvc-colondb.zip https://www.kaggle.com/api/v1/datasets/download/longvil/cvc-colondb
unzip data_raw/cvc-colondb.zip -d data_raw

In [None]:
%%bash
mkdir data/test/CVC-ColonDB
mkdir data/test/CVC-ColonDB/images
mkdir data/test/CVC-ColonDB/masks
cp data_raw/TestDataset/CVC-ColonDB/images/*.png data/test/CVC-ColonDB/images/
cp data_raw/TestDataset/CVC-ColonDB/masks/*.png data/test/CVC-ColonDB/masks/

In [None]:
raw_dir = 'data_raw/CVC-ColonDB/'
old_dir = 'data_raw/TestDataset/CVC-ColonDB/'

num_imgs_equal = 0
num_masks_equal = 0
filenames = sorted(os.listdir(os.path.join(raw_dir, 'masks')))

for filename in filenames:
    # check images
    raw_img_path = os.path.join(raw_dir, 'images', filename)
    old_img_path = os.path.join(old_dir, 'images', filename)
    with open(raw_img_path, 'rb') as f_raw_img:
        raw_img = imageio.imread(f_raw_img, format='PNG')
    with open(old_img_path, 'rb') as f_old_img:
        old_img = imageio.imread(f_old_img, format='PNG')

    if np.array_equal(raw_img, old_img):
        num_imgs_equal += 1
    else:
        print(f'{filename} images are not equal')

    # check masks
    raw_mask_path = os.path.join(raw_dir, 'masks', filename)
    old_mask_path = os.path.join(old_dir, 'masks', filename)
    with open(raw_mask_path, 'rb') as f_raw_mask:
        raw_mask = imageio.imread(f_raw_mask, format='PNG')
        raw_mask = raw_mask.astype(np.bool)
    with open(old_mask_path, 'rb') as f_old_mask:
        old_mask = imageio.imread(f_old_mask, format='PNG')

    if np.array_equal(raw_mask, old_mask):
        num_masks_equal += 1
    else:
        print(f'{filename} masks are not equal')

    # check that images and masks have same size
    if raw_img.shape[0] != raw_mask.shape[0]:
        print(f'{filename} images and masks have different height')

print('Total # of files:', len(filenames))
print('Number of equal images:', num_imgs_equal)
print('Number of equal masks:', num_masks_equal)

## ETIS-LaribPolypDB dataset

> Juan S. Silva, Aymeric Histace, Olivier Romain, Xavier Dray, Bertrand Granado, Towards embedded detection of polyps in WCE images for early diagnosis of colorectal cancerInternational Journal of Computer Assisted Radiology and Surgery, Springer Verlag (Germany), 2014, 9 (2), pp. 283-293.

Website: https://polyp.grand-challenge.org/ETISLarib/

Download the ETIS-LaribPolypDB dataset from this [Dropbox link](https://www.dropbox.com/s/j4nsxijf5dhzb6w/ETIS-LaribPolypDB.rar?dl=0) and save it to the `polyps/data_raw` directory. If `unrar` is not installed, install it with

```bash
sudo apt update
sudo apt install unrar
```

The following bash command unzips the `.rar` file and creates the following directory structure:

```plain
data_raw/
    ETIS-LaribPolypDB/
        ETIS-LaribPolypDB/
            1.tif
            ...
            196.tif
        Ground Truth/
            p1.tif
            ...
            p196.tif
```

The ETIS-LaribPolypDB dataset includes 196 images with segmentation masks. Following [PraNet](https://github.com/DengPingFan/PraNet), all images are included in the test set.

We confirm that our dataset processing procedure for ETIS-LaribPolypDB produces identical files to the images included in the PraNet test set.

In [None]:
%%bash
unrar x -r "data_raw/ETIS-LaribPolypDB.rar" "data_raw"

In [None]:
raw_img_dir = 'data_raw/ETIS-LaribPolypDB/ETIS-LaribPolypDB'
raw_mask_dir = 'data_raw/ETIS-LaribPolypDB/Ground Truth'

def process_etis_laribpolypdb(img_dir: str, mask_dir: str):
    os.makedirs(img_dir, exist_ok=True)
    os.makedirs(mask_dir, exist_ok=True)

    num_files = len(os.listdir(raw_img_dir))
    for i in tqdm(range(1, num_files+1)):
        raw_img_path = os.path.join(raw_img_dir, f'{i}.tif')
        raw_mask_path = os.path.join(raw_mask_dir, f'p{i}.tif')

        raw_img = Image.open(raw_img_path)
        raw_mask = Image.open(raw_mask_path)

        # check that height and width match
        assert raw_img.size == raw_mask.size
        assert raw_img.mode == 'RGB'
        assert raw_mask.mode == 'L'

        # save image
        img = raw_img
        img.save(os.path.join(img_dir, f'{i}.png'))
        raw_img.close()

        # check that mask is strictly black or white, then convert it to binary
        raw_mask_arr = np.asarray(raw_mask)
        assert np.array_equal(np.unique(raw_mask_arr), [0, 255])
        mask = raw_mask.convert('1')

        # save mask
        mask.save(os.path.join(mask_dir, f'{i}.png'))
        raw_mask.close()

In [None]:
process_etis_laribpolypdb(
    img_dir='data/test/ETIS-LaribPolypDB/images',
    mask_dir='data/test/ETIS-LaribPolypDB/masks')

In [None]:
%%bash
diff -r data/test/ETIS-LaribPolypDB/ data_raw/TestDataset/ETIS-LaribPolypDB/

## Kvasir-SEG dataset

> Debesh Jha, Pia H. Smedsrud, Michael A. Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D. Johansen. 2020. Kvasir-SEG: A Segmented Polyp Dataset. In _International Conference on Multimedia Modeling_, 2020. Springer International Publishing, 451–462. https://doi.org/10.1007/978-3-030-37734-2_37

Website: https://datasets.simula.no/kvasir-seg/

Download the Kvasir-SEG dataset and unzip it using the subsequent commands. It should create the following directory structure:

```plain
data_raw/
    Kvasir-SEG/
        images/
            cju0qkwl35piu0993l0dewei2.jpg
            ...
            ck2395w2mb4vu07480otsu6tw.jpg
        masks/
            cju0qkwl35piu0993l0dewei2.jpg
            ...
            ck2395w2mb4vu07480otsu6tw.jpg
```

The Kvasir-SEG dataset includes 1000 images with segmentation masks. Of the 1000 images, we use a 900 train / 100 test split. The split matches what [PraNet](https://github.com/DengPingFan/PraNet) used.

<!-- We do not use this dataset as-is from the PraNet repo because the PraNet masks are slightly pixel-shifted and contain some dithering noise compared to the original ground-truth masks. Therefore, we process the original ground-truth masks ourselves. -->

In [None]:
%%bash
wget https://datasets.simula.no/downloads/kvasir-seg.zip -O data_raw/kvasir-seg.zip
unzip data_raw/kvasir-seg.zip -d data_raw

In [None]:
# test split from PraNet
test_filenames = ['cju0u82z3cuma0835wlxrnrjv', 'cju15wdt3zla10801odjiw7sy', 'cju16ach3m1da0993r1dq3sn2', 'cju16whaj0e7n0855q7b6cjkm', 'cju17z0qongpa0993de4boim4', 'cju1amqw6p8pw0993d9gc5crl', 'cju1bm8063nmh07996rsjjemq', 'cju1c3218411b08014g9f6gig', 'cju1cbokpuiw70988j4lq1fpi', 'cju1cj3f0qi5n0993ut8f49rj', 'cju1cqc7n4gpy0855jt246k68', 'cju1ddr6p4k5z08780uuuzit2', 'cju1f8w0t65en0799m9oacq0q', 'cju1h89h6xbnx08352k2790o9', 'cju1hp9i2xu8e0988u2dazk7m', 'cju2hfqnmhisa0993gpleeldd', 'cju2hjrqcvi2j0801bx1i6gxg', 'cju2hos57llxm08359g92p6jj', 'cju2hqt33lmra0988fr5ijv8j', 'cju2lberzkdzm09938cl40pog', 'cju2mh8t6p07008350e01tx2a', 'cju2nnqrqzp580855z8mhzgd6', 'cju2np2k9zi3v079992ypxqkn', 'cju2omjpeqj5a0988pjdlb8l1', 'cju2osuru0ki00855txo0n3uu', 'cju2pag1f0s4r0878h52uq83s', 'cju2rga4psq9n09881z519xx0', 'cju2rmd2rsw9g09888hh1efu0', 'cju2rqo702wpx0855fn7d5cxh', 'cju2top2ruxxy0988p1svx36g', 'cju2wve9v7esz0878mxsdcy04', 'cju2y40d8ulqo0993q0adtgtb', 'cju2yi9tz8vky0801yqip0xyl', 'cju2yo1j1v0qz09934o0e683p', 'cju2yv4imv6cz099314jveiib', 'cju2zp89k9q1g0855k1x0f1xa', 'cju2zwg05a0oy0801yr73ig7g', 'cju30ajhw09sx0988qyahx9s8', 'cju30gxjq0djk0988jytm49rs', 'cju30j1rgadut0801vuyrsnt8', 'cju31w6goazci0799n014ly1q', 'cju32srle1xfq083575i3fl75', 'cju34m7h536wq0988xz7gx79v', 'cju34xspwzenf0993cyzajv9n', 'cju3tp94kfstl08181awh6z49', 'cju3uhb79gcgr0871orbrbi3x', 'cju3v11mrgwwb0755u242ygye', 'cju3x5u2tiihx0818914gzxy1', 'cju3xga12iixg0817dijbvjxw', 'cju3ya7goj6at0818v2l5ay7f', 'cju3ykamdj9u208503pygyuc8', 'cju40m0rjkpw80871z6n6yg1u', 'cju42qet0lsq90871e50xbnuv', 'cju42wamblrqn098798r2yyok', 'cju43jcqim2cp08172dvjvyui', 'cju45rj7ln8980850a7821fov', 'cju45ty6zn9oz0850qy4qnck1', 'cju45v0pungu40871acnwtmu5', 'cju5cky5xb0ay0801oxet697t', 'cju5clr68b48r0755cmuvponm', 'cju5hi52odyf90817prvcwg45', 'cju5hyi9yegob0755ho3do8en', 'cju5k3j3uf6de0817hszzfr7n', 'cju5o4pk9h0720755lgp9jq8m', 'cju5wrrs0m2af0818vmnajbtw', 'cju5x00l6m5j608503k78ptee', 'cju5xjn5mm78b09871spyqhhr', 'cju5xkwzxmf0z0818gk4xabdm', 'cju5xq3tdm9fn0987pbedxdg5', 'cju5y4hgqmk0i08180rjhbwvp', 'cju5yeqiwmkgl0801fzv2douc', 'cju6us80mv1b50871ebyq2wxa', 'cju6uy20suzbl0987rzuhz7z9', 'cju6v1m1xv07w09870ah3njy1', 'cju6vifjlv55z0987un6y4zdo', 'cju6vrs1ov8cr098788h8gs6j', 'cju6x0yqbvxqt0755dhxislgb', 'cju7ajnbo1gvm098749rdouk0', 'cju7awzmu1ncs0871hziy65zx', 'cju7bd1qu1mx409877xjxibox', 'cju7bgnvb1sf808717qa799ir', 'cju7crgxa28550755wbsgqkel', 'cju7da88w2eod0755wejzynvt', 'cju7ddtz729960801uazp1knc', 'cju7do8c72dbo0801vxfzxdc4', 'cju7dymur2od30755eg8yv2ht', 'cju7ecl9i2i060987xawjp4l0', 'cju7fbndk2sl608015ravktum', 'cju7fcgbe2z3p07550vaflqdb', 'cju7fpfzq2wyf0818xxd1oziv', 'cju84hibuktj80871u519o71q', 'cju88cddensj00987788yotmg', 'cju88t4fvokxf07558ymyh281', 'cju88vx2uoocy075531lc63n3', 'cju8alhigqn2h0801zksudldd', 'cju8aqq8uqmoq0987hphto9gg', 'cju8bk8oirjhw0817hgkua2w8', 'cju8c2rqzs5t80850d0zky5dy', 'cju8d4jgatgpj0871q2ophhkm', 'cju8dqkrqu83i0818ev74qpxq']
print('# test:', len(test_filenames))

raw_img_dir = 'data_raw/Kvasir-SEG/images'
raw_mask_dir = 'data_raw/Kvasir-SEG/masks'

In [None]:
def process_kvasir_seg(img_dir: str, mask_dir: str, is_train: bool):
    os.makedirs(img_dir, exist_ok=True)
    os.makedirs(mask_dir, exist_ok=True)

    if is_train:
        all_filenames = set(os.path.splitext(x)[0] for x in os.listdir(raw_img_dir))
        img_filenames = sorted(all_filenames - set(test_filenames))
        assert len(img_filenames) == 900
    else:
        img_filenames = test_filenames

    for img_filename in tqdm(img_filenames):
        raw_img_path = os.path.join(raw_img_dir, f'{img_filename}.jpg')
        raw_mask_path = os.path.join(raw_mask_dir, f'{img_filename}.jpg')

        raw_img = Image.open(raw_img_path)
        raw_mask = Image.open(raw_mask_path)

        # check that height and width match
        assert raw_img.size == raw_mask.size
        assert raw_img.mode == 'RGB'
        assert raw_mask.mode == 'RGB'

        # save image
        img = raw_img
        img.save(os.path.join(img_dir, f'{img_filename}.png'))
        raw_img.close()

        raw_mask_arr = np.asarray(raw_mask)
        # assert all channels in the mask are identical, then just use 1st channel
        assert np.array_equal(raw_mask_arr[..., 0], raw_mask_arr[..., 1])
        assert np.array_equal(raw_mask_arr[..., 0], raw_mask_arr[..., 2])

        # manually round the mask to 0 and 255
        # - if we instead called PIL.Image.convert('1') to convert the mask to 1-bit, it will
        #   apply unwanted dithering (e.g., if 255 pixels all have value 1, then one of them
        #   will be set to white, and the others will be set to black)
        mask = Image.fromarray(raw_mask_arr[..., 0] >= 128)
        assert mask.mode == '1'

        # save mask
        mask.save(os.path.join(mask_dir, f'{img_filename}.png'))
        mask.close()
        raw_mask.close()

In [None]:
process_kvasir_seg(
    img_dir='data/train/Kvasir-SEG/images',
    mask_dir='data/train/Kvasir-SEG/masks',
    is_train=True)

process_kvasir_seg(
    img_dir='data/test/Kvasir-SEG/images',
    mask_dir='data/test/Kvasir-SEG/masks',
    is_train=False)

Check that the images and masks are identical between our processed Kvasir-SEG dataset and the PraNet versions.

Note: the input images have different PNG encoded values, but their decoded images are identical.

In [None]:
%%bash
diff -r data/test/Kvasir-SEG/masks/ data_raw/TestDataset/Kvasir/masks/

for file in data/train/Kvasir-SEG/masks/c*.png; do
    diff "$file" "data_raw/TrainDataset/masks/${file##*/}"
done

In [None]:
%%bash
diff -r data/test/Kvasir-SEG/images/ data_raw/TestDataset/Kvasir/images/

for file in data/train/Kvasir-SEG/images/c*.png; do
    diff "$file" "data_raw/TrainDataset/image/${file##*/}";
done

:  # no-op, to ensure bash returns exit-code 0

In [None]:
def check_images_equal(new_dir: str, old_dir: str):
    for img_name in os.listdir(new_dir):
        new_img_path = os.path.join(new_dir, img_name)
        old_img_path = os.path.join(old_dir, img_name)

        with Image.open(new_img_path) as new_img, Image.open(old_img_path) as old_img:
            if not np.array_equal(np.asarray(new_img), np.asarray(old_img)):
                print(f'{img_name} images are not equal')


new_img_dir = 'data/test/Kvasir-SEG/images/'
old_img_dir = 'data_raw/TestDataset/Kvasir/images/'
check_images_equal(new_img_dir, old_img_dir)

new_img_dir = 'data/train/Kvasir-SEG/images/'
old_img_dir = 'data_raw/TrainDataset/image/'
check_images_equal(new_img_dir, old_img_dir)

## Duplicates

We are aware that some of the images in the test set (in CVC-ColonDB and ETIS-LaribPolypDB) are duplicates of each other. However, as this is a widely-used dataset, an no one else seems to de-duplicate these images, we do not do so either.

We do not believe there are any duplicated images in the CVC-ClinicDB and Kvasir-SEG datasets. Since these are the two datasets which have splits inside the training set, we do not believe there is any contamination between the train and test sets.

In [None]:
%%bash
find data ! -empty -type f -exec md5sum {} + | sort | uniq -w32 -dD | awk '{print $2, $1}' | sort

## CVC-300 dataset*

We don't use this dataset, because it is entirely a duplicate of CVC-ColonDB, as the following code shows.

Check that CVC-300 is essentially a duplicate of CVC-ColonDB.

In [None]:
cvc300_dir = 'data_raw/TestDataset/CVC-300/'
cvc_colondb_dir = 'data_raw/TestDataset/CVC-ColonDB/'

num_imgs_equal = 0
num_masks_equal = 0
filenames = sorted(os.listdir(os.path.join(cvc300_dir, 'masks')))

for filename in filenames:
    # check images
    cvc300_img_path = os.path.join(cvc300_dir, 'images', filename)
    cvc_colondb_img_path = os.path.join(cvc_colondb_dir, 'images', filename)
    with open(cvc300_img_path, 'rb') as f_cvc300_img:
        cvc300_img = imageio.imread(f_cvc300_img, format='PNG')
    with open(cvc_colondb_img_path, 'rb') as f_cvc_colondb_img:
        cvc_colondb_img = imageio.imread(f_cvc_colondb_img, format='PNG')

    if np.array_equal(cvc300_img, cvc_colondb_img):
        num_imgs_equal += 1
    else:
        print(f'{filename} images are not equal')

    # check masks
    cvc300_mask_path = os.path.join(cvc300_dir, 'masks', filename)
    cvc_colondb_mask_path = os.path.join(cvc_colondb_dir, 'masks', filename)
    with open(cvc300_mask_path, 'rb') as f_cvc300_mask:
        cvc300_mask = imageio.imread(f_cvc300_mask, format='PNG')
    with open(cvc_colondb_mask_path, 'rb') as f_cvc_colondb_mask:
        cvc_colondb_mask = imageio.imread(f_cvc_colondb_mask, format='PNG')

    if np.array_equal(cvc300_mask, cvc_colondb_mask):
        num_masks_equal += 1
    else:
        print(f'{filename} masks are not equal')

print('Total # of files:', len(filenames))
print('Number of equal images:', num_imgs_equal)
print('Number of equal masks:', num_masks_equal)