# Dataset Preprocessing
In this notebook we are going to preprocess the slides to generate our sets of patches for training, validation, and testing.

In [1]:
from pathgen.utils.seeds import set_seed

global_seed = 123
set_seed(global_seed)

In [2]:
from pathgen.utils.paths import project_root

experiment_name = "all"
experiment_root = project_root() / "experiments" / experiment_name

## 1. Audit the dataset
Let's find out how many patches there are in each slide in the Camelyon16 dataset, so we can select from the in a principled way.

In [3]:
import pathgen.data.datasets.camelyon16 as camelyon16

train = camelyon16.training()
len(train)

270

In [4]:
train[0]

(PosixPath('/home/ubuntu/pathgen/data/camelyon16/raw/training/tumor/tumor_001.tif'),
 PosixPath('/home/ubuntu/pathgen/data/camelyon16/raw/training/lesion_annotations/tumor_001.xml'),
 'tumor',
 '')

In [5]:
from pathgen.preprocess.patching import make_index
from pathgen.preprocess.tissue_detection import TissueDetectorOTSU
from pathgen.preprocess.patching import GridPatchFinder

tissue_detector = TissueDetectorOTSU()
patch_finder = GridPatchFinder(6, 0, 256, 256)

index = make_index(train, tissue_detector, patch_finder)
index.summary()

indexing tumor_001.tif
indexing tumor_002.tif
indexing tumor_003.tif
indexing tumor_004.tif
indexing tumor_005.tif
indexing tumor_006.tif
indexing tumor_007.tif
indexing tumor_008.tif
indexing tumor_009.tif
indexing tumor_010.tif
indexing tumor_011.tif
indexing tumor_012.tif
indexing tumor_013.tif
indexing tumor_014.tif
indexing tumor_015.tif
indexing tumor_016.tif
indexing tumor_017.tif
indexing tumor_018.tif
indexing tumor_019.tif
indexing tumor_020.tif
indexing tumor_021.tif
indexing tumor_022.tif
indexing tumor_023.tif
indexing tumor_024.tif
indexing tumor_025.tif
indexing tumor_026.tif
indexing tumor_027.tif
indexing tumor_028.tif
indexing tumor_029.tif
indexing tumor_030.tif
indexing tumor_031.tif
indexing tumor_032.tif
indexing tumor_033.tif
indexing tumor_034.tif
indexing tumor_035.tif
indexing tumor_036.tif
indexing tumor_037.tif
indexing tumor_038.tif
indexing tumor_039.tif
indexing tumor_040.tif
indexing tumor_041.tif
indexing tumor_042.tif
indexing tumor_043.tif
indexing tu

Unnamed: 0,background,normal,tumor
0,0.0,27687,238
1,0.0,14454,28
2,0.0,23267,171
3,0.0,42063,447
4,0.0,12353,71
...,...,...,...
265,0.0,24853,0
266,0.0,40939,0
267,0.0,39127,0
268,0.0,31096,0


In [6]:
training_summary = index.summary()
total_normal = training_summary['normal'].sum()
total_tumor = training_summary['tumor'].sum()
print(f"Total normal patches: {total_normal}")
print(f"Total tumor patches: {total_tumor}")

Total normal patches: 7438602
Total tumor patches: 307594


## 2. Sampling at a slide level
We want the samples in the different sets (train, validate, and test) to come from seperate slides. Test is it's own dataset, but validate has to be broken off from train on a per-slide basis. We want the split to take into account the number of patches in each set (because it is the patches that we are interested in). Let's base the split so that the validate set has about 30% of the tumor patches in it. Baseing it on the tumor patches makes sense because they are the class with the smallest number of samples.

Our algorithm for splitting will work like so:
1. Work out the total number of tumor patches in the whole dataset.
2. Work out what 30% (or whatever the amount is) of the total patches is.
3. Start counting with 0 patches and 0 slides in the validation set.
4. Randomly select a slide in training and move it to validation.
5. Add the number of patches in that slide to the total.
6. Repeat from 4 until the total number of patches is greater than 30% of the total.

In [7]:
total_tumor = training_summary['tumor'].sum()
valid_tumor_count = int(total_tumor * 0.3)  # rounds down
valid_tumor_count

92278

In [8]:
import random

train_slide_indices = list(range(len(training_summary)))
random.shuffle(train_slide_indices)
valid_slide_indices = []
total_valid_patches = 0
while total_valid_patches < valid_tumor_count:
    slide_idx = train_slide_indices[0]
    tumor_count = training_summary.iloc[slide_idx]['tumor']
    total_valid_patches += tumor_count
    train_slide_indices.pop(0)
    valid_slide_indices.append(slide_idx)
    
# print them out
print(train_slide_indices)
print(valid_slide_indices)

# check there are no duplicates
print(len(train_slide_indices) + len(valid_slide_indices))
print(len(set(train_slide_indices + valid_slide_indices)))

[203, 98, 184, 117, 182, 94, 135, 199, 178, 37, 83, 104, 153, 68, 223, 66, 7, 266, 10, 192, 30, 181, 47, 127, 173, 91, 190, 239, 20, 156, 240, 226, 267, 53, 42, 71, 161, 256, 60, 33, 252, 166, 51, 258, 88, 140, 177, 248, 147, 228, 186, 169, 254, 191, 16, 27, 251, 193, 237, 257, 189, 65, 230, 49, 176, 34, 2, 236, 118, 244, 113, 200, 39, 45, 216, 139, 59, 126, 73, 15, 128, 265, 103, 58, 90, 209, 93, 160, 247, 150, 48, 4, 180, 227, 99, 116, 18, 100, 141, 86, 85, 43, 225, 168, 217, 21, 185, 46, 204, 8, 130, 95, 107, 171, 218, 124, 245, 234, 212, 268, 175, 249, 131, 165, 221, 101, 3, 241, 207, 144, 134, 163, 155, 210, 52, 123, 205, 133, 87, 78, 196, 9, 214, 120, 67, 122, 146, 110, 74, 5, 202, 32, 36, 255, 23, 11, 250, 259, 114, 187, 80, 1, 17, 96, 152, 224, 22, 198, 111, 231, 0, 41, 62, 179, 261, 172, 69, 81, 269, 174, 170, 194, 19, 55, 136, 208, 44, 137, 26]
[105, 262, 40, 61, 148, 238, 195, 246, 263, 115, 162, 222, 157, 119, 50, 77, 132, 201, 82, 92, 31, 197, 167, 24, 213, 206, 149, 70, 1

Now we have the indices of the two sets, let's turn them into their own datasets based on those indices.

In [9]:
len(train_slide_indices), len(valid_slide_indices)

(189, 81)