# Dataset Preprocessing
In this notebook we are going to preprocess the slides to generate our sets of patches for training, validation, and testing.

In [1]:
from pathgen.utils.seeds import set_seed

global_seed = 123
set_seed(global_seed)

In [2]:
from pathgen.utils.paths import project_root

experiment_name = "all"
experiment_root = project_root() / "experiments" / experiment_name

## 1. Audit the dataset
Let's find out how many patches there are in each slide in the Camelyon16 dataset, so we can select from the in a principled way.

In [3]:
import pathgen.data.datasets.camelyon16 as camelyon16

train = camelyon16.training_small()
len(train)

10

In [4]:
train[0]

(PosixPath('/home/ubuntu/pathgen/data/camelyon16/raw/training/tumor/tumor_024.tif'),
 PosixPath('/home/ubuntu/pathgen/data/camelyon16/raw/training/lesion_annotations/tumor_024.xml'),
 'tumor',
 '')

In [5]:
from pathgen.preprocess.patching import make_index
from pathgen.preprocess.tissue_detection import TissueDetectorOTSU
from pathgen.preprocess.patching import GridPatchFinder

tissue_detector = TissueDetectorOTSU()
patch_finder = GridPatchFinder(6, 0, 256, 256)

index = make_index(train, tissue_detector, patch_finder)
index

indexing tumor_024.tif
indexing tumor_038.tif
indexing tumor_054.tif
indexing tumor_063.tif
indexing tumor_065.tif
indexing tumor_076.tif
indexing tumor_089.tif
indexing normal_014.tif
indexing normal_038.tif
indexing normal_100.tif


<pathgen.preprocess.patching.slides_index.SlidesIndex at 0x7f0fe5b735f8>

In [6]:
training_summary = index.summary()
total_normal = training_summary['normal'].sum()
total_tumor = training_summary['tumor'].sum()
print(f"Total normal patches: {total_normal}")
print(f"Total tumor patches: {total_tumor}")
training_summary

Total normal patches: 235153
Total tumor patches: 89530


label,background,normal,tumor
0,0,50101,31
1,0,5283,736
2,0,25753,22335
3,0,45720,49
4,0,12051,71
5,0,15591,37151
6,0,44378,29157
7,0,8002,0
8,0,3365,0
9,0,24909,0


In [7]:
index.save(experiment_root / 'train_index')

In [8]:
from pathgen.preprocess.patching import SlidesIndex

loaded_index = SlidesIndex.load(experiment_root / 'train_index')
len(loaded_index)

10

## 2. Sampling at a slide level
We want the samples in the different sets (train, validate, and test) to come from seperate slides. Test is it's own dataset, but validate has to be broken off from train on a per-slide basis. We want the split to take into account the number of patches in each set (because it is the patches that we are interested in). Let's base the split so that the validate set has about 30% of the tumor patches in it. Baseing it on the tumor patches makes sense because they are the class with the smallest number of samples.

Our algorithm for splitting will work like so:
1. Work out the total number of tumor patches in the whole dataset.
2. Work out what 30% (or whatever the amount is) of the total patches is.
3. Start counting with 0 patches and 0 slides in the validation set.
4. Randomly select a slide in training and move it to validation.
5. Add the number of patches in that slide to the total.
6. Repeat from 4 until the total number of patches is greater than 30% of the total.

In [8]:
total_tumor = training_summary['tumor'].sum()
valid_tumor_count = int(total_tumor * 0.3)  # rounds down
valid_tumor_count

26859

In [9]:
import random

train_slide_indices = list(range(len(training_summary)))
random.shuffle(train_slide_indices)
valid_slide_indices = []
total_valid_patches = 0
while total_valid_patches < valid_tumor_count:
    slide_idx = train_slide_indices[0]
    tumor_count = training_summary.iloc[slide_idx]['tumor']
    total_valid_patches += tumor_count
    train_slide_indices.pop(0)
    valid_slide_indices.append(slide_idx)
    
# print them out
print(train_slide_indices)
print(valid_slide_indices)

# check there are no duplicates
print(len(train_slide_indices) + len(valid_slide_indices))
print(len(set(train_slide_indices + valid_slide_indices)))

[9, 2, 3, 6, 1, 4, 0]
[8, 7, 5]
10
10


Now we have the indices of the two sets, let's turn them into their own datasets based on those indices.

In [10]:
len(train_slide_indices), len(valid_slide_indices)

(7, 3)