### Preprocessing of CT-scans

In this tutorial you will be able to:
1. Load CT-scans from MetaImage (mhd) format
2. Run preprocessing and dump scans to [blosc](https://github.com/Blosc/python-blosc)
    1. Resize all scan to fixed size
    2. Unify spacing AND resize to fixed size
3. Load dumped files and make masks for them
4. Visualize slices of scans
5. Sample crops of fixed size from preprocessed scans with masks

Examples in this notebook use [LUNA16 competition dataset](https://luna16.grand-challenge.org/) in MetaImage (mhd/raw) format.

In [1]:
import os
import sys
import glob
import shutil
from ipywidgets import interact
from copy import deepcopy
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [4]:
sys.path.append('..')

In [5]:
from radio import CTImagesMaskedBatch as CTIMB
from radio.dataset import *

### 1. Load CT-scans from MetaImage (mhd) format


You need to specify mask for '\*.mhd' input files in DIR_LUNA, and provide output dir path in DIR_DUMP. Here we use unzipped competition dataset, mhd files are stored in subfolders, names of subfolders are taken as ids.

In [6]:
DIR_LUNA = '/notebooks/data/MRT/luna/s*/*.mhd'
DIR_DUMP = '/notebooks/data/MRT/output/'

**WARNING**: think thoroughly before running the cell below, it deletes output folder

In [None]:
if os.path.exists(DIR_DUMP):
    shutil.rmtree(DIR_DUMP)

Start by creating  `Dataset.FilesIndex` and  `Dataset`

In [None]:
ind = FilesIndex(path=DIR_LUNA, no_ext=True)

If everything is ok, you'll see total number of mhd files.

In [None]:
len(ind.index)

In [None]:
ds = Dataset(index=ind, batch_class=CTIMB)

### 2. Run preprocess on dataset and dump it

#### A. Reshaping dataset to fixed shape

Note, worlflow is in lazy-mode, so it is not yet running

In [None]:
workflow = (
    ds.pipeline()
      .load(fmt='raw')
      .resize(n_workers=6, shape=(128, 256, 256))
      .dump(dst=DIR_DUMP)
)

Here you actually start preprocessing.

Note, that preprocessing all LUNA16 scans may take significant time

In [None]:
BATCH_SIZE = 8
workflow.run(batch_size=BATCH_SIZE, shuffle=False)

#### B. Unify spacing AND resize to fixed size

For this goal pipeline would be:

In [None]:
workflow = (
    ds.pipeline()
      .load(fmt='raw')
      .unify_spacing(shape=(384, 448, 448), spacing=(0.9, 0.9, 0.9))
      .dump(dst=DIR_DUMP)
)

Idea is following: 

1) Shape is changed for every scan so, that spacing would meet required **```(0.9, 0.9, 0.9)```**

2) Interim shape is cropped (if it is bigger) or padded (if it is smaller) to meet **```shape```**

### 3. Load dumped scans, build masks

Here you need annotation file with nodules locations and diameters. 

It is also provided by LUNA16 https://luna16.grand-challenge.org/data/

In [None]:
nodules = pd.read_csv('/notebooks/data/MRT/luna/CSVFILES/annotations.csv')

ind_dumped = FilesIndex(path=DIR_DUMP + '*', dirs=True)

batch_dumped = CTIMB(ind_dumped.create_subset(ind_dumped.index[0 : 4]))

batch_dumped.load(fmt='blosc')

batch_dumped.fetch_nodules_info(nodules)

batch_dumped.create_mask()

So, you just loaded 1 batch of 4 scans and made masks for it

### 4. Check the whole thing: visualise slices of scans

It is convenient to use interact for visualising various slices and masks

In [None]:
def plot_arr_slices(height, *arrays, clim=(-1200, 300)):
    fig, axes = plt.subplots(1, len(arrays), figsize=(14, len(arrays)*8))
    
    for arr, i in zip(arrays, range(len(arrays))):
        depth = arr.shape[0]
        n_slice = int(depth * height)
        
        kwargs = dict()
        if np.max(arr) - np.min(arr) > 2.0:
            kwargs.update(clim=clim)
        else:
            kwargs.update(clim=(0, 1))
    
        axes[i].imshow(arr[n_slice], cmap=plt.cm.gray, **kwargs)
    plt.show()

Let's see the first patient scans

In [None]:
n_pat = 0

interact(lambda height: plot_arr_slices(height, batch_dumped[n_pat], batch_dumped.get_mask(n_pat)), 
         height=(0.01, 0.99, 0.01))

### 5. Sample crops of fixed size from preprocessed scans with masks

Let's take batch we loaded and visualised and sample crops with nodules from it via ```sample_nodules``` method

In [None]:
nods_batch = batch_masked.sample_nodules(batch_size=10, nodule_size=(32, 64, 64), share=0.7,
                                         variance=[100, 400, 400])

It creates new batch with 10 items in it, which are crops (aka patches) from original scans of size (32, 64, 64). 

However, you may want crops with nodules in different positions (not in the center of crop), for this specify ```variance``` which allows to shift center of crop to (10, 20, 20) voxels along (z, y, x) axes in example above. 

Also, for traingin neural nets, you may want to have crops without nodules at all, specify ```share``` which is share of items with nodules.