In [None]:
import sys
sys.path.append('../')

# Building batches for training neural nets

Welcome! This is the third tutorial of the series, covering lung cancer research using RadIO. If you haven't read the first two tutorials, we encourage you to do that before tackling this one. Anyways, here is a quick reminder.

Machine learning-solutions always start with preprocessing. RadIO thinks of preprocessing as of chained sequence of actions -  a `Pipeline`. Each `Pipeline` represents a *plan* of what is going to happen with data, rather than a real computation, and is made of actions, implemented in RadIO ([or by you](https://analysiscenter.github.io/lung_cancer/intro/preprocessing.html#writing-your-own-actions)). E.g., you can set up a simple preprocessing pipeline, including `load` from [Luna dataset](https://luna16.grand-challenge.org/)-format and `resize` to shape **[92, 256, 256]** in a following way:

In [None]:
from radio.dataset import Pipeline                 # the cell executes fast
simple_preproc = (Pipeline()                       # we only write a plan
                  .load(fmt='raw')                 # no computations here
                  .resize(shape=(92, 256, 256)))   # it happens later

It might be a good idea to replace `resize` with `unify_spacing`, that not only changes shape of scans, but also zooms them to the same scale:

In [None]:
other_simple_preproc = (Pipeline()
                        .load(fmt='raw')
                        .unify_spacing(shape=(92, 256, 256), spacing=(3.5, 1.0, 1.0)))

You can also add some data-augmenting actions to your pipeline. E.g., `rotate` of scans or cropping out its central part using `central_crop`.

In [None]:
augmenting_pipeline = (Pipeline()
                       .load(fmt='raw')
                       .unify_spacing(shape=(92, 256, 256), spacing=(3.5, 1.0, 1.0))
                       .central_crop(crop_size=(64, 192, 192))) 

In [None]:
from radio.dataset import FilesIndex, Dataset
from radio import CTImagesMaskedBatch

LUNA_MASK = '/data/MRT/luna/s*/*.mhd'                                      # set glob-mask for scans from Luna-dataset here
luna_index = FilesIndex(path=LUNA_MASK, no_ext=True)                       # preparing indexing structure
luna_dataset = Dataset(index=luna_index, batch_class=CTImagesMaskedBatch)

In [None]:
bch1 = (luna_dataset >> other_simple_preproc).next_batch(2, shuffle=False)

In [None]:
bch2 = (luna_dataset >> augmenting_pipeline).next_batch(2, shuffle=False)

In [None]:
from utils import show_slices
show_slices([bch1, bch2], scan_indices=[0, 0], ns_slice=[30, 58], grid=True)

In [None]:
bch1.indices == bch2.indices

In [None]:
bch1.origin

In [None]:
bch2.origin