## Dataset creation notebook

This notebook shows how to build a dataset for the training of a new model in AxonDeepSeg. It covers the following steps:

* How to structure the raw data.
* How to define the parameters of the patch extraction and divide the raw labelled dataset into patches.
* How to generate the training dataset of patches by combining all raw data patches.



### STEP 0: IMPORTS.

In [8]:
from AxonDeepSeg.data_management import dataset_building
import os, shutil
from scipy.misc import imread, imsave

### STEP 1: GENERATE THE DATASET.

### Suggested procedure for training/validation split of the dataset:

* From the raw dataset folder, split the raw samples (subfolders) into training and validation. For instance, if you have 6 samples of similar image size, you can keep 5 of them for the training and the remaining one for the validation. 

#### The original folder structure before the training/validation split:

* ***folder_raw_data***
    * **Train**
        * **sample1**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*
        * **sample2**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*
            
            ...
            
        * **sample5**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*

* Put the 8 selected samples in the Train subfolder and the other 2 in the Validation subfolder.
* Run the raw_img_to_patches function on both the Train and Validation subfolders to split the data into patches.

#### 1.1. Define the parameters of the patch extraction.

* **path_raw_data**: Path of the folder that contains the raw data. Each labelled sample of the dataset should be in a different subfolder. For each sample (and subfolder), the expected files are the following:
    * *"image.png"*: The microscopy sample image (uint8 format).
    * *"mask.png"*: The microscopy sample image (uint8 format).
    * *"pixel_size_in_micrometer.txt"*: A one-line text file with the value of the pixel size of the sample. For instance, if the pixel size of the sample is 0.02um, the value in the text file should be **"0.02"**.
    
* **path_patched_data**: Path of the folder that will contain the raw data divided into patches. Each sample (i.e. subfolder) of the raw dataset will be divided into patches and saved in this folder. For instance, if a sample of the original data is divided into 10 patches, the corresponding folder in the **path_patched_dataset** will contain 10 image and mask patches, named **image_0.png** to **image_9.png** and **mask_0.png** to **mask_9.png**, respectively. 

* **patch_size**: The size of the patches in pixels. For instance, a patch size of **128** means that each generated patch will be 128x128 pixels.

* **general_pixel_size**: The pixel size (i.e. resolution) of the generated patches in micrometers. The pixel size will be the same for all generated patches. If the selected pixel size is different from the native pixel sizes of the samples, downsampling or upsampling will be performed. Note that the pixel size should be chosen by taking into account the modality of the dataset and the patch size.  

In [9]:



path_raw_data_train = '/Users/rudinakaprata/Dropbox/raw_data_SEM/Train'
path_patched_data_train = '/Users/rudinakaprata/Dropbox/patched_data_SEM/Train'

path_raw_data_validation = '/Users/rudinakaprata/Dropbox/raw_data_SEM/Validation'
path_patched_data_validation = '/Users/rudinakaprata/Dropbox/patched_data_SEM/Validation'



patch_size = 256
general_pixel_size = 0.1

Important! We now define the random seed. This will enable us to reproduce the exact same images each time we use the same random seed.

This will be used to enable the generation the same validation set and the same testing set.

Suggested procedure for training/validation split of the dataset:

* From the raw dataset folder, split the raw samples into training and validation. For instance, if you have 10 samples of similar size, you can keep 8 of them for the training and the other 2 for the validation.
* Put the 8 selected samples in the Train subfolder and the other 2 in the Validation subfolder.
* Run the raw_img_to_patches function on both the Train and Validation subfolders to split the data into patches.

We then call the function build_dataset. It will automatically create the dataset in the previously specified folder.

## 3-CREATING TRAIN AND VALIDATION SET

Change thresh indices if you want to generate a mask with different classes.

In [11]:
dataset_building.raw_img_to_patches(path_raw_data_train, path_patched_data_train, thresh_indices = [0, 0.2, 0.8], patch_size=patch_size, resampling_resolution=general_pixel_size)

100%|██████████| 7/7 [00:06<00:00,  1.02it/s]


Enter here the path of the new dataset (will be created if non existent)

In [5]:
path_dataset = '/Users/rudinakaprata/Dropbox/processed_final_data_SEM/'

Creation of the dataset

In [6]:
dataset_building.patched_to_dataset(path_patched_data, path_dataset, type_='unique', random_seed=2017)

100%|██████████| 8/8 [00:08<00:00,  1.11s/it]
