## Dataset creation notebook

This notebook shows how to build a dataset for the training of a new model in AxonDeepSeg. It covers the following steps:

* How to structure the raw data.
* How to define the parameters of the patch extraction and divide the raw labelled dataset into patches.
* How to generate the training dataset of patches by combining all raw data patches.



### STEP 0: IMPORTS.

In [1]:
from AxonDeepSeg.data_management import dataset_building
import os, shutil
from scipy.misc import imread, imsave

### STEP 1: GENERATE THE DATASET.

### Suggested procedure for training/validation split of the dataset:

* **Example use case:** we have 6 labelled samples in our dataset. To respect the split convention (between 10-30% of samples kept for validation), we can keep 5 samples for the training and the remaining one for the validation. 

---
##### The folder structure *before* the training/validation split:

* ***folder_of_your_raw_data***

     * **sample1**
          * *image.png*
          * *mask.png*
          * *pixel_size_in_micrometer.txt*
     * **sample2**
          * *image.png*
          * *mask.png*
          * *pixel_size_in_micrometer.txt*
            
            ...
            
     * **sample6**
          * *image.png*
          * *mask.png*
          * *pixel_size_in_micrometer.txt*
            
---
#### The folder structure *after* the training/validation split:

* ***folder_of_your_raw_data***

    * **Train**
        * **sample1**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*
        * **sample2**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*
            
            ...
            
        * **sample5**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*
            
    * **Validation**
        * **sample6**
            * *image.png*
            * *mask.png*
            * *pixel_size_in_micrometer.txt*
---         

#### 1.1. Define the parameters of the patch extraction.

* **path_raw_data**: Path of the folder that contains the raw data. Each labelled sample of the dataset should be in a different subfolder. For each sample (and subfolder), the expected files are the following:
    * *"image.png"*: The microscopy sample image (uint8 format).
    * *"mask.png"*: The microscopy sample image (uint8 format).
    * *"pixel_size_in_micrometer.txt"*: A one-line text file with the value of the pixel size of the sample. For instance, if the pixel size of the sample is 0.02um, the value in the text file should be **"0.02"**.
    
* **path_patched_data**: Path of the folder that will contain the raw data divided into patches. Each sample (i.e. subfolder) of the raw dataset will be divided into patches and saved in this folder. For instance, if a sample of the original data is divided into 10 patches, the corresponding folder in the **path_patched_dataset** will contain 10 image and mask patches, named **image_0.png** to **image_9.png** and **mask_0.png** to **mask_9.png**, respectively. 

* **patch_size**: The size of the patches in pixels. For instance, a patch size of **128** means that each generated patch will be 128x128 pixels.

* **general_pixel_size**: The pixel size (i.e. resolution) of the generated patches in micrometers. The pixel size will be the same for all generated patches. If the selected pixel size is different from the native pixel sizes of the samples, downsampling or upsampling will be performed. Note that the pixel size should be chosen by taking into account the modality of the dataset and the patch size.  

In [2]:
# Define the paths for the training samples
path_raw_data_train = '/Users/Dropbox/raw_data_SEM/Train'
path_patched_data_train = '/Users/Dropbox/patched_data_SEM/Train'

# Define the paths for the validation samples
path_raw_data_validation = '/Users/Dropbox/raw_data_SEM/Validation'
path_patched_data_validation = '/Users/Dropbox/patched_data_SEM/Validation'

patch_size = 256
general_pixel_size = 0.1

#### 1.2. Divide the training/validation samples into patches.

In the **path_patched_data** folder defined above, the original samples are going to be split into patches of same size. For instance, the sample 1 of the training set of the example use case above will be split into *n* patches and its corresponding subfolder in the **path_patched_data** folder will have the following structure:

---
* ***folder_of_your_patched_data***

    * **Train**
        * **sample1**
            * *image_0.png*
            * *mask_0.png*
            * *image_1.png* 
            * *mask_1.png*
            * *image_2.png*
            * *mask_2.png*
            
            ...
            
            * *image_n.png* 
            * *mask_n.png*        
---


* Run the *raw_img_to_patches* function on both *Train* and *Validation* subfolders to split the data into patches. Note the input param. **thresh_indices** is a list of the threshold values to use in order to generate the classes of the training masks. The default value is [0, 0.2, 0.8], meaning that the mask labels (background=0, myelin=0.5, axon=1) will be split into our 3 classes.

In [3]:
# Split the *Train* dataset into patches
dataset_building.raw_img_to_patches(path_raw_data_train, path_patched_data_train, thresh_indices = [0, 0.2, 0.8], patch_size=patch_size, resampling_resolution=general_pixel_size)

# Split the *Validation* dataset into patches
dataset_building.raw_img_to_patches(path_raw_data_validation, path_patched_data_validation, thresh_indices = [0, 0.2, 0.8], patch_size=patch_size, resampling_resolution=general_pixel_size)


  warn("The default mode, 'constant', will be changed to 'reflect' in "
100%|██████████| 7/7 [00:09<00:00,  1.30s/it]
100%|██████████| 1/1 [00:04<00:00,  4.34s/it]


#### 1.3. Regroup all the divided patches in the same training/validation folder.

Finally, to build the dataset folder that is going to be used for the training, all patches obtained from the different samples are regrouped into the same folder and renamed. The final training and validation folders will have the following structure (*m* is the total number of training patches and *p* is the total number of validation patches):

---
* ***folder_of_your_final_patched_data***

    * **Train**
         * *image_0.png*
         * *mask_0.png*
         * *image_1.png* 
         * *mask_1.png*
         * *image_2.png*
         * *mask_2.png*
            
         ...
           
         * *image_m.png* 
         * *mask_m.png*   
         
    * **Validation**
         * *image_0.png*
         * *mask_0.png*
         * *image_1.png* 
         * *mask_1.png*
         * *image_2.png*
         * *mask_2.png*
            
         ...
           
         * *image_p.png* 
         * *mask_p.png*         
         
---

Note that we define a random seed in the input of the *patched_to_dataset* function in order to reproduce the exact same images each time we run the function. This is done to enable the generation of the same training and validation sets (for reproducibility). Also note that the **type_** input argument of the function can be set to **"unique"** or **"mixed"** to specify if the generated dataset comes from the same modality, or contains more than one modality.

In [4]:
# Path of the final training dataset
path_final_dataset_train = '/Users/Dropbox/final_patched_data_SEM/Train'

# Path of the final validation dataset
path_final_dataset_validation = '/Users/Dropbox/final_patched_data_SEM/Validation'

In [5]:
# Regroup all training patches
dataset_building.patched_to_dataset(path_patched_data_train, path_final_dataset_train, type_='unique', random_seed=2017)

# Regroup all validation patches
dataset_building.patched_to_dataset(path_patched_data_validation, path_final_dataset_validation, type_='unique', random_seed=2017)


100%|██████████| 7/7 [00:05<00:00,  1.36it/s]
100%|██████████| 1/1 [00:02<00:00,  2.45s/it]
