# Data preparation

- This notebook includes data preprocessing steps for [SAM-Adapter](https://github.com/tianrun-chen/SAM-Adapter-PyTorch).
- The codes mainly source from [HiSup](https://github.com/SarahwXU/HiSup).
- This workflow is only suitable for ***binary segmentation***. Feel free to adapt it for multiclass segmentation.
- You can upscale images (4 times) by a super resolution model ([EDSR](https://github.com/aswintechguy/Deep-Learning-Projects/tree/main/Super%20Resolution%20-%20OpenCV)) by OpenCV.
- The default structure and format of your input datasets are:<br>
Here we aim to convert a large geotiff image/label data into small patches for deep learning models.<br>

- **Data Structure:** <br>

    Dataset1<br>
    - raw
        - train
            - images  (geotiff, uint8, 3 bands (RGB), you can create and enhance image data in GIS software in advance)
            - gt      (geotiff, uint8, value:0(background), 255(targets)(not necessary to have to be 255 if it is a binary segmentation but have to be distinctive from background))
        - test
            - images
            - gt
    
    Dataset2<br>
        ... ...<br>

In [1]:
import os

# your current working directory where your codes are stored.
path = os.getcwd() 
print(path)

# if path is not expected, then unnote these two lines to customize the rigth working directory
# path_base = "/content/drive/MyDrive/PhD_Research/SAM/Data_Final/Data_Preparation/Data_Preparation_Final"
# os.chdir(path_base)

# set up the path for datasets
path_database = "/home/yunya/anaconda3/envs/Data"

# import self-defined functions
from DataProcessing import data_process_sam_seg_final, data_process_augmentation_final
from DataProcessing import upscale_data_by_SR_final, upscale_data_by_cubic_final
from DataProcessing import upscale_testing_image_by_SR_final, upscale_testing_data_cubic_final

/home/yunya/anaconda3/envs/Data_Preparation/Data_Preparation_Final


## Data preparation for SAM Adapter

In [None]:
# perhaps you have multiple datasets to be processed. print them then select the datasets you would like to process.
os.listdir(path_database)

In [4]:
# set up datasets in list format to be processed
data_list = ['Dagaha2017', "Djibo2019"]

# for type_list, the naming rule is "train_" plus a short description.
# it is set because you may want to try different training datasets
# for example here, I want to compare the influence of data size on model performance, I set train_small and train_large.
# in training data folder, you can put multiple geotiff data of images and ground truth data
type_list = ['train_small']

In [5]:
# 1024 is the default patchsize for SAM adapter.
# however, when you create patches with 1024 by 1024 pixels, you will observe that there are many small buildings in one single patch, which can bring difficulty for training
patch_size = 1024

data_process_sam_seg_final(path_database, data_list, type_list, patch_size)

Start processing: Dagaha2017 train_small 1024
Start processing: Djibo2019 train_small 1024


In [6]:
# 256 is selected to create smaller patches that can be upscaled to 1024 by EDSR or other "bilinear/cubic" approaches.
# you can choose other sizes for upscaling by bicubic interpretation but 256 is fixed for EDSR.
patch_size = 256

data_process_sam_seg_final(path_database, data_list, type_list, patch_size)

Start processing: Dagaha2017 train_small 256
Start processing: Djibo2019 train_small 256


## Data augmentation by flipping and rotation

In [7]:
# set up datasets to be used
data_list = ['Dagaha2017']
type_list = ['train_small']

# data augmentation should be chosen when the size of training data is too small to produce satisifying results
# select all or some of the following data augmentation choices
# operation_list = ["vertical_flip", "horizontal_flip", "rotate"]
# degrees_list = [90, 180, 270]

In [8]:
patch_size = 1024
operation_list = ["vertical_flip", "rotate"]
degrees_list = [90, 180]
# if you want to test the influences of different data augmentation combinations
# set "aug_idx" as other values, such as 1, 2... or _flip_rot40 （_ is necessary to output folder more readable）
aug_idx = ""

data_process_augmentation_final(path_database, data_list, type_list, patch_size, operation_list, degrees_list)

Start processing: /home/yunya/anaconda3/envs/Data/Dagaha2017/SAM/1024
Done.


In [9]:
patch_size = 256
operation_list = ["vertical_flip", "rotate"]
degrees_list = [90, 180]
# if you want to test more, set "aug_idx" as other values, such as 1, 2... or _flip_rot40 （_ is necessary to output folder more readable）
aug_idx = ""

data_process_augmentation_final(path_database, data_list, type_list, patch_size, operation_list, degrees_list)

Start processing: /home/yunya/anaconda3/envs/Data/Dagaha2017/SAM/256
Done.


## Upscale image of training data (optional) 
The upscaling by SR may **take quite a long time**, but it can generate better results in most experiments. <br>
It is much faster to use traditional upscaling approaches (cubic interpolation here).<br>
**Therefore, cubic interpolation is recommended.**

In [10]:
data_list = ['Dagaha2017']
type_list = ['train_small']

In [None]:
# upscale by SR model - EDSR model (better results, but very slow)
upscale_data_by_SR_final(path_database, data_list, type_list)

In [11]:
# upscale by cubic interpolation (faster, recommended)
upscale_data_by_cubic_final(path_database, data_list, type_list)

Start processing: Dagaha2017    train_small
Done.


## Upscale testing data

In [12]:
data_list = ['Dagaha2017']

In [13]:
#### upscale data by cubic interpolation (recommended, faster), if you want to try nearest or bilinear, change them in the .py script
data_type = "images"
patch_size = 256

for dataset in data_list:
    upscale_testing_data_cubic_final(path_database, dataset, data_type, patch_size)

Processing by cubic: /home/yunya/anaconda3/envs/Data/Dagaha2017/raw/test/images/dagahaley1.tif
(3, 12404, 19132)
Processing by cubic: /home/yunya/anaconda3/envs/Data/Dagaha2017/raw/test/images/dagahaley2.tif
(3, 5496, 5556)


In [14]:
# upscale Ground Truth data
data_type = "gt"
patch_size = 256

for dataset in data_list:
    upscale_testing_data_cubic_final(path_database, dataset, data_type, patch_size)

Processing by cubic: /home/yunya/anaconda3/envs/Data/Dagaha2017/raw/test/gt/dagahaley1.tif
(1, 12408, 19128)
Processing by cubic: /home/yunya/anaconda3/envs/Data/Dagaha2017/raw/test/gt/dagahaley2.tif
(1, 5492, 5556)


In [None]:
#### upscale image by SR (very slow)
data_list = ['Dagaha2017']
data_type = "images"

for dataset in data_list:
    upscale_testing_image_by_SR_final(path_database, dataset, data_type)

Processing by SR: /home/yunya/anaconda3/envs/Data/Dagaha2017/raw/test/images/dagahaley2.tif
num_patches_height: 3, num_patches_width: 3
Done:    Row: 0
Done:    Row: 1
Done:    Row: 2
Processing by SR: /home/yunya/anaconda3/envs/Data/Dagaha2017/raw/test/images/dagahaley1.tif
num_patches_height: 7, num_patches_width: 10
