# Data preparation

- This notebook includes data preprocessing steps for [HiSup](https://github.com/SarahwXU/HiSup), [SAM-Adapter](https://github.com/tianrun-chen/SAM-Adapter-PyTorch), and [Segmentation_Models](https://github.com/qubvel/segmentation_models.pytorch).
- The codes mainly source from [HiSup](https://github.com/SarahwXU/HiSup).
- This workflow is only suitable for ***binary segmentation***. Feel free to adapt it for multiclass segmentation.
- You can upscale images (4 times) by a super resolution model ([EDSR](https://github.com/aswintechguy/Deep-Learning-Projects/tree/main/Super%20Resolution%20-%20OpenCV)) by OpenCV.
- The default structure and format of your input datasets are:<br>
Here we aim to convert a large geotiff image/label data into small patches for deep learning models.<br>
path/Data/Dataset_Name/raw/train(or test, val)/images<br>
path/Data/Dataset_Name/raw/train(or test, val)/images<br>

    Dataset1<br>
    - raw
        - train
            - images  (geotiff, uint8, 3 bands (RGB), you can create and enhance image data in GIS software in advance)
            - gt      (geotiff, uint8, value:0(background), 255(targets)(not necessary to have to be 255 if it is a binary segmentation but have to be distinctive from background))
        - test
            - images
            - gt
        - val
            - images
            - gt<br>
    
    Dataset2<br>
        ... ...<br>

In [None]:
# Set up paths and data types

import os
from pathlib import Path
import glob

path = os.getcwd() # your current working directory where your codes are stored.
print(path)

from DataProcessing import data_process_hisup, data_process_sam_seg, upscale_img, upscale_lab, set_sr_model, data_process_augmentation

# define path of data
path_database = "your path/Data" # # path of dataset
data_list = os.listdir(path_database)
data_list.sort()
# remove .ipynb_checkpoints if included in the folder

# define the data types to be processed. In case you have different training datasets
# testing data does not have to be processed here because we would like to evaluate the model performance on a complete large testing data rather than small patches
type_list = ['train_large', 'train_small', 'val']

# print all datasets to be processed, modifify the list if necessary
data_list, type_list

/home/yunya/anaconda3/envs/Data_Preparation/Data_Preparation_Final


(['Dagaha2017', 'Djibo2019', 'Kutupalong2018', 'Minawao2017', 'Nduta2017'],
 ['train_large', 'train_small', 'val'])

## Data preparation for HiSup

In [None]:
# the default patchsize is 512.

for dataset in data_list:
    for dtype in type_list:

        path_dataset = os.path.join(path_database, dataset)
        print("Start processing: " + dataset + " " + dtype)

        data_process_hisup(path_dataset, dtype)
        print("Done")

Start processing: Dagaha2017 train_large


100%|██████████| 7/7 [02:14<00:00, 19.17s/it]


Done
Start processing: Dagaha2017 train_small


100%|██████████| 7/7 [00:32<00:00,  4.67s/it]


Done
Start processing: Dagaha2017 val


100%|██████████| 7/7 [00:10<00:00,  1.49s/it]


Done


## Data preparation for SAM Adapter

In [None]:
# 1024 is the default patchsize for SAM adapter.
# change this part for your own purpose, e.g. data_list and patchsize_list

patch_size = 1024
model_name = "SAM"

for dataset in data_list:
    for dtype in type_list:

        path_dataset = os.path.join(path_database, dataset)
        print("Start processing: " + dataset + " " + dtype + " " + str(patch_size))

        data_process_sam_seg(path_dataset, dtype, patch_size, model_name)
        print("Done")

Start processing: Dagaha2017 train_large 1024


100%|██████████| 7/7 [01:41<00:00, 14.44s/it]


Done
Start processing: Dagaha2017 train_small 1024


100%|██████████| 7/7 [00:14<00:00,  2.04s/it]


Done
Start processing: Dagaha2017 val 1024


100%|██████████| 7/7 [00:00<00:00,  7.90it/s]

Done





In [None]:
# 256 is created for upscaling to 1024 by EDSR.
# change this part for your own purpose, e.g. data_list and patchsize_list

patch_size = 256
model_name = "SAM"

data_sr_list = ['Dagaha2017']
type_sr_list = ['train_small', 'val']

for dataset in data_sr_list:
    for dtype in type_sr_list:

        path_dataset = os.path.join(path_database, dataset)
        print("Start processing: " + dataset + " " + dtype + " " + str(patch_size))

        data_process_sam_seg(path_dataset, dtype, patch_size, model_name)
        print("Done")

Start processing: Dagaha2017 train_small 256


100%|██████████| 7/7 [00:12<00:00,  1.85s/it]


Done
Start processing: Dagaha2017 val 256


100%|██████████| 7/7 [00:03<00:00,  1.99it/s]

Done





## Data preparation for Segmention Model

In [None]:
# 224 is the default patchsize for segmentation model pytorch.
# change this part for your own purpose, e.g. data_list and patchsize_list
# in this research, we use the same dataset (including original data and upscaled data) for SAM, then applied random crop (cropped size is 224).

patch_size = 224
model_name = "SS"

for dataset in data_list:
    for dtype in type_list:

        path_dataset = os.path.join(path_database, dataset)
        print("Start processing: " + dataset + " " + dtype + " " + str(patch_size))

        data_process_sam_seg(path_dataset, dtype, patch_size, model_name)
        print("Done")

Start processing: Dagaha2017 train_large 224


100%|██████████| 7/7 [02:05<00:00, 17.91s/it]


Done
Start processing: Dagaha2017 train_small 224


100%|██████████| 7/7 [00:15<00:00,  2.29s/it]


Done
Start processing: Dagaha2017 val 224


100%|██████████| 7/7 [00:03<00:00,  1.94it/s]

Done





## Data preparation for Flipping and Rotation

In [None]:
dataset = "Djibo2019"
dtype = "train_small"

path_base_img = os.path.join(path_database, dataset, "SAM", "SR", dtype, "augmentation", "images")
img_list = glob.glob(path_base_img + "/*.png")

path_base_gt = os.path.join(path_database, dataset, "SAM", "SR", dtype, "augmentation", "gt")
gt_list = glob.glob(path_base_gt + "/*.png")

operation_list = ["horizontal_flip", "vertical_flip", "rotate"]
degrees_list = [90, 180, 270]
op_times = 1

for operation in operation_list:

    if operation != "rotate":

        degrees = 0 # no use here
        data_process_augmentation(path_base_img, img_list, path_base_gt, gt_list, dtype, operation, degrees, op_times)
        op_times += 1

    elif operation == "rotate":

        for degrees in degrees_list:
            data_process_augmentation(path_base_img, img_list, path_base_gt, gt_list, dtype, operation, degrees, op_times)
            op_times += 1

    else:
        print(operation + " doesn't belong to either of horizontal_flip, vertical_flip, rotate.")


horizontal_flip: images saved.
horizontal_flip: gt saved.
vertical_flip: images saved.
vertical_flip: gt saved.
rotate: images saved.
rotate: gt saved.
rotate: images saved.
rotate: gt saved.
rotate: images saved.
rotate: gt saved.


# Print number of patches for each type of created data

### HiSup

In [None]:
model_name = "HiSup"

for dataset in data_list:
    for dtype in type_list:

        path_dataset = os.path.join(path_database, dataset, model_name, dtype, "images")
        alldata_list = os.listdir(path_dataset)

        print("{:<7}   {:<15}  {}".format(len(alldata_list), dataset, dtype))

1260      Dagaha2017       train_large
224       Dagaha2017       train_small
63        Dagaha2017       val
1232      Djibo2019        train_large
280       Djibo2019        train_small
63        Djibo2019        val
9450      Kutupalong2018   train_large
2156      Kutupalong2018   train_small
504       Kutupalong2018   val
1008      Minawao2017      train_large
168       Minawao2017      train_small
28        Minawao2017      val
5670      Nduta2017        train_large
896       Nduta2017        train_small
294       Nduta2017        val


### SAM

In [None]:
model_name = "SAM"
patch_size = 1024

for dataset in data_list:
    for dtype in type_list:

        path_dataset = os.path.join(path_database, dataset, model_name, str(patch_size), dtype, "images")
        alldata_list = os.listdir(path_dataset)

        print("{:<7}   {:<15}  {:<6}  {}".format(len(alldata_list), dataset, patch_size, dtype))

350       Dagaha2017       1024    train_large
56        Dagaha2017       1024    train_small
7         Dagaha2017       1024    val
280       Djibo2019        1024    train_large
56        Djibo2019        1024    train_small
7         Djibo2019        1024    val
1848      Kutupalong2018   1024    train_large
420       Kutupalong2018   1024    train_small
112       Kutupalong2018   1024    val
224       Minawao2017      1024    train_large
56        Minawao2017      1024    train_small
7         Minawao2017      1024    val
1176      Nduta2017        1024    train_large
224       Nduta2017        1024    train_small
63        Nduta2017        1024    val


In [None]:
model_name = "SAM"
patch_size = 256

data_sr_list = ['Dagaha2017']
type_sr_list = ['train_small', 'val']

for dataset in data_sr_list:
    for dtype in type_sr_list:

        path_dataset = os.path.join(path_database, dataset, model_name, str(patch_size), dtype, "images")
        alldata_list = os.listdir(path_dataset)

        print("{:<7}   {:<15}  {:<6}  {}".format(len(alldata_list), dataset, patch_size, dtype))

686       Dagaha2017       256     train_small
175       Dagaha2017       256     val


## Upscale image (optional) (256x256 to 1024x1024)
The upscaling section may take quite a long time.

In [None]:
model_name = "SAM"
upscaled_folder = "SR"

data_sr_list = ['Dagaha2017']
type_sr_list = ['train_small', 'val']
patch_size = 256

for dataset in data_sr_list:
    for dtype in type_sr_list:

        path_dataset = os.path.join(path_database, dataset, model_name, str(patch_size), dtype, "images")
        img_name_list = os.listdir(path_dataset)

        for img_name in img_name_list:

            path_ups_img = os.path.join(path_database, dataset, model_name, upscaled_folder, dtype, "images")
            Path(path_ups_img).mkdir(parents=True, exist_ok=True)

            path_ups_lab = os.path.join(path_database, dataset, model_name, upscaled_folder, dtype, "gt")
            Path(path_ups_lab).mkdir(parents=True, exist_ok=True)

            path_in_img = os.path.join(path_database, dataset, model_name, str(patch_size), dtype, "images", img_name)
            path_out_img = os.path.join(path_ups_img, img_name)
            path_in_lab = os.path.join(path_database, dataset, model_name, str(patch_size), dtype, "gt", img_name)
            path_out_lab = os.path.join(path_ups_lab, img_name)

            # set up sr model
            sr_model = set_sr_model()

            # upscale images
            upscale_img(path_in_img, path_out_img, sr_model)

            # upscale labels
            upscale_lab(path_in_lab, path_out_lab)

        print("Done: {:<15}   {:<6}  {}".format(len(alldata_list), dataset, dtype))

## Upscale testing data

In [None]:
#### upscale data by nearest neighboring and bilinear interpolation
upscale_list = ['nearest', 'bilinear']
data_type = "images"

for dataset in data_list:
    for upscaled_folder in upscale_list:
        upscale_testing_data_nearest_bilinear(path_database, dataset, upscaled_folder, data_type)


# upscale Ground Truth data
upscale_list = ['bilinear']
data_type = "gt"

for dataset in data_list:
    for upscaled_folder in upscale_list:
        upscale_testing_data_nearest_bilinear(path_database, dataset, upscaled_folder, data_type)

In [None]:
#### upscale image by SR

upscaled_folder = "EDSR"
data_type = "images"

for dataset in data_list:
    upscale_testing_data_SR(path_database, dataset, upscaled_folder, data_type)