# Dataset creation

This notebook is used for creating the datasets used for the training, validation and testing of the deep-learning model. 

Author of the notebook:
Antonio Magherini (Antonio.Magherini@deltares.nl).

In [1]:
# move to root directory

%cd .. 

c:\Users\Magherin\OneDrive - Stichting Deltares\Desktop\jamuna_morpho


In [2]:
# reload modules to avoid restarting the notebook every time these are updated

%load_ext autoreload
%autoreload 2

In [3]:
# import modules 

import torch 

from preprocessing.dataset_generation import * 

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cpu


Directories of original and preprocessed images. 

In [5]:
dir_orig = r'data\satellite\original'
dir_proc = r'data\satellite\preprocessed'
dir_dataset = r'data\satellite\dataset'
dir_dataset_1024x512 = r'data\satellite\dataset_1024x512'
dir_dataset_jan = r'data\satellite\dataset_month1' 
dir_dataset_feb = r'data\satellite\dataset_month2'
dir_dataset_mar = r'data\satellite\dataset_month3'
dir_dataset_apr = r'data\satellite\dataset_month4'

Available collections.

In [6]:
JRC = r'JRC_GSW1_4_MonthlyHistory'

Set string variables.

In [7]:
train = 'training'
val = 'validation'
test = 'testing'

train_val_test_list = [train, val, test]

The next cells are used just to show how the different functions work. 

1. Create the input and target datasets: all images are loaded regardless of their quality.

In [8]:
input_mar, target_mar = create_datasets(val, 1, 5, dir_folders=dir_dataset_mar)

In [9]:
print(f'Input and target shape month by month (training reach 1):\n\
March --> input shape: {np.shape(input_mar)} - Target shape: {np.shape(target_mar)}')

Input and target shape month by month (training reach 1):
March --> input shape: (29, 4, 1000, 500) - Target shape: (29, 1, 1000, 500)


In [8]:
input_jan, target_jan = create_datasets(train, 1, 5, dir_folders=dir_dataset_jan)
input_feb, target_feb = create_datasets(train, 1, 5, dir_folders=dir_dataset_feb)
input_mar, target_mar = create_datasets(train, 1, 5, dir_folders=dir_dataset_mar)
input_apr, target_apr = create_datasets(train, 1, 5, dir_folders=dir_dataset_apr)



In [9]:
print(f'Input and target shape month by month (training reach 1):\n\
January --> input shape: {np.shape(input_jan)} - Target shape: {np.shape(target_jan)}\n\
February --> input shape: {np.shape(input_feb)} - Target shape: {np.shape(target_feb)}\n\
March --> input shape: {np.shape(input_mar)} - Target shape: {np.shape(target_mar)}\n\
April --> input shape: {np.shape(input_apr)} - Target shape: {np.shape(target_apr)}')

Input and target shape month by month (training reach 1):
January --> input shape: (29, 4, 1000, 500) - Target shape: (29, 1, 1000, 500)
February --> input shape: (29, 4, 1000, 500) - Target shape: (29, 1, 1000, 500)
March --> input shape: (29, 4, 1000, 500) - Target shape: (29, 1, 1000, 500)
April --> input shape: (29, 4, 1000, 500) - Target shape: (29, 1, 1000, 500)


2. Combine input and target datasets filtering out bad images (based on <code>no-data</code> and <code>water</code> thresholds). 

In [10]:
input_jan_filtered, target_jan_filtered = combine_datasets(train, 1, dir_folders=dir_dataset_jan)
input_feb_filtered, target_feb_filtered = combine_datasets(train, 1, dir_folders=dir_dataset_feb)
input_mar_filtered, target_mar_filtered = combine_datasets(train, 1, dir_folders=dir_dataset_mar)
input_apr_filtered, target_apr_filtered = combine_datasets(train, 1, dir_folders=dir_dataset_apr)

In [11]:
print(f'Input and target shape month by month after filtering out not suitable images (training reach 1):\n\
January --> input shape: {np.shape(input_jan_filtered)} - Target shape: {np.shape(target_jan_filtered)}\n\
February --> input shape: {np.shape(input_feb_filtered)} - Target shape: {np.shape(target_feb_filtered)}\n\
March --> input shape: {np.shape(input_mar_filtered)} - Target shape: {np.shape(target_mar_filtered)}\n\
April --> input shape: {np.shape(input_apr_filtered)} - Target shape: {np.shape(target_apr_filtered)}')

Input and target shape month by month after filtering out not suitable images (training reach 1):
January --> input shape: (6, 4, 1000, 500) - Target shape: (6, 1000, 500)
February --> input shape: (17, 4, 1000, 500) - Target shape: (17, 1000, 500)
March --> input shape: (13, 4, 1000, 500) - Target shape: (13, 1000, 500)
April --> input shape: (10, 4, 1000, 500) - Target shape: (10, 1000, 500)


### 1. Training dataset

In [12]:
# training
dtype = dtype=torch.float32

dataset_train_jan = create_full_dataset(train, dir_folders=dir_dataset_jan, device=device, dtype=dtype)
dataset_train_feb = create_full_dataset(train, dir_folders=dir_dataset_feb, device=device, dtype=dtype)
dataset_train_mar = create_full_dataset(train, dir_folders=dir_dataset_mar, device=device, dtype=dtype)
dataset_train_apr = create_full_dataset(train, dir_folders=dir_dataset_apr, device=device, dtype=dtype)

print(f'Total training samples considering different months:\n\
January --> {len(dataset_train_jan)}\n\
February --> {len(dataset_train_feb)}\n\
March --> {len(dataset_train_mar)}\n\
April --> {len(dataset_train_apr)}')

Total training samples considering different months:
January --> 378
February --> 402
March --> 413
April --> 262


In [13]:
print(f"Datasets shape (same for every monthly dataset)\n\
Input dataset sample shape: {dataset_train_jan[0][0].shape} - Target dataset sample shape: {dataset_train_jan[0][1].shape}")

Datasets shape (same for every monthly dataset)
Input dataset sample shape: torch.Size([4, 1000, 500]) - Target dataset sample shape: torch.Size([1000, 500])


### 2. Validation dataset

In [14]:
# validation
dataset_val_jan = create_full_dataset(val, dir_folders=dir_dataset_jan, device=device, dtype=dtype)
dataset_val_feb = create_full_dataset(val, dir_folders=dir_dataset_feb, device=device, dtype=dtype)
dataset_val_mar = create_full_dataset(val, dir_folders=dir_dataset_mar, device=device, dtype=dtype)
dataset_val_apr = create_full_dataset(val, dir_folders=dir_dataset_apr, device=device, dtype=dtype)

print(f'Total validation samples considering different months:\n\
January --> {len(dataset_val_jan)}\n\
February --> {len(dataset_val_feb)}\n\
March --> {len(dataset_val_mar)}\n\
April --> {len(dataset_val_apr)}')

Total validation samples considering different months:
January --> 9
February --> 19
March --> 13
April --> 17


### 3. Testing dataset

In [15]:
# testing
dataset_test_jan = create_full_dataset(test, dir_folders=dir_dataset_jan, device=device, dtype=dtype)
dataset_test_feb = create_full_dataset(test, dir_folders=dir_dataset_feb, device=device, dtype=dtype)
dataset_test_mar = create_full_dataset(test, dir_folders=dir_dataset_mar, device=device, dtype=dtype)
dataset_test_apr = create_full_dataset(test, dir_folders=dir_dataset_apr, device=device, dtype=dtype)

print(f'Total validation samples considering different months:\n\
January --> {len(dataset_test_jan)}\n\
February --> {len(dataset_test_feb)}\n\
March --> {len(dataset_test_mar)}\n\
April --> {len(dataset_test_apr)}')

Total validation samples considering different months:
January --> 16
February --> 19
March --> 17
April --> 17
