# Data Exploration + Splitting Algorithm for overlapping tiles

The goal of this notebook is to compare the original images before they are split into tiles. When training the model on the entire dataset the dataset is split into training data, validation data and test data. When evaluating the model based on calculated metrics it is important to use data that the model has never seen before. As experiment two shows that it is beneficial if the tiles in which the images are split overlap it is crucial that tiles of the same original images don't appear in e.g. training data and test data. Therefore it is necessary to analyze the images in order to come up with a good training-, validation- and test split of the dataset.

The entire dataset consists of 16 images taken on the following dates:
- 2022_12_12
- 2022_12_02
- 2022_10_23
- 2022_10_13
- 2022_09_18
- 2022_09_13
- 2022_09_08
- 2022_09_03
- 2022_08_24
- 2022_08_14
- 2022_08_04
- 2022_07_30
- 2022_07_25
- 2022_07_15
- 2022_07_10
- 2022_06_20

All images have been split with a tile size of 256 and a step size of 200.

The goal is to explore the following parameters:
- amount of tiles per image
- amount of pixels per class per image

in order to come up with a good solution of splitting the dataset.

### 0. Get Stats for each image

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
! ls
%cd drive/MyDrive/MachineLearning/Geospatial_ML
! ls

drive  sample_data
/content/drive/.shortcut-targets-by-id/15HUD3sGdfvxy5Y_bjvuXgrzwxt7TzRfm/MachineLearning/Geospatial_ML
architecture.drawio	 evaluation   notebooks     requirements.txt
combine_npz_files.ipynb  experiments  prepare_data
data_exploration	 models       README.md


In [4]:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
import pickle

In [5]:
def check_shapes(x_input, y_mask):
  if not x_input.shape[0] == y_mask.shape[0]:
    raise TypeError('amount of tiles different in input and mask array.')
  if not (x_input.shape[1] == 256 and x_input.shape[2] == 256):
    raise TypeError('tile size of input array does not match 256')
  if not (x_input.shape[3] == 5):
    raise TypeError('input array does not have 5 channels')
  if not (y_mask.shape[1] == 256 and y_mask.shape[2] == 256):
    raise TypeError('tile size of mask array does not match 256')

In [6]:
def num_of_pixels_per_class(y_mask, label):
  flatten = np.reshape(y_mask, (-1,))
  pixel_match = (flatten == label)
  pix_per_class = np.count_nonzero(pixel_match)
  return pix_per_class


In [7]:
def data_exploration(x_input, y_mask, file_name):

  check_shapes(x_input, y_mask)

  num_tiles = x_input.shape[0]
  num_pixels = x_input.shape[0] * x_input.shape[1] * x_input.shape[2]

  num_land_pix = num_of_pixels_per_class(y_mask, 2)
  num_valid_pix = num_of_pixels_per_class(y_mask, 1)
  num_invalid_pix = num_of_pixels_per_class(y_mask, 0)

  if not num_pixels == (num_land_pix + num_valid_pix + num_invalid_pix):
    raise TypeError('pixels per class summed up is not equal to num_pixels.')

  percenetage_land = 100/ num_pixels * num_land_pix
  percenetage_valid = 100/ num_pixels * num_valid_pix
  percenetage_invalid = 100/ num_pixels * num_invalid_pix

  return {
      'file_name': file_name,
      'x_input.shape': x_input.shape,
      'y_mask.shape': y_mask.shape,
      'num_tiles': num_tiles,
      'num_pixels': num_pixels,
      'num_land_pix': num_land_pix,
      'num_valid_pix': num_valid_pix,
      'num_invalid_pix': num_invalid_pix,
      'percenetage_land': percenetage_land,
      'percenetage_valid': percenetage_valid,
      'percenetage_invalid': percenetage_invalid
  }
  

In [8]:
data_directory = "../data_colab/256_200"

all_stats = []

for file_name in os.listdir(data_directory):
  if file_name.startswith('2022'):
    tiles_path = os.path.join(data_directory, file_name)

    y_mask  = np.load(tiles_path)['y_mask']
    x_input = np.load(tiles_path)['x_input']

    print(file_name)
    print(y_mask.shape)
    print(x_input.shape)
    print()

    stats = data_exploration(x_input, y_mask, file_name)
    all_stats.append(stats)


2022_10_13.npz
(889, 256, 256)
(889, 256, 256, 5)

2022_07_15.npz
(864, 256, 256)
(864, 256, 256, 5)

2022_09_18.npz
(1174, 256, 256)
(1174, 256, 256, 5)

2022_06_20.npz
(1251, 256, 256)
(1251, 256, 256, 5)

2022_10_23.npz
(1164, 256, 256)
(1164, 256, 256, 5)

2022_07_25.npz
(1258, 256, 256)
(1258, 256, 256, 5)

2022_08_04.npz
(1319, 256, 256)
(1319, 256, 256, 5)

2022_07_10.npz
(1323, 256, 256)
(1323, 256, 256, 5)

2022_07_30.npz
(1183, 256, 256)
(1183, 256, 256, 5)

2022_08_14.npz
(1179, 256, 256)
(1179, 256, 256, 5)

2022_08_24.npz
(1306, 256, 256)
(1306, 256, 256, 5)

2022_09_03.npz
(1196, 256, 256)
(1196, 256, 256, 5)

2022_12_12.npz
(957, 256, 256)
(957, 256, 256, 5)

2022_09_08.npz
(927, 256, 256)
(927, 256, 256, 5)

2022_12_02.npz
(1142, 256, 256)
(1142, 256, 256, 5)

2022_09_13.npz
(1175, 256, 256)
(1175, 256, 256, 5)



### 1. Display stats

In [9]:
df = pd.DataFrame(all_stats)
df.shape


(16, 11)

In [10]:
# sort by num tiles
df_num_tiles = df[['file_name', 'num_tiles', 'percenetage_valid', 'percenetage_invalid', 'percenetage_land']]
df_num_tiles_sorted = df_num_tiles.sort_values(by='num_tiles')
df_num_tiles_sorted


Unnamed: 0,file_name,num_tiles,percenetage_valid,percenetage_invalid,percenetage_land
1,2022_07_15.npz,864,10.050526,50.681909,39.267565
0,2022_10_13.npz,889,16.807331,39.782475,43.410194
13,2022_09_08.npz,927,31.447306,24.559434,43.99326
12,2022_12_12.npz,957,21.371911,28.253614,50.374475
14,2022_12_02.npz,1142,17.023361,35.459047,47.517591
4,2022_10_23.npz,1164,10.701117,43.796029,45.502853
2,2022_09_18.npz,1174,48.241497,7.988471,43.770031
15,2022_09_13.npz,1175,46.242989,10.073545,43.683466
9,2022_08_14.npz,1179,47.804025,8.449461,43.746515
8,2022_07_30.npz,1183,52.403293,3.37269,44.224017


In [11]:
mean_tiles = df['num_tiles'].mean()
print(f'Mean number of tiles: {mean_tiles}')

Mean number of tiles: 1144.1875


In [12]:
# sort by ratio invalid valid
df_ratio = df[['file_name', 'percenetage_valid', 'percenetage_invalid', 'percenetage_land', 'num_tiles']]
df_ratio_sorted = df_ratio.sort_values(by='percenetage_valid')
df_ratio_sorted

Unnamed: 0,file_name,percenetage_valid,percenetage_invalid,percenetage_land,num_tiles
11,2022_09_03.npz,5.212303,52.216682,42.571015,1196
1,2022_07_15.npz,10.050526,50.681909,39.267565,864
4,2022_10_23.npz,10.701117,43.796029,45.502853,1164
0,2022_10_13.npz,16.807331,39.782475,43.410194,889
14,2022_12_02.npz,17.023361,35.459047,47.517591,1142
12,2022_12_12.npz,21.371911,28.253614,50.374475,957
10,2022_08_24.npz,27.908288,26.841526,45.250187,1306
7,2022_07_10.npz,30.30057,26.719039,42.980391,1323
13,2022_09_08.npz,31.447306,24.559434,43.99326,927
6,2022_08_04.npz,37.76005,13.440227,48.799723,1319


In [13]:
mean_perc_invalid = df['percenetage_invalid'].mean()
print(f'Mean percentage of invalid pixels: {mean_perc_invalid}')

print(f'Total amount of tiles: {df["num_tiles"].sum()}')
print(f'Total amount of invalid pixels: {df["num_invalid_pix"].sum()}')

Mean percentage of invalid pixels: 24.320244004101355
Total amount of tiles: 18307
Total amount of invalid pixels: 280566117


### 2. Splitting algorithm

In [14]:
def _validate(df, total_tiles, total_invalids, threshold, target):

  df_num_tiles = df['num_tiles'].sum()
  df_num_invalides = df['num_invalid_pix'].sum()


  percentage_tiles = 100 / total_tiles * df_num_tiles
  percentage_invalides = 100 / total_invalids * df_num_invalides

  if (target + threshold) >= percentage_tiles <= (target - threshold):
    return False
  elif (target + threshold) >= percentage_invalides <= (target - threshold):
    return False
  else:
    print(f'Total_tiles: {total_tiles} total_invalids: {total_invalids}')
    print(f'percentage_tiles: {percentage_tiles}, percentage_invalides: {percentage_invalides} ')
    print()
    
    return True

def splitting_algorithm(threshold, df):
  total_tiles = df['num_tiles'].sum()
  total_invalids = df['num_invalid_pix'].sum()

  df_copy = df.copy()

  valid = False
  count = 0
  
  while not valid:
    print(f'Count: {count}')
    df_copy = df.copy()

    training_set = df_copy.sample(n=10)
    df_copy = df_copy.drop(training_set.index)
    
    validation_set = df_copy.sample(n=3)
    df_copy = df_copy.drop(validation_set.index)

    test_set = df_copy.sample(n=3)
    df_copy = df_copy.drop(test_set.index)

    train_validate = _validate(training_set, total_tiles, total_invalids, threshold, 60)
    val_validate = _validate(validation_set, total_tiles, total_invalids, threshold, 20)
    test_validate = _validate(test_set, total_tiles, total_invalids, threshold, 20)

    if train_validate and val_validate and test_validate:
      valid = True
    else:
      training_set = None
      validation_set= None
      test_set= None
      count += 1

  return training_set, validation_set, test_set

training_set, validation_set, test_set = splitting_algorithm(1, df)

Count: 0
Count: 1
Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 59.3925820724313, percentage_invalides: 66.65511965580647 

Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 20.19992352651991, percentage_invalides: 19.660535844390647 

Count: 2
Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 59.3925820724313, percentage_invalides: 66.65511965580647 

Count: 3
Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 59.944283607363296, percentage_invalides: 69.6904626583972 

Count: 4
Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 19.151144371005625, percentage_invalides: 22.62645955926317 

Count: 5
Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 62.664554541978475, percentage_invalides: 61.40069543750359 

Total_tiles: 18307 total_invalids: 280566117
percentage_tiles: 19.604522860108155, percentage_invalides: 27.21555112087893 

Count: 6
Count: 7
Total_tiles: 18307 total_invalids: 280566117
percentage_ti

In [17]:
print('Training Set Images')
print(training_set['file_name'])
print(f'Total tiles: {training_set["num_tiles"].sum()}')
print()
print('Validation Set Images')
print(validation_set['file_name'])
print(f'Total tiles: {validation_set["num_tiles"].sum()}')
print()
print('Test Set Images')
print(test_set['file_name'])
print(f'Total tiles: {test_set["num_tiles"].sum()}')
print()

Training Set Images
5     2022_07_25.npz
9     2022_08_14.npz
4     2022_10_23.npz
0     2022_10_13.npz
8     2022_07_30.npz
7     2022_07_10.npz
6     2022_08_04.npz
1     2022_07_15.npz
12    2022_12_12.npz
13    2022_09_08.npz
Name: file_name, dtype: object
Total tiles: 11063

Validation Set Images
15    2022_09_13.npz
11    2022_09_03.npz
2     2022_09_18.npz
Name: file_name, dtype: object
Total tiles: 3545

Test Set Images
14    2022_12_02.npz
3     2022_06_20.npz
10    2022_08_24.npz
Name: file_name, dtype: object
Total tiles: 3699



- In total we have 16 images. When applying a 60-20-20 split this means we will use 10 image for training, 3 images for validation and 3 images for testing.

- To find the best split two parameters are relevant, the number of tiles to achieve the 60-20-20 split and the number of invalid pixels.

- The algorithm comes up with possible solutions until it finds one where the amount of tiles and the amount of invalid pixels differs max 1% from the target 60-20-20 split



### 3. Results

#### Training Set Images

60.03 % of total tiles, 59.11 % of total invalid pixels 

- 2022_06_20.npz
- 2022_09_13.npz
- 2022_12_12.npz
- 2022_07_10.npz
- 2022_07_25.npz
- 2022_09_08.npz
- 2022_10_13.npz
- 2022_07_15.npz
- 2022_10_23.npz
- 2022_07_30.npz



#### Validation Set Images

20.15 % of total tiles, 29.92 % of total invalid pixels 

- 2022_08_04.npz
- 2022_09_03.npz
- 2022_09_18.npz


#### Test Set Images

19.81 % of total tiles, 19.97 % of total invalid pixels 

- 2022_12_02.npz
- 2022_08_14.npz
- 2022_08_24.npz
