<a href="https://colab.research.google.com/github/emely3h/Geospatial_ML/blob/feature%2Fadd-data-generators-to-fix-ram-problem/combine_npz_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Combine npz files

This is a notebook for the last step in the prepare data pipeline as we did not have enough RAM to run it locally. To train the model on the entire dataset it is more convenient to have all tile-arrays of all images in one .npz file. 

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#! ls
%cd drive/MyDrive/MachineLearning
%cd Geospatial_ML
! ls

/content/drive/.shortcut-targets-by-id/15HUD3sGdfvxy5Y_bjvuXgrzwxt7TzRfm/MachineLearning
/content/drive/.shortcut-targets-by-id/15HUD3sGdfvxy5Y_bjvuXgrzwxt7TzRfm/MachineLearning/Geospatial_ML
architecture.drawio  colab.py	  models	__pycache__
colab-new.py	     evaluation   notebooks	README.md
colab_new.py	     experiments  prepare_data	requirements.txt


In [4]:
import numpy as np
import os
import pickle
import datetime
data_path = "../data_colab/256_200"

uncompressed file is 2GB, 50MB and compressed 274 MB
=> loading/ decompressing all arrays takes ~ 15 * 2,04 GB = 31 GB
=> loading all images into RAM still works but compressing them fails

all 5 images decompressed in memory ~ 18 GB RAM

- combining 8 images with savez() takes 20 GB < 5min, < 20GB System RAM
- combining 8 images with savez_compressed() takes 1,84 GB > 10min, ~ 30 GB System RAM
- trying to combine 11 images with savez_compressed crashed during loading 9th image


In [None]:
data_path = "../data_colab/256_200"

total_tiles = 0
for file in os.listdir(data_path):
  if not os.path.isdir(os.path.join(data_path, file)) and not file.startswith('combined') and not file.startswith('compressed'):
   
    print(f'Image: {file}')
    array = np.load(f'{data_path}/{file}')
    total_tiles += array['x_input'].shape[0]
    print(array['x_input'].shape)
    print(array['y_mask'].shape)
    print()

print(f'Total amount of tiles {total_tiles}')

Image: 2022_10_13.npz
(889, 256, 256, 5)
(889, 256, 256)

Image: 2022_07_15.npz
(864, 256, 256, 5)
(864, 256, 256)

Image: 2022_09_18.npz
(1174, 256, 256, 5)
(1174, 256, 256)

Image: 2022_06_20.npz
(1251, 256, 256, 5)
(1251, 256, 256)

Image: 2022_10_23.npz
(1164, 256, 256, 5)
(1164, 256, 256)

Image: 2022_07_25.npz
(1258, 256, 256, 5)
(1258, 256, 256)

Image: 2022_08_04.npz
(1319, 256, 256, 5)
(1319, 256, 256)

Image: 2022_07_10.npz
(1323, 256, 256, 5)
(1323, 256, 256)

Image: 2022_07_30.npz


In [5]:
print(f'Started at: {datetime.datetime.now()}')

x_output_shape = (18350, 256, 256, 5) # can not deal with rest, should be 18307
y_output_shape = (18350, 256, 256)

# memory-mapped array to hold the output data
output_file_x = np.memmap(os.path.join(data_path, "combined_x_input.npy"), mode="w+", shape=x_output_shape, dtype=np.float32)
output_file_y = np.memmap(os.path.join(data_path, "combined_y_mask.npy"), mode="w+", shape=y_output_shape, dtype=np.float32)

file_count = 0


for file in os.listdir(data_path):
  if not os.path.isdir(os.path.join(data_path, file)) and not file.startswith('combined') and not file.startswith('compressed'):
    file_count += 1
    print(f'loading file {file_count}: {file}')
    
    # Load the compressed numpy array in chunks using np.memmap
    with np.load(os.path.join(data_path, file), mmap_mode="r") as data:
        
        chunk_size = 50 
        num_chunks = data["x_input"].shape[0] // chunk_size
        
        for chunk in range(num_chunks):
            print(f'Chunk {chunk}')
            start_idx = (file_count - 1) * num_chunks * chunk_size + chunk * chunk_size
            end_idx = start_idx + chunk_size
            
            # Write the chunk to the output file using the memory-mapped array
            print(f'output file indexes: {start_idx} : {end_idx}  Chunk shape {data["x_input"][chunk * chunk_size:(chunk + 1) * chunk_size, ...].shape}')
            
            output_file_x[start_idx:end_idx, ...] = data["x_input"][chunk * chunk_size:(chunk + 1) * chunk_size, ...]
            output_file_y[start_idx:end_idx, ...] = data["y_mask"][chunk * chunk_size:(chunk + 1) * chunk_size, ...]

# 4h until here????

print('finished concatenating arrays')

output_file_x.flush()
output_file_y.flush()
print('finished flushing')

np.savez_compressed(os.path.join(data_path, "compressed_combined.npz"), x_input=output_file_x, y_mask=output_file_y)
print('finish compressing')

# Delete the memory-mapped array to free up resources
del output_file_x
del output_file_y

print(f'Finished at: {datetime.datetime.now()}')


Started at: 2023-04-01 15:16:11.933561
loading file 1: 2022_10_13.npz
Chunk 0
output file indexes: 0 : 50  Chunk shape (50, 256, 256, 5)
Chunk 1
output file indexes: 50 : 100  Chunk shape (50, 256, 256, 5)
Chunk 2
output file indexes: 100 : 150  Chunk shape (50, 256, 256, 5)
Chunk 3
output file indexes: 150 : 200  Chunk shape (50, 256, 256, 5)
Chunk 4
output file indexes: 200 : 250  Chunk shape (50, 256, 256, 5)
Chunk 5
output file indexes: 250 : 300  Chunk shape (50, 256, 256, 5)
Chunk 6
output file indexes: 300 : 350  Chunk shape (50, 256, 256, 5)
Chunk 7
output file indexes: 350 : 400  Chunk shape (50, 256, 256, 5)
Chunk 8
output file indexes: 400 : 450  Chunk shape (50, 256, 256, 5)
Chunk 9
output file indexes: 450 : 500  Chunk shape (50, 256, 256, 5)
Chunk 10
output file indexes: 500 : 550  Chunk shape (50, 256, 256, 5)
Chunk 11
output file indexes: 550 : 600  Chunk shape (50, 256, 256, 5)
Chunk 12
output file indexes: 600 : 650  Chunk shape (50, 256, 256, 5)
Chunk 13
output file 

ValueError: ignored

In [4]:
array_x = np.load(f'{data_path}/compressed_combined_x_input.npz', allow_pickle=True)
array_y = np.load(f'{data_path}/compressed_combined_y_mask.npz', allow_pickle=True)

# todo: do data splits while loading in batch

In [6]:
x_input = array_x['y_mask'] # should be x_input forgot to change
#print(array_x.files)
#print(x_input.shape)

In [7]:
y_mask = array_y['y_mask']
#print(array_y.files)
#print(y_mask.shape)

In [8]:
x_input = x_input[:19100]
#x_input.shape

In [11]:
y_mask.shape

(19100, 256, 256)

In [None]:
with np.load(f'{data_path}/compressed_combined.npz') as data:
    
    x_input = data['x_input.npy']
    y_mask = data['y_mask.npy']
    chunk_size = 60
    
    for i in range(0, len(x_input), chunk_size):
        
        chunk_x = x_input[i:i+chunk_size]
        
        chunk_y = y_mask[i:i+chunk_size]
        
        print(chunk_x.shape)
        print(chunk_y.shape)
        


time to execute npy: 50 min

system ram needed: ~10 GB

crashed on last img 2022_08_09: ValueError                                Traceback (most recent call last)

<ipython-input-19-2ced60a8ab90> in <module>
     19             end_idx = start_idx + chunk_size
     20             # Write the chunk to the output file using the memory-mapped array
---> 21             output_file[start_idx:end_idx, ...] = data["x_input"][j * chunk_size:(j + 1) * chunk_size, ...]
     22 
     23 # Delete the memory-mapped array to free up resources

ValueError: could not broadcast input array from shape (50,256,256,5) into shape (38,256,256,5)



file size npz
time to execute npz


=> saved as npy can not be read but when npy is then saved as compressed npz readin/ loading works, takes just more time but files are smaller



Problem: running out of ram when trying to save more than 5 images in one compressed npz, crashing always just at the savez_compressed() step

=> combining only 5 images into one file and then trying to combine those 2 files if possible

=> better way? Why does savez_compressed() consume most RAM?

50GB not enough for saving 10 images => loading + decompressing images takes ~ 30 GB why does last step, saving take so much RAM?
