The goal of this notebook is to pick n random simulations from all the available chunks of compressed python dictionaries containing, to then combine these simulations into one dataset. This new dataset should be a good estimate on the representation of the distribution from the complete dataset.  

Currently a random 25 % of each chunk is selected to be the sample from that chunk. This has the risk of not representing the complete distribution of the chunk well, 25 % is little... e.g. certain params might be missing (if there are more gas mixratios than ch4, co2, co, h2o)

# Imports

In [1]:
import numpy as np
import glob

from keijzer_exogan import *

np.random.seed(23) # set random seed for reproducability

In [2]:
"""
Load all chunks, add a random 5 % of all samples per chunk to X
"""
dir_ = '/datb/16011015/ExoGAN_data//'
paths = glob.glob(dir_+'chunck_*.pkgz')
print(len(paths))

100


In [3]:
paths[:3]

['/datb/16011015/ExoGAN_data/chunck_75.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_24.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_15.pkgz']

In [4]:
paths[:2]

['/datb/16011015/ExoGAN_data/chunck_75.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_24.pkgz']

In [5]:
paths[2:5]

['/datb/16011015/ExoGAN_data/chunck_15.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_41.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_54.pkgz']

In [6]:
X = []
for i in tqdm(range(len(paths[50:]))):
    dict_ = load(paths[i])
    
    X_complete = []
    for j in dict_.keys():
        X_complete.append(dict_[j])
     
    X_complete = np.array(X_complete)
    np.random.shuffle(X_complete) # shuffle the samples
    
    split_index = int(len(X_complete)*0.25) # select the first 25 % of each chunk (due to RAM limitations on laptop)
    X.append(X_complete[:split_index]) # add only first 25 % of each chunk to X
    
    # free memory
    del dict_, X_complete

100%|██████████| 50/50 [10:01<00:00, 12.13s/it]


In [7]:
len(X[0]),len(X[1])

(25000, 25000)

In [8]:
%%time
X = np.array(X) 

CPU times: user 12.9 s, sys: 8.39 s, total: 21.3 s
Wall time: 21.2 s


In [9]:
X.shape

(50, 25000)

In [10]:
np.save(dir_+"selection//"+'last_chunks_25_percent.npy', X)

In [11]:
print('done')

done
