The goal of this notebook is to pick n random simulations from all the available chunks of compressed python dictionaries, to then combine these simulations into one dataset. This smaller selection of the ExoGAN dataset should be a good representation of the distribution from the complete dataset. E.g. all parameters are uniformly distributed.  

* Training dataset: 25% per chunck at random, for the first 50 ExoGAN chuncks ($1.25 \cdot 10^6$ simulations).
* Testing dataset: 25% per chunck at random, for the last 50 ExoGAN chuncks. However for development purposes a way smaller selection (5k) is used of the last 50 chuncks.

<img src="../appendix/ExoGAN_train_test.png">

For the ExoGAN data download and description see the ExoGAN repository [link](https://github.com/ucl-exoplanets/ExoGAN_public).

# Imports

In [1]:
import numpy as np
import glob

from keijzer_exogan import *

np.random.seed(23) # set random seed for reproducability

# Recieving the file paths 

In [2]:
"""
Load all chunks, add a random 5 % of all samples per chunk to X
"""
dir_ = '/datb/16011015/ExoGAN_data//'
paths = glob.glob(dir_+'chunck_*.pkgz')
print(len(paths))

100


## Making sure there is no data leakage at the split point

In [3]:
paths[:6]

['/datb/16011015/ExoGAN_data/chunck_75.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_24.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_15.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_41.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_54.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_35.pkgz']

In [5]:
paths[:5]

['/datb/16011015/ExoGAN_data/chunck_75.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_24.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_15.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_41.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_54.pkgz']

In [6]:
paths[5:7]

['/datb/16011015/ExoGAN_data/chunck_35.pkgz',
 '/datb/16011015/ExoGAN_data/chunck_16.pkgz']

Notice that paths[:5] and paths[5:] is split correctly.

## Loading chuncks, selecting sample per chunck and save to list

In [6]:
X = []
for i in tqdm(range(len(paths[:50]))):
    dict_ = load(paths[i])
    
    X_complete = []
    for j in dict_.keys():
        X_complete.append(dict_[j])
     
    X_complete = np.array(X_complete)
    np.random.shuffle(X_complete) # shuffle the samples
    
    split_index = int(len(X_complete)*0.25) # select the first 25 % of each chunk (due to RAM limitations on laptop)
    X.append(X_complete[:split_index]) # add only first 25 % of each chunk to X
    
    # free memory
    del dict_, X_complete

100%|██████████| 50/50 [09:57<00:00, 12.29s/it]


In [7]:
len(X[0]),len(X[1])

(25000, 25000)

## Convert the list to an array

In [8]:
%%time
X = np.array(X)

CPU times: user 12.5 s, sys: 7.87 s, total: 20.4 s
Wall time: 20.4 s


In [9]:
X.shape

(50, 25000)

shape: (chunck, amount of simulations)

## Saving array to disk
Notice that X consists of dictionaries.  
`np.save()` converts this in some special way to a proper ndarray...  
This increases RAM usage by a lot.  
  
One improvement would be getting rid of the dictionaries, save params and spectra to seperate `.npy` files?

In [11]:
np.save(dir_+"selection//"+'first_chunks_25_percent_test.npy', X)

print('done')

done
