## Train-Validation-Test set split

In this notebook, I split the full dataset into training, validation, and test sets.


This split is performed both on the images and the annotations.

In [18]:
# Import basic packages
import numpy as np 
import scipy as sp
import pandas as pd
import os
import json
from PIL import Image
import skimage.draw
import numpy.ma as ma
import imageio

# Train test split
from sklearn.model_selection import train_test_split

### Import image data 

Let's first import the images - these are stored are numpy arrays.
They come from Mike Wang's training set for the paper "A Machine Learning Approach to the Detection of Ghosting Artifacts in Dark Energy Survey Images" 

In [4]:
X_train_temp = np.load("train_set/x_ghstcln400-cnn5_ins4.npy")/255.0 #Images
y_train_temp = np.load("train_set/y_ghstcln400-cnn5_fix9z0_ins4.npy")#Labels
z_train_temp = np.load("train_set/z_ghstcln400-cnn5_ins4.npy")#contains expnum, year, and filter for each image in "x"

Keep only the ghosts (those with label = 1); get the exposure numbers.

In [5]:
X_train_ghosts = X_train_temp[(y_train_temp==1.0)]
z_train_ghosts = z_train_temp[(y_train_temp==1.0)]
Expnums = z_train_ghosts['expnum']
print(len(Expnums))

2499


For the training set we used we also demand the ghosts to have been classified as such by the neural network (mentioned in the above paper) with probability > $95\%$.

### Import annotations

Now, let's import the `json` file containing the annotations for the full dataset of 2000 images we used.

In [9]:
annotations_full = json.load(open("annotations_full.json"))
annotations_file = list(annotations_full['_via_img_metadata'].values())  # don't need the dict keys

In [10]:
# Let's see its size, it should be 2000
print(np.shape(annotations_file))

(2000,)


Let's start by getting the filenames

In [13]:
# Get filenames 
filenames = []

for i in range(2000):
    filename_loc = annotations_file[i]['filename']
    filenames.append(filename_loc)

Get all the annotations as a dictionary 

In [136]:
annotations = [a for a in annotations_file]

### Split now both annotations and images

Now I have to split the annotations and images into a training-validation and test sets.

To do that, I will simply create an array with integer entries 1-2000 and split that in $70\%$ for training, $15\%$ validation, and $15\%$ test sets.

In [156]:
indices_full = np.arange(2000)

# Split into training and validation-test (combided)
indices_train, indices_valtest = train_test_split(indices_full, test_size=0.3, random_state=42)

# Split validation-test sets
indices_val, indices_test = train_test_split(indices_valtest, test_size=0.5, random_state=42)

As a sanity check, let's print their size

In [157]:
print(len(indices_train),len(indices_val),len(indices_test))

1400 300 300


#### Split and save annotation json files

Now let's split the annotations (dictionaries) to training/validation/test sets and save them to corresponding `json` files.

- For training set

In [163]:
#Array containing the filenames for the training set
filenames_train = np.array(filenames)[indices_train] 
# Now for the annotations of the training set
annotations_train = []
for i in range(len(indices_train)):
    local_annot = annotations[indices_train[i]]
    annotations_train.append(local_annot)

In [169]:
# Save into a json file
json_annot_train = json.dumps(annotations_train)
f = open("annotations_train.json","w")
f.write(json_annot_train)
f.close()

- For validation set

In [165]:
#Array containing the filenames for the validation set
filenames_val = np.array(filenames)[indices_val] 
# Now for the annotations of the validation set
annotations_val = []
for i in range(len(indices_val)):
    local_annot = annotations[indices_val[i]]
    annotations_val.append(local_annot)

In [170]:
# Save into a json file
json_annot_val = json.dumps(annotations_val)
f = open("annotations_val.json","w")
f.write(json_annot_val)
f.close()

- For test set

In [171]:
#Array containing the filenames for the test set
filenames_test = np.array(filenames)[indices_test] 
# Now for the annotations of the test set
annotations_test = []
for i in range(len(indices_test)):
    local_annot = annotations[indices_test[i]]
    annotations_test.append(local_annot)

In [172]:
# Save into a json file
json_annot_test = json.dumps(annotations_test)
f = open("annotations_test.json","w")
f.write(json_annot_test)
f.close()

### Now split the images

- For the training set

In [186]:
# Get exposure numbers
Expnum_train = []

for i in range(len(indices_train)):
    filename_loc = filenames_train[i]
    expnum_loc = filename_loc[10:19]
    Expnum_train.append(expnum_loc)
    
    
print(Expnum_train[-1])

D00589715


Now, populate the `Train_set` file with the corresponding images

In [202]:
n = len(indices_train)

for i in range(n):
    # Get the exposure number for the i-th element of training set
    Expnum_loc = Expnum_train[i]
    # Local ghost 
    X_loc = X_train_ghosts[Expnums==Expnum_loc][0,:,:]
    # ================================================
    # ================================================
    # Initialize
    leng = np.shape(X_loc)[0] #Size of image (400pixels)
    # For three channels
    X_ghost_3ch = np.zeros((leng,leng,3))

    #Populate 
    X_ghost_3ch[:,:,0] = X_loc
    X_ghost_3ch[:,:,1] = X_loc
    X_ghost_3ch[:,:,2] = X_loc
    
    # Save the image
    #imageio.imwrite("./Training_set/Ghost_img_{0}.jpg".format(Expnum_loc),(X_ghost_3ch*255.).astype(np.uint8))   

- For the validation set

In [189]:
# Get exposure numbers
Expnum_val = []

for i in range(len(indices_val)):
    filename_loc = filenames_val[i]
    expnum_loc = filename_loc[10:19]
    Expnum_val.append(expnum_loc)
    
    
print(Expnum_val[-1])

D00577759


In [203]:
n = len(indices_val)

for i in range(n):
    # Get the exposure number for the i-th element of training set
    Expnum_loc = Expnum_val[i]
    # Local ghost 
    X_loc = X_train_ghosts[Expnums==Expnum_loc][0,:,:] #Note, train here is general
    # Refers to the initial training set
    # ================================================
    # ================================================
    # Initialize
    leng = np.shape(X_loc)[0] #Size of image (400pixels)
    # For three channels
    X_ghost_3ch = np.zeros((leng,leng,3))

    #Populate 
    X_ghost_3ch[:,:,0] = X_loc
    X_ghost_3ch[:,:,1] = X_loc
    X_ghost_3ch[:,:,2] = X_loc
    
    # Save the image
    #imageio.imwrite("./Validation_set/Ghost_img_{0}.jpg".format(Expnum_loc),(X_ghost_3ch*255.).astype(np.uint8))   

- For the test set

In [204]:
# Get exposure numbers
Expnum_test = []

for i in range(len(indices_test)):
    filename_loc = filenames_test[i]
    expnum_loc = filename_loc[10:19]
    Expnum_test.append(expnum_loc)
    
    
print(Expnum_test[-1])

D00262304


In [206]:
n = len(indices_test)

for i in range(n):
    # Get the exposure number for the i-th element of training set
    Expnum_loc = Expnum_test[i]
    # Local ghost 
    X_loc = X_train_ghosts[Expnums==Expnum_loc][0,:,:] #Note, train here is general
    # Refers to the initial training set
    # ================================================
    # ================================================
    # Initialize
    leng = np.shape(X_loc)[0] #Size of image (400pixels)
    # For three channels
    X_ghost_3ch = np.zeros((leng,leng,3))

    #Populate 
    X_ghost_3ch[:,:,0] = X_loc
    X_ghost_3ch[:,:,1] = X_loc
    X_ghost_3ch[:,:,2] = X_loc
    
    # Save the image
    #imageio.imwrite("./Test_set/Ghost_img_{0}.jpg".format(Expnum_loc),(X_ghost_3ch*255.).astype(np.uint8))   