# [Galaxy Zoo - The Galaxy Challenge](https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge)

Important note: Images provided by Kaggle must be unpacked in a folder called ``./data/images_training_rev1`` (at the root of the project).

Now we'll proceed to create an 80/20 split of the dataset. To do this, we'll randomly select 80% of the images and place them in a directory. These images will be all potentially used for training later on. The remaining 20% of the images will be placed in a different directory and these will be used to validate our model.

In [1]:
import glob
import math
import random
import os
import warnings

# Save data directory in variables
data_dir = '../data'
original_data_dir = data_dir + '/images_training_rev1'
training_dir = data_dir + '/training'
validation_dir = data_dir + '/validation'

def load_img_paths():
    '''
    Retrieve the full path of all images in the dataset
    '''
    return glob.glob(original_data_dir + '/*.jpg')

def get_train_set_size(dataset_size, train_set_split = 80):
    '''
    Return size of training set based on the size of the entire dataset and the desired split
    '''
    assert(dataset_size > 0)
    assert(train_set_split > 0)
    assert(train_set_split < 100)
    
    return math.floor((train_set_split * dataset_size)/100)

def create_train_and_validation_dirs(img_paths, train_set_size):
    '''
    Randomly select the desired number of images to be located in the new training directory. 
    The remaning images will be placed in a new directory to be used for validation.
    '''
    assert(len(img_paths) > 0)
    assert(train_set_size > 0)
    
    # Randomly select images that will be in each set
    random.shuffle(img_paths)
    train_img_paths = img_paths[0:train_set_size]
    validation_img_paths = img_paths[train_set_size:]

    # Create training and validation directory 
    if not os.path.exists(training_dir):
        os.makedirs(training_dir)
    if not os.path.exists(validation_dir):
        os.makedirs(validation_dir)
        
    # Place training and validation images in their respective directories
    for x in train_img_paths:
        os.rename(x, x.replace(original_data_dir, training_dir))
    for x in validation_img_paths:
        os.rename(x, x.replace(original_data_dir, validation_dir))

Let's go ahead and execute the helper methods defined above:

In [2]:
img_paths = load_img_paths()
if len(img_paths) > 0:
    train_set_size = get_train_set_size(len(img_paths), 80)
    create_train_and_validation_dirs(img_paths, train_set_size)
else:
    warning_msg = """
        No images were found in the '%s' directory. Either training and validation
        directories have already been created, or the datasets structure is not correctly setup.
    """ % original_data_dir
    warnings.warn(warning_msg)

        No images were found in the '../data/images_training_rev1' directory. Either training and validation
        directories have already been created, or the datasets structure is not correctly setup.
    
  # Remove the CWD from sys.path while we load stuff.
