Script is used to separate the images into training and testing sets. There are 11 labels: lake, plants, window, buildings, grass, animal, water, person, clouds, sky, and NA]. In some images multiple labels may apply, in others, just one. This first part of the script removes all images which belong to NA or multiple categories.

When running the set for the first time, I discover that some files contain no information. To get around this I made a list of all the files which were invalid. The single labels need to be modified to not include these invalid instances. 

In [1]:
import numpy as np
labels = open("labels.txt", "r") # Open the labels file
newlabels = open("singlelabels.txt","wb")
missingInfo = open("MissingImages.txt",'r')
import mmap
s = mmap.mmap(missingInfo.fileno(), 0, access=mmap.ACCESS_READ)
values = dict()
for index, lab in enumerate(['lake', 'plants', 'window', 'buildings',
                             'grass', 'animal', 'water', 'person',
                             'clouds', 'sky', 'NA']):
    values[lab] = index

This is also a really good time to count how many of each class we have. Since we now have to also check if that exists in the missing info list, we now need to import mmap. 

In [2]:
class_freqs = np.zeros(10)
for line in labels:
    words = line.split()
    # If length is 2 we have only a filename and a label
    if len(words) == 2:
        if words[1] != 'NA':
            if s.find(words[0]) == -1:
                string_to_write = words[0] + ' ' + str(values[words[1]]) + '\n'
                newlabels.write(string_to_write)
                class_freqs[values[words[1]]] += 1

Read the first line to check that we've done things properly

In [3]:
newlabels.close()
labels.close()
newlabels = open("singlelabels.txt","r")
print newlabels.readline()

173.jpg 7



And report on how many of each class we have.

In [4]:
for index, lab in enumerate(['lake', 'plants', 'window', 'buildings',
                             'grass', 'animal', 'water', 'person',
                             'clouds', 'sky']):
    print "Number containing {} is {:.0f}".format(lab, class_freqs[index])

Number containing lake is 271
Number containing plants is 5591
Number containing window is 5222
Number containing buildings is 2707
Number containing grass is 3207
Number containing animal is 19240
Number containing water is 5663
Number containing person is 34597
Number containing clouds is 3354
Number containing sky is 9887


Now these are split into training and testing sets, based on a split of 80% test 10% validation 10% train. 

There are 89739 images. These are split into the three groups

In [5]:
nItems = int(sum(class_freqs))
import random as r
positions = range(0,nItems)
r.shuffle(positions)
train_idx = positions[:int(nItems*0.8)]
val_idx = positions[int(nItems*0.8)+1:int(nItems*0.9)]
test_idx = positions[int(nItems*0.9)+1:]

In [None]:
labels = open("singlelabels.txt", "r") # Open the labels file
test_labels = open("singlelabels_test.txt","wb")
train_labels = open("singlelabels_train.txt","wb")
val_labels = open("singlelabels_val.txt","wb")
for i, line in enumerate(labels):
    if i in test_idx:
        test_labels.write(line)
    if i in train_idx:
        train_labels.write(line)
    if i in val_idx:
        val_labels.write(line)
test_labels.close()
train_labels.close()
val_labels.close()

In [7]:
nItems

89739