Script is used to separate the images into training and testing sets. There are 11 labels: lake, plants, window, buildings, grass, animal, water, person, clouds, sky, and NA]. In some images multiple labels may apply, in others, just one. This first part of the script removes all images which belong to NA or multiple categories.

When running the set for the first time, I discover that some files contain no information. To get around this I made a list of all the files which were invalid. The single labels need to be modified to not include these invalid instances. 

In [12]:
import numpy as np
labels = open("labels_none_missing.txt", "r") # Open the labels file
newlabels = open("singlelabels.txt","wb")
values = dict()
for index, lab in enumerate(['lake', 'plants', 'window', 'buildings',
                             'grass', 'animal', 'water', 'person',
                             'clouds', 'sky', 'NA']):
    values[lab] = index

This is also a really good time to count how many of each class we have. Since we now have to also check if that exists in the missing info list, we now need to import mmap. 

In [13]:
class_freqs = np.zeros(10)
for line in labels:
    words = line.split()
    # If length is 2 we have only a filename and a label
    if len(words) == 2:
        if words[1] != 'NA':
            string_to_write = words[0] + ' ' + str(values[words[1]]) + '\n'
            newlabels.write(string_to_write)
            class_freqs[values[words[1]]] += 1

Read the first line to check that we've done things properly

In [14]:
newlabels.close()
labels.close()
newlabels = open("singlelabels.txt","r")
print newlabels.readline()

173.jpg 7



And report on how many of each class we have.

In [15]:
for index, lab in enumerate(['lake', 'plants', 'window', 'buildings',
                             'grass', 'animal', 'water', 'person',
                             'clouds', 'sky']):
    print "Number containing {} is {:.0f}".format(lab, class_freqs[index])

Number containing lake is 271
Number containing plants is 5591
Number containing window is 5220
Number containing buildings is 2707
Number containing grass is 3207
Number containing animal is 19237
Number containing water is 5663
Number containing person is 34551
Number containing clouds is 3353
Number containing sky is 9874


Now these are split into training and testing sets, based on a split of 80% test 10% validation 10% train. 

There are 89739 images. These are split into the three groups

In [5]:
nItems = int(sum(class_freqs))
import random as r
positions = range(0,nItems)
r.shuffle(positions)
train_idx = positions[:int(nItems*0.8)]
val_idx = positions[int(nItems*0.8)+1:int(nItems*0.9)]
test_idx = positions[int(nItems*0.9)+1:]

In [9]:
labels = open("singlelabels.txt", "r") # Open the labels file
test_labels = open("singlelabels_test.txt","wb")
train_labels = open("singlelabels_train.txt","wb")
val_labels = open("singlelabels_val.txt","wb")
for i, line in enumerate(labels):
    if i in test_idx:
        test_labels.write(line)
    if i in train_idx:
        train_labels.write(line)
    if i in val_idx:
        val_labels.write(line)
test_labels.close()
train_labels.close()
val_labels.close()

Before beginning the multitask model, lets see how often pairs of categories occur together. This can help guide which should be the first grouping to be put together.

For the multitask model, first a simple model is made with two categories, both of which are binary problems, trained on the same net. The data is processed so that the included labels are now person, and animal. Hopefully two similar problems. The dataset then contains three different classes, contains just person, just animal, or person and animal. 

In [16]:
import numpy as np

# The labels currently in use
labelList = ['sky','clouds']
values = dict()
for index, lab in enumerate(labelList):
    values[lab] = index
newlabels = open("./Multilabel_two_classes/labels.txt","wb")
labels = open("labels_none_missing.txt", "r") # Open the labels file

In [17]:
class_freqs = np.zeros(3)

for line in labels:
    words = line.split()
    # If multiple labels
    isSingleAndInList = len(words) == 2 and words[1] in labelList
    isInList = map(lambda x: x in labelList, words[1:])
    if isSingleAndInList or all(isInList):
        # Find which labels are present and convert to [0 1] format
        label_txt = [int(y) for y in map(lambda x: x in words[1:],labelList)]
        string_to_write = words[0] + ' ' + str(label_txt) + '\n'
        newlabels.write(string_to_write)
        class_freqs[(label_txt[0] + 2*label_txt[1])-1] += 1

In [18]:
line =  labels.readline()
print line
print class_freqs


[  9874.   3353.  15636.]


So we only have 740 instances of people and animals occuring in the same space. Useing the simpler imagenet architecture from the base caffe library since we have relatively few training cases. Better would be to use sky and clouds. These represent a difficult situation to classify seperately and recognise together! Now new labels lists are made for the training and validation steps. These are two_class_train and two_class_test. 

In [19]:
pTrain = 0.8 # Therefore pTest is 0.2

nItems = int(sum(class_freqs))
import random as r
positions = range(0,nItems)
r.shuffle(positions)
train_idx = positions[:int(nItems*pTrain)]
test_idx = positions[int(nItems*pTrain)+1:]

# Loop through and write to the new files.
labels = open("./Multilabel_two_classes/labels.txt","r") # Open the labels file
test_labels = open("./Multilabel_two_classes/two_class_test.txt","wb")
train_labels = open("./Multilabel_two_classes/two_class_train.txt","wb")
for i, line in enumerate(labels):
    if i in test_idx:
        test_labels.write(line)
    if i in train_idx:
        train_labels.write(line)
test_labels.close()
train_labels.close()

Now also add labels for the binary classifier. Here two seperate classifiers will be trained, then the results will have to be compiled to give the multilabel version. As such the labels will be for each - [cloud present cloud not present], class 1 being present, class two being not present.  

In [20]:
labelList = ['sky','clouds']
skylabels = open("./Multilabel_binary/sky_labels.txt","wb")
cloudlabels = open("./Multilabel_binary/cloud_labels.txt","wb")
labels = open("labels_none_missing.txt", "r") # Open the labels file

In [21]:
for line in labels:
    words = line.split()
    # If multiple labels
    isSingleAndInList = (len(words) == 2) and (words[1] in labelList)
    isInList = map(lambda x: x in labelList, words[1:])
    if isSingleAndInList:
        if words[1] == 'clouds':
            cloudLabel = 1
            skyLabel = 0
        elif words[1] == 'sky':
            cloudLabel = 0
            skyLabel = 1
    elif all(isInList):
        cloudLabel = 1
        skyLabel = 1
    if isSingleAndInList or all(isInList):
        cloud_to_write = words[0] + ' ' + str(cloudLabel) + '\n'
        sky_to_write = words[0] + ' ' + str(skyLabel) + '\n'
        skylabels.write(sky_to_write)
        cloudlabels.write(cloud_to_write)
skylabels.close()
cloudlabels.close()

In [22]:
# Write the test and train sets for these two.
# Loop through and write to the new files.
skylabel = "./Multilabel_binary/sky_labels.txt"
skylabel_train = "./Multilabel_binary/sky_labels_train.txt"
skylabel_test = "./Multilabel_binary/sky_labels_test.txt"

cloudlabel = "./Multilabel_binary/cloud_labels.txt"
cloudlabel_train = "./Multilabel_binary/cloud_labels_train.txt"
cloudlabel_test = "./Multilabel_binary/cloud_labels_test.txt"

splittxtfile(skylabel, skylabel_train, skylabel_test, test_idx)
splittxtfile(cloudlabel, cloudlabel_train, cloudlabel_test, test_idx)

In [1]:
def splittxtfile(labelFile, trainFile, testFile, test_idx):
    '''
    labelFile - file containing the labels
    testFile - file to write test labels to
    trainFile - file to write train labels to
    testIdx - index to line belonging to the test file
    '''
    
    labels = open(labelFile,"r") # Open the labels file
    test_labels = open(testFile,"wb")
    train_labels = open(trainFile,"wb")
    for i, line in enumerate(labels):
        if i in test_idx:
            test_labels.write(line)
        else:
            train_labels.write(line)
    test_labels.close()
    train_labels.close()
    labels.close()