# Picking protagonists

In order to have a chance of doing well on the task at hand, we need to have a relatively decent starting point. That means, that the images of a cat and a dog we pick to constitute our training set should be as representative of the larger dataset as possible.

But how can this be measured? We will come up with an answer, but without a doubt it will not be complete a complete one and likely not a very good one either. We will utilize a deep autoencoder and will measure the errors of the reproduction.

The reasoning is that a picture of a cat that in some way is representative of the remaining cat pictures in the dataset should have a low reproduction error. After all, we reproduce the image utilizing the latent factors shared by all the images. Unfortunately, the reproduction can also be successful becuase that particular image was 'easy' to reproduce - that easiness could come from the fact of it not containing any useful information at all.

I am still relatively optimistic about the heuristic we are going to employ and the starting point we will come up with should be significantly better than selecting an image at random.

In [2]:
import os, shutil, random, glob
import bcolz
import keras
import keras.preprocessing.image
from keras.layers import Input, Flatten, Dense, Dropout, Activation, BatchNormalization, GlobalMaxPooling2D
from keras.preprocessing.image import ImageDataGenerator
from keras.optimizers import Adam
from keras.models import Model
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import scipy


Using TensorFlow backend.


Code below assumes that the train data from the https://www.kaggle.com/c/dogs-vs-cats competition has been downloaded and unzipped into the `train` directory under root of the repository.

In [4]:
gen = ImageDataGenerator()
train_data = gen.flow_from_directory('train', target_size=(224, 224), batch_size=100, shuffle=False)

Found 0 images belonging to 0 classes.


In [4]:
train_filenames = train_data.filenames
bcolz.carray(train_filenames, rootdir='train_filenames', mode='w').flush()
train_y = keras.utils.to_categorical(train_data.classes)
bcolz.carray(train_y, rootdir='train_y', mode='w').flush()

In [5]:
base_model = VGG19(
    include_top=False,
    weights='imagenet',
    input_shape=(224, 224, 3),
    pooling=None
)

In [6]:
train_X = base_model.predict_generator(train_data, steps=train_data.n)
bcolz.carray(train_X, rootdir='train_X', mode='w').flush()

In [7]:
trn_ids = np.random.randint(25000, size=6)
val_ids = np.delete(np.arange(25000), trn_ids)

trn_X = train_X[trn_ids, ...]
trn_y = train_y[trn_ids]

random_subset = np.random.randint(24994, size=500)
val_X = train_X[random_subset, ...]
val_y = train_y[random_subset]

In [8]:
inputs = Input(shape=(7, 7, 512))
# x = keras.layers.MaxPooling2D(pool_size=(2,2), strides=(2,2))(inputs)
# x = Flatten()(x)
# x = Dense(4096)(x)

x = GlobalMaxPooling2D()(inputs)
x = Dense(4096)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dense(2)(x)
x = BatchNormalization()(x)
predictions = Activation('softmax')(x)

model = Model(inputs, predictions)

In [9]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 7, 7, 512)         0         
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              2101248   
_________________________________________________________________
batch_normalization_1 (Batch (None, 4096)              16384     
_________________________________________________________________
activation_1 (Activation)    (None, 4096)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 8194      
_________________________________________________________________
batch_normalization_2 (Batch (None, 2)                 8         
__________

In [10]:
model.compile(Adam(lr=1e-4), 'categorical_crossentropy', metrics=['accuracy'])

In [11]:
model.fit(x=trn_X, y=trn_y, batch_size=6, epochs=40, validation_data=(val_X, val_y), verbose=2)

Train on 6 samples, validate on 500 samples
Epoch 1/40
13s - loss: 1.3000 - acc: 0.1667 - val_loss: 3.6826 - val_acc: 0.5820
Epoch 2/40
0s - loss: 0.3923 - acc: 1.0000 - val_loss: 2.5259 - val_acc: 0.6480
Epoch 3/40
0s - loss: 0.2874 - acc: 1.0000 - val_loss: 1.9335 - val_acc: 0.7020
Epoch 4/40
0s - loss: 0.2282 - acc: 1.0000 - val_loss: 1.5719 - val_acc: 0.7280
Epoch 5/40
0s - loss: 0.1965 - acc: 1.0000 - val_loss: 1.3259 - val_acc: 0.7400
Epoch 6/40
0s - loss: 0.1788 - acc: 1.0000 - val_loss: 1.1501 - val_acc: 0.7540
Epoch 7/40
0s - loss: 0.1686 - acc: 1.0000 - val_loss: 1.0168 - val_acc: 0.7640
Epoch 8/40
0s - loss: 0.1625 - acc: 1.0000 - val_loss: 0.9130 - val_acc: 0.7720
Epoch 9/40
0s - loss: 0.1587 - acc: 1.0000 - val_loss: 0.8284 - val_acc: 0.7800
Epoch 10/40
0s - loss: 0.1562 - acc: 1.0000 - val_loss: 0.7584 - val_acc: 0.7860
Epoch 11/40
0s - loss: 0.1544 - acc: 1.0000 - val_loss: 0.7007 - val_acc: 0.7920
Epoch 12/40
0s - loss: 0.1531 - acc: 1.0000 - val_loss: 0.6527 - val_acc:

<keras.callbacks.History at 0x7fdb1020b208>

Let's validate on the entire training set.

In [12]:
val_X = train_X[val_ids, ...]
val_y = train_y[val_ids]

In [13]:
model.fit(x=trn_X, y=trn_y, batch_size=6, epochs=1, validation_data=(val_X, val_y), verbose=2)

Train on 6 samples, validate on 24994 samples
Epoch 1/1
10s - loss: 0.1340 - acc: 1.0000 - val_loss: 0.4262 - val_acc: 0.8204


<keras.callbacks.History at 0x7fdaf369de80>

In [14]:
[train_filenames[idx] for idx in trn_ids]

['dogs/dog.9455.jpg',
 'cats/cat.4549.jpg',
 'cats/cat.10649.jpg',
 'dogs/dog.1881.jpg',
 'dogs/dog.4863.jpg',
 'cats/cat.9190.jpg']