<h1>George W Bush CNN Small</h1>

<strong>Abstract</strong> We will throw 95% of the other images away, so their are roughly the same number of George W Bush images and non George W Bush images. CNN classifies all images as non-George W Bush, probably because their are slightly more non-George W Bush images. Signifies, we need more layers and not identifying faces properly.

<strong>Purpose</strong> Build a CNN trained to identify George W Bush. He was chosen because of his high number of images, we would like to see how a CNN with a lot of images of a single person performs. George W Bush has 529 images and there are about 5,748 other people with a total of 12,643 images.





In [1]:
%load_ext autoreload

In [2]:
%autoreload 2
%matplotlib inline

import os
import fnmatch
import matplotlib.pyplot as plt
import numpy as np
from skimage import io
from skimage.transform import resize
from sklearn.metrics import confusion_matrix

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils

plt.rcParams['figure.figsize'] = (12.0, 10.0)
np.random.seed(123456)

Using Theano backend.


In [10]:
data_path = '../data/'
data_lfw_path = data_path + 'lfw_cropped/'

batch_size = 128
nb_epoch = 12
img_rows, img_cols = 100, 100
test_size_percent = .8
validation_split = .2
random_discard_percent = .95

<h2>Preparing Data</h2>

In [11]:
def get_filenames_separated_from_target(target):
    files = []
    target_files = []
    
    for root, dirnames, filenames in os.walk(data_lfw_path):
        for dirname in dirnames:
                for filename in os.listdir(os.path.join(data_lfw_path, dirname)):
                    if filename.endswith(".jpg"):
                        f = os.path.join(root + dirname, filename)
                        if dirname == target:
                            target_files.append(f)
                        else:
                            files.append(f)
    return target_files, files

In [12]:
def get_train_and_test_sets(target_data, data):
    data_to_keep = int((1 - random_discard_percent) * len(data))
    np.random.shuffle(data)
    
    all_data = [(t, 1) for t in target_data] + [(t, 0) for t in data[:data_to_keep]]
    np.random.shuffle(all_data)
    
    test_size = int(test_size_percent * len(all_data))
    X_train = np.array([x[0] for x in all_data[:test_size]])
    y_train = np.array([x[1] for x in all_data[:test_size]])
    X_test = np.array([x[0] for x in all_data[test_size:]])  
    y_test = np.array([x[1] for x in all_data[test_size:]])
    
    X_train = X_train.reshape(X_train.shape[0], 3, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 3, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255

    return (X_train, y_train), (X_test, y_test)
    

In [13]:
def image_read(f):
    return resize(io.imread(f), (img_rows, img_cols))

In [14]:
target_files, files = get_filenames_separated_from_target('George_W_Bush')

In [15]:
images = [image_read(f) for f in files]
target_images = [image_read(f) for f in target_files]

In [16]:
(X_train, y_train), (X_test, y_test) = get_train_and_test_sets(target_images, images)

<h2>Training and Testing the CNN</h2>

Implementation of VGG-like convnet http://keras.io/examples/

In [19]:
VGG = Sequential()

VGG.add(Convolution2D(32, 3, 3, input_shape=(3, img_rows, img_cols)))
VGG.add(Activation('relu'))
VGG.add(Convolution2D(32, 3, 3))
VGG.add(Activation('relu'))
VGG.add(MaxPooling2D(pool_size=(2, 2)))
VGG.add(Dropout(0.25))

VGG.add(Convolution2D(64, 3, 3))
VGG.add(Activation('relu'))
VGG.add(Convolution2D(64, 3, 3))
VGG.add(Activation('relu'))
VGG.add(MaxPooling2D(pool_size=(2, 2)))
VGG.add(Dropout(0.25))

VGG.add(Flatten())

VGG.add(Dense(256))
VGG.add(Activation('relu'))
VGG.add(Dropout(0.5))

VGG.add(Dense(1))
VGG.add(Activation('sigmoid'))

VGG.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              class_mode='binary')

VGG.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch, 
        show_accuracy=True, verbose=1, validation_split=validation_split)

Train on 742 samples, validate on 186 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x11f74b0d0>

In [20]:
score = VGG.evaluate(X_test, y_test, show_accuracy=True, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

('Test score:', 0.68989973134748928)
('Test accuracy:', 0.54077253218884125)


In [21]:
y_pred = VGG.predict_classes(X_test)



In [22]:
confusion_matrix(y_test, y_pred)

array([[126,   0],
       [107,   0]])