In this tutorial, we will present a few simple yet effective methods that you can use to build a powerful image classifier, using only very few training examples - just a few hundred pictures from each class you want to be able to recognize.

we will go over the following options:
1. training a small network from scratch(as a baseline)
2. using the bottleneck features of a pre-trained network.
3. fine-tuning the top layers of a pre-trained newtwork

This will lead us to cover the following Keras features:
1. fit_generator for training Keras a model using Python data generators
2. ImageDataGenerator for real-time data augmentation
3. layer freezing and model fine-tuning
4. ...and more.

## Our setup: only 2000 training examples(1000 per class)

In our examples we will use two sets of pictures, which we get from Kaggle:
1000 cats and 1000 dogs(although the original dataset had 12,500 cats and 12,500 dogs, we just took the first 1000 images for each class)

That is very few examples to learn from, for a classification problem that is far from simple.
So this is a challenging machine learning problem, but it is also a realistic one:in a lot of real-world use cases, even small-scale data collection can be extremely expensive or sometimes near-impossible(e.g. in medical imaging).Being able to make the most out of very little data is a key skill of competent data scientist.

In the resulting competition, top entrants were able to score over 98% accuracy by using modern deep learning techniques. In our case, because we restrict ourselves to only 8% of the dataset, the problem is much harder.

## On the relevance of deep learning for small-data problems

A message that I hear often is that "deep learning is only relevant when you have a huge amount of data."
While not entirely incorrect, this is somewhat misleading.
Certainly, deep learning requires the ability to learn features automatically from the data, which is generally only possible when lots of training data is available -- especially for problems where the input samples are very high-dimensional, like images.  
However, convolutional neural networks -- a pillar algorithm of deep learning -- are by design one of the best models available for the most "perceptual" problems(such as image classification), even with very little data to learn from.  
Training a connnet from scratch on a small image dataset will still yield reasonable results, without the need for any custom feature engineering. Convnets are just plain good. They are the right tool for the job.

But whats more, deep learning models are by nature highly repurposable:  
you can take, say, an image classification or speech-to-text model trained on a large-scale dataset then reuse it on a significantly different problem with only minor changes, as we will see in this post. Specifically in the case of computer vision, many pre-trained models(usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models out of very little data.

# Data pre-processing and data augmentation

In order to make the most of our few training examples, we will "augment" them via a number of random transformations, so that our model would never see twice the exact same picture. This helps prevent overfitting and helps the model generalize better.

In Keras this can be done via the keras.preprocessing.image.ImageDataGenerator class. This class allows you to:
1. configure random transformations and normalization operations to be done on your image data during training
2. instantiate generators of augmented image batches (and their labels) via .flow(data, labels) or .flow_from_directory(directory). These generators can then be used with the Keras model methods that accept data generators as inputs, fit_generator, evaluate_generator and predict_generator.

In [1]:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

Using TensorFlow backend.


In [2]:
datagen = ImageDataGenerator(
            rotation_range=40,
            width_shift_range=0.2,
            height_shift_range=0.2,
            shear_range=0.2,
            zoom_range=0.2,
            horizontal_flip=True,
            fill_mode="nearest"
            )

In [3]:
img = load_img("./Dogs_vs_Cats/train/cat.0.jpg") # This is a PIL image

In [4]:
x = img_to_array(img) # This is a Numpy array with shape(3, 150, 150)
print(type(x), x.shape)

<class 'numpy.ndarray'> (374, 500, 3)


In [5]:
x = x.reshape((1,) + x.shape) # This is a Numpy array with shape(1, 3, 150, 150)

In [6]:
# the .flow() command below generates batches of randomly transformed images
# and saves the results to the "preview/" directory
i = 0
for batch in datagen.flow(x, batch_size=1, save_to_dir="./Dogs_vs_Cats/preview", save_prefix="cat", save_format="jpeg"):
    i = i + 1
    if i > 20:
        break # otherwise the generator would loop indefinitely

## Training a small convnet from scratch: 80% accuracy in 40 lines of code

The right tool for an image classification job is a convnet, so let us try to train one on our data, as an initial baseline. Since we only have few examples, our number one concern should be overfitting. Overfitting happens when a model exposed to too few examples learns patterns that do not generalize to new data, i.e. when the model starts using irrelevant features for making predictions. 

Data augmentation is one way to fight overfitting, but is is not enough since our augmented samples are still highly correlated. Your main focus for fighting overfitting should be the entropic capacity of your model -- how much information your model is allowed to store. A model that can store a lot of information has the potential to be more accurate by leveraging more features, but it is also more at risk to start storing irrelevant features. Meanwhile, a model that can only store a few features will have to focus on the most significant features found in the data, and these are more likely to be truly relevant and to generalize better.

There are different ways to modulate entropic capacity. The main one is the choice of the number of parameters in your model, i.e. the number of layers and the size of each layer. Another way is the use of weight regularization, such as L1 or L2 regularization, which consists in forcing model weights to take smaller values.

In our case we will use a very small convnet with few layers and few filters per layer, alongside data augmentation and dropout. Dropout also helps reduce overfitting, by preventing a layer from seeing twice the exact same pattern, thus acting in a way analoguous to data augmentation(you could say that both dropout and data augmentation tend to disrupt random correlations occuring in your data).

This script goes along the blog post
"Building powerful image classification models using very little data"
from blog.keras.io.
It uses data that can be downloaded at:
https://www.kaggle.com/c/dogs-vs-cats/data
In our setup, we:
- created a data/ folder
- created train/ and validation/ subfolders inside data/
- created cats/ and dogs/ subfolders inside train/ and validation/
- put the cat pictures index 0-999 in data/train/cats
- put the cat pictures index 1000-1400 in data/validation/cats
- put the dogs pictures index 12500-13499 in data/train/dogs
- put the dog pictures index 13500-13900 in data/validation/dogs
So that we have 1000 training examples for each class, and 400 validation examples for each class.
In summary, this is our directory structure:

```
data/
    train/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...
    validation/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...
```

In [7]:
# the code below is our first model, a simple stack of 3 convolution layers with a ReLU activation and followed by
# max-pooling layers. This is very similar to the architectures that Yann LeCun advocated in the 1990s for image 
# classification.

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K

In [8]:
# dimensions of your images
img_width, img_height = 150, 150

In [9]:
train_data_dir = "./data/train/"
validation_data_dir = "./data/validation/"
nb_train_samples = 2000
nb_validation_samples = 800

In [10]:
epochs = 50
batch_size = 16

In [11]:
if K.image_data_format() == 'channels_first':
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

Let us prepare our data. We will use .flow_from_directory() to generate batches of image data(and their labels)
directly from our jpgs in their respective folders

In [12]:
# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
                    rescale = 1./255,
                    shear_range=0.2,
                    zoom_range=0.2,
                    horizontal_flip=True
                )


# This is the augmentation configuration we will use for testing
# only rescaling
test_datagen = ImageDataGenerator(rescale=1. / 255)

In [13]:
# this is a generator that will read pictures found in subfolders of "data/train", 
# and indefinitely generate batches of augmented image data
train_generator = train_datagen.flow_from_directory(
                    'data/train/', # this is the target directory
                    target_size = (150,150), # all images will be resized to 150x150
                    batch_size = batch_size,
                    class_mode = "binary" # since we use binary_crossentropy loss, we need binary labels
                    )

# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
                        "data/validation/",
                        target_size = (150,150),
                        batch_size = batch_size,
                        class_mode = "binary"
                        )

Found 2000 images belonging to 2 classes.
Found 802 images belonging to 2 classes.


In [14]:
model.fit_generator(
    train_generator,
    steps_per_epoch = nb_train_samples // batch_size,
    epochs = epochs,
    validation_data = validation_generator,
    validation_steps = nb_validation_samples // batch_size
)

model.save_weights("./first_try.h5")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


This approach gets us to a validation accuracy of 0.79-0.81 after 50 epochs(a number that was picked arbitrarily)  
because the model is small and uses aggressive dropout, it does not seem to be overfitting too much by that point. So at the time the Kaggle competition was launched, we would be already be "state of art" -- with 8% of the data, and no effort to optimize our architecture or hyperparameters. In fact, the Kaggle competition, this model would have scored in the top 100(out of 215 entrants).

Note that the variance of the validation accuracy is fairly high, both because accuracy is a high-variance metric and because we only use 800 validation samples. A good validation strategy in such cases would be to do k-fold cross validation, but this would require training k models for every evaluation round.

## Using the bottleneck features of a pre-trained netword:90% accuracy in a minute

A more refined approach would be to leverage a network pre-trained on a large dataset.  
Such a network would have already learned features that are useful for most computer vision problems, and leveraging such features would allow us to reach a better accuracy than any method that would only rely on the available data.

We will use the VGG16 architecture, pre-trained on the ImageNet dataset -- a model previously featured on this blog.Because the ImageNet dataset contains several "cat" classes(persian cat, siamese cat ...) and many "dog" classes among its total of 1000 classes, this model will already have learned features that are relevant to our classification problem.  

In Fact, it is possible that merely recording the softmax predictions of the model over our data rather than the bottleneck features would be enough to solve our dogs vs cats classification problem extremely well. However, the method we present here is more likely to generalize well to a broader range of problems, including problems featuring classes absent from ImageNet.

!["VGG16 Architecture"](./vgg16_original.png)

Our strategy will be as follow : we will only instantiate the convolutional part of the model, everything up the the fully-connected layers. We will then run this model on our training and validation data once, recording the output(the "bottleneck features" from the VGG16 model: the last activation maps before the fully-connected layers) in two numpy arrays. Then we train a small fully-connected model on top of the stored features.

The reason why we are storing the features offline rather than adding our fully-connected model directly on top of a frozen convolutional base and running the whole thing, is computational effiency. Running VGG16 is expensive, especially if you are woking on CPU, and we want to only do it once. Note that this prevents us from using data augmentation.

In [15]:
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras import  applications

In [16]:
# dimensions of our images
img_width, img_height = 150, 150

top_model_weights_path = "bottleneck_fc_model.h5"
train_data_dir = "data/train"
validation_data_dir = "data/validation"
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 50
batch_size = 16

In [17]:
def save_bottleneck_features():
    datagen = ImageDataGenerator(rescale=1. / 255)

    # build the VGG16 network
    model = applications.VGG16(include_top=False, weights='imagenet')

    generator = datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_width, img_height),
        batch_size=batch_size,
        class_mode=None,
        shuffle=False)
    bottleneck_features_train = model.predict_generator(
        generator, nb_train_samples // batch_size)
    np.save(open('bottleneck_features_train.npy', 'wb'),
            bottleneck_features_train)

    generator = datagen.flow_from_directory(
        validation_data_dir,
        target_size=(img_width, img_height),
        batch_size=batch_size,
        class_mode=None,
        shuffle=False)
    bottleneck_features_validation = model.predict_generator(
        generator, nb_validation_samples // batch_size)
    
    np.save(open('bottleneck_features_validation.npy', 'wb'),
            bottleneck_features_validation)

In [18]:
def train_top_model():
    train_data = np.load(open('bottleneck_features_train.npy', "rb"))
    train_labels = np.array(
        [0] * (nb_train_samples // 2) + [1] * (nb_train_samples // 2))

    validation_data = np.load(open('bottleneck_features_validation.npy', "rb"))
    validation_labels = np.array(
        [0] * (nb_validation_samples // 2) + [1] * (nb_validation_samples // 2))

    model = Sequential()
    model.add(Flatten(input_shape=train_data.shape[1:]))
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy', metrics=['accuracy'])

    model.fit(train_data, train_labels,
              epochs=epochs,
              batch_size=batch_size,
              validation_data=(validation_data, validation_labels))
    
    model.save_weights(top_model_weights_path)

In [20]:
save_bottleneck_features()

Found 2000 images belonging to 2 classes.
Found 802 images belonging to 2 classes.


In [21]:
train_top_model()

Train on 2000 samples, validate on 800 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


We reach a validation accuracy of 0.90-0.91: not bat at all.  
This is definitely partly due to the fact that the base model was trained on a dataset that already featured dogs and cats(among hundreds of other classes).

## Fine-tuning the top layers of a pre-trained network

To further improve our previous result, we can try to to "fine-tune" the last convolutional block of the VGG16 model anlongside the top-level classfier. Fine-tuning consist in starting from a trained network, then re-training it on a new dataset using very small weight updates. In our case, this can be done in 3 steps:
- instantiate the convolutinoal base of VGG16 and load its weights
- add our previously defined fully-connected model on top, and load its weights
- freeze the layers of the VGG16 model up to the last convolutional block

![](./vgg16_modified.png)

Note that :  

- in order to perform fine-tuning, all layers should start with properly trained weights:for instance you should not slap a randomly initialized fully-connected network on top of a pre-trained convolutional base. This is because the large gradient updates triggered by the randomly initialized weights would wreck the learned weights in the convolutional base. In our case this is why we first train the top-level classifier, and only then start fine-tuning convolutional weights alongside it.

- we choose to only fine-tune the last convolutional block rather than the entire network in order to prevent overfitting, since the entire network would have a very large entropic capacity and thus a strong tendency to overfit. The features learned by low-level convolutional blocks are more general, less abstract than those found higher-up, so it is sensible to keep the first few blocks fixed (more general features) and only fine-tune the last one (more specialized features).

- fine-tuning should be done with a very slow learning rate, and typically with the SGD optimizer rather than an adaptative learning rate optimizer such as RMSProp. This is to make sure that the magnitude of the updates stays very small, so as not to wreck the previously learned features.

In [41]:
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dropout, Flatten, Dense

In [23]:
# path to the model weights files
weights_path = "./vgg16_weights.h5"
top_model_weights_path = "./bottleneck_fc_model.h5"

In [24]:
# dimensions of our images.
img_width, img_height = 150, 150

In [25]:
train_data_dir = "./data/train/"
validation_data_dir = "./data/validation/"

In [26]:
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 50
batch_size = 16

In [37]:
# build the VGG network
base_model = applications.VGG16(weights="imagenet",  include_top=False, input_shape=(150, 150, 3))
print("Model loaded.")

Model loaded.


In [38]:
# build a classifier model to put on top of the convolutional model
top_model = Sequential()
top_model.add(Flatten(input_shape=model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dropout(0.5))
top_model.add(Dense(1, activation='sigmoid'))

In [39]:
# note that it is necessary to start with a fully-trained classifier
# including the top classifier
# in order to successfully do fine-tuning
top_model.load_weights(top_model_weights_path)

In [42]:
# add the model on top of the convolutional base
# model.add(top_model)
model = Model(inputs=base_model.input, outputs=top_model(base_model.output))

In [43]:
# set the first 25 layers (up to the last conv block)
# to non-trainable (weights will not be updated)

for layer in model.layers[:25]:
    layer.trainable = False

In [44]:
# compile the model with a SGD/momentum optimizer
# and a very slow learning rate

model.compile(
    loss="binary_crossentropy",
    optimizer = optimizers.SGD(lr=1e-4, momentum=0.9),
    metrics=['accuracy']
             )

In [45]:
# prepare data augmentation configuration
train_datagen = ImageDataGenerator(
    rescale = 1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

test_datagen = ImageDataGenerator(rescale=1./255)

In [46]:
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

Found 2000 images belonging to 2 classes.
Found 802 images belonging to 2 classes.


In [47]:
# fine-tune the model
model.fit_generator(
    train_generator,
    samples_per_epoch=nb_train_samples,
    epochs=epochs,
    validation_data=validation_generator,
    nb_val_samples=nb_validation_samples)

  import sys


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f85edc35320>

In [48]:
model.save_weights("./fine_tuning_model.h5")