## Data processing

In [1]:
import pandas as pd
X_meta = pd.read_csv('/Users/alex/Desktop/comet/data/training/X_meta.csv')

In [2]:
X_meta.LabelName.value_counts()

Strawberry     7617
Orange         5972
Tomato         5826
Apple          3703
Grape          2588
Lemon          1678
Banana         1507
Grapefruit     1185
Watermelon      788
Pear            741
Peach           721
Pomegranate     653
Pineapple       630
Mango           404
Common fig      303
Cantaloupe      158
Name: LabelName, dtype: int64

### Attempting to apply imbalanced learning

The classes are extremely inbalanced, with Cantaloupe having a fiftieth as many image samples in the dataset as Strawberry. If our goal is to build a classifer which is equally good for all of these image types, we'd benefit from rebalancing the classes.

In [3]:
X = X_meta[['CroppedImageURL']].values
y = X_meta['LabelName'].values

from imblearn.over_sampling import RandomOverSampler
X, y = RandomOverSampler(random_state=42).fit_resample(X, y)

In [4]:
import numpy as np

split_ratio = 0.8
n_samples = len(X)
split_idx = int(split_ratio * n_samples)

np.random.seed(42)
idxs = np.arange(len(X))
np.random.shuffle(idxs)

X_train, X_test = X[idxs[:split_idx]], X[idxs[split_idx:]]
y_train, y_test = y[idxs[:split_idx]], y[idxs[split_idx:]]

In order to use the `flow_from_dataframe` feature we need to have the latest version of `keras-preprocessing` installed. This library is not up-to-date in the `keras` version I have installed.

In [53]:
%pip install -U git+https://github.com/keras-team/keras-preprocessing.git

Uninstalling Keras-Preprocessing-1.0.2:
  Would remove:
    /Users/alex/miniconda3/envs/open-fruits-dev/lib/python3.6/site-packages/Keras_Preprocessing-1.0.2-py2.7.egg-info
    /Users/alex/miniconda3/envs/open-fruits-dev/lib/python3.6/site-packages/keras_preprocessing/*
Proceed (y/n)? ^C
[31mOperation cancelled by user[0m
Note: you may need to restart the kernel to use updated packages.
Collecting git+https://github.com/keras-team/keras-preprocessing.git
  Cloning https://github.com/keras-team/keras-preprocessing.git to /private/var/folders/kx/vhz7qj2j2537dm7m86qllzx00000gn/T/pip-req-build-zvf8qnc0
Building wheels for collected packages: Keras-Preprocessing
  Building wheel for Keras-Preprocessing (setup.py) ... [?25ldone
[?25h  Stored in directory: /private/var/folders/kx/vhz7qj2j2537dm7m86qllzx00000gn/T/pip-ephem-wheel-cache-h40r8pft/wheels/03/a0/39/171f6040d36f36c71168dc69afa81334351b20955dc36ce932
Successfully built Keras-Preprocessing
[31mkeras 2.2.4 has requirement keras_app

In [5]:
import keras
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    rescale=1/255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

test_datagen = ImageDataGenerator(
    rescale=1/255
)

Using TensorFlow backend.


To try to coax `keras` into respecting oversampling I will use the `from_from_dataframe` facility targeting the oversampled `DataFrame`.

In [83]:
X_train_df = pd.DataFrame().assign(ImagePath=X_train[:, 0], ImageClass=y_train)
X_test_df = pd.DataFrame().assign(ImagePath=X_test[:, 0], ImageClass=y_test)

train_generator = train_datagen.flow_from_dataframe(
    X_train_df,
    directory='/Users/alex/Desktop/comet/data/images_cropped/',
    x_col='ImagePath',
    y_col='ImageClass',
    target_size=(48, 48),
    batch_size=512,
    class_mode='categorical'
)

validation_generator = test_datagen.flow_from_dataframe(
    X_test_df,
    directory='/Users/alex/Desktop/comet/data/images_cropped/',
    x_col='ImagePath',
    y_col='ImageClass',    
    target_size=(48, 48),
    batch_size=512,
    class_mode='categorical'
)

Found 972 validated image filenames belonging to 16 classes.
Found 13765 validated image filenames belonging to 16 classes.


In [7]:
print(len(X_train_df))
print(len(train_generator.filenames))

97497
30581


It appears that unfortunately the training generator does not respect oversampling in its input `DataFrame`. If we want to oversample and still use the tranining generators, it appears that we will have to write copies of image files directly to disk. We could omit the traning generator, but then we'd have to form a whole-dataset `numpy` array, which would take more RAM than this computer has.

## Model 1&mdash;Bottleneck VGG16

Bottleneck VGG comes from ["Building powerful image classification models using very little data"](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html).

Bottlenecking is a simplistic but extremely fast pretraining technique. We begin by passing the dataset through a pretrained model with the top (fully connected) layer omitted. By running all samples through this model, we generate the sparsified, featurized representation that that model learned for the given piece of data. That sample is then fed as input to our simpler custom model, which trains on these inputs in order to generate model results.

This is a fast pretraining technique because it is as fast as training the simple model we use on top of the bottleneck, plus passing all data through the pretrained model once.

**Note**: to verify that this code was correct, I did `head(1000)` on the input `DataFrame`. This made it so I could complete one round of training locally, just to make sure that all of the sizes align.

In [23]:
batch_size = 512

model = keras.applications.VGG16(include_top=False, weights='imagenet')
bottleneck_features_train = model.predict_generator(
    train_generator,
    steps=len(train_generator.filenames) // batch_size
)

bottleneck_features_test = model.predict_generator(
    test_generator, 
    steps=len(train_generator.filenames) // batch_size
)

The output of the last layer is a list of convolutional feature weights (as returned by a final `MaxPool` layer):

In [26]:
bottleneck_features_train[0, 0, 0]

array([0.3718927 , 0.        , 0.        , 0.7131844 , 0.        ,
       0.        , 0.        , 0.54641426, 0.        , 0.7506968 ,
       0.        , 0.        , 0.        , 0.        , 0.4727954 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.6142101 , 0.        ,
       0.        , 1.4543753 , 0.        , 0.        , 0.        ,
       0.        , 0.58099055, 0.5119456 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 1.3331678 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.84954274,
       0.25664645, 0.        , 0.08509421, 0.        , 1.829702  ,
       0.        , 0.        , 0.2652532 , 0.505351  , 0.        ,
       0.        , 0.464088  , 0.        , 0.        , 0.        ,
       0.        , 1.2156435 , 0.        , 0.        , 0.     

In [27]:
model.layers[-1]

<keras.layers.pooling.MaxPooling2D at 0x1a269fd2b0>

The output shape is interesting and requires flattening:

In [25]:
bottleneck_features_train.shape

(972, 1, 1, 512)

Although the blog post saves these to a `numpy` array and reads them back out, the best way to do this is actually to put the new layers that you want directly on top of the existing model.

In [79]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Dropout

input_shape = (1, 1, 512)  # == bottleneck_features_train.shape[1:]

prior = keras.applications.VGG16(
    include_top=False, 
    weights='imagenet',
    input_shape=(48, 48, 3)
)
model = Sequential()
model.add(prior)
model.add(Flatten())
model.add(Dense(256, activation='relu', name='Dense_Intermediate'))
model.add(Dropout(0.2, name='Dropout_Regularization'))
model.add(Dense(16, activation='sigmoid', name='Output'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

**Note**: you must specify the `input_shape` on the `keras.applications.VGG16` layer. if you do not specify it the model will assume a just-in-time `(None, None, None, 512)` shape (where the first `None` is the number of samples and the other two are god knows what). `Flatten` cannot work with this input shape. It complains that it doesn't have enough information to do its job.

**Note**: `48x48` is the smallest legal input size for `VGG16`. Keras will literally not allow you to specify inputs smaller than that.

In [84]:
model.fit_generator(
    train_generator,
    steps_per_epoch=len(train_generator.filenames) // batch_size,
    epochs=1
)

Epoch 1/1


<keras.callbacks.History at 0x1a64985438>

Putting it all together:

In [None]:
import numpy as np
import pandas as pd
import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Flatten, Dense, Dropout


X_meta = pd.read_csv('/Users/alex/Desktop/comet/data/training/X_meta.csv')
X = X_meta[['CroppedImageURL']].values
y = X_meta['LabelName'].values


np.random.seed(42)
idxs = np.arange(len(X))
np.random.shuffle(idxs)
X_train, X_test = X[idxs[:split_idx]], X[idxs[split_idx:]]
y_train, y_test = y[idxs[:split_idx]], y[idxs[split_idx:]]
X_train_df = pd.DataFrame().assign(ImagePath=X_train[:, 0], ImageClass=y_train)
X_test_df = pd.DataFrame().assign(ImagePath=X_test[:, 0], ImageClass=y_test)


train_datagen = ImageDataGenerator(
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    rescale=1/255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)
test_datagen = ImageDataGenerator(
    rescale=1/255
)
train_generator = train_datagen.flow_from_dataframe(
    X_train_df,
    directory='/Users/alex/Desktop/comet/data/images_cropped/',
    x_col='ImagePath',
    y_col='ImageClass',
    target_size=(48, 48),
    batch_size=512,
    class_mode='categorical'
)
validation_generator = test_datagen.flow_from_dataframe(
    X_test_df,
    directory='/Users/alex/Desktop/comet/data/images_cropped/',
    x_col='ImagePath',
    y_col='ImageClass',
    target_size=(48, 48),
    batch_size=512,
    class_mode='categorical'
)


batch_size = 512
model = keras.applications.VGG16(include_top=False, weights='imagenet')
prior = keras.applications.VGG16(
    include_top=False, 
    weights='imagenet',
    input_shape=(48, 48, 3)
)
model = Sequential()
model.add(prior)
model.add(Flatten())
model.add(Dense(256, activation='relu', name='Dense_Intermediate'))
model.add(Dropout(0.2, name='Dropout_Regularization'))
model.add(Dense(16, activation='sigmoid', name='Output'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.layers[0].trainable = False

model.fit_generator(
    train_generator,
    steps_per_epoch=len(train_generator.filenames) // batch_size,
    epochs=1
)

A modified version of this script, fitted to use `comet` and `t4` is `models/bottleneck_model.py`.

**Note**: the original run generating that model was done via `alekseylearn` on AWS SageMaker, hence why it ends by running `model.save('/opt/ml/model/model.h5')`. I downloaded the artifact to `bottleneck_model.h5` locally before pushing it to remote storage as a package.

In [57]:
import t4
t4.Package().set('bottleneck_model.h5', 'bottleneck_model.h5').push('quilt/open_fruit_models', 's3://quilt-example')

HBox(children=(IntProgress(value=0, description='Hashing', max=59998064, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Copying', max=59998064, style=ProgressStyle(description_width…




(remote Package)
 └─bottleneck_model.h5

## Model 2&mdash;Pretrained VGG16

The bottleneck approach is fast, but we don't do any training on the pretrained model. If we did some training there too, we'd improve our score even further.

We will unfreeze the topmost convolutional block of the network (this has the least generalized features, and will get us the most gain from retraining), then train again. To do this sucessfully we need to do a few things:

1. Use a pre-trained model for the top layer, as otherwise the weight updates will wreck the existing features in the unfrozen CNN block.
2. Set a small non-adaptive learning rate, e.g. a small SGD value. Again for the same reason.

We will reuse the topmost layer of the bottleneck model's output for this purpose. If you were starting from scratch you could pretrain this simple neural network using an [autoencoder](https://www.kaggle.com/residentmario/autoencoders).

In [60]:
import keras
from keras.models import Sequential
from keras.layers import Flatten, Dense, Dropout
from keras.optimizers import SGD


# fetch the pretrained model from storage
import t4
t4.Package.browse('quilt/open_fruit_models', 's3://quilt-example')['bottleneck_model.h5']\
    .fetch('bottleneck_model.h5')
pretrained_model = keras.models.load_model('bottleneck_model.h5')


# define the model
input_shape = (1, 1, 512)  # == bottleneck_features_train.shape[1:]
prior = keras.applications.VGG16(
    include_top=False, 
    weights='imagenet',
    input_shape=(48, 48, 3)
)
model = Sequential()
model.add(prior)
model.add(Flatten())
model.add(Dense(256, activation='relu', name='Dense_Intermediate'))
model.add(Dropout(0.2, name='Dropout_Regularization'))
model.add(Dense(16, activation='sigmoid', name='Output'))


# set pretrained weights
for new_layer, old_layer in zip(model.layers[-4:], pretrained_model.layers[-4:]):
    new_layer.set_weights(old_layer.get_weights())


# leave the outermost convblock trainable, but freeze all other layers
for cnn_block_layer in model.layers[0].layers[:-4]:
    cnn_block_layer.trainable = False

    
# compile the model
model.compile(
    # one-tenth the standard SGD learning rate w/ some momentum
    optimizer=SGD(lr=1e-4, momentum=0.9),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

HBox(children=(IntProgress(value=0, description='Copying', max=59998064, style=ProgressStyle(description_width…






In [55]:
# run this cell to test that the model compiles successfully
# model.fit_generator(
#     train_generator,
#     # use a tiny batch size just to verify that the model trains correctly
#     steps_per_epoch=len(train_generator.filenames) // 8,
#     epochs=1
# )

## Model 3&mdash;Pretrained VGG16 with Progressive Resizing

In [4]:
import keras

In [8]:
keras.applications.VGG16??