# To do
1. Create validation and test sets (figure out native Keras way of doing it, rather than do it manually)
2. Figure out the way to generate X and y at run time, rather than writing them out in full, which is only feasible for the small data that I started with here.

Adapt F. Chollet's code from Keras blog (https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) and GitHub (https://gist.github.com/fchollet/f35fbc80e066a49d65f1688a7e99f069) to apply pre-trained VGG16 network to our data.

In [1]:
'''Things we need to do before analysis of the main data:
- create a data/ folder
- create train/, validation/, and test/ subfolders inside data/
'''

'Things we need to do before analysis of the main data:\n- create a data/ folder\n- create train/, validation/, and test/ subfolders inside data/\n'

In [41]:
import numpy as np
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras import applications

Using code from https://www.kaggle.com/inversion/processing-bson-files/notebook

In [42]:
import io
import bson

from skimage.data import imread
import multiprocessing
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
from keras.utils import to_categorical

## Define features and labels

In [43]:
#change path depending on where your data is
data = bson.decode_file_iter(open('/Users/satoru/Documents/Kaggle/Cdiscount/train_example.bson', 'rb'))

prod_to_category = dict()

for c, d in enumerate(data):
    product_id = d['_id']
    category_id = d['category_id'] #This won't be in Test data
    prod_to_category[product_id] = category_id
    for e, pic in enumerate(d['imgs']):
        picture = imread(io.BytesIO(pic['picture']))
        # do something with the picture, etc

prod_to_category = pd.DataFrame.from_dict(prod_to_category, orient='index')
prod_to_category.index.name = '_id'
prod_to_category.rename(columns={0: 'category_id'}, inplace=True)

For the train_example data, we can define X and y as arrays directly. In the real training set, we need a way to pipe the data into the model during runtime.

In [44]:
data = bson.decode_file_iter(open('/Users/satoru/Documents/Kaggle/Cdiscount/train_example.bson', 'rb'))
X = []
y = np.array([], dtype=str)

for c, d in enumerate(data):
    for e, pic in enumerate(d['imgs']):
        picture = imread(io.BytesIO(pic['picture']))
        X = np.append(X, picture)
        y = np.append(y, d['category_id'])
#Each picture has size 180x180, with 3 colors.

In [45]:
X = X.reshape(110,180,180,3)

In [47]:
y_dict = dict(zip(list(set(y)), range(36))) #36 is the number of category_id's that are in this
y_int = np.array([y_dict[id] for id in y])

## Use VGG16 to get bottleneck features

In [48]:
# dimensions of our images.
img_width, img_height = 180, 180

top_model_weights_path = 'bottleneck_fc_model.h5'
#train_data_dir = 'data/train'
#validation_data_dir = 'data/validation'
nb_train_samples = 110
#nb_validation_samples = 800
epochs = 50
batch_size = 16

In [49]:
# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

train_generator = train_datagen.flow(X, y_int, batch_size=batch_size)

In [50]:
def save_bottlebeck_features():

    # build the VGG16 network
    model = applications.VGG16(include_top=False, weights='imagenet')
    
    bottleneck_features_train = model.predict_generator(
        train_generator, (nb_train_samples-1) // batch_size + 1)
    #needed to make sure to get all examples
    
    np.save(open('bottleneck_features_train.npy', 'wb'),
            bottleneck_features_train) #must be binary mode 'wb'

#    generator = datagen.flow_from_directory(
#        validation_data_dir,
#        target_size=(img_width, img_height),
#        batch_size=batch_size,
#        class_mode=None,
#        shuffle=False)
#    bottleneck_features_validation = model.predict_generator(
#        generator, nb_validation_samples // batch_size)
#    np.save(open('bottleneck_features_validation.npy', 'w'),
#            bottleneck_features_validation)

## Train the top model

In [51]:
def train_top_model():
    train_data = np.load(open('bottleneck_features_train.npy', 'rb'))
    train_labels = y_int

#    validation_data = np.load(open('bottleneck_features_validation.npy'))
#    validation_labels = np.array(
#        [0] * (nb_validation_samples / 2) + [1] * (nb_validation_samples / 2))

    model = Sequential()
    model.add(Flatten(input_shape=train_data.shape[1:]))
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(36, activation='softmax'))

    model.compile(optimizer='rmsprop',
                  loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    model.fit(train_data, train_labels,
              epochs=epochs,
              batch_size=batch_size)
#              validation_data=(validation_data, validation_labels))
    model.save_weights(top_model_weights_path)

In [52]:
save_bottlebeck_features()

In [53]:
train_top_model()

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Predictions

In [54]:
train_data = np.load(open('bottleneck_features_train.npy', 'rb'))
train_labels = y_int

top_model = Sequential()
top_model.add(Flatten(input_shape=train_data.shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dropout(0.5))
top_model.add(Dense(36, activation='softmax'))

In [55]:
top_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 12800)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 256)               3277056   
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 36)                9252      
Total params: 3,286,308
Trainable params: 3,286,308
Non-trainable params: 0
_________________________________________________________________


In [56]:
top_model.load_weights(top_model_weights_path)

In [57]:
np.argmax(top_model.predict(train_data), axis=1)

array([12, 12,  9,  1, 27, 12, 26,  9, 17,  3, 12, 16, 28, 28, 28, 28,  4,
       12, 34, 25, 20, 20, 12,  8, 12, 12, 12, 12, 31, 31, 31, 31, 12, 12,
       12, 21, 21, 12,  3,  9, 12, 12,  2,  2, 12, 12, 13, 13,  9,  9,  9,
       12, 29, 10, 12, 12,  0, 12, 12, 22, 35, 12, 12, 30, 30, 30, 11, 19,
       12, 12,  7,  7, 12, 17, 17, 17, 12, 12, 25, 14,  6,  5,  5,  5,  5,
       12, 12, 18, 18, 17, 12,  1, 24, 12, 32, 33, 12, 12, 12, 12,  4, 12,
       24, 24, 17, 17, 17, 17, 23, 15])

In [58]:
y_int

array([12, 12,  9,  1, 27, 12, 26,  9, 17,  3, 12, 16, 28, 28, 28, 28,  4,
       12, 34, 25, 20, 20, 12,  8, 12, 12, 12, 12, 31, 31, 31, 31, 12, 12,
       12, 21, 21, 12,  3,  9, 12, 12,  2,  2, 12, 12, 13, 13,  9,  9,  9,
       12, 29, 10, 12, 12,  0, 12, 12, 22, 35, 12, 12, 30, 30, 30, 11, 19,
       12, 12,  7,  7, 12, 17, 17, 17, 12, 12, 25, 14,  6,  5,  5,  5,  5,
       12, 12, 18, 18, 17, 12,  1, 24, 12, 32, 33, 12, 12, 12, 12,  4, 12,
       24, 24, 17, 17, 17, 17, 23, 15])

These all match, but it's easy to overfit to this small data, because top_model has 3.3M weights.