## Model 3: transfer learning ##

In this notebook, I implement transfer learning using the ResNet50 pre-trained network.

Predictions made using this model scored 0.277 by Kaggle, slightly better than the 2nd model's score of 0.276 but worse than the 1st model's score of 0.283 (submitted to Kaggle on Jan 20, 2019).

Hardware used: CPU: i5 2.10GHz x 6, GPU: none: RAM: 16Gb + 32Gb virtual

In [1]:
# load libraries
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Activation, Conv2D, AveragePooling2D, MaxPooling2D, BatchNormalization, GlobalAveragePooling2D, Flatten, Dropout, Dense
from keras.callbacks import ModelCheckpoint
from keras.applications.imagenet_utils import preprocess_input
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import subprocess

Using TensorFlow backend.


In [2]:
# load train files and labels into dataframe
traindf_all = pd.read_csv('train.csv')
print(traindf_all.head())
print(len(traindf_all))

           Image         Id
0  0000e88ab.jpg  w_f48451c
1  0001f9222.jpg  w_c3d896a
2  00029d126.jpg  w_20df2c5
3  00050a15a.jpg  new_whale
4  0005c1ef8.jpg  new_whale
25361


In [3]:
# remove unlabeled images
traindf = traindf_all.drop(traindf_all[traindf_all.Id == 'new_whale'].index.tolist())
traindf.reset_index(drop=True, inplace=True)
del traindf_all
print(len(traindf))

15697


In [4]:
# create dataframe with distinct ids and count of images per id
ids = pd.DataFrame(traindf['Id'].unique(), columns=['Id'])
ids['Count'] = 0
for r in ids.itertuples():
    id = r.Id
    cnt = len(traindf[traindf['Id'] == id])
    ind = ids[ids['Id'] == id].index.values[0]
    ids.loc[ind, 'Count'] = cnt
print(ids.head(3))
print(len(ids))

          Id  Count
0  w_f48451c     14
1  w_c3d896a      4
2  w_20df2c5      4
5004


In [None]:
# Get image dimensions and color mode for all training images
traindf['Width'] = 0
traindf['Height'] = 0
traindf['Mode'] = ''
i = 0
for r in traindf.itertuples(): 
    img_name = r.Image 
    img_path = 'train/' + img_name
    img = Image.open(img_path) 
    width, height = img.size
    mode = img.mode
    traindf.loc[i, ['Width', 'Height', 'Mode']] = width, height, mode
    i += 1
print(traindf.head())

***
### Transfer Learning ###

In the following cells, I use the bottleneck features for the ResNet50 pretrained network that I obtained earlier. 

See section **"Obtaining Bottleneck Features"** at the end of the notebook for code and other details.
***

In [20]:
# load previously saved bottleneck features and labels
bnfeatures_train = np.load('tensors/model_3/bnfeatures_train.npy')
tensors_train_labels = np.load('tensors/model_1/tensors_train_labels.npy')

In [21]:
# build model to go as last layer to pretrained model
INPUT_SHAPE = bnfeatures_train.shape[1:]
model = Sequential()
model.add(Flatten(input_shape=INPUT_SHAPE))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(5004, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 100352)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               25690368  
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 5004)              1286028   
Total params: 26,976,396
Trainable params: 26,976,396
Non-trainable params: 0
_________________________________________________________________


In [22]:
# train the model
EPOCHS = 10
BATCH_SIZE = 16
checkpointer = ModelCheckpoint(filepath='saved_models/weights.model_3.h5', verbose=1, save_best_only=True)
history = model.fit(
        x=bnfeatures_train,
        y=tensors_train_labels,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        callbacks=[checkpointer],
        validation_split=0.1,
        verbose=1)

Train on 14127 samples, validate on 1570 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 8.46365, saving model to saved_models/weights.model_3.h5
Epoch 2/10

Epoch 00002: val_loss did not improve from 8.46365
Epoch 3/10

Epoch 00003: val_loss did not improve from 8.46365
Epoch 4/10

Epoch 00004: val_loss did not improve from 8.46365
Epoch 5/10

Epoch 00005: val_loss did not improve from 8.46365
Epoch 6/10

Epoch 00006: val_loss did not improve from 8.46365
Epoch 7/10

Epoch 00007: val_loss did not improve from 8.46365
Epoch 8/10

Epoch 00008: val_loss did not improve from 8.46365
Epoch 9/10

Epoch 00009: val_loss did not improve from 8.46365
Epoch 10/10

Epoch 00010: val_loss did not improve from 8.46365


In [23]:
# load best weights
model.load_weights('saved_models/weights.model_3.h5')

In [26]:
# load previously saved bottleneck features of testing set
bnfeatures_test = np.load('tensors/model_3/bnfeatures_test.npy')

In [27]:
# make predictions
predictions = model.predict(bnfeatures_test, verbose=1)



In [28]:
labels = np.array(ids.Id)

In [29]:
# get 5 best predictions per image and decode to whale ids
# insert "new_whale" where prediction probability drops below 10% 
testdf['Id'] = ''
for i, pred in enumerate(predictions):
    inx = np.argsort(pred)[-5:][::-1].tolist()
    preds = labels[inx].tolist()
    probs = pred[inx]
    try:
        # get index of first prediction with prob less than 10%
        j = (probs < 0.1).tolist().index(True)
        # enter "new_whale" in that index, and shift any remaining preds to right
        for ii in range(4, (j-1), -1):
            if ii==j:
                preds[ii] = 'new_whale'
            else:
                preds[ii] = preds[ii-1]
    except ValueError:
        pass
    testdf.loc[i,'Id'] = ' '.join(preds)
print(testdf.head())

           Image                                                 Id
0  21253f840.jpg  new_whale w_23a388d w_9b5109b w_0369a5c w_3de579a
1  769f8d32b.jpg  new_whale w_23a388d w_9b5109b w_0369a5c w_3de579a
2  a69dc856e.jpg  new_whale w_23a388d w_9b5109b w_0369a5c w_3de579a
3  79bee536e.jpg  new_whale w_23a388d w_9b5109b w_0369a5c w_3de579a
4  7eb9a6f1b.jpg  new_whale w_23a388d w_9b5109b w_0369a5c w_3de579a


In [30]:
# write to file and submit to Kaggle
testdf.to_csv('submissions/submit_0120_01.txt', index=False)

This submission scored 0.277, worse than my 1st model's score of 0.281.

This model is making about the same prediction for all images. The Ids that it is predicting are those that have the most number of images in the training set. So, Id "w_23a388d" has 73 images, "w_9b5109b" has 65, itc. The model is clearly biased towards the wales it has seen the most. 

***
### Obtaining Bottleneck Features ###
***

In [8]:
# load pretrained model, with last layer removed
from keras.applications.resnet50 import ResNet50
from keras.applications.resnet50 import preprocess_input
model_resnet50 = ResNet50(weights='imagenet', include_top=False)

In [9]:
# function to convert images to tensors
# (same as created in Model 1)
def imgs_to_tensors(df, path, size=(100, 100)):
    '''
    df: dataframe listing image file names in column "Image"
    path: directory where image files are located (don't include /)
    size: target height and width to resize images to
    '''
    HEIGHT, WIDTH = size
    LEN=df.shape[0]   
    tensors = np.zeros((LEN, HEIGHT, WIDTH, 3))
    i = 0
    for im_name in df.Image:
        if (i%1000==0):
            print('Processing image {}: {}'.format(i, im_name))
        im_path = path + '/' + im_name
        # load image to PIL format
        im = image.load_img(path=im_path, 
                            grayscale=False, 
                            color_mode='rgb', 
                            target_size=(HEIGHT, WIDTH), 
                            interpolation='nearest')
        # convert to numpy array/tensor with shape (HEIGHT, WIDTH, 3)
        x = image.img_to_array(im)
        x = preprocess_input(x) # important line! I am not sure why
        tensors[i] = x
        i += 1   
    return tensors

In [10]:
# create training tensors and save on disk
# (ResNet50 requires input images to be in shape 224x224)
# (divide by 255 to normalize pixel values)
tensors_train = imgs_to_tensors(df=traindf, path='train', size=(224, 224))/255
np.save('tensors/model_3/tensors_train', tensors_train)
print(tensors_train.shape)

Processing image 0: 0000e88ab.jpg
Processing image 1000: 10b694367.jpg
Processing image 2000: 21e28ae02.jpg
Processing image 3000: 32533a7fb.jpg
Processing image 4000: 42f134dea.jpg
Processing image 5000: 5297b6c40.jpg
Processing image 6000: 6311688b7.jpg
Processing image 7000: 7390cbfab.jpg
Processing image 8000: 83336c385.jpg
Processing image 9000: 92f450203.jpg
Processing image 10000: a39babc55.jpg
Processing image 11000: b36da6f7c.jpg
Processing image 12000: c4160ee65.jpg
Processing image 13000: d3b15e280.jpg
Processing image 14000: e3fe27a84.jpg
Processing image 15000: f3f3f8b92.jpg
(15697, 224, 224, 3)


In [11]:
# create bottleneck features from training tensors, and save on disk
bnfeatures_train = model_resnet50.predict(tensors_train, verbose=1)
np.save('tensors/model_3/bnfeatures_train', bnfeatures_train)
print(bnfeatures_train.shape)

(15697, 7, 7, 2048)


In [13]:
# load test files into dataframe
filelist = os.listdir('test')
testdf = pd.DataFrame(filelist, columns=['Image'])
print(testdf.head(3))
print(len(testdf))

           Image
0  21253f840.jpg
1  769f8d32b.jpg
2  a69dc856e.jpg
7960


In [14]:
# create testing tensors, and save on disk
# (ResNet50 requires input images to be in shape 224x224)
# (divide by 255 to normalize pixel values)
tensors_test = imgs_to_tensors(df=testdf, path='test', size=(224, 224))/255
np.save('tensors/model_3/tensors_test', tensors_test)
print(tensors_test.shape)

Processing image 0: 21253f840.jpg
Processing image 1000: b14876130.jpg
Processing image 2000: 8d8c7a728.jpg
Processing image 3000: ca3921cc1.jpg
Processing image 4000: e23615e20.jpg
Processing image 5000: 0e5538c86.jpg
Processing image 6000: 3234bf468.jpg
Processing image 7000: 465b5b1ab.jpg
(7960, 224, 224, 3)


In [15]:
# create bottleneck features from testing tensors, and save on disk
bnfeatures_test = model_resnet50.predict(tensors_test, verbose=1)
np.save('tensors/model_3/bnfeatures_test', bnfeatures_test)
print(bnfeatures_test.shape)

(7960, 7, 7, 2048)
