## Model 2: CNN with image augmentation ##

In this notebook, I use image augmentation with the CNN from Model 1.

Predictions made using this model scored 0.276 by Kaggle, slightly worse than the 1st model result of 0.283 (submitted on Jan 18, 2019).

Hardware used: CPU: i5 2.10GHz x 6, GPU: none: RAM: 16Gb + 32Gb virtual

In [1]:
# load libraries
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Activation, Conv2D, AveragePooling2D, MaxPooling2D, BatchNormalization, GlobalAveragePooling2D, Flatten, Dropout, Dense
from keras.callbacks import ModelCheckpoint
from keras.applications.imagenet_utils import preprocess_input
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import subprocess

Using TensorFlow backend.


In [2]:
# load train files and labels into dataframe
traindf_all = pd.read_csv('train.csv')
print(traindf_all.head())
print(len(traindf_all))

           Image         Id
0  0000e88ab.jpg  w_f48451c
1  0001f9222.jpg  w_c3d896a
2  00029d126.jpg  w_20df2c5
3  00050a15a.jpg  new_whale
4  0005c1ef8.jpg  new_whale
25361


In [3]:
# remove unlabeled images
traindf = traindf_all.drop(traindf_all[traindf_all.Id == 'new_whale'].index.tolist())
traindf.reset_index(drop=True, inplace=True)
del traindf_all
print(len(traindf))

15697


In [4]:
# create dataframe with distinct ids and count of images per id
ids = pd.DataFrame(traindf['Id'].unique(), columns=['Id'])
ids['Count'] = 0
for r in ids.itertuples():
    id = r.Id
    cnt = len(traindf[traindf['Id'] == id])
    ind = ids[ids['Id'] == id].index.values[0]
    ids.loc[ind, 'Count'] = cnt
print(ids.head(3))
print(len(ids))

          Id  Count
0  w_f48451c     14
1  w_c3d896a      4
2  w_20df2c5      4
5004


In [4]:
# Get image dimensions and color mode for all training images
traindf['Width'] = 0
traindf['Height'] = 0
traindf['Mode'] = ''
i = 0
for r in traindf.itertuples(): 
    img_name = r.Image 
    img_path = 'train/' + img_name
    img = Image.open(img_path) 
    width, height = img.size
    mode = img.mode
    traindf.loc[i, ['Width', 'Height', 'Mode']] = width, height, mode
    i += 1
print(traindf.head())

           Image         Id  Width  Height Mode
0  0000e88ab.jpg  w_f48451c   1050     700  RGB
1  0001f9222.jpg  w_c3d896a    758     325  RGB
2  00029d126.jpg  w_20df2c5   1050     497  RGB
3  000a6daec.jpg  w_dd88965   1050     458  RGB
4  0016b897a.jpg  w_64404ac   1050     450  RGB


***
### Using augmented images ###

In the following cells, I proceed with the dataset <code>subset</code> that uses augmented images in addition to existing ones. This dataset has exactly 5 images per whale Id (25,020 total images) made of some combination of existing and new images obtained using random augmentation. In creating this dataset, I used augmented images where an Id had fewer than 5 images, and I dropped all but 5 of the existing images where an Id had more than 5 images.

See section **"Implementing Image Augmentation"** at the end of this notebook for code and other details. Briefly, I created 484,703 new images for a total of 500,400 old *and* new images, resulting in exactly 100 images per Id.
***

In [5]:
# load subset
subset = pd.read_csv('subset.csv')
print(subset.head())
print(len(subset))

           Image         Id  New
0  0000e88ab.jpg  w_f48451c    0
1  0001f9222.jpg  w_c3d896a    0
2  00029d126.jpg  w_20df2c5    0
3  000a6daec.jpg  w_dd88965    0
4  0016b897a.jpg  w_64404ac    0
25020


In [None]:
# function to convert images to tensors
def imgs_to_tensors(df, path, size=(100, 100)):
    '''
    df: dataframe listing image file names in column "Image"
    path: directory where image files are located (don't include /)
    size: target height and width to resize images to
    '''
    HEIGHT, WIDTH = size
    LEN=df.shape[0]   
    tensors = np.zeros((LEN, HEIGHT, WIDTH, 3))
    i = 0
    for im_name in df.Image:
        if (i%1000==0):
            print('Processing image {}: {}'.format(i, im_name))
        im_path = path + '/' + im_name
        # load image to PIL format
        im = image.load_img(path=im_path, 
                            grayscale=False, 
                            color_mode='rgb', 
                            target_size=(HEIGHT, WIDTH), 
                            interpolation='nearest')
        # convert to numpy array/tensor with shape (HEIGHT, WIDTH, 3)
        x = image.img_to_array(im)
        x = preprocess_input(x) # important line! I am not sure why
        tensors[i] = x
        i += 1   
    return tensors

In [None]:
# create tensors of training images and save on disk
# (divide by 255 to normalize pixel values)
tensors_train = imgs_to_tensors(df=subset, path='augmented/train')/255
np.save('tensors/model_2/tensors_train', tensors_train)
print(tensors_train.shape)

In [6]:
labels = np.array(ids.Id)

In [None]:
# create labels of training images and save on disk
tensors_train_labels = np.zeros((len(subset), len(ids)))
i = 0
for id in subset.Id:
    j = np.argwhere(labels==id)[0, 0]
    tensors_train_labels[i, j] = 1
    i += 1
np.save('tensors/model_2/tensors_train_labels', tensors_train_labels)
print(tensors_train_labels.shape)

In [7]:
# load previously saved tensors and labels, if any
tensors_train = np.load('tensors/model_2/tensors_train.npy')
tensors_train_labels = np.load('tensors/model_2/tensors_train_labels.npy')

In [8]:
# build basic model
# (similar to one described in Lesson 2.18 in Deep Learning section of ML Nanodegree)

model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu',
                       input_shape=(tensors_train.shape[1], tensors_train.shape[2], 3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(tensors_train_labels.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 100, 100, 16)      208       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 50, 50, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 50, 50, 32)        2080      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 25, 25, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 25, 25, 64)        8256      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 9216)              0         
__________

In [9]:
# train the model
EPOCHS = 10
BATCH_SIZE = 16
checkpointer = ModelCheckpoint(filepath='saved_models/weights.model_2.h5', verbose=1, save_best_only=True)
history = model.fit(
        x=tensors_train,
        y=tensors_train_labels,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        callbacks=[checkpointer],
        validation_split=0.1,
        verbose=1)

Train on 22518 samples, validate on 2502 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 8.96174, saving model to saved_models/weights.model_2.h5
Epoch 2/10

Epoch 00002: val_loss did not improve from 8.96174
Epoch 3/10

Epoch 00003: val_loss did not improve from 8.96174
Epoch 4/10

Epoch 00004: val_loss did not improve from 8.96174
Epoch 5/10

Epoch 00005: val_loss did not improve from 8.96174
Epoch 6/10

Epoch 00006: val_loss did not improve from 8.96174
Epoch 7/10

Epoch 00007: val_loss did not improve from 8.96174
Epoch 8/10

Epoch 00008: val_loss did not improve from 8.96174
Epoch 9/10

Epoch 00009: val_loss did not improve from 8.96174
Epoch 10/10

Epoch 00010: val_loss did not improve from 8.96174


In [10]:
# load best weights
model.load_weights('saved_models/weights.model_2.h5')

In [11]:
# load test files into dataframe
filelist = os.listdir('test')
testdf = pd.DataFrame(filelist, columns=['Image'])
print(testdf.head(3))
print(len(testdf))

           Image
0  21253f840.jpg
1  769f8d32b.jpg
2  a69dc856e.jpg
7960


In [None]:
# create tensors for test images and save on disk
tensors_test = imgs_to_tensors(df=testdf, path='test')/255
np.save('tensors/model_2/tensors_test', tensors_test)
print(tensors_test.shape)

In [12]:
# load previously saved test tensors, if any
tensors_test = np.load('tensors/model_2/tensors_test.npy')

In [13]:
# make predictions
predictions = model.predict(tensors_test, verbose=1)



In [14]:
# get 5 best predictions per image and decode to whale ids
# insert "new_whale" where prediction probability drops below 10% 
testdf['Id'] = ''
for i, pred in enumerate(predictions):
    inx = np.argsort(pred)[-5:][::-1].tolist()
    preds = labels[inx].tolist()
    probs = pred[inx]
    try:
        # get index of first prediction with prob less than 10%
        j = (probs < 0.1).tolist().index(True)
        # enter "new_whale" in that index, and shift any remaining preds to right
        for ii in range(4, (j-1), -1):
            if ii==j:
                preds[ii] = 'new_whale'
            else:
                preds[ii] = preds[ii-1]
    except ValueError:
        pass
    testdf.loc[i,'Id'] = ' '.join(preds)
print(testdf.head(10))

           Image                                                 Id
0  21253f840.jpg  new_whale w_022b708 w_4ba728f w_686c0b3 w_aa3489d
1  769f8d32b.jpg  new_whale w_686c0b3 w_022b708 w_aa3489d w_4ba728f
2  a69dc856e.jpg  new_whale w_022b708 w_4ba728f w_686c0b3 w_aa3489d
3  79bee536e.jpg  new_whale w_022b708 w_686c0b3 w_4ba728f w_aa3489d
4  7eb9a6f1b.jpg  new_whale w_022b708 w_4ba728f w_686c0b3 w_aa3489d
5  8e0a9e74b.jpg  new_whale w_686c0b3 w_022b708 w_c06798f w_a6067a9
6  4853537ad.jpg  new_whale w_022b708 w_4ba728f w_686c0b3 w_aa3489d
7  8cba4a867.jpg  new_whale w_686c0b3 w_022b708 w_aa3489d w_4ba728f
8  8da08a11a.jpg  new_whale w_022b708 w_686c0b3 w_4ba728f w_aa3489d
9  48a937823.jpg  new_whale w_022b708 w_686c0b3 w_4ba728f w_aa3489d


In [15]:
# write to file and submit to Kaggle
testdf.to_csv('submissions/submit_0118_04.txt', index=False)

This submission scored 0.276 in Kaggle, slightly worse than the Model 1 result of 0.281.

***
### Implementing Image Augmentation ###

This section contains my code for creating new images using randomized image augmentation. My goal was to have 100 images (old and new) for each class. So, if a class had only 1 image, I created 99 new images. If it had 10 images, I created 90, and so on. When creating new images, I used all existing images for the class equally (or as equally as possible).

**Note** that running the next few cells takes very long (over 20 hours). But it only needs to be done once because the results are saved to disk.

In [None]:
# get count of new images to create/augment per existing image

n = 100 # target number of total (original + augmented) imgs
traindf['Augment'] = 0

for r in ids.itertuples():
    id = r.Id
    cnt = r.Count
    aug_per_img = (100-cnt) // cnt
    indx = traindf[traindf['Id'] == id].index.tolist()
    if aug_per_img:
        traindf.loc[indx, 'Augment'] = aug_per_img
    total = cnt*aug_per_img + cnt
    short = 100-total
    if short:
        indx_add = np.random.choice(indx, size=short, replace=False).tolist()
        traindf.loc[indx_add, 'Augment'] = aug_per_img + 1

print(traindf.head())

In [None]:
# create new images
# (careful, this cell will run about 20 hrs)

stop

datagen = ImageDataGenerator(
        rotation_range=30,
        width_shift_range=0.1,
        height_shift_range=0.1,
        shear_range=20,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

# create new dataframe to hold results
traindf_big = pd.DataFrame(index=range(0,500400), columns=['Image','Id','Test','Width','Height','Mode','New'])
traindf_big.Test = 0
traindf_big.Width = 0
traindf_big.Height = 0
traindf_big.New = 0

print('please wait...')
cnt_new = 0
cnt_old = 0
i = 0
for r in traindf.itertuples():
    cnt_old += 1
    if cnt_old%1000 == 0:
        print(cnt_old)
    im = r.Image
    id = r.Id
    wi = r.Width
    he = r.Height
    mo = r.Mode
    au = r.Augment
    traindf_big.loc[i,'Image'] = im
    traindf_big.loc[i,'Id'] = id
    traindf_big.loc[i,'Width'] = wi
    traindf_big.loc[i,'Height'] = he
    traindf_big.loc[i,'Mode'] = mo
    i += 1
    if au:
        im_path = 'train/' + im
        img = Image.open(im_path)
        x = image.img_to_array(img)  # this is a Numpy array with shape (3, width, weight)
        x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, width, height)
        j = 1 # not a typo
        # generate batches of randomly transformed images and save on disk
        for batch in datagen.flow(x, batch_size=1, save_to_dir='preprocessed/temp', save_format='jpeg'):
            cnt_new += 1
            # move file and give new name
            temf = os.listdir('augmented/temp/')[0]
            temf_path = 'augmented/temp/' + temf
            newf = 'aug{:0>6}.jpg'.format(cnt_new)
            newf_path = 'augmented/train/' + newf
            os.replace(temf_path, newf_path)
            traindf_big.loc[i,'Image'] = newf
            traindf_big.loc[i,'Id'] = id
            traindf_big.loc[i,'Width'] = wi
            traindf_big.loc[i,'Height'] = he
            traindf_big.loc[i,'Mode'] = mo
            traindf_big.loc[i,'New'] = 1
            # print('created a new image ' + str(j) + ' ' + newf)
            i += 1
            j += 1
            if j > au:
                break

# save the dataframe to file
traindf_big.to_csv('train_big.csv', index=False)

In [None]:
# load previously saved dataframe, if any
traindf_big = pd.read_csv('train_big.csv')

In [None]:
# let's see an example of augmented image, along with the existing image it was based on
fig, ax = plt.subplots(1, 2, figsize=(20, 10))
img_old = Image.open('train/9ab65fac4.jpg')
img_new = Image.open('augmented/train/aug295753.jpg')
ax[0].imshow(img_old)
ax[1].imshow(img_new)

In [None]:
# select 5 imgs per label for training into new dataframe
# select imgs randomly, but prefer old imgs to new/augmented
# expect 5004 * 5 = 25020 rows

traindf_big['Subset'] = 0

for r in ids.itertuples():
    id = r.Id
    cnt = r.Count
    inx = traindf_big[(traindf_big.Id == id) & (traindf_big.New == 0)].index.tolist()
    inx = random.sample(inx, min(len(inx), 5)) 
    inx_new = traindf_big[(traindf_big.Id == id) & (traindf_big.New == 1)].index.tolist()
    inx_new = random.sample(inx_new, 4)
    inx.extend(inx_new) # index of imgs for given Id, with old imgs listed first
    inx_5 = inx[:5]
    traindf_big.loc[inx_5, 'Subset'] = 1

subset = traindf_big[traindf_big.Subset==1].copy()
subset.drop(['Test', 'Width', 'Height', 'Mode', 'Subset'], axis=1, inplace=True)
subset.reset_index(drop=True, inplace=True)
subset.to_csv('subset.csv', index=False)
del traindf_big
del subset

In [None]:
# load previously saved dataframe, if any
subset = pd.read_csv('subset.csv')