# Kaggle State Farm Competition 

In this notebook we will create an entry for the State Farm Kaggle Competition. It is based on the concepts presented in lessons 1-3 from the fast.ai deep learning course.

First of all, we will define a function to load the batches of images using the Keras preprocessing module.


In [1]:
from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.layers.normalization import BatchNormalization
from keras.optimizers import Adam
import numpy as np

Using Theano backend.
Using cuDNN version 5005 on context None
Mapped name None to device cuda: GeForce GTX 960M (0000:01:00.0)


In [15]:
from keras.preprocessing import image, sequence

def get_batches(dirname, gen=image.ImageDataGenerator(), shuffle=True, batch_size=12, class_mode='categorical',
                target_size=(224,224)):
    return gen.flow_from_directory(dirname, target_size=target_size,
            class_mode=class_mode, shuffle=shuffle, batch_size=batch_size)

In [10]:
import os
DATA_DIR = os.path.join(os.getcwd(),'data','sample')
train_batches = get_batches(os.path.join(DATA_DIR,'train'))
val_batches = get_batches(os.path.join(DATA_DIR,'valid'),shuffle=False)

Found 3500 images belonging to 10 classes.
Found 2138 images belonging to 10 classes.


Now, the classes of our train and validation sets will need to be onehot encoded so that it matches the output format of our network.

In [None]:
from sklearn.preprocessing import OneHotEncoder
def one_hot(x): return np.array(OneHotEncoder().fit_transform(x.reshape(-1,1)).todense())

train_labels = one_hot(train_batches.classes)
val_labels = one_hot(val_batches.classes)

A CNN based on the VGG16 architecture will be used as our main model. A class that contains the vgg16 architecture and its optimal weights will be used so that it is not necessary to recreate it.

In [None]:
from vgg16 import Vgg16

In [None]:
vgg = Vgg16()
model = vgg.model
model.output_shape

Since VGG16 was originally trained to predict between 1000 classes and our dataset only contains 10 classes, the last dense layer of the network will need to be changed. The rest of the layers of the network will be set as non trainable, so that they keep their original weights.

In [None]:
model.pop()
for layer in model.layers: layer.trainable = False
model.add(Dense(10,activation='softmax'))
model.output_shape

And the model is compiled with Keras.

In [None]:
vgg.compile(lr=0.0001)

Everything is set to start training the model.

In [None]:
model.fit_generator(train_batches, samples_per_epoch=train_batches.nb_sample, nb_epoch=5,
                validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

The training reached an accuracy of 0.8589 in the training data and 0.4471 in the validation data. This does not seem very good, so we will try new approaches to tackle this problem.

## Architecture from scratch

Since fine tunning VGG16 didn't conclude in any good results (probably because of the difference in the goals between ImageNet and our dataset), we will try training a CNN from scratch.

First, let's try an architecture based on 2 conv layers with batchnorm and maxpooling followed by a fully connected layer.

In [7]:
def conv1(batches,val_batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Convolution2D(64,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Flatten(),
            Dense(200, activation='relu'),
            BatchNormalization(),
            Dense(10, activation='softmax')
        ])

    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    model.optimizer.lr = 0.001
    model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    return model

In [5]:
conv1(train_batches,val_batches)

Epoch 1/2
Epoch 2/2
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.models.Sequential at 0x3127e550>

The validation accuracy is still way lower than our training accuracy. Let's try to adding some dropout to reduce this overfitting.

In [4]:
def conv1(batches,val_batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Convolution2D(64,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Flatten(),
            Dense(200, activation='relu'),
            BatchNormalization(),
            Dropout(0.5),
            Dense(10, activation='softmax')
        ])

    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    model.optimizer.lr = 0.001
    model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    return model

In [5]:
model = conv1(train_batches,val_batches)

## Data augmentation

In order to boost the performance of our network even more, we will try to augment our training data by performing some modifications such as shifts or rotations. Hopefully, this will help the model to generalize better and improve the performance on the validation set.

In [20]:
gen_t = image.ImageDataGenerator(rotation_range=15, height_shift_range=0.05, 
                shear_range=0.1, channel_shift_range=20, width_shift_range=0.1)

train_batches = get_batches(os.path.join(DATA_DIR,'train'))

Found 3500 images belonging to 10 classes.


In [22]:
model = conv1(train_batches,val_batches)

Epoch 1/2
Epoch 2/2
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [23]:
model.save_weights('models/augmented.h5')

In [12]:
model.load_weights('models/augmented.h5')

In [21]:
model.optimizer.lr = 0.0001
model.fit_generator(train_batches, train_batches.nb_sample, nb_epoch=10, validation_data=val_batches, 
                 nb_val_samples=val_batches.nb_sample)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
 456/3500 [==>...........................] - ETA: 32s - loss: 0.0320 - acc: 1.0000

KeyboardInterrupt: 

## Submit

Once we are happy with the model, let's use it to make some predictions on the test set and submit these to Kaggle.

In [39]:
DATA_DIR = os.path.join(os.getcwd(),'data')
test_batches = get_batches(os.path.join(DATA_DIR,'test'),shuffle=False)

Found 79726 images belonging to 1 classes.


In [41]:
preds = model.predict_generator(test_batches, test_batches.nb_sample)

In [42]:
def do_clip(arr, mx): return np.clip(arr, (1-mx)/9, mx)

In [43]:
subm = do_clip(preds,0.93)

In [44]:
subm_name = os.path.join(DATA_DIR,'results','subm.gz')

In [45]:
classes = sorted(train_batches.class_indices, key=train_batches.class_indices.get)

In [46]:
import pandas as pd
submission = pd.DataFrame(subm, columns=classes)
submission.insert(0, 'img', [a[8:] for a in test_batches.filenames])
submission.head()

Unnamed: 0,img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9
0,img_1.jpg,0.007778,0.007778,0.007778,0.007778,0.007778,0.93,0.007778,0.007778,0.007778,0.007778
1,img_10.jpg,0.007778,0.007778,0.007778,0.007778,0.007778,0.93,0.007778,0.007778,0.007778,0.007778
2,img_100.jpg,0.896594,0.007778,0.007778,0.007778,0.007778,0.007778,0.007778,0.011672,0.007778,0.074021
3,img_1000.jpg,0.007778,0.007778,0.77523,0.007778,0.007778,0.007778,0.193272,0.007778,0.024516,0.007778
4,img_100000.jpg,0.009174,0.007778,0.007778,0.899323,0.072271,0.007778,0.007778,0.007778,0.017353,0.007778


In [47]:
submission.to_csv(subm_name, index=False, compression='gzip')

The first submission got a private score of 1.04346 and public score of 1.13722. This would place us in the top 37% of the leaderbord. Be aware that the training was performed on sample data and not the full training set which would have improved the accuracy even more.