# Building ResNet50 From Scratch (sort-of)
I wanted to become more familiar with the structure of a Deep Residual Network and so decided to define my own using Keras. I am aware that Keras [already has a ResNet50 model](https://keras.io/applications/#resnet50), but there are a few reasons to build my own:  
- I want to replace the Fully-Connected layer with my own 10-node layer (instead of appending it)
- I've never used a functional model in Keras, so it's good practice
- I'll have a better appreciation for developing a deep network  

Of course, I'll need something to train it on and for that, I'll be using it for my entries to a Kaggle competition for [identifying camera models from pictures](https://www.kaggle.com/c/sp-society-camera-model-identification)


## Imports and Setup
First, some standard library imports

In [1]:
import os
import tensorflow as tf
import keras
import cv2
import numpy as np
from enum import Enum
%pylab inline

Using TensorFlow backend.


Populating the interactive namespace from numpy and matplotlib


## The Data
For the competition, we have 10 camera models and a directory structure like this:
- data
- - HTC-1-M7
- - LG-Nexus-5x
- - ...

The images could also be manipulated in a number of ways, so from this, I have generated a new dataset as follows:
- data-512
- - train_cropped
- - - HTC-1-M7
- - - ...
- - train_compressed_70
- - - HTC-1-M7
- - - ...
- - train_resize_5
- - ...  

Basically, I just have a dedicated folder for each type of photo manipulation (also, data-512 indicates that all of the images I'm using for training have been center-cropped to be 512x512, this may change at some point). Below, we'll define the labels to be used

In [2]:
class CameraLabel(Enum):
    HTC_1_M7 = 0
    LG_Nexus_5x = 1
    Motorola_Droid_Maxx = 2
    Motorola_Nexus_6 = 3
    Motorola_X = 4
    Samsung_Galaxy_Note3 = 5
    Samsung_Galaxy_S4 = 6
    Sony_NEX_7 = 7
    iPhone_4s = 8
    iPhone_6 = 9

### Sampling the Data
While the images are 512x512, the network will accept a 224x224 image. To that end, I've created two basic functions for retrieving crops of the training data; *center_crop* does as you'd expect and *crop_224* will retrieve *crops_per_image* for each image passed in. Each of the crops in *crop_224* will be from a random position within the original image

In [3]:
'''
Crops a randomly-selected 224x224x3 image from a larger image
'''
def crop_224(images, crops_per_image=1):
    out_images = []
    for image in images:
        row_index = np.random.randint(0, image.shape[0] - 224)
        col_index = np.random.randint(0, image.shape[1] - 224)
        out_images += [image[row_index:row_index + 224, col_index:col_index + 224, :]]
    return np.asarray(out_images)

def center_crop(img, crop_dim=224):
    edge = crop_dim // 2
    height, width, _ = img.shape
    center_height = height // 2
    center_width = width // 2
    top = center_height - edge
    bottom = center_height + edge
    left = center_width - edge
    right = center_width + edge
    return img[top:bottom,left:right]

### Loading the Data
Of course, we need to load the data to train our network. Now, I can't load all of the data at once, because even at 512x512, that takes at least 13GB of memory, so I'll simply introduce a *load_ratio* parameter so I can load a subset of the data for each epoch. I'm also being lazy with my development and just use a separate function for loading the validation data.

In [4]:
def load(path='data-224', load_ratio=1.0, image_dim=224, verbose=False):
    print('Loading images')
    data = []
    labels = []
    # Read data from every type of generated data
    for data_path in os.listdir(path):
        full_data_path = os.path.join(path, data_path)
        # Read data from every camera type
        for label in CameraLabel:
            camera_path = os.path.join(full_data_path, label.name.replace('_', '-'))
            if verbose:
                print("Loading images for ", camera_path)
            filenames = os.listdir(camera_path)
            last_index = int(load_ratio * len(filenames))
            for filename in filenames[:last_index]:
                img = cv2.imread(os.path.join(camera_path, filename))
                if img.shape[0] != image_dim or img.shape[1] != image_dim or img.shape[2] != 3:
                    print('Image with invalid shape! Shape: ', img.shape, 'Name: ', os.path.join(camera_path, filename))
                    continue
                data += [img]
                labels += [CameraLabel[label.name].value]
    return np.asarray(data), np.expand_dims(np.asarray(labels), axis=1)

def load_validation(path='data/validation', needs_crop=False, verbose=False):
    data = []
    labels = []
    # Read data from every camera type
    for label in CameraLabel:
        camera_path = os.path.join(path, label.name.replace('_', '-'))
        if verbose:
            print("Loading images for ", camera_path)
        for filename in os.listdir(camera_path):
            img = cv2.imread(os.path.join(camera_path, filename))
            if needs_crop:
                img = center_crop(img)
            if img.shape[0] != 224 or img.shape[1] != 224 or img.shape[2] != 3:
                print('Loaded an image with improper dimensions. Discarding')
            else:
                data += [img]
                labels += [CameraLabel[label.name].value]
            del img
    return np.asarray(data), np.expand_dims(np.asarray(labels), axis=1)

Let's go ahead and load the validation data

In [5]:
valid_images, valid_labels = load_validation('data/flickr-validation', needs_crop=True, verbose=True)
print('Loaded ', valid_images.shape[0], 'images')

Loading images for  data/flickr-validation/HTC-1-M7
Loading images for  data/flickr-validation/LG-Nexus-5x
Loading images for  data/flickr-validation/Motorola-Droid-Maxx
Loading images for  data/flickr-validation/Motorola-Nexus-6
Loaded an image with improper dimensions. Discarding
Loading images for  data/flickr-validation/Motorola-X
Loading images for  data/flickr-validation/Samsung-Galaxy-Note3
Loading images for  data/flickr-validation/Samsung-Galaxy-S4
Loading images for  data/flickr-validation/Sony-NEX-7
Loading images for  data/flickr-validation/iPhone-4s
Loading images for  data/flickr-validation/iPhone-6
Loaded  14583 images


## Building the Network
Ok! Time to build this network. I have built it based on this [netscope graph](https://dgschwend.github.io/netscope/#/preset/resnet-50) with the main exception being the final fully-connected layer. Also, of course, I've referenced the now-famous paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385).  

### A Note About Weights
I haven't spent much time researching weight initialization for the network outside of reading [some notes](http://cs231n.github.io/neural-networks-2/#init), but I'm following TensorFlow's practice of using *truncated normals* to initialize the weights of my network. From my experience, it works well for ReLU-based networks both with and without batch normalization, so I'll keep that in here for now. We'll need a simple standard deviation function to do so:

In [5]:
def std(inputs):
    return 1 / math.sqrt(inputs)

### Loading the Network
In case there's an already-existing network I'd like to work with, I'll go ahead and load it here. Otherwise, we'll use the network definition after this part.

In [5]:
# The model to work with
model_filename = 'keras_models/second_resnet.h5'

In [6]:
if os.path.isfile(model_filename):
    print('Loading previously-created model,', model_filename)
    model = keras.models.load_model(model_filename)
else:
    print('No model found for path: ', model_filename)

Loading previously-created model, keras_models/second_resnet.h5


## Defining the Network
If you've loaded a network in the previous section, then skip to the section on Training.  

So, I need a fresh network sometimes and here's where I'll define it. Looking over the netscope graph and the paper, the key building *block* to building a ResNet model is what's called the *bottleneck*. It is a series of 1x1, 3x3, and 1x1 convolutions with batch normalization and ReLU activations between them (**not at the end of it, however**) with various parameters to maintain shape. To make my life easier, I'm going to define the *bottleneck_block* function to build it for me

In [8]:
'''
The bottleneck block of a resnet has the following structure:
    1x1 conv on the input features (no bias, no pad, 1 stride), batch norm w/ scaling, and followed by ReLU activation
    3x3 conv w/ no bias, 1 pad, 1 stride, batch norm w/ scaling, ReLU follows
    1x1 conv w/ 4x input features (no bias, no pad, 1 stride), batch norm w/ scaling, NO ACTIVATION
'''
def bottlenet_block(input_net, input_filters=64, down_sampling=False, feature_size=56, middle_filters_divisor=2, output_filters_scale=1):
    
    first_stride = 2 if down_sampling else 1
    # if I want to initialize weights, I need to know incoming size, figure that out
    
    # First Convolution
    block = keras.layers.Conv2D(filters=input_filters, kernel_size=1, strides=first_stride, use_bias=False, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(input_filters * feature_size ** 2)))(input_net)
    block = keras.layers.BatchNormalization()(block)
    block = keras.layers.Activation('relu')(block)
    # Second Convolution
    block = keras.layers.Conv2D(filters=input_filters // middle_filters_divisor, kernel_size=3, padding='same', strides=1, use_bias=False, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(input_filters // middle_filters_divisor * (feature_size / first_stride) ** 2)))(block)
    block = keras.layers.BatchNormalization()(block)
    block = keras.layers.Activation('relu')(block)
    # Final Convolution (no activation)
    block = keras.layers.Conv2D(filters=input_filters * output_filters_scale, kernel_size=1, strides=1, use_bias=False, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(input_filters * output_filters_scale * (feature_size / first_stride) ** 2)))(block)
    block = keras.layers.BatchNormalization()(block)
    
    return block

The parameters above are a little confusing, so I'll detail them here.
- input_net - The network we're attaching this bottleneck block to
- input_filters - The number of filters/features coming into this block
- down_sampling - Instead of pooling, ResNets just bump the convolution stride when the features are about to double
- feature_size - Used for calculating weights during initialization. Far as I can tell, there's not an easy way to get the incoming filter/feature shape for a functional model with Keras until the model is actually instantiated. Kinda hacky
- middle_filters_divisor - Used in tandem with output_filters_scale to determine the number of features in the middle convolution. Typically 2 or 4
- output_filters_scale - Multiplies the incoming number of filters for the next layer. Typically is 1 or 2

The last two parameters are the weird ones. If you look closely at the [netscope graph], you'll notice each *block* typically will reduce the number of filters by a factor of 4 and then restore it to its original size at the end, which would give us  `middle_filters_divisor=4` and `output_filters_scale=1`. However, at points in the network where the features will double, the middle layers of the block will only have the filters cut in *half* and then the output layer will *double* the number of features, so you'll see `middle_filters_divisor=2` and `output_filters_scale=2`.

### Just Build It Already
Ok, so the blocks defined above will be repeated *16 times* (3x16 gives me 48 layers, plus an initial conv-layer and output layer, hence ResNet**50**!). Yes, the definition that follows is long and could be rolled up into a loop, but I wanted to show exactly what the structure looks like; plus, I don't have to define any extra function parameters :)

In [8]:
inputs = keras.layers.Input(shape=(224, 224, 3))

# An initial convolutional layer with strides=2 and pooling to reduce the image size
net = keras.layers.Conv2D(filters=64, kernel_size=7, strides=2, padding='same', input_shape=(224, 224, 3), kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(224*224*3)))(inputs)
net = keras.layers.BatchNormalization()(net)
net = keras.layers.Activation('relu')(net)
net = keras.layers.MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(net)

##########################################
# First Set of Bottlenecks
bottle_1 = bottlenet_block(net, input_filters=64, feature_size=56, middle_filters_divisor=1, output_filters_scale=4)
# Projection Function
net = keras.layers.Conv2D(filters=256, kernel_size=1, use_bias=False, strides=1, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(56*56*256)))(net)
net = keras.layers.BatchNormalization()(net)
# Merge Bottleneck and Projection
net = keras.layers.Add()([bottle_1, net])
net = keras.layers.Activation('relu')(net)

bottle_2 = bottlenet_block(net, input_filters=256, feature_size=56, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_2, net])
net = keras.layers.Activation('relu')(net)

bottle_3 = bottlenet_block(net, input_filters=256, feature_size=56, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_3, net])
net = keras.layers.Activation('relu')(net)

##########################################
# Second Set of Bottlenecks
bottle_4 = bottlenet_block(net, input_filters=256, down_sampling=True, feature_size=56, middle_filters_divisor=2, output_filters_scale=2)
# Projection Function
net = keras.layers.Conv2D(filters=512, kernel_size=1, use_bias=False, strides=2, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(28*28*512)))(net)
net = keras.layers.BatchNormalization()(net)
# Merge Bottleneck and Projection
net = keras.layers.Add()([bottle_4, net])
net = keras.layers.Activation('relu')(net)

bottle_5 = bottlenet_block(net, input_filters=512, feature_size=28, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_5, net])
net = keras.layers.Activation('relu')(net)

bottle_6 = bottlenet_block(net, input_filters=512, feature_size=28, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_6, net])
net = keras.layers.Activation('relu')(net)

bottle_7 = bottlenet_block(net, input_filters=512, feature_size=28, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_7, net])
net = keras.layers.Activation('relu')(net)

##########################################
# Third Set of Bottlenecks
bottle_8 = bottlenet_block(net, input_filters=512, down_sampling=True, feature_size=28, middle_filters_divisor=2, output_filters_scale=2)
# Projection Function
net = keras.layers.Conv2D(filters=1024, kernel_size=1, use_bias=False, strides=2, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(14*14*1024)))(net)
net = keras.layers.BatchNormalization()(net)
# Merge Bottleneck and Projection
net = keras.layers.Add()([bottle_8, net])
net = keras.layers.Activation('relu')(net)

bottle_9 = bottlenet_block(net, input_filters=1024, feature_size=14, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_9, net])
net = keras.layers.Activation('relu')(net)

bottle_10 = bottlenet_block(net, input_filters=1024, feature_size=14, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_10, net])
net = keras.layers.Activation('relu')(net)

bottle_11 = bottlenet_block(net, input_filters=1024, feature_size=14, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_11, net])
net = keras.layers.Activation('relu')(net)

bottle_12 = bottlenet_block(net, input_filters=1024, feature_size=14, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_12, net])
net = keras.layers.Activation('relu')(net)

bottle_13 = bottlenet_block(net, input_filters=1024, feature_size=14, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_13, net])
net = keras.layers.Activation('relu')(net)

##########################################
# Fourth Set of Bottlenecks
bottle_14 = bottlenet_block(net, input_filters=1024, down_sampling=True, feature_size=14, middle_filters_divisor=2, output_filters_scale=2)
# Projection Function
net = keras.layers.Conv2D(filters=2048, kernel_size=1, use_bias=False, strides=2, kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(7*7*2048)))(net)
net = keras.layers.BatchNormalization()(net)
# Merge Bottleneck and Projection
net = keras.layers.Add()([bottle_14, net])
net = keras.layers.Activation('relu')(net)

bottle_15 = bottlenet_block(net, input_filters=2048, feature_size=7, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_15, net])
net = keras.layers.Activation('relu')(net)

bottle_16 = bottlenet_block(net, input_filters=2048, feature_size=7, middle_filters_divisor=4, output_filters_scale=1)
net = keras.layers.Add()([bottle_16, net])
net = keras.layers.Activation('relu')(net)

# Final Pooling. Should have 2048 outputs
net = keras.layers.AveragePooling2D(pool_size=(7,7))(net)

# Final Fully-Connected Layer
flatten_shape = 2048
net = keras.layers.Flatten()(net)
predictions = keras.layers.Dense(10, activation='softmax', kernel_initializer=keras.initializers.TruncatedNormal(stddev=std(10 * flatten_shape)))(net)

model = keras.models.Model(inputs=inputs, outputs=predictions)

### Compiling the Model
Nothing extravagant going on here, just using the Adam optimizer and typical cross-entropy classification.  

**Note: If loading a previously-trained model, this step is not necessary.**

In [9]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam', 
              metrics=['accuracy'])

## Training the Model
Now, I have the model and it's time to train! Hold on to your butts, though; on a modest 4th gen i5 and a powerful GTX 1080, it will take about 10 minutes for an epoch to train, assuming ~25000 images in batches of 16.  

First up, the training function. It's messy, I know

In [7]:
#loss_and_metrics = model.evaluate(train_data, train_labels, batch_size=50)
#model.fit(train_data, train_labels, epochs=5, batch_size=50)

''' Crop & Train Idea:

1. Load 25% of images, which will take about 8GB of memory 
2. Random crop 4 times for each image, giving 5 samples per image
3. Train for an epoch
4. Shuffle and repeat step 3 for an epoch if I feel like it (spend less time loading images. It's a bonus!)
5. Repeat from beginning
'''

def train(model, data_path='data-512', load_ratio=0.25, validation_data=None, epochs=1, batch_size=16, crops_per_image=4, bonus_training=False):
    
    if validation_data is not None:
        valid_images = validation_data[0]
        valid_label_one_hot = tf.keras.utils.to_categorical(validation_data[1], num_classes=10)
    else:
        valid_images = None
        valid_label_one_hot = None
        
    for i in range(epochs):
        images, labels = load(path=data_path, image_dim=512, load_ratio=load_ratio)
        train_images = crop_224(images, crops_per_image=crops_per_image)
        for j in range(1, crops_per_image):
            train_images = np.concatenate((train_images, crop_224(images, crops_per_image=crops_per_image)))
        
        # Need to elementwise duplicate labels for crops_per_image > 1
        labels = np.reshape(np.tile(labels[:,0], crops_per_image), (crops_per_image * len(labels), 1))
        # One-hot-ify the labels
        train_labels = tf.keras.utils.to_categorical(labels, num_classes=10)
        
        # using initial_epoch does not work for some reason, so all epochs will say 1/1 :(
        if validation_data is not None:
            model.fit(train_images, train_labels, epochs=1, batch_size=batch_size, validation_data=(valid_images, valid_label_one_hot), verbose=1)
        else:
            model.fit(train_images, train_labels, epochs=1, batch_size=batch_size, verbose=1)
        # get some more mileage outta training before reloading
        if bonus_training:
            shuffle(train_images, train_labels)
            model.fit(train_images, train_labels, epochs=1, batch_size=batch_size, validation_data=(valid_images, valid_label_one_hot), verbose=1)
        if (i + 1) % 5 == 0:
            print('Saving model')
            model.save(model_filename)
        # I noticed some memory leakage, probably because of Jupyter, so I explicitly delete all of the generated data here
        del images
        del labels
        del train_images
        del train_labels

### Get On With It and Train!
Ok ok, time to finally train! Sit back and watch the numbers (slowly) go up :) Feel free to bump up the batch_size if the GPU has enough memory, it should speed things along a tiny bit.

In [8]:
train(model, data_path='data-512-with-flickr', load_ratio=0.10, epochs=10, batch_size=32, crops_per_image=3)

Loading images
Epoch 1/1
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Saving model
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Loading images
Epoch 1/1
Saving model


## Saving the Model
Keras makes this way too easy, it almost feels dirty. Change the filename if it suits you.

In [10]:
model.save(model_filename)

## Evaluating the Model
We have a trained model, now what do we do? Make a submission to Kaggle, of course! The code below reads each image in `data/test` one at a time and writes the prediction out to a submission file.  

For added fun, I sample each test image 10 times (cropped at different spots) and just take the prediction with the highest confidence; I found this gives slightly better results. It's a naive approach and will certainly be improved later with proper ensembling, averaging, etc.

In [9]:
'''
Guide to making predictions:
open file
write header
For each test image:
    load the image
    crop the image in multiple plocations
    evaluate the image using the trained model and select label with highest confidence from all crops
    write filename and category name to file
save file
close file
submit file
'''
samples = 9
test_dir = 'data/test'
submission_filename = 'submissions/eighth_submission.csv'
result = ['fname,camera']
for filename in os.listdir(test_dir):
    image = cv2.imread(os.path.join(test_dir, filename))
    test_images = np.expand_dims(center_crop(image), axis=0)
    for i in range(samples):
        test_images = np.concatenate((test_images, crop_224([image])))
    predictions = model.predict(test_images, batch_size=test_images.shape[0], verbose=0)
    category = np.argmax(predictions) % 10
    camera_name = CameraLabel(category).name.replace('_','-')
    result += [filename + ',' + camera_name]
    #print(result)
    del image
    del test_images
result = np.asarray(result)

np.savetxt(submission_filename, result, fmt="%s")