# Statefarm Data - Phase 4B - All-Convolutional Models

Vgg16 performed better than InceptionV3 and Resnet50 in Phase4 experiments.  It was obvious that Vgg became over-fitted quite quickly.  It's no wonder when trying to train over 3 million parameters in the dense layers based on only 50 subjects (in turn providing approx. 22000 training images). Using dropout to control over-fitting is not an efficient way of creating a stable model either. Overfitting results when there is not enough data for the quantity of parameters requiring training.  (Though with infinitely flexible non-linear models, over-fitting will eventually happen with too much training unless is done to disrupt that process). Most of the parameters from the Vgg16 neural network are contributed by the dense fully connected layers, and comparatively few from the convolutional layers.  All convolutional model architectures are a way to reduce the number of parameters in a model and help eliminate an overfitting problem when not much training data is available.  

#### In this notebook, my objective is comparing the performance of various all convolutional models based on Vgg19. 

In [None]:
import theano
from theano.sandbox import cuda
cuda.use('gpu0')

In [None]:
%matplotlib inline
IMPORT_DIR = '/home/ubuntu/nbs'
%cd $IMPORT_DIR

In [None]:
from __future__ import division,print_function

import os, json
from glob import glob
import numpy as np
np.set_printoptions(precision=4, linewidth=100)
from matplotlib import pyplot as plt
import daveutils
from daveutils import *
import davenet
from davenet import *
import my_cv_modeler
from my_cv_modeler import *

In [None]:
ALL_DATA_DIR = '/home/ubuntu/'
DATA_HOME_DIR = ALL_DATA_DIR+'statefarm1/'
TRAIN_DIR = DATA_HOME_DIR+'train/'
VALID_DIR = DATA_HOME_DIR+'valid/'
SAMPLE_DIR = DATA_HOME_DIR+'sample/'
MODELS_DIR = DATA_HOME_DIR+'models/'
RESULTS_DIR = DATA_HOME_DIR+'results/'
TEST_DIR = DATA_HOME_DIR+'test/'

# 1. Prepare Data

#### Identify and remove poor quality training data

Previously Identified Data that is badly classified or multi-class:

In [None]:
bad_img_nums=np.array(['16927','101091','31121','27454','49471','47068','18737','14223','68147','68040','54867',
                  '38427', '8131', '62871', '99733', '92769','75819', '79819'])
#n.b. some of these image numbers at in the validation folder

In [None]:
%cd $DATA_HOME_DIR

Move bad images from /train to /bad folder

In [None]:
from shutil import move
from shutil import copytree #(src, dst, symlinks=False, ignore=None)
%cd $DATA_HOME_DIR
def move_bad_to_bad_folder(from_dir, bad_filenames, bad_dir = 'bad_train'):  #bad_dir must not already exist
    count = 0
    copytree(from_dir, bad_dir)
    g = glob(from_dir+'/c?/*.jpg')
    for filename in g:
        if filename[len(from_dir)+8:][:-4] in bad_filenames:
            print(filename[len(from_dir)+1:])
            move(filename, bad_dir+'/'+filename[len(from_dir)+1:])
            count+=1
    print(count,"items successfully moved from /",from_dir,"folder to: ../",bad_dir)

In [None]:
move_bad_to_bad_folder('train', bad_img_nums, 'bad_train')

# 2. Create a Sequential Vgg Model 

### 1. Add fc_bn layers, and train only the final layer

Import the fully trained Vgg16bn network from Imagenet

In [None]:
from keras.applications.vgg19 import VGG19
from keras.applications.vgg19 import preprocess_input
from keras.models import Model

vgg19layers = VGG19(include_top=True, weights='imagenet')
#base_model = VGG19(weights='imagenet')
#model = Model(input=base_model.input, output=base_model.get_layer('block4_pool').output)

In [None]:
save_model(vgg19layers,1,'vgg19')

In [None]:
for i, layer in enumerate(vgg19layers.layers):
    print(i, layer)

In [None]:
vgg19layers.summary()

Make it so that the convoluted layers are not trainable

# Freeze Conv Layers to FC1 and Add new Dense Layer

In [None]:
count_frozen = 0
for layer in vgg19layers.layers:
    layer.trainable = False
    if layer.trainable == False: count_frozen+=1
print(count_frozen,"layers are frozen") 

Create a functional model

In [None]:
#model = Model(input=vgg19layers.input, output=vgg19layers.output)
model = Model(input=vgg19layers.input, output=vgg19layers.get_layer('fc1').output)

In [None]:
model.summary()

In [None]:
x = model.get_layer('fc1')

# Baseline Model: Finetune a truncated Vgg19 model (1x4096 fc hidden layer)

In [None]:
predictions = Dense(10, activation='softmax')(x.output)

In [None]:
vgg19short = Model(input=vgg19layers.input,output=predictions)

In [None]:
vgg19short.summary()

### Train the Baseline Model (1 hidden dense layer w 4096 filters)- including use of 14k pseudo label test cases

Use ImageGenerator because there are too many training images to store (resized) in an array.
1. Not using data augmentation at this stage.
2. Not using validation data for training at this stage.

n.b. Mixiterator was not used.  Only test data having a prediction probability >0.995 has been used.
This data is considered to be of such good quality that it can be mixed with real data. The pseudo training data will make up 43% of the training data at this stage (39% after validation data is added). Yes, it's a little high, but lets see how it goes.. 

Create the image generator (no augmentation)