# MNIST - Training

Given the optimal architecture that meets the memory constraints, train it (first floating point, then quantized). For small networks, bad initializations and training steps appear to be especially problematic. So we run each of these main training steps several times and choose the result that maximizes performance on the validation set. We run the final quantized network on the test set.

In [1]:
from __future__ import absolute_import, division, print_function
import os, sys, pdb, pickle
from itertools import product
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

import tensorflow as tf
import keras
from keras.datasets import mnist
from keras.models import Model, Sequential, load_model
from keras.layers import Input, Dense, Dropout, Flatten, Conv2D, MaxPooling2D, AveragePooling2D, Lambda, Activation, Add, concatenate
from keras.callbacks import LearningRateScheduler, ModelCheckpoint
from keras.engine.topology import Layer
from keras import regularizers, activations
from keras import backend as K

from quantization_layers import *
from network_parameterization import *

#os.environ['CUDA_VISIBLE_DEVICES']=''

out_folder = 'models'
if not os.path.exists(out_folder):
    os.makedirs(out_folder)

Using TensorFlow backend.


## Prepare the data and network

In [2]:
num_classes = 10

# Grab and massage the training and test data.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
img_rows, img_cols = x_train.shape[1:3]

x_train = x_train.astype('float32') / 256
x_test  = x_test.astype('float32')  / 256
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test  = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

np.random.seed(0)
val_set = np.zeros(x_train.shape[0], dtype='bool')
val_set[np.random.choice(x_train.shape[0], 10000, replace=False)] = 1
x_val = x_train[val_set]
y_val = y_train[val_set]
x_train = x_train[~val_set]
y_train = y_train[~val_set]

with open('augmented_x_200k_v2.npy', 'rb') as f:
    x_train_aug = np.load(f) * 255 / 256
with open('augmented_y_200k_v2.npy', 'rb') as f:
    y_train_aug = np.load(f)

print('x_train shape:', x_train.shape)
print(x_train_aug.shape[0], 'train samples')
print(x_val.shape[0], 'val samples')
print(x_test.shape[0], 'test samples')

x_train shape: (50000, 28, 28, 1)
200000 train samples
10000 val samples
10000 test samples


In [3]:
config = [('A', 2, 4), ('C', 5, 3, 3, 1, 1, 4, 8, 4), ('C', 8, 3, 3, 1, 1, 4, 8, 4), ('C', 11, 3, 3, 1, 1, 4, 8, 4), ('M', 2, 4), ('D', 0.1, 4), ('S', 10, 4, 8, 8)]
storage = sum(compute_storage(config, input_dimensions=[28,28,1], input_bits=8, streaming='true', convolution_strategy='herringbone'))
print('[%04dB] - Network:'%storage, config)

[1947B] - Network: [('A', 2, 4), ('C', 5, 3, 3, 1, 1, 4, 8, 4), ('C', 8, 3, 3, 1, 1, 4, 8, 4), ('C', 11, 3, 3, 1, 1, 4, 8, 4), ('M', 2, 4), ('D', 0.1, 4), ('S', 10, 4, 8, 8)]


## Run floating point training

Find the best FP32 network over 10 training sessions.

In [4]:
trials = 50
best_acc = 0
for trial in range(trials):
    print('\n\nOn trial %d/%d:'%(trial+1, trials))
    lrate95 = LearningRateScheduler(lambda epoch: max(1e-4, 0.005 * 0.95**epoch))
    ckptL = ModelCheckpoint(out_folder + '/modelFL_%d.h5'%trial, monitor='val_loss', verbose=0, save_best_only=True)
    ckptA = ModelCheckpoint(out_folder + '/modelFA_%d.h5'%trial, monitor='val_acc', verbose=0, save_best_only=True)
    
    X_input = Input(shape=x_train.shape[1:])
    X = output_logits(X_input, config, fp=True, qt=0)
    X = Activation('softmax')(X)
    modelF = Model(X_input, X)
    modelF.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['accuracy'])
    print('#param:', modelF.count_params())
    histf = modelF.fit(x_train_aug, y_train_aug, batch_size=1024, epochs=50, callbacks=[lrate95, ckptA, ckptL], verbose=0, validation_data=(x_val, y_val))
    
    modelF = load_model(out_folder + '/modelFL_%d.h5'%trial, custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
    score = modelF.evaluate(x_val, y_val, verbose=0)
    cur_acc = np.max(histf.history['val_acc'])
    print('Best Validation loss: %.4f - accuracy: %.4f'%(np.min(histf.history['val_loss']), cur_acc))
    print('Model F Validation loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
    if cur_acc > best_acc:
        best_acc = cur_acc
        modelF.save(out_folder + '/modelFL_best.h5')
        modelF = load_model(out_folder + '/modelFA_%d.h5'%trial, custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
        modelF.save(out_folder + '/modelFA_best.h5')
print('\n\n\nBest Validation Accuracy: %.4f'%best_acc)



On trial 1/50:
#param: 3003
Best Validation loss: 0.0304 - accuracy: 0.9905
Model F Validation loss: 0.0304 - accuracy: 0.9902


On trial 2/50:
#param: 3003
Best Validation loss: 0.0328 - accuracy: 0.9905
Model F Validation loss: 0.0328 - accuracy: 0.9896


On trial 3/50:
#param: 3003
Best Validation loss: 0.0381 - accuracy: 0.9892
Model F Validation loss: 0.0381 - accuracy: 0.9889


On trial 4/50:
#param: 3003
Best Validation loss: 0.0331 - accuracy: 0.9904
Model F Validation loss: 0.0331 - accuracy: 0.9892


On trial 5/50:
#param: 3003
Best Validation loss: 0.0381 - accuracy: 0.9894
Model F Validation loss: 0.0381 - accuracy: 0.9890


On trial 6/50:
#param: 3003
Best Validation loss: 0.0340 - accuracy: 0.9892
Model F Validation loss: 0.0340 - accuracy: 0.9888


On trial 7/50:
#param: 3003
Best Validation loss: 0.0328 - accuracy: 0.9906
Model F Validation loss: 0.0328 - accuracy: 0.9903


On trial 8/50:
#param: 3003
Best Validation loss: 0.0364 - accuracy: 0.9896
Model F Validation 

## Run quantized training twice

1. First run - find the best quantized network over 10 training sessions while allowing quantization scales to move.
2. Second run - find the best quantized network over 10 training sessions while freezing quantization scales (allows more stable convergence).

In [4]:
trials = 10
best_acc = 0
for trial in range(trials):
    print('\n\n\nOn trial %d/%d:'%(trial+1, trials))
    modelF = load_model(out_folder + '/modelFL_best.h5', custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
    score = modelF.evaluate(x_val, y_val, verbose=0)
    print('Model F Validation loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
    
    lrate98 = LearningRateScheduler(lambda epoch: max(1e-4, 0.005 * 0.98**epoch))
    ckptL = ModelCheckpoint(out_folder + '/modelTL_%d.h5'%trial, monitor='val_loss', verbose=0, save_best_only=True)
    ckptA = ModelCheckpoint(out_folder + '/modelTA_%d.h5'%trial, monitor='val_acc', verbose=0, save_best_only=True)
    X_input = Input(shape=x_train.shape[1:])
    X = output_logits(X_input, config, fp=False, qt=1)
    X = Activation('softmax')(X)
    model = Model(X_input, X)
    model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['accuracy'])
    loadQ(modelF, model, x_val, verbose=True)
    histL = model.fit(x_train_aug, y_train_aug, batch_size=1024, epochs=100, callbacks=[lrate98, ckptA, ckptL], verbose=0, validation_data=(x_val, y_val))

    model = load_model(out_folder + '/modelTL_%d.h5'%trial, custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
    score = model.evaluate(x_val, y_val, verbose=0)
    cur_acc = np.max(histL.history['val_acc'])
    print('Best Validation loss: %.4f - accuracy: %.4f'%(np.min(histL.history['val_loss']), cur_acc))
    print('Model Validation loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
    if cur_acc > best_acc:
        best_acc = cur_acc
        model.save(out_folder + '/modelTL_best.h5')
        model = load_model(out_folder + '/modelTA_%d.h5'%trial, custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
        model.save(out_folder + '/modelTA_best.h5')
print('\n\n\nBest Validation Accuracy: %.4f'%best_acc)




On trial 1/10:
Model F Validation loss: 0.0335 - accuracy: 0.9916
Quantization layer 3: w(-0.0477 +/- 0.7150) b(0.2804 +/- 0.1756) qw(1.6116) qb(1.1171) qa(1.9521)
Quantization layer 4: w(0.0133 +/- 0.4118) b(0.0412 +/- 0.2700) qw(1.0611) qb(1.1542) qa(4.7224)
Quantization layer 5: w(-0.0514 +/- 0.3263) b(0.1365 +/- 0.2259) qw(0.8358) qb(1.4945) qa(6.5495)
Quantization layer 9: w(-0.0225 +/- 0.2671) b(0.0134 +/- 0.3728) qw(0.7327) qb(1.7259) qa(23.9298)
Best Validation loss: 0.0349 - accuracy: 0.9894
Model Validation loss: 0.0349 - accuracy: 0.9887



On trial 2/10:
Model F Validation loss: 0.0335 - accuracy: 0.9916
Quantization layer 3: w(-0.0477 +/- 0.7150) b(0.2804 +/- 0.1756) qw(1.6116) qb(1.1171) qa(1.9521)
Quantization layer 4: w(0.0133 +/- 0.4118) b(0.0412 +/- 0.2700) qw(1.0611) qb(1.1542) qa(4.7224)
Quantization layer 5: w(-0.0514 +/- 0.3263) b(0.1365 +/- 0.2259) qw(0.8358) qb(1.4945) qa(6.5495)
Quantization layer 9: w(-0.0225 +/- 0.2671) b(0.0134 +/- 0.3728) qw(0.7327) qb(1

In [5]:
trials = 10
best_acc = 0
for trial in range(trials):
    print('\n\n\nOn trial %d/%d:'%(trial+1, trials))
    modelL = load_model(out_folder + '/modelTL_best.h5', custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
    score = modelL.evaluate(x_val, y_val, verbose=0)
    print('Model L Validation loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
    modelL.save_weights('temp.h5')

    lrate98 = LearningRateScheduler(lambda epoch: max(1e-4, 0.005 * 0.98**epoch))
    ckptA = ModelCheckpoint(out_folder + '/modelQA_%d.h5'%trial, monitor='val_acc', verbose=0, save_best_only=True)
    ckptL = ModelCheckpoint(out_folder + '/modelQL_%d.h5'%trial, monitor='val_loss', verbose=0, save_best_only=True)
    X_input = Input(shape=x_train.shape[1:])
    X = output_logits(X_input, config, fp=False, qt=0)
    X = Activation('softmax')(X)
    model = Model(X_input, X)
    model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['accuracy'])
    model.load_weights('temp.h5')
    histq = model.fit(x_train_aug, y_train_aug, batch_size=1024, epochs=200, callbacks=[lrate98, ckptA, ckptL], verbose=0, validation_data=(x_val, y_val))
    model = load_model(out_folder + '/modelQL_%d.h5'%trial, custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
    
    score = model.evaluate(x_val, y_val, verbose=0)
    cur_acc = np.max(histq.history['val_acc'])
    print('Best Validation loss: %.4f - accuracy: %.4f'%(np.min(histq.history['val_loss']), cur_acc))
    print('Model Validation loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
    if cur_acc > best_acc:
        best_acc = cur_acc
        model.save(out_folder + '/modelQL_best.h5')
        model = load_model(out_folder + '/modelQA_%d.h5'%trial, custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
        model.save(out_folder + '/modelQA_best.h5')
print('\n\n\nBest Validation Accuracy: %.4f'%best_acc)




On trial 1/10:
Model L Validation loss: 0.0362 - accuracy: 0.9892
Best Validation loss: 0.0335 - accuracy: 0.9909
Model Validation loss: 0.0335 - accuracy: 0.9909



On trial 2/10:
Model L Validation loss: 0.0362 - accuracy: 0.9892
Best Validation loss: 0.0345 - accuracy: 0.9904
Model Validation loss: 0.0345 - accuracy: 0.9899



On trial 3/10:
Model L Validation loss: 0.0362 - accuracy: 0.9892
Best Validation loss: 0.0346 - accuracy: 0.9906
Model Validation loss: 0.0346 - accuracy: 0.9904



On trial 4/10:
Model L Validation loss: 0.0362 - accuracy: 0.9892
Best Validation loss: 0.0349 - accuracy: 0.9904
Model Validation loss: 0.0349 - accuracy: 0.9890



On trial 5/10:
Model L Validation loss: 0.0362 - accuracy: 0.9892
Best Validation loss: 0.0343 - accuracy: 0.9907
Model Validation loss: 0.0343 - accuracy: 0.9900



On trial 6/10:
Model L Validation loss: 0.0362 - accuracy: 0.9892
Best Validation loss: 0.0359 - accuracy: 0.9907
Model Validation loss: 0.0359 - accuracy: 0.9892



O

## Run the test set

Note: we manually selected the model that had both high accuracy and low loss according to:

$\max_i \frac{\sigma_a^2}{\sigma^2 + \sigma_a^2} z_a^{(i)} - z_\ell^{(i)}$

where $z_a^{(i)}$ is the z-score of the $i^\text{th}$ accuracy and $z_\ell^{(i)}$ is the z-score of the $i^\text{th}$ loss. $\sigma = \sqrt{p(1-p)/M} \approx 10^{-3}$ is the standard deviation of accuracy on the test set, calculated by assuming test samples are Bernoulli random variates with $p = \text{average accuracy}$. $\sigma_a$ is the approximate uncertainty (standard deviation) of accuracy around a given loss value.

In [10]:
# Run the test accuracy

model = load_model(out_folder + '/modelQL_0.h5', custom_objects={'DenseQ':DenseQ, 'ConvQ':ConvQ, 'ResidQ':ResidQ, 'quantize':quantize, 'concatenate':concatenate})
score = model.evaluate(x_val, y_val, verbose=0)
print('Validation loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test       loss: %.4f - accuracy: %.4f'%(score[0], score[1]))
print('%04dB - %05.2f%% - Network:'%(storage, 100*score[1]), config)

Validation loss: 0.0335 - accuracy: 0.9909
Test       loss: 0.0295 - accuracy: 0.9916
1947B - 99.16% - Network: [('A', 2, 4), ('C', 5, 3, 3, 1, 1, 4, 8, 4), ('C', 8, 3, 3, 1, 1, 4, 8, 4), ('C', 11, 3, 3, 1, 1, 4, 8, 4), ('M', 2, 4), ('D', 0.1, 4), ('S', 10, 4, 8, 8)]
