In [25]:
import numpy as np
from scipy import signal
import matplotlib
%matplotlib inline
np.set_printoptions(precision=3, suppress=True)

# Part 4: Optimization with loss feedback

This notebook will explore losses, particularly L1,L2 regression vs classification

In [26]:
l2_loss = lambda labels,output: -0.5*np.nanmean(np.power(output - labels, 2.0))
l1_loss = lambda labels,output: -np.nanmean(np.abs(output - labels))

one_hot = lambda labels, n_classes: np.eye(n_classes)[np.array(labels).reshape(-1)].transpose()
ce_loss = lambda labels,output: -np.nansum(np.log(softmax(output))* one_hot(labels, output.shape[0]), axis=0)
softmax = lambda x: np.exp(x)/np.nansum(np.exp(x), axis=0)

raw_outputs = np.random.randn(10)
maxind = np.argmax(softmax(raw_outputs), axis=0)

The gradient at the loss has the magnitude of the label in it for all but L1 loss, which only has its polarity vs the output. L1 is like warmer/colder versus L2 is like a compass (but still not a map).

 L1 is robust to outliers but has major drawback of multiple solutions
 
 L1 just adds up the lengths in each of N dimensions which have many identical L1 norms 
 
 L2 takes the vector length across the dimensions and can have only one solution
 
# Data splits for training, validation and testing

Data splitting means reserving some of the data for training and some for evaluating. Splitting is sometimes provided by the platform but necessary to understand. A typical split: 70:20:10 for train/validation/test. The validation is used to determine how to modify training hyperparameters and test is to provide a final unbiased (supposedly) evaluation of how well the network is going to perform on data not used to optimize it.

In [27]:
import gzip
import struct
datasets=['data/train-labels-idx1-ubyte.gz', 'data/train-images-idx3-ubyte.gz']
with gzip.open(datasets[0], 'rb') as fp:
    # labels
    magic=struct.unpack(">L", fp.read(size=4))[0]
    assert magic == 2049, 'magic number in header does not match expectations'
    num_items=struct.unpack(">L", fp.read(size=4))[0]
    mnist_labels=np.asarray(list(fp.read()))

with gzip.open(datasets[1], 'rb') as fp:
    magic=struct.unpack(">L", fp.read(size=4))[0]
    assert magic == 2051, 'magic number in header does not match expectations'
    num_imgs=struct.unpack(">L", fp.read(size=4))[0]
    num_rows=struct.unpack(">L", fp.read(size=4))[0]
    num_cols=struct.unpack(">L", fp.read(size=4))[0]
    data=np.asarray(list(fp.read()))
    mnist_data=data.reshape((num_imgs, 28, 28, 1))

# split into train, val and held-back test sets
index=np.arange(num_items)
split1=int(num_items*0.7)
split2=int(num_items*0.9)
index_train=index[:split1]
index_val=index[split1:split2]
index_test=index[split2:]
# randomize index (only for train set) and get data loader
np.random.shuffle(index_train)
train_data = zip([d for d in mnist_data[index_train,:,:,:]], [l for l in mnist_labels[index_train]])
val_data = zip([d for d in mnist_data[index_val,:,:,:]], [l for l in mnist_labels[index_val]])
test_data = zip([d for d in mnist_data[index_test,:,:,:]], [l for l in mnist_labels[index_test]])

The test set should ideally ONLY be used to validate the finished network, to reduce opportunity for network to learn how to fake you out. The ideal scenario we're aiming at is a network that learns representations that generalize to new data. But we haven't seen the new data yet and it could be subtly different from the data we split into train/val/test. This is related to the "unseen data" problem.

The effect of using test sets for directing training can be negligible or it can be catastrophic if you don't have additional realistic test sets to confirm. 

A much bigger effect however, is shift between 1) the train/val/test dataset distributions and 2) the real-world, production data distributions the network will encounter.

# Class imbalance
A related data sampling problem can be class imbalance. Say you have 90 examples of class 1 and 10 examples of class 2. Lets see what happens to an untrained 2-class network which should be 50% if unbiased by sampling.

In [28]:
n_images=2000
labels=np.concatenate([np.ones((int(0.9*n_images),)), np.zeros((int(0.1*n_images),))])
data=np.random.random((n_images,784,))
trivial_inference = lambda data: np.ones(len(data),)
trivial_accuracy = lambda labels, outputs: sum(labels==outputs)/float(n_images)
outputs = trivial_inference(data)
accuracy = trivial_accuracy(labels, outputs)
print('Accuracy with an untrained, hand-crafted net: ', accuracy)

Accuracy with an untrained, hand-crafted net:  0.9


There exist several techniques to handle imbalance, which fall approximately into three paths:

 1. rebalance the sampling of what the net is exposed to

 2. rebalance the gradients by downweighting according to classes exposed (in many cases this is about the same as rebalancing sampling)

 3. modify the loss to be self-balancing or less sensitive to imbalance (e.g. focal loss, sampling-sensitive loss, generalized regression loss)

Sampling considerations can be very subtle, for example, object detection requires learning from positive and negative examples. Typically, there will be far more potential negative examples (e.g. draw a box anywhere there isn't overlap with a label or use any label with the wrong class). Balancing negative/positive was necessary to get the first detectors working well.

# Optimization with SGD vs Adam
Stochastic gradient descent (SGD) is a shortcut to computing the total sample gradient.  Minibatches of data are passed thru network and gradients accumulated, then used to adjust using the moving average with momentum.

In [29]:
sgd_update_batch_grads = lambda grads, grads_mov_avg, momentum: [momentum * mov_grads + (1-momentum)*g for g,mov_grads in zip(grads, grads_mov_avg)]
sgd_update_weights = lambda params, batch_grads, lr: [param - lr * delta for param,delta in zip(params, batch_grads)]

Adaptive moment estimation is more sophisticated, it attempts to estimate the ideal gradient learning weight and momentum per parameter, described in detail in https://arxiv.org/pdf/1412.6980.pdf

First, initialize gradient mean and variance as zero. Next, for each batch update with:

 $m = beta1 * m + (1 - beta1) * grad$

 $v = beta2 * v + (1 - beta2) * (grad^2)$

 $w = w - learning\_rate * \frac{m}{\sqrt{v} + epsilon}$

In [30]:
adam_update_movavg = lambda grads_m, grads_v, grads, beta1, beta2: [[beta1*m + (1-beta1)*g,beta2*v + (1-beta2)*g*g] for m,v,g in zip(grads_m, grads_v, grads)]
adam_update_weights = lambda params, grads_m, grads_v, lr, time: [param - lr * delta for param,delta in zip(params, batch_grads)]
learning_rate=0.001
beta1=0.9
beta2=0.999
epsilon=1e-08
time=100

The moments (m and v) must be rescaled according to the number of times they've been accumulated

$m = \frac{m}{1 - beta1^{time}}$, where time = number of time steps

$v = \frac{v}{1 - beta2^{time}}$

Thus m and v start out larger (by 1-beta) and trend towards the real m and v (10-fold larger for m and 500-fold larger in the case of v) this can be combined with learning rate to set the effective learning rate as a function of time:

In [31]:
learning_rate = learning_rate * np.sqrt(1-beta2**time)/(1-beta1**time)

# Dropout and batchnorm

Dropout simply takes a random mask at some dropout probability threshold and sets the activation values flowing in to zero.
For backpropagation, it is essential to set the gradients to zero on each backward pass before we use them either to get the next layer back or when accumulating gradients, so the mask has to remain unchanged between forward and backward passes.

In [32]:
dropout = lambda data, mask: np.where(mask, data, np.zeros_like(data))
dropout_backprop = lambda grads, mask: np.where(mask, grads, np.zeros_like(grads))

Batchnorm builds a moving-average mean and standard deviation from data passed thru it. This is then used to normalize the data before doing anything else. The network is set up to learn a de-normalization. Specifically, it learns a gamma for the new mean and beta for new stddev.

For backpropagation to learn the beta and gamma, its contributions are the derivative of the output (f_bn) with respect to gamma/beta:

 $\frac{df_{bn}}{dbeta} = 1$
 
 $\frac{df_{bn}}{dgamma} = data\_norm$

Where data_norm is the previous activation normalized by the moving average used to normalize.


In [33]:
def batchnorm(data, mov_avg_mu, mov_avg_sigma, gamma, beta, across_axis=0, eps=1e-5, momentum=0.9):
    mean = np.mean(data)
    var = np.sqrt(np.mean(np.mean(data - mean) ** 2))
    mov_avg_mu = momentum*mov_avg_mu + (1-momentum)*mean
    mov_avg_sigma = momentum*mov_avg_sigma + (1-momentum)*var
    data_norm = (data - mov_avg_mu) / mov_avg_sigma
    data_bn = (data_norm * gamma) + beta
    return (data_bn,mov_avg_mu,mov_avg_sigma,gamma,beta)

# batchnorm_delta returns data_norm to be used in a backward pass to multiply into the backward-working gradients
# for gamma, for delta treat it like we treated the bias in fc layer backprop
batchnorm_delta = lambda data, mov_avg_mu, mov_avg_sigma: (data - mov_avg_mu) / mov_avg_sigma