First proof of concept
======================

_Chuan-Zheng Lee <<czlee@stanford.edu>>_ <br />
_July 2021_

Here's the idea in this notebook:

- We take the most basic nontrivial neural network task we can think of—I nominate the [MNIST digit recognition task](https://keras.io/examples/vision/mnist_convnet/).
- Run a well-known, impossible-to-fail training system on that
- Code up the new [`Optimizer`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer) object, and try that out
- Does it work? How does it compare?

This will not be journal-ready, but it will provide a short development cycle for our new `Optimizer`.

If we get bored of this task, there are plenty more basic working examples in https://keras.io/examples/vision/. Again, not saying we should use these in our paper, but they'll get us started.

Really basic MNIST task
-----------------------

This code is literally lifted straight out of https://keras.io/examples/vision/mnist_convnet/, except that I changed the optimizer to SGD (it's Adam in the example).

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [3]:
model = keras.Sequential(
    [
        layers.InputLayer(input_shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                1

In [4]:
batch_size = 128
nepochs = 15

# I changed the optimizer to SGD (it was Adam), and instantiated an Optimizer object to make
# it clearer when we write our own optimizer.
optimizer = keras.optimizers.SGD()
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=batch_size, epochs=nepochs, validation_split=0.1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f1b74dd5970>

In [5]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Test loss: 0.0861344262957573
Test accuracy: 0.9742000102996826


# Custom optimizer

Nah, bad idea.

# Opening the gradient update loop

This code is sort of taken from https://keras.io/getting_started/intro_to_keras_for_researchers/#layer-gradients, though it is adapted. Differences from the code in that tutorial (which uses the same dataset, but a different model), mostly to be consistent with the basic MNIST tutorial we used above:
- We use the more complicated network architecture that we used above, not the simple three-layer network in the research tutorial
- The output is a probability vector (softmax), not logits

In [6]:
# running eagerly

# same model, loss function and optimizer as before, but instantiate new ones
model = keras.Sequential(
    [
        layers.InputLayer(input_shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)
loss_fn = keras.losses.CategoricalCrossentropy()
optimizer = keras.optimizers.SGD()

nepochs = 15
batch_size = 64
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
n = x_train.shape[0] // batch_size

for epoch in range(nepochs):
    
    for i, (x, y) in dataset.enumerate():
        with tf.GradientTape() as tape:
            probs = model(x)            # forward pass
            loss = loss_fn(y, probs)    # external loss value
        
        gradients = tape.gradient(loss, model.trainable_weights)            # compute gradients
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))  # apply gradients
    
        if i % 100 == 0:
            print(f"epoch {epoch} of {nepochs}, {i} of {n}, loss: {loss:f}", end='\r')

epoch 14 of 15, 900 of 937, loss: 0.117984

In [7]:
probs_test = model(x_test)
test_loss = loss_fn(y_test, probs_test)
accuracy_fn = keras.metrics.CategoricalAccuracy()
test_accuracy = accuracy_fn(probs_test, y_test)

print("Test loss:", float(test_loss))
print("Test accuracy:", float(test_accuracy))

Test loss: 0.0636865571141243
Test accuracy: 0.9805999994277954


In [8]:
# running with compiled function

# same model, loss function and optimizer as before, but instantiate new ones
model = keras.Sequential(
    [
        layers.InputLayer(input_shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)
loss_fn = keras.losses.CategoricalCrossentropy()
optimizer = keras.optimizers.SGD()

nepochs = 15
batch_size = 64
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
n = x_train.shape[0] // batch_size

@tf.function
def train_on_batch(x, y):
    with tf.GradientTape() as tape:
        probs = model(x)
        loss = loss_fn(y, probs)
    gradients = tape.gradient(loss, model.trainable_weights)
    optimizer.apply_gradients(zip(gradients, model.trainable_weights))
    return loss

for epoch in range(nepochs):
    for i, (x, y) in dataset.enumerate():
        loss = train_on_batch(x, y)
        if i % 100 == 0:
            print(f"epoch {epoch} of {nepochs}, {i} of {n}, loss: {loss:f}", end='\r')

epoch 14 of 15, 900 of 937, loss: 0.152469

In [9]:
probs_test = model(x_test)
test_loss = loss_fn(y_test, probs_test)
accuracy_fn = keras.metrics.CategoricalAccuracy()
test_accuracy = accuracy_fn(probs_test, y_test)

print("Test loss:", float(test_loss))
print("Test accuracy:", float(test_accuracy))

Test loss: 0.06301199644804001
Test accuracy: 0.9800999760627747


# Larger batch sizes

In some sense, our algorithm is a variant on full gradient descent, so I guess we should be sure that we can do that. Though… 60,000 is a lot. Maybe let's just try larger batch sizes.

In [10]:
model = keras.Sequential(
    [
        layers.InputLayer(input_shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)
loss_fn = keras.losses.CategoricalCrossentropy()
optimizer = keras.optimizers.SGD()

nepochs = 15
batch_size = 500
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
n = x_train.shape[0] // batch_size

@tf.function
def train_on_batch(x, y):
    with tf.GradientTape() as tape:
        probs = model(x)
        loss = loss_fn(y, probs)
    gradients = tape.gradient(loss, model.trainable_weights)
    optimizer.apply_gradients(zip(gradients, model.trainable_weights))
    return loss

for epoch in range(nepochs):
    for i, (x, y) in dataset.enumerate():
        loss = train_on_batch(x, y)
        print(f"epoch {epoch} of {nepochs}, {i} of {n}, loss: {loss:f}", end='\r')

epoch 14 of 15, 119 of 120, loss: 0.228805

In [11]:
probs_test = model(x_test)
test_loss = loss_fn(y_test, probs_test)
accuracy_fn = keras.metrics.CategoricalAccuracy()
test_accuracy = accuracy_fn(probs_test, y_test)

print("Test loss:", float(test_loss))
print("Test accuracy:", float(test_accuracy))

Test loss: 0.19686318933963776
Test accuracy: 0.9431999921798706


# Adding random noise to the gradients

Here's a really dumb idea that isn't quite what we mean: Just add some random noise to the gradients

In [12]:
model = keras.Sequential(
    [
        layers.InputLayer(input_shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)
loss_fn = keras.losses.CategoricalCrossentropy()
optimizer = keras.optimizers.SGD()

nepochs = 15
batch_size = 500
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
n = x_train.shape[0] // batch_size
σₙ = 1.0

@tf.function
def train_on_batch(x, y):
    with tf.GradientTape() as tape:
        probs = model(x)
        loss = loss_fn(y, probs)
    gradients = tape.gradient(loss, model.trainable_weights)
    gradients = [g + tf.random.normal(shape=g.shape, mean=0.0, stddev=σₙ) for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_weights))
    return loss

for epoch in range(nepochs):
    for i, (x, y) in dataset.enumerate():
        loss = train_on_batch(x, y)
        print(f"epoch {epoch} of {nepochs}, {i} of {n}, loss: {loss:f}", end='\r')

epoch 14 of 15, 119 of 120, loss: 0.775801

In [13]:
probs_test = model(x_test)
test_loss = loss_fn(y_test, probs_test)
accuracy_fn = keras.metrics.CategoricalAccuracy()
test_accuracy = accuracy_fn(probs_test, y_test)

print("Test loss:", float(test_loss))
print("Test accuracy:", float(test_accuracy))

Test loss: 0.8540118932723999
Test accuracy: 0.807200014591217
