# **Searching for Activation Functions**

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.

## **Abstract**

Experiments were conducted using the ResNet-50 and MNIST datasets, and some contents that were not specifically stated in the paper were organized at random. Even if it is specifically stated in the thesis, if it is judged that some improvement is necessary, it has been slightly revised.

* Modified content

 - In order to speed up the training, the model matrix operation is performed with float16 using mixed precision.

 - Resize the input image to 112x112x3.

 - Change the batch size to 64 and the epoch to 15.

* Randomly organized content

 - Since the initialize value for the Swish parameter beta is not specified, we used random normal distribution as mean = 1 and standard diviation = 0.05.

All experimental results have been committed to Tensorboard Dev, and you can check them at the address below.

*https://tensorboard.dev/experiment/z9KMPf3GReOt79vaXOOVGg/*

## **Default Setting**

In [1]:
import tensorflow as tf

import numpy as np
from matplotlib import pyplot as plt

tf.__version__

'2.4.1'

In [2]:
!python -V

Python 3.7.10


In [3]:
!nvidia-smi

Wed Mar 24 04:25:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:            25G        755M         22G        1.0M        2.0G         24G
Swap:            0B          0B          0B


In [5]:
tf.keras.mixed_precision.set_global_policy("mixed_float16")

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-SXM2-16GB, compute capability 7.0


## **Load Dataset**

In [6]:
## Load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

## Print the shapes.
print(f"x_train.shape: {x_train.shape}")
print(f"y_train.shape: {y_train.shape}")
print(f"x_test.shape: {x_test.shape}")
print(f"y_test.shape: {y_test.shape}")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
x_train.shape: (60000, 28, 28)
y_train.shape: (60000,)
x_test.shape: (10000, 28, 28)
y_test.shape: (10000,)


In [7]:
RESIZED_IMAGE = [112, 112]

@tf.function
def tr_rescale_and_resize(images, labels):
    images = tf.expand_dims(images, axis = -1)
    images = tf.image.convert_image_dtype(images, tf.float32)
    images = tf.image.resize(images, RESIZED_IMAGE)
    images = tf.image.grayscale_to_rgb(images)
    labels = tf.cast(labels, tf.int32)
    return images, labels

@tf.function
def ts_rescale_and_resize(images):
    images = tf.expand_dims(images, axis = -1)
    images = tf.image.convert_image_dtype(images, tf.float32)
    images = tf.image.resize(images, RESIZED_IMAGE)
    images = tf.image.grayscale_to_rgb(images)
    return images

In [8]:
BATCH_SIZE = 64
AUTO = tf.data.experimental.AUTOTUNE ## -1

## Train & test split.
NUM_TEST_SIZE = 10_000

tr_X, vl_X = x_train[NUM_TEST_SIZE:], x_train[:NUM_TEST_SIZE]
tr_Y, vl_Y = y_train[NUM_TEST_SIZE:], y_train[:NUM_TEST_SIZE]


## Make tf.data.Dataset.
tr_ds = tf.data.Dataset.from_tensor_slices((tr_X, tr_Y)
                        ).batch(BATCH_SIZE
                        ).map(tr_rescale_and_resize, num_parallel_calls = AUTO
                        ).cache(
                        ).shuffle(3_000, reshuffle_each_iteration = True
                        ).prefetch(AUTO)

vl_ds = tf.data.Dataset.from_tensor_slices((vl_X, vl_Y)
                        ).batch(BATCH_SIZE
                        ).map(tr_rescale_and_resize, num_parallel_calls = AUTO
                        ).cache(
                        # ).shuffle(3_000, reshuffle_each_iteration = True
                        ).prefetch(AUTO)

ts_ds = tf.data.Dataset.from_tensor_slices((x_test,)
                        ).batch(BATCH_SIZE
                        ).map(ts_rescale_and_resize, # num_parallel_calls = AUTO
                        ).cache(
                        # ).shuffle(3_000, reshuffle_each_iteration = True
                        ).prefetch(AUTO)

## Print the shapes.
print(f"tr_dataset.element_spec: {[i.shape for i in tr_ds.element_spec]}")
print(f"vl_dataset.element_spec: {[i.shape for i in vl_ds.element_spec]}")
print(f"ts_dataset.element_spec: {[ts_ds.element_spec.shape]}")

tr_dataset.element_spec: [TensorShape([None, 112, 112, 3]), TensorShape([None])]
vl_dataset.element_spec: [TensorShape([None, 112, 112, 3]), TensorShape([None])]
ts_dataset.element_spec: [TensorShape([None, 112, 112, 3])]


## **Modeling**

In [9]:
## Since the native TF implementation has a more memory-efficient gradient implementation,
## it is recommended to use already implemented functions as much as possible.

## e.g. tf.nn.swish, tf.nn.relu6, tf.nn.sigmoid.

## Reference: https://github.com/qubvel/efficientnet/blob/master/efficientnet/model.py#L99-L117

class Swish(tf.keras.layers.Layer):
    def __init__(self, trainable = True):
        super(Swish, self).__init__()
        ## Since the initial value for the parameter beta of swish implemented 
        ## in the paper is not specified, initializer was implemented arbitrarily.
        self.trainable = trainable

        if self.trainable:
            self.initializer = tf.keras.initializers.RandomNormal(mean = 1., stddev = 0.05)
        else:
            self.initializer = tf.keras.initializers.Ones()

    def build(self, input_shape):
        ## Add trainable parameter beta.
        self.beta = self.add_weight(shape = (1,), 
                                    initializer = self.initializer, 
                                    trainable = self.trainable)

    def call(self, x, training = True):
        return x * tf.nn.sigmoid(self.beta * x)

In [10]:
def Conv2D_BN_ReLU(
    x, 
    filters, 
    kernel_size, 
    strides = 1,
    padding = "same",
    use_activation = True,
    activation_fn = None,
):
    x = tf.keras.layers.Conv2D(filters, kernel_size, strides = strides, padding = padding)(x)
    x = tf.keras.layers.BatchNormalization()(x)

    if use_activation:
        assert activation_fn is not None
        x = tf.keras.layers.Activation(activation_fn)(x)

    return x


def residual_block(
    x, 
    output_filters,
    activation_fn,
):
    assert x.shape[-1] <= output_filters, \
        "The size of the output dimension must be equal to or greater than the size of the input dimension."

    residual = x

    strides = 2 if (residual.shape[-1] < output_filters and output_filters !=  256) else 1

    x = Conv2D_BN_ReLU(x, output_filters // 4, 1, strides = strides, activation_fn = activation_fn)
    x = Conv2D_BN_ReLU(x, output_filters // 4, 3, activation_fn = activation_fn)
    x = Conv2D_BN_ReLU(x, output_filters, 1, use_activation = False)

    ## When the input and output are the same dimensions, the shortcut performs identity mapping.
    ## When the demensions increase, the projection shortcut in Eqn.(2) is used to match dimensions (done by 1x1 convolutions).
    if residual.shape[-1] < output_filters:
        residual = Conv2D_BN_ReLU(residual, output_filters, 1, strides = strides, use_activation = False)

    x = tf.keras.layers.Add()([x, residual])
    x = tf.keras.layers.Activation(activation_fn)(x)

    return x


def ResNet50_Baseline(
    model_name,
    activation_fn = tf.nn.relu6,
    num_classes = 10,
):
    """Base ResNet50"""
    x = model_input = tf.keras.layers.Input(shape = (*RESIZED_IMAGE, 3), dtype = tf.float32)

    ## Entry flow.
    x = Conv2D_BN_ReLU(x, 64, 7, strides = 2, activation_fn = activation_fn)
    x = tf.keras.layers.MaxPool2D(3, strides = 2, padding = "same")(x)

    ## Middle flow.
    for output_filters in [256] * 3 + [512] * 4 + [1024] * 6 + [2048] * 3:
        x = residual_block(x, output_filters, activation_fn = activation_fn)

    ## Exit flow.
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    x = tf.keras.layers.Dense(num_classes)(x)

    model_output = x = tf.keras.layers.Activation(tf.nn.softmax, dtype = tf.float32)(x) ## float16 to float32

    return tf.keras.Model(
        inputs = model_input,
        outputs = model_output,
        name = model_name)

In [11]:
# tmp = ResNet50_Baseline("swish", activation_fn = Swish(trainable = True))
# tmp.summary()

In [12]:
# del tmp

## **Train**

In [13]:
!rm -rf logs

In [14]:
relu6  = ResNet50_Baseline("ReLU-6",  activation_fn = tf.nn.relu6)
swish  = ResNet50_Baseline("Swish",   activation_fn = Swish(trainable = True))
swish1 = ResNet50_Baseline("Swish-1", activation_fn = Swish(trainable = False))

In [15]:
def get_callbacks(model_name):        
    ## TensorBoard callback.
    log_dir = f"logs/fit/{model_name}"
    tb_callback = tf.keras.callbacks.TensorBoard(log_dir = log_dir, histogram_freq = 1, update_freq = "batch")
    
    return [tb_callback]

In [16]:
total_hist = {}
for model in [relu6, swish, swish1]:
    print(f"Model {model.name}...\n")

    model.compile(optimizer = "adam",
                  loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                  metrics = ["acc"])
    
    hist = model.fit(tr_ds,
                     validation_data = vl_ds,
                     epochs = 15,
                     verbose = 2,
                     callbacks = get_callbacks(model.name))
    
    total_hist[model.name] = hist

Model ReLU-6...

Epoch 1/15
782/782 - 71s - loss: 0.2169 - acc: 0.9382 - val_loss: 0.2088 - val_acc: 0.9529
Epoch 2/15
782/782 - 41s - loss: 0.0736 - acc: 0.9795 - val_loss: 0.1395 - val_acc: 0.9659
Epoch 3/15
782/782 - 41s - loss: 0.0586 - acc: 0.9839 - val_loss: 0.4681 - val_acc: 0.8978
Epoch 4/15
782/782 - 41s - loss: 0.0554 - acc: 0.9843 - val_loss: 0.4252 - val_acc: 0.9008
Epoch 5/15
782/782 - 41s - loss: 0.0393 - acc: 0.9877 - val_loss: 0.0719 - val_acc: 0.9820
Epoch 6/15
782/782 - 41s - loss: 0.0413 - acc: 0.9879 - val_loss: 0.0690 - val_acc: 0.9822
Epoch 7/15
782/782 - 41s - loss: 0.0392 - acc: 0.9893 - val_loss: 0.0958 - val_acc: 0.9742
Epoch 8/15
782/782 - 41s - loss: 0.0294 - acc: 0.9911 - val_loss: 0.1214 - val_acc: 0.9713
Epoch 9/15
782/782 - 41s - loss: 0.0353 - acc: 0.9895 - val_loss: 0.0359 - val_acc: 0.9892
Epoch 10/15
782/782 - 41s - loss: 0.0246 - acc: 0.9926 - val_loss: 0.0544 - val_acc: 0.9856
Epoch 11/15
782/782 - 41s - loss: 0.0235 - acc: 0.9929 - val_loss: 0.060

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir logs/fit

In [None]:
!tensorboard dev upload --logdir ./logs \
    --name "Simple experiment of swish activation function with MNIST" \
    --description "Training results from the paper 'Searching for Activation Functions'" \
    --one_shot