<h2>Activation Functions in Keras and PyTorch</h2>

This notebook provides a summary of the activation functions supported by Keras (2.4.0) and PyTorch (1.7.0).

Activation functions constitute a crucial component of deep learning. Activation functions are mathematical equations that determine the output of a neural network, the accuracy of the output as well as the computational efficiency of training a model. They serve as a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer.

The function is attached to each neuron in the network, and determines whether it should be activated or not. They also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. When building a model and training a neural network, the selection of the most approporiate activation functions is crucial. We can achieve better results if we experiment with different activation functions for different problems.

There are 3 main types of activation functions:
1. Binary Step Functions
2. Linear Activation Functions
3. Non-Linear Activation Functions

In this notebook I will only cover Non-Linear Activation Functions for deep neural networks.

<strong>ReLU</strong> (= Rectified Linear Unit)

- Most commonly used activation function. Many libraries and hardware accelerators provide optimizations specific to ReLU.
- ReLU is the best choice when speed is a priority. It is computationally efficient as it allows the network to converge quickly.
- The gradient of the ReLU function is 0 when its input is negative.
- The output will always be positive if it is used in the output layer.
- The function suffers from dying ReLUs: When the weighted sum of the neuron's inputs approach 0 or are negative, the gradient of the function becomes 0, and the network can no longer perform backpropagation or learn. Some neurons die during training, and they only output 0. With a large learning rate, half of the neurons may die.
- Dead neurons may eventually come back to life when the Gradient Descent tweaks it in a way that the weighted sum of their inputs turns positive.

In [None]:
# Keras example 1
tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0)

# Keras example 2
model.add(layers.Dense(32, activation=’relu’))

# PyTorch example
m = nn.ReLU()
input = torch.randn(2)
output = m(input)

<strong>Leaky ReLU</strong>

- This function is a variation of ReLU, and provides a solution for the dead neurons problem with ReLU.
- It has a small positive slope (typically 0.01) in the negative area (z < 0), so it enables backpropagation, even for negative input values. The small slope ensures that the leaky ReLUs never die.
- It outperforms the strict ReLU activation function.
- This function is a good option if runtime latency is a priority.

In [None]:
# Keras example
tf.keras.layers.LeakyReLU(alpha=0.3)

# PyTorch example
m = nn.LeakyReLU(negative_slope=0.01, inplace=False)
input = torch.randn(2)
output = m(input)

<strong>PReLU</strong> (= Parametric Leaky ReLU)

- This function allows the negative slope to be learned. It provides the slope of the negative part of the function as an argument.
- A disadvantage is that it may perform differently for different problems.

<strong>RReLU</strong> (= Randomized Leaky ReLU)

- If we have spare time and additional computing power, we can use cross-validation to evaluate PReLU and RReLU. RReLU may be evaluated if the network is overfitting, PReLU may be evaluated if we work with a large training set.
- There is no official implementation of RReLU in Keras 2.4.0.

In [None]:
# Keras example for PReLU
tf.keras.layers.PReLU(
    alpha_initializer="zeros",
    alpha_regularizer=None,
    alpha_constraint=None,
    shared_axes=None)

# PyTorch example for PReLU
torch.nn.PReLU(num_parameters=1, init=0.25)

# PyTorch example for RReLU
torch.nn.RReLU(lower=0.125, upper=0.333, inplace=False)

<strong>Thresholded ReLU</strong>

In [None]:
# Keras example
tf.keras.layers.ThresholdedReLU(theta=1.0)

# PyTorch example
torch.nn.Threshold(threshold, value, inplace=False)

<strong>ELU</strong> (= Exponential Linear Unit)

- The convergence rate of ELUs is faster during training than a ReLU network, but they are slower to compute than ReLUs and its variants owing to the exponential function.
- ELUs saturate to a negative value when the argument gets smaller (z < 0). Mean activations that are closer to 0 enable faster learning, as they bring the gradient closer to the natural gradient.
- This function diminishes the vanishing gradient effect, and does not produce dead neurons.

In [None]:
# Keras example
tf.keras.activations.elu(x, alpha=1.0)

# PyTorch example
torch.nn.ELU(alpha=1.0, inplace=False)

<strong>SELU</strong> (= Scaled ELU)

- This function significantly outperforms other activation functions for deep neural networks. It may also improve performance in CNNs.
- If we build a NN only with a stack of dense layers, and if all hidden layers use the SELU activation function, the network will self-normalize: the output of each layer will preserve a mean of 0 and STD of 1 during training. This solves the vanishing gradient problem.
- Requirements for self-normalization:
	- The network’s architecture must be Sequential.
    - The input features must be standardized (mean 0 and STD 1).
	- Every hidden layer’s weight must be initialized with kernel_initializer=”lecun_normal”.
	- It needs to be used together with the dropout variant tf.keras.layers.AlphaDropout (not the regular dropout).
- If the network’s architecture does not allow for self-normalization, ELU may perform better than SELU. This happens because SELU is not smooth at z = 0.

In [None]:
# Keras example 1
tf.keras.activations.selu(x)

# Keras example 2
model.add(layers.Dense(32, activation=’selu’, kernel_initializer='lecun_normal'))

# PyTorch example
torch.nn.SELU(inplace=False)

<strong>Sigmoid</strong> (or <strong>Logistic</strong>)

- Sigmoid is equivalent to a 2-element Softmax, where the second element is assumed to be 0.
- This functions has a smooth gradient, which prevents “jumps” in output values.
- The output values fall between 0 and 1, which normalizies the output of each neuron, and provides clear predictions.
- For small values (<-5) Sigmoid returns a value close to 0, and for large values (>5) the result of the function is close to 1.
- Disadvantages of the function: vanishing gradient, outputs are not zero-centered, and it is computationally expensive.

In [None]:
# Keras example
tf.keras.activations.sigmoid(x)

# PyTorch example
torch.nn.Sigmoid

<strong>TanH (= Hyperbolic Tangent)</strong>

- This is an S-shaped, continuous, and differentiable function.
- It is very similar to Sigmoid, but is zero-centered.
- TanH scales the labels to the range -1 to 1.
- This range makes each layer’s output centered around 0 at the beginning of training, which speeds up convergence.

In [None]:
# Keras example
tf.keras.activations.tanh(x)

# PyTorch example
torch.nn.Tanh

<strong>Softmax</strong>

- Softmax converts a real vector to a vector of categorical probabilities.
- Softmax is often used as the activation function for the output layer of a classification network, because the result could be interpreted as a probability distribution.
- The elements of the output vector are in range (0, 1) and sum to 1.

In [None]:
# Keras example
tf.keras.activations.softmax(x, axis=-1)

# PyTorch example
torch.nn.Softmax(dim=None)

<strong>Softplus</strong>

- This is a smooth variant of ReLU.
- It is close to 0 when z < 0, and close to z when z > 0.

In [None]:
# Keras example
tf.keras.activations.softplus(x)

# PyTorch example
torch.nn.Softplus(beta=1, threshold=20)

<strong>Softsign</strong>

In [None]:
# Keras example
tf.keras.activations.softsign(x)

# PyTorch example
torch.nn.Softsign

<strong>Additional Activation Function supported in Keras (2.4.0)</strong>

In [None]:
# Exponential
tf.keras.activations.exponential(x)

<strong>Additional Activation Functions supported in PyTorch (1.7.0)</strong>

In [None]:
# ReLU6
torch.nn.ReLU6(inplace=False)

# CELU
torch.nn.CELU(alpha=1.0, inplace=False)

# GELU
torch.nn.GELU

# SiLU
torch.nn.SiLU(inplace=False)

# Softmin
torch.nn.Softmin(dim=None)

# Softmax2d
torch.nn.Softmax2d

# Softshrink
torch.nn.Softshrink(lambd=0.5)

# Hardshrink
torch.nn.Hardshrink(lambd=0.5)

# Hardsigmoid
torch.nn.Hardsigmoid(inplace=False)

# Hardtanh
torch.nn.Hardtanh(min_val=-1.0, max_val=1.0, inplace=False, min_value=None, max_value=None)

# Hardswish
torch.nn.Hardswish(inplace=False)

# LogSigmoid
torch.nn.LogSigmoid

# LogSoftmax
torch.nn.LogSoftmax(dim=None)

# AdaptiveLogSoftmaxWithLoss
torch.nn.AdaptiveLogSoftmaxWithLoss(
    in_features,
    n_classes,
    cutoffs,
    div_value=4.0,
    head_bias=False)

# MultiheadAttention
torch.nn.MultiheadAttention(
    embed_dim,
    num_heads,
    dropout=0.0,
    bias=True,
    add_bias_kv=False,
    add_zero_attn=False,
    kdim=None,
    vdim=None)

# Tanhshrink
torch.nn.Tanhshrink

As a summary, the question arises:

<strong>Which activation functions are best to use for hidden layers in a deep neural network?</strong>

SELU > ELU > Leaky ReLU (RReLU, PReLU) > ReLU > TanH > Sigmoid