# Chapter 4 - Activation Functions

## ReLU Activation Function Code

In [17]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
outputs = []

for i in inputs:
    outputs.append(i if i > 0 else 0)

outputs

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]

In [18]:
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
outputs = []

for i in inputs:
    outputs.append(max(0, i))

outputs

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]

In [19]:
import numpy as np
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
outputs = np.maximum(0, inputs)

outputs

array([0. , 2. , 0. , 3.3, 0. , 1.1, 2.2, 0. ])

In [20]:
import nnfs
nnfs.init() #We will also use this package to ensure repeatability for everyone, using nnfs.init(), after importing NumPy
from nnfs.datasets import spiral_data

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        self.output = None

    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

class Activation_ReLU:
    def __init__(self):
        self.output = None

    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

X, y = spiral_data(samples=100, classes=3)
dense1 = Layer_Dense(2, 3)
activation1 = Activation_ReLU()
dense1.forward(X)
activation1.forward(dense1.output)
activation1.output[:5]

array([[0.        , 0.        , 0.        ],
       [0.        , 0.00011395, 0.        ],
       [0.        , 0.00031729, 0.        ],
       [0.        , 0.00052666, 0.        ],
       [0.        , 0.00071401, 0.        ]], dtype=float32)

## Softmax activation function

In [21]:
# Values from the previous output when we described
# what a neural network is
layer_outputs = [4.8, 1.21, 2.385]

# e - mathematical constant, we use E here to match a common coding
# style where constants are uppercased
E = 2.71828182846  # you can also use math.e

# For each value in a vector, calculate the exponential value
exp_values = []
for output in layer_outputs:
    exp_values.append(E ** output)  # ** - power operator in Python

# this is how I implemented it on my own
summed_exp = sum(exp_values)
normalized_exp = list(map(lambda x: x/summed_exp, exp_values))
normalized_exp_sum = sum(normalized_exp)
print('exponentiated values: {}'.format(exp_values))
print('sum of exponentiated values: {}'.format(summed_exp))
print('normalized exponentiated values: {}'.format(normalized_exp))
print('sum of the expoentiated values (should be 1): {}'.format(normalized_exp_sum))

# here's the book's solution
# Now normalize values
norm_base = sum(exp_values)  # We sum all values
norm_values = []
for value in exp_values:
    norm_values.append(value / norm_base)
print('Normalized exponentiated values:')
print(norm_values)
print('Sum of normalized values:', sum(norm_values))


exponentiated values: [121.51041751893969, 3.3534846525504487, 10.85906266492961]
sum of exponentiated values: 135.72296483641975
normalized exponentiated values: [0.8952826639573506, 0.024708306782070668, 0.08000902926057876]
sum of the expoentiated values (should be 1): 1.0
Normalized exponentiated values:
[0.8952826639573506, 0.024708306782070668, 0.08000902926057876]
Sum of normalized values: 1.0


We can use numpy to do the same thing in a more compact way

In [22]:
import numpy as np

# Values from the earlier previous when we described
# what a neural network is
layer_outputs = [4.8, 1.21, 2.385]

# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print('exponentiated values:')
print(exp_values)

# Now normalize values
norm_values = exp_values / np.sum(exp_values)
print('normalized exponentiated values:')
print(norm_values)
print('sum of normalized values:', np.sum(norm_values))

exponentiated values:
[121.51041752   3.35348465  10.85906266]
normalized exponentiated values:
[0.89528266 0.02470831 0.08000903]
sum of normalized values: 0.9999999999999999


We’re trying to sum all the outputs from a layer for each sample in a batch; converting the layer’s output array with row length equal to the number of neurons in the layer, to just one value.
We need a column vector with these values since it will let us normalize the whole batch of samples, sample-wise, with a single calculation

In [23]:
print('Sum axis 1, but keep the same dimensions as input:')
layer_outputs = np.array([[4.8, 1.21, 2.385],
                          [8.9, -1.81, 0.2],
                          [1.41, 1.051, 0.026]])
print(np.sum(layer_outputs, axis=1, keepdims=True))

Sum axis 1, but keep the same dimensions as input:
[[8.395]
 [7.29 ]
 [2.487]]


In [24]:
# Softmax activation
class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

There are two main challenges with neural networks:

- “dead neurons”
- very large numbers (referred to as “exploding” values).

The exponential function used in softmax activation is one of the sources of exploding values.
With Softmax, thanks to the normalization, we can subtract any value from all the inputs, and it will not change the output:

In [25]:
softmax = Activation_Softmax()
softmax.forward([[1, 2, 3]])
print(softmax.output)
softmax.forward([[-2, -1, 0]])  # subtracted 3 - max from the list
print(softmax.output)

[[0.09003057 0.24472847 0.66524096]]
[[0.09003057 0.24472847 0.66524096]]


Now, we can add another dense layer as the output layer, setting it to contain as many inputs as the previous layer has outputs and as many outputs as our data includes classes. Then we can apply the softmax activation to the output of this new layer:

In [26]:
# Create dataset
X, y = spiral_data(samples=100, classes=3)

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 3 input features (as we take output
# of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)

# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()

# Make a forward pass of our training data through this layer
dense1.forward(X)

# Make a forward pass through activation function
# it takes the output of first dense layer here
activation1.forward(dense1.output)

# Make a forward pass through second Dense layer
# it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Make a forward pass through activation function
# it takes the output of second dense layer here
activation2.forward(dense2.output)

# Let's see output of the first few samples:
print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]]


As you can see, the distribution of predictions is almost equal, as each of the samples has ~33% (0.33) predictions for each class. This results from the random initialization of weights (a draw from the normal distribution, as not every random initialization will result in this) and zeroed biases.

These outputs are also our “confidence scores.” To determine which classification the model has chosen to be the prediction, we perform an *argmax* on these outputs, which checks which of the classes in the output distribution has the highest confidence and returns its index - the predicted class index.

That said, the confidence score can be as important as the class prediction itself. For example, the argmax of [0.22, 0.6, 0.18] **is the same as the argmax for [0.32, 0.36, 0.32]. In both of these, the argmax function would return an index value of 1 (the 2nd element in Python’s zero-indexed paradigm), but obviously, a 60% confidence is much better than a 36% confidence.