## Chapter 4 of the book - Activation Functions 

Activation function is applied to the output of a neuron. These allow neural networks to map non-linear functions. 
Some examples:
- Step function (simplest), 0 or 1
- Linear activation, maps x to y. Basically what we've been doing so far, passing the input of one layer to the next
- Sigmoid y = 1/ (1 + e^-x)

The problem with a step function is that it’s less clear to the optimizer what these impacts are because there’s very little information
gathered from this function. It’s either on (1) or off (0). It’s hard to tell how “close” this step
function was to activating or deactivating. 

When it comes time to optimize weights and biases, it’s easier for the
optimizer if we have activation functions that are more granular and informative. Thats where the sigmoid comes in

![Alt text](image-1.png)

In [2]:
import numpy as np
from nnfs.datasets import spiral_data
import nnfs 
import matplotlib.pyplot as plt
nnfs.init()

## ReLU Activation in the Hidden Layers 
page  85 - 93

Shows a brilliant way to visualize why you need multiple hidden layers for complex behaviour. 

Seems like more layers allow neurons to interact with each other to produce more complex outputs, and more neurons in each layer allows you to produce more behaviours. 


In [3]:
# Dense layer
class Layer_Dense:
    # weights are being factored down bcos training steps only make small changes to the weights relative to initialized values. The disproportinally large init values will cause training to take a long time
    init_weight_multiplier = 0.01
    
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        # Initialize weights and biases
        self.weights = self.init_weight_multiplier * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases
        

In [4]:
# Relu activation function

inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
output = []
for i in  inputs:
    output.append(max(0, i))
    
#or numpy version
output2 = np.maximum(0, inputs)

print(f"{output} \n{output2}")


[0, 2, 0, 3.3, 0, 1.1, 2.2, 0] 
[0.  2.  0.  3.3 0.  1.1 2.2 0. ]


In [5]:
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from input
        self.output = np.maximum(0, inputs)

In [6]:
X, y = spiral_data(samples=100, classes=3)

In [7]:

# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Make a forward pass of our training data through this layer
dense1.forward(X)
# Forward pass through activation func.
# Takes in output from previous layer
print(dense1.output[:5])
activation1.forward(dense1.output)
print(activation1.output[:5])
## all negative values have been reduced to 0, while everything else is mapped x->y

[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00]
 [-1.0475188e-04  1.1395361e-04 -4.7983500e-05]
 [-2.7414842e-04  3.1729150e-04 -8.6921798e-05]
 [-4.2188365e-04  5.2666257e-04 -5.5912682e-05]
 [-5.7707680e-04  7.1401405e-04 -8.9430439e-05]]
[[0.         0.         0.        ]
 [0.         0.00011395 0.        ]
 [0.         0.00031729 0.        ]
 [0.         0.00052666 0.        ]
 [0.         0.00071401 0.        ]]


## Soft Max activation function

Depending on what you require of your neural network, you may want to consider using other activation functions. 
For our example,  we'd like a classifier (to classify each branch of the spiral from spiral_data)

The RELU act. func. is 
- unbounded: can be any value 0 -> inf 
- not normalized: due to ^
- exclusive: each output is independent of the others


To address this lack of context,
the softmax activation on the output data can take in non-normalized, or uncalibrated, inputs and
produce a normalized distribution of probabilities for our classes.
This distribution returned by the softmax activation function represents **confidence scores** for each
class and will add up to 1. 

![Alt text](image-2.png)


In [8]:
layer_outputs = [4.8, 1.21, 2.385]
E = np.e 

exp_values = []
for output in layer_outputs:
    exp_values.append(E ** output)
    
print('exponentiated values:')
print(exp_values)

# Now normalize values
norm_base = sum(exp_values) # We sum all values
norm_values = []
for value in exp_values:
    norm_values.append(value / norm_base)
print('Normalized exponentiated values:')
print(norm_values)
print('Sum of normalized values:', sum(norm_values))
#normalized values add up to 1. This will give us a probability score for each neuron output. 

exponentiated values:
[121.51041751873483, 3.353484652549023, 10.859062664920513]
Normalized exponentiated values:
[0.8952826639572619, 0.024708306782099374, 0.0800090292606387]
Sum of normalized values: 0.9999999999999999


The same in numpy

In [9]:
layer_outputs = [4.8, 1.21, 2.385]
# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print('exponentiated values:')
print(exp_values)
# Normalize them for each sample
probabilities = exp_values / np.sum(exp_values, axis=0, keepdims=True)
print(probabilities)

exponentiated values:
[121.51041752   3.35348465  10.85906266]
[0.89528266 0.02470831 0.08000903]


In [10]:
# Layer outputs as matrix
layer_outputs = np.array([  [4.8, 1.21, 2.385],
                            [8.9, -1.81, 0.2],
                            [1.41, 1.051, 0.026]
                            ])

print(f"np.sum no axis \n{np.sum(layer_outputs, axis=None)}\n")   # no shape, just a value
print(f"np.sum row-wise or sum of each col\n{np.sum(layer_outputs, axis=0, keepdims=True)}\n")     #.shape (1,3)
print(f"np.sum col-wise or sum of each row\n{np.sum(layer_outputs, axis=1, keepdims=True)}\n")     #.shape (3,1)
print(f"np.sum same as above, but we're not keeping dimensions. Keeping dimensions aids in matrix calculations later \n{np.sum(layer_outputs, axis=1)}\n")     #.shape (3,1)
# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print('exponentiated values:')
print(exp_values)
# Normalize them for each sample
probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
print('probability values:')
print(probabilities)


np.sum no axis 
18.172

np.sum row-wise or sum of each col
[[15.11   0.451  2.611]]

np.sum col-wise or sum of each row
[[8.395]
 [7.29 ]
 [2.487]]

np.sum same as above, but we're not keeping dimensions. Keeping dimensions aids in matrix calculations later 
[8.395 7.29  2.487]

exponentiated values:
[[1.21510418e+02 3.35348465e+00 1.08590627e+01]
 [7.33197354e+03 1.63654137e-01 1.22140276e+00]
 [4.09595540e+00 2.86051020e+00 1.02634095e+00]]
probability values:
[[8.95282664e-01 2.47083068e-02 8.00090293e-02]
 [9.99811129e-01 2.23163963e-05 1.66554348e-04]
 [5.13097164e-01 3.58333899e-01 1.28568936e-01]]


Axis had to be set to 1 to act on the second dimension of layer outputs (3,3). (Axis0, Axis1 ...) to sum rows 

In [11]:
# Softmax activation
class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities
        # subracting the max of inputs. This is because the exponential function can explode! try running np.exp(1000). 
        # To prevent exp overflow, subtract the highest value from each row, which normalizes each row on the max.
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalize them for each sample
        self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)        #probabilities
        

In [12]:
print(np.max(layer_outputs, axis=1, keepdims=True))
print(np.exp(layer_outputs - np.max(layer_outputs, axis=1, keepdims=True)))
softmax = Activation_Softmax()
softmax.forward(layer_outputs)
softmax.output

[[4.8 ]
 [8.9 ]
 [1.41]]
[[1.00000000e+00 2.75983304e-02 8.93673389e-02]
 [1.00000000e+00 2.23206120e-05 1.66585811e-04]
 [1.00000000e+00 6.98374351e-01 2.50574249e-01]]


array([[8.95282664e-01, 2.47083068e-02, 8.00090293e-02],
       [9.99811129e-01, 2.23163963e-05, 1.66554348e-04],
       [5.13097164e-01, 3.58333899e-01, 1.28568936e-01]])

### Time to put the lessons together

In [13]:
# Create dataset
X, y = spiral_data(samples=100, classes=3)
# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Create second Dense layer with 3 input features (as we take output
# of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)
# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()
# Make a forward pass of our training data through this layer
dense1.forward(X)
# Make a forward pass through activation function
# it takes the output of first dense layer here
activation1.forward(dense1.output)
# Make a forward pass through second Dense layer
# it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)
# Make a forward pass through activation function
# it takes the output of second dense layer here
activation2.forward(dense2.output)
# Let's see output of the first few samples:
print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]]


As you can see, the distribution of predictions is almost equal, as each of the samples has ~33%
(0.33) predictions for each class. This results from the random initialization of weights (a draw
from the normal distribution, as not every random initialization will result in this) and zeroed
biases. These outputs are also our “confidence scores.”

To determine which classification the
model has chosen to be the prediction, we perform an argmax on these outputs, which checks
which of the classes in the output distribution has the highest confidence and returns its index - the
predicted class index. 

# Full code up to now

In [14]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()
# Dense layer
class Layer_Dense:
    # Layer initialization
    def __init__(self, n_inputs, n_neurons):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases
        
# ReLU activation      
class Activation_ReLU:
        # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)

# Softmax activation
class Activation_Softmax:
# Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1,
        keepdims=True))
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1,
        keepdims=True)
        self.output = probabilities
        
# Create dataset
X, y = spiral_data(samples=100, classes=3)
# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Create second Dense layer with 3 input features (as we take output
# of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)
# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()
# Make a forward pass of our training data through this layer
dense1.forward(X)
# Make a forward pass through activation function
# it takes the output of first dense layer here
activation1.forward(dense1.output)
# Make a forward pass through second Dense layer
# it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)
# Make a forward pass through activation function
# it takes the output of second dense layer here
activation2.forward(dense2.output)
# Let's see output of the first few samples:
print(activation2.output[:5])

# Applying arg max returns the index of highest value from each row
print(np.argmax(activation2.output[:5], axis=1, keepdims=True))

[[0.33333334 0.33333334 0.33333334]
 [0.3333332  0.3333332  0.33333364]
 [0.3333329  0.33333293 0.3333342 ]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
[[0]
 [2]
 [2]
 [2]
 [2]]


# Chapter 5 - Calculating Network Error with Loss

The loss function, also referred to as the cost function, is the algorithm
that quantifies how wrong a model is. Loss is the measure of this metric. Since loss is the model’s
error, we ideally want it to be 0.
If you’re familiar with linear regression, then you already know one of the loss functions used
with neural networks that do regression: squared error (or mean squared error with neural
networks).
We’re classifying, so we need a different loss function


Categorical cross-entropy is explicitly used to
compare a “ground-truth” probability (y or “targets”) and some predicted distribution (y-hat or
“predictions”), so it makes sense to use cross-entropy here.


The formula for calculating the categorical cross-entropy of y (actual/desired distribution) and
y-hat (predicted distribution) is:
![Alt text](image-3.png)

Where Li denotes sample loss value, i is the i-th sample in the set, j is the label/output index, y
denotes the target values, and y-hat denotes the predicted values.



In [15]:
import math

softmax_output = [0.7, 0.1, 0.2]
# Ground truth
target_output = [1, 0, 0]

loss = -( target_output[0] * math.log(softmax_output[0]) +
          target_output[1] * math.log(softmax_output[1]) +
          target_output[2] * math.log(softmax_output[2]) )
print(loss)
#one-hot encoding means only one value is used for calculating loss, as the other terms get multiplied by 0.  Therefore we can skip multiplication
loss == -( target_output[0] * math.log(softmax_output[0]) ) 

0.35667494393873245


True

Doing this on a numpy array of layer outputs (softmax). 

In [16]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])
class_targets = [0, 1, 1]

#numpy array indexing is different to vanilla python lists. 
#we can pass an array of indicies, which we get from class_targets. we use range(len()) to index each row in softmax_outputs. 
print(softmax_outputs[ 
                      range(len(softmax_outputs)), 
                      class_targets 
                     ])
#now convert these confidences values to loss using our equation from earlier
loss = -np.log(softmax_outputs[ 
                               range(len(softmax_outputs)), 
                               class_targets 
                              ])
print(f"loss {loss}")
#we can do a simple analysis on loss,  such as average loss per batch
average_loss = np.mean(loss)
print(f"average loss  {average_loss}")



[0.7 0.5 0.9]
loss [0.35667494 0.69314718 0.10536052]
average loss  0.38506088005216804


In [17]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])
# one hot encoded labels
class_targets = np.array([  [1, 0, 0],  
                            [0, 1, 0],
                            [0, 1, 0]])
# Probabilities for target values -
# only if categorical labels
if len(class_targets.shape) == 1:
    correct_confidences = softmax_outputs[
        range(len(softmax_outputs)),
        class_targets
    ]
# Mask values - only for one-hot encoded labels
elif len(class_targets.shape) == 2:
    correct_confidences = np.sum(
            softmax_outputs*class_targets,
            axis=1
        )
# Losses
neg_log = -np.log(correct_confidences)
average_loss = np.mean(neg_log)

print(average_loss)

0.38506088005216804


fix log(0) = -inf problem by clipping to min and max possible confidence. 

In [18]:
# LOSS_CLIP_VAL = 1e7
# y_pred_clipped = np.clip(y_pred, LOSS_CLIP_VAL, 1 - LOSS_CLIP_VAL)

# Categorical Cross-Entropy Class Loss


In [37]:
# Common loss class
class Loss:
    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):
        # Calculate sample losses
        sample_losses = self._forward(output, y)
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        # Return loss
        return data_loss
    
# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):
    def __init__(self):
        self.correct_confidences = []
    # Forward pass
    def _forward(self, y_pred, y_true):
        # Number of samples in a batch
        samples = len(y_pred)
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Probabilities for target values -
        # only if categorical labels  #vectors
        if len(y_true.shape) == 1:
            self.correct_confidences = y_pred_clipped[
                range(samples),
                y_true
            ]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:    #matrix 
            #reminder on what this does
            # y_pred = [[0.3, 0.7],
            #           [0.4, 0.6]]
            # y_true = [[0, 1],
            #           [1, 0]] 
            # then 
            # correct_confidences = [ [0.3*0 + 0.7*1],  
            #                         [0.4*1 + 0.6*0]]  
            # 0 values contribute nothing, so we basically get [0.7, 0.4] (remember, we keepdims is False by default so array gets flattened)
            # axis 1 acts on rows 
            self.correct_confidences = np.sum(
                y_pred_clipped*y_true,
                axis=1
            )
            
        # Losses
        negative_log_likelihoods = -np.log(self.correct_confidences)
        return negative_log_likelihoods

In [31]:
loss_function = Loss_CategoricalCrossentropy()
loss = loss_function.calculate(softmax_outputs, class_targets)
print(loss)

0.38506088005216804


## Combining our NN with Catgorical Cross-Entropy Loss Class

In [35]:
# Create dataset
X, y = spiral_data(samples=100, classes=3)
print(f"Input data {X.shape} | output data {y.shape}")
# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Create second Dense layer with 3 input features (as we take output
# of previous layer here) and 3 output values
dense2 = Layer_Dense(3, 3)
# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()
# Create loss function
loss_function = Loss_CategoricalCrossentropy()
# Perform a forward pass of our training data through this layer
dense1.forward(X)
# Perform a forward pass through activation function
# it takes the output of first dense layer here
activation1.forward(dense1.output)
# Perform a forward pass through second Dense layer
# it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)
# Perform a forward pass through activation function
# it takes the output of second dense layer here
activation2.forward(dense2.output)
# Let's see output of the first few samples:
print(activation2.output[:5])
# Perform a forward pass through activation function
# it takes the output of second dense layer here and returns loss
loss = loss_function.calculate(activation2.output, y)
# Print loss value
print('loss:', loss)

Input data (300, 2) | output data (300,)
[[0.33333334 0.33333334 0.33333334]
 [0.33333373 0.33333296 0.33333334]
 [0.3333337  0.3333328  0.3333335 ]
 [0.3333341  0.33333236 0.33333355]
 [0.3333342  0.33333215 0.33333364]]
loss: 1.0986164
