# This notebook explains the Softmax Activation Function in detail



## Now first question after learning multiple diff activation functions like Sigmoid and ReLU, why another activation function ?


### Imagine these are our output values from a layer :-
```sh
layer_outputs = [4.8, 1.21, 2.385]
```
### Now what do we do with these ?

### If we are only predicting, then we would consider , whichever is the largest value from the layer_outputs is our Prediction. Here layer_outputs[0] = 4.8, is our prediction


## We also have to remeber we are trying to train our Neural Networks as well along with the predicting values.

### We ll learn in upcoming notebooks on training NNs

### But First Step in training the model is to determine *how wrong is our model ?* 

## Which one of these is more correct ?
```sh
layer_outputs = [4.8, 1.21, 2.385] 
(Most People will consider this one as this has relatively larger neuron output at index 0) 
layer_outputs = [4.8, 4.79, 4.25]
```
### Accuracy Wise, they are both identical (coz both have 4.8 as prediction)

### First step to measure how wrong the output is to compare them relatively to other neurons. This can't be done using ReLU as ReLU doesn't provides per neuron basis comparisions of outputs

### Next problem is these both are unbouned, so relatively closeness can vary btw both samples, and there r lots of issues with no solid way to determine how wrong our model is.

## This is why we need some new activation function :- *Softmax Activation Function*

### Goal :- Let's say we need to classify cats & dogs, we have 2 neurons at output layer. So our goal would be to get the outputs from the 2 neurons in the way of *probability distribution* that if the input is a cat, first neuron shows 0.80 - 0.90 or 1.0 (upto 100 % probability distribution)  and second neuron outputs 0.08, 0.12, etc like values (i.e low confidence score for it being a dog)



### This will help us to determine how wrong / right our model is.


### If we use a probability distribution, and keep ReLU as function:
```sh
ReLU(0) = 0,.0000 % 
ReLU(1) = 1, 100.0000 % 
```
### The problem here is, if any of those ouput values is negative, then ReLU will clip it and yeild Zero.

### ReLU(-1) / ReLU(-9000) / ReLU(-21) = 0, 0.000 %

#### What if all values are negative ? This doesn't makes any sense
#### There are some other methods but research shows they also aren't effective

## Exponential Function 
```sh
y = e^x ("e ~ 2.718281828459045, Euler's Number") 
if x = -10.0000
y = e^-10.0000 ~ + 0.00005
```

### What this function actually does is just to make sure no value is actually negative, at the output after passing through it. So we can calculate how accurate the output is.

In [2]:
# Coding for exponentiation function

import math

layer_outputs = [4.8, 1.21, 2.385]


E = math.e

exp_values = []

for output in layer_outputs:
    exp_values.append(E**output)
print(exp_values)

[121.51041751873483, 3.353484652549023, 10.859062664920513]


## After exponentiation, next step is to normalize these values

#### In our case, the output of single neuron. is divided by the sum of the output neurons in the output layer

#### This gives us the probability distribution that we want.

#### This requires us to get rid of all the negative values without losing the information of those negative values

In [14]:
## coding Normalization

norm_base = sum(exp_values)
norm_values = []

for value in exp_values:
    norm_values.append(value / norm_base)
print(norm_values,", Sum of Norm Values :"+str(sum(norm_values)))

### exp_values = exponentiated values
### norm_values = normalized exponentiated values

[0.8952826639572619, 0.024708306782099374, 0.0800090292606387] , Sum of Norm Values :0.9999999999999999


## Quicker way to calculate both

In [1]:
import numpy as np
import nnfs  # not necessary, just following nnfs code in video
nnfs.init() # --"--

layer_outputs = [4.8, 1.21, 2.385]

exp_values = np.exp(layer_outputs)  # numpy's exp function 
norm_values = exp_values / np.sum(exp_values)
print("Exp Values",exp_values)
print("Norm Exp Values",norm_values)

Exp Values [121.51041752   3.35348465  10.85906266]
Norm Exp Values [0.89528266 0.02470831 0.08000903]


## What we did here is :-
## layer_output -->  Exponentiate --> Normalize --> Output 

### This can be written as :-
## layer_output --> *Softmax* --> Output

### Exponentiation and Normalization together are called The Softmax Function ! ! 
### Now we can easily understand what softmax activation function really is and why we are applying it.

# Moving ahead,
## In above code, we were dealing with *Single Vector of Output*, but in real, we will be dealing with a *Batch of Outputs*


In [2]:
## dynamic code that can deal with batches 

layer_outpus = [[4.8, 1.21, 2.385],
               [8.9, -2.6, -12.5],
               [0.987, -5, 1.5]]


exp_values = np.exp(layer_outpus) # by default np exps over individual values
exp_values

array([[1.21510418e+02, 3.35348465e+00, 1.08590627e+01],
       [7.33197354e+03, 7.42735782e-02, 3.72665317e-06],
       [2.68317287e+00, 6.73794700e-03, 4.48168907e+00]])

In [3]:
np.sum(layer_outpus) 

-0.31799999999999784

#### but we want to have set of 3 sums from the 3 diff vectors inside the batch
#### np doesnt do this by default , rather takes all individual values and sums up

In [5]:
# to take sum across rows
np.sum(exp_values,axis = 1)   # axis = 0 is sum across rows

array([1.35722965e+02, 7.33204782e+03, 7.17159988e+00])

#### But for finding Norm Values, we need to keep the dims of sum of the exponential values in same as the exp_values are

In [6]:
## This keepdims will keep the matrix of same dimension as exp_values

np.sum(exp_values,axis = 1, keepdims=True)

array([[1.35722965e+02],
       [7.33204782e+03],
       [7.17159988e+00]])

In [7]:
exp_values   # see same dim

array([[1.21510418e+02, 3.35348465e+00, 1.08590627e+01],
       [7.33197354e+03, 7.42735782e-02, 3.72665317e-06],
       [2.68317287e+00, 6.73794700e-03, 4.48168907e+00]])

In [9]:
# implementation


layer_outpus = [[4.8, 1.21, 2.385],
               [8.9, -2.6, -12.5],
               [0.987, -5, 1.5]]


exp_values = np.exp(layer_outpus) 


norm_values = exp_values / np.sum(exp_values, axis=1, keepdims=True)

print("The Normalized Values are : \n",norm_values)

The Normalized Values are : 
 [[8.95282664e-01 2.47083068e-02 8.00090293e-02]
 [9.99989870e-01 1.01299910e-05 5.08269076e-10]
 [3.74138673e-01 9.39531919e-04 6.24921795e-01]]


## One slight issue with exponentiation is explosion of values as the input into exp function grows.
```sh
np.exp(1) = 2.781281....

np.exp(10) = 22026.465794....

np.exp(100) = 2.68811714...e+43  # doesnt takes too much to get massive nos

np.exp(1000) = given overflow error in python
```

## Overflow Prevention 
```sh
v = u - max(u)  
# use this in output layer prior to exponentiation
# substract the largest value in the layer from all of the values in layer


if output, u = [1, 2, 3],  max(u) = 3


 v = [1, 2, 3] - 3 = [-2, -1 , 0]

So, the largest value, now prior to exponentiation is zero (after applying this operation), then range of possibilities becomes somewhere btw (0,1) after exponentiation

```

#### The actual output with or without applying overflow prevention method, is actually the same, so we are just preventing lots of time and space complexcity issues by doing the same

## Now putting all above together in our Fundamental NN class

In [13]:
import numpy as np
import nnfs 
from nnfs.datasets import spiral_data
nnfs.init()
layer_outpus = [[4.8, 1.21, 2.385],
               [8.9, -2.6, -12.5],
               [0.987, -5, 1.5]]

X, y = spiral_data(100,3)
class layer_dense:
    def __init__(self,n_inputs, n_neurons):
        self.n_neurons = n_neurons
        self.weights = 0.10*np.random.randn(n_inputs, n_neurons) 
        print("Created a hidden layer with "+ str(self.n_neurons)+" neurons")
        self.biases = np.zeros((1,n_neurons))
    
    def forward(self,inputs):   # this method will just take the inputs
        self.output = np.dot(inputs, self.weights) + self.biases
        
class activation_relu:
        def forward(self, inputs):
            self.output= np.maximum(0,inputs)

class activation_softmax:
        def forward(self, inputs):
            exp_values = np.exp(inputs - np.max(inputs,axis=1,keepdims=True))
            probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True) 
            self.output = probabilities

In [14]:
X,y = spiral_data(100,3)   # has 2 input class


dense1 = layer_dense(2,3)

Created a hidden layer with 3 neurons


In [16]:
activation1 = activation_relu()

### output of the prv layer was a 3 (3 neurons) so input for this layer is 3
dense2 = layer_dense(3,3)  

# taking 3 as output neurons and treating this as output layer

activation2 = activation_softmax()

Created a hidden layer with 3 neurons


In [20]:
dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)


activation2.output   

array([[0.33333334, 0.33333334, 0.33333334],
       [0.33333334, 0.33333334, 0.33333334],
       [0.3333338 , 0.33333224, 0.333334  ],
       [0.33334652, 0.33330026, 0.33335325],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33336297, 0.3332589 , 0.3333781 ],
       [0.33335266, 0.3332848 , 0.33336252],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33348224, 0.333216  , 0.33330175],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33333334, 0.33333334, 0.33333334],
       [0.33348945, 0.33321032, 0.3333002 ],
       [0.33351874, 0.33318728, 0.333294  ],
       [0.33351728, 0.3331884 , 0.33329433],
       [0.33366132, 0.33307493, 0.3332637 ],
       [0.3337274 , 0.3330229 , 0.33324966],
       [0.3336114 , 0.3331143 , 0.33327436],
       [0.33366475, 0.33307222, 0.33326298],
       [0.

In [22]:
activation2.output[8]  # batch 8

array([0.33333334, 0.33333334, 0.33333334], dtype=float32)

In [24]:
activation2.output[4]   # batch 4

array([0.33333334, 0.33333334, 0.33333334], dtype=float32)

## Now to calculate what's right and whats wrong, we will gonna require a loss function.

## That's topic for next notebook code ! . . .