## Softmax Activation Function

<br>

![](https://drive.google.com/uc?id=1OngR478lvQM7OWBvvKOo_5lX_J8TKf8U)

<br>

### Logits

In [None]:
logits = [.9, .7]

In [None]:
cat_logit = logits[0]
dog_logit = logits[1]

### Cat/dog example

In [None]:
import math

In [None]:
denominator = math.pow(math.e, cat_logit) + math.pow(math.e, dog_logit)

In [None]:
denominator

4.473355818627425

In [None]:
cat_prob = math.pow(math.e, cat_logit) / denominator

cat_prob

0.549833997312478

In [None]:
dog_prob = math.pow(math.e, dog_logit) / denominator

dog_prob

0.4501660026875221

In [None]:
cat_prob + dog_prob

1.0

### Create softmax function


In [None]:
def softmax(x_array):

    denominator = 0
    for x in x_array:
        denominator += math.pow(math.e, x)

    probs = []
    for x in x_array:
        probs.append(
            math.pow(math.e, x) / denominator
        )

    return probs


### Cat/dog example

In [None]:
logits = [.9, .7]

CAT_IDX = 0
DOG_IDX = 1

probs = softmax(logits)

print(f'Cat: {probs[CAT_IDX]}, Dog: {probs[DOG_IDX]}')

Cat: 0.549833997312478, Dog: 0.4501660026875221


### Example with 4 logits

In [None]:
logits = [.9, .7, .6, .0001]

HOUSE_IDX = 0
TREE_IDX = 1
LIZARD_IDX = 2
SKUNK_IDX = 3

probs = softmax(logits)

print(f'House: {probs[HOUSE_IDX]}, Tree: {probs[TREE_IDX]}, Lizard: {probs[LIZARD_IDX]}, Skunk: {probs[SKUNK_IDX]}')

House: 0.3371363104229755, Tree: 0.276023865322535, Lizard: 0.24975672161474802, Skunk: 0.13708310263974147


In [None]:
sum_val = 0

for prob in probs:
    sum_val += prob

sum_val

1.0

### Translational Invariance

See: https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo

In [None]:
list_of_logits = [
    [1, 4],
    [101, 104],
    [-101, -104],                      
]

In [None]:
for logits in list_of_logits:
    print(
        softmax(logits)
    )

[0.04742587317756678, 0.9525741268224331]
[0.04742587317756679, 0.9525741268224331]
[0.9525741268224333, 0.04742587317756679]


### Note

Why use softmax as opposed to standard normalization?
- https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization

### With 1 logit = 1 ... duh!

In [None]:
import numpy as np

In [None]:
for x in np.arange(-10, 10):
    print(softmax([x])[0])

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


## The Softmax Activation Function

In our case, we’re looking to get this model to be a **classifier**, so we want an activation function meant for classification. 

The softmax activation function can take in non-normalized, or uncalibrated, inputs and produce a normalized distribution of probabilities for our classes. 

This distribution returned by the softmax activation function represents **confidence scores** for each class and will add up to 1. 

<br>

![](https://drive.google.com/uc?id=1OngR478lvQM7OWBvvKOo_5lX_J8TKf8U)

<br>

The numerator exponentiates the current output value and the denominator takes a sum of all of the exponentiated outputs for a given sample.

---

Exponentiation serves multiple purposes. 

An exponential value of any number is always nonnegative; returns: 
- 0 for negative infinity 
- 1 for the input of 0 
- increases for positive values

![](https://drive.google.com/uc?id=1HroSgcamG-yFwdbvLTa30Ku9xdG8onx8)

---

The exponential function is a **monotonic function**: with higher input values,
outputs are also higher, so we won’t change the predicted class after applying it while making sure that we get non-negative values. 

It also adds stability to the result as the **normalized exponentiation** is more about the difference between numbers than their magnitudes.

## Interlude - `np.sum`

**Important Note** 

> With arrays, do not think of rows, columns, etc. because it might confuse you so that you think that the one dimension is more "higher-up" than another dimension - like the rows contain columns/fields. 
>
> That is **wrong** - all dimensions are equal. So if you have a 3*3*3 array, think of it as a Rubik's cube, where each dimension is just as "important" - none "contain" the other. 
>
> Same obivously applies when there is more than 3 dimensions as well ...

In [None]:
cube = np.ones((3, 3, 3))

cube

array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])

In [None]:
cube[0]

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [None]:
cube[0][0]

array([1., 1., 1.])

In [None]:
cube[0][0][0]

1.0

In [None]:
cube[0][0][0] = 5.0
cube[1][0][0] = 55.0
cube[2][0][0] = 99.0

cube

array([[[ 5.,  1.,  1.],
        [ 1.,  1.,  1.],
        [ 1.,  1.,  1.]],

       [[55.,  1.,  1.],
        [ 1.,  1.,  1.],
        [ 1.,  1.,  1.]],

       [[99.,  1.,  1.],
        [ 1.,  1.,  1.],
        [ 1.,  1.,  1.]]])

In [None]:
cube.shape

(3, 3, 3)

**Sum without axis**

In [None]:
np.sum(cube)

183.0

...or:

In [None]:
np.sum(cube, axis=None)

183.0

**Sum with axis 0**

Collapse dimension 0 into remaining dimensions.

In [None]:
np.sum(cube, axis=0)

array([[159.,   3.,   3.],
       [  3.,   3.,   3.],
       [  3.,   3.,   3.]])

**Sum with axis 1**

Collapse dimension 1 into remaining dimensions.

In [None]:
np.sum(cube, axis=1)

array([[  7.,   3.,   3.],
       [ 57.,   3.,   3.],
       [101.,   3.,   3.]])

**Perfect - but now let's keep the same dimension as the input (same number of dimensions, not that each dimension will have same size)**


In [None]:
new_array = np.sum(cube, axis=1, keepdims=True)

new_array

array([[[  7.,   3.,   3.]],

       [[ 57.,   3.,   3.]],

       [[101.,   3.,   3.]]])

In [None]:
new_array.shape

(3, 1, 3)

In [None]:
cube.shape

(3, 3, 3)

In [None]:
print(
    len(new_array.shape),
    len(cube.shape)
)

3 3


## Exploding Values

Finally, we'll also include a subtraction of the largest of the inputs before exponentiation (see full code at end).

```py
# Get unnormalized probabilities
exp_values = np.exp(
    inputs - np.max(inputs, axis=1, keepdims=True)
)
```


There are two main pervasive challenges with neural networks: “dead neurons”
and very large numbers (referred to as “exploding” values). 

“Dead” neurons and enormous numbers can wreak havoc down the line and render a network useless over time. 

The exponential function used in softmax activation is one of the sources of exploding values. 

---

We know the exponential function tends toward 0 as its input value approaches negative infinity, and the output is 1 when the input is 0. 

We can use this property to prevent the exponential function from overflowing. 

Suppose we subtract the maximum value from a list of input values. We would then change the output values to always be in a range from some negative value up to 0, as the largest number subtracted by itself returns 0, and any smaller number subtracted by it will result in a negative number - exactly the range discussed above. 

<br>

**With Softmax, thanks to the normalization, we can subtract any
value from all of the inputs, and it will not change the output.**


In [None]:
print(f'''
    exp(-np.inf): {np.exp(-np.inf)}
    exp(-3)     : {np.exp(-3)}
    exp(-2)     : {np.exp(-2)}
    exp(-1)     : {np.exp(-1)}
    exp(-.5)    : {np.exp(-.5)}
    exp(0)      : {np.exp(0)}
'''    
)


    exp(-np.inf): 0.0
    exp(-3)     : 0.049787068367863944
    exp(-2)     : 0.1353352832366127
    exp(-1)     : 0.36787944117144233
    exp(-.5)    : 0.6065306597126334
    exp(0)      : 1.0



## Note - Softmax is Translationally Invariant

https://towardsdatascience.com/the-big-issue-with-softmax-cd6169fede8f
- **However**, see comments - seems okay afterall

In [None]:
def softmax(x):
    return np.exp(x) / np.exp(x).sum()

Instead of thinking about the output as actual class probabilities, view it as an indication based on the scores, which class is the most likely. 


In [None]:
print(softmax([1, 4]))
print(softmax([101, 104]))
print(softmax([-54, -51]))

[0.04742587 0.95257413]
[0.04742587 0.95257413]
[0.04742587 0.95257413]


---

With added constant 0 logit class:

In [None]:
print(f'{np.round(softmax([0, 1, 4]), 5)}')
print(f'{np.round(softmax([0, 101, 104]), 5)}')
print(f'{np.round(softmax([0, -54, -51]), 5)}')

[0.01715 0.04661 0.93624]
[0.      0.04743 0.95257]
[1. 0. 0.]
