### Softmax Activation Function Intro


In our case, we’re looking to get this model to be a **classifier**, so we want an activation function (in the final layer) for classification. 

The softmax activation function can take in non-normalized inputs and produce a normalized distribution of probabilities for our classes. 

This distribution returned by the softmax activation function represents **confidence scores** for each class and will add up to 1. 

<br>

![](https://drive.google.com/uc?id=1OngR478lvQM7OWBvvKOo_5lX_J8TKf8U)

<br>

- The numerator exponentiates the current output value 
- The denominator takes a sum of all of the exponentiated outputs for a given sample



### Softmax Function Code


In [11]:
import math
import numpy as np

In [3]:
def softmax(logits):

    denominator = 0
    for logit in logits:
        denominator += math.pow(math.e, logit)

    probs = []
    for logit in logits:
        probs.append(
            math.pow(math.e, logit) / denominator
        )

    return probs


**Logits** here simply refers to the unscaled output of the earlier network layers. The sum of the inputs does not have to equal 1 and the values are not probabilities.

As for why they're called logits, we'll leave that for another time ...

### Cat/dog example

In [16]:
logits = [.9, .7]

CAT_IDX = 0
DOG_IDX = 1

probs = softmax(logits)

print(f'''
    Cat: {probs[CAT_IDX]} 
    Dog: {probs[DOG_IDX]}
    SUM: {np.array(probs).sum()}
''')


    Cat: 0.549833997312478 
    Dog: 0.4501660026875221
    SUM: 1.0



### Example with 4 logits

In [17]:
logits = [.9, .7, .6, .0001]

HOUSE_IDX = 0
TREE_IDX = 1
LIZARD_IDX = 2
SKUNK_IDX = 3

probs = softmax(logits)

print(f'''
    House : {probs[HOUSE_IDX]} 
    Tree  : {probs[TREE_IDX]}
    Lizard: {probs[LIZARD_IDX]}
    Skunk : {probs[SKUNK_IDX]}
    SUM   : {np.array(probs).sum()}
''')


    House : 0.3371363104229755 
    Tree  : 0.276023865322535
    Lizard: 0.24975672161474802
    Skunk : 0.13708310263974147
    SUM   : 1.0



### Why use softmax as opposed to standard normalization?

'Tis a good question, but not to be answered here :)

In short, Softmax works better with neural networks, especially when looking how this function feeds into the classification loss function (cross-entropy loss) - they play well together.

For more details see [here](https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization).

### Sigmoid Activation Function

The softmax is the generalization of the sigmoid for multi-class problems.

If only a single logit is being passed into the final layer's activation function, do not use Softmax - it will just return 1 :)

In [20]:
for x in np.arange(-5, 5):
    print(softmax([x])[0])

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


Rather, use the Sigmoid function. It returns a value in the range of: 
- 0 for negative infinity
- through 0.5 for the input of 0
- to 1 for positive infinity


![](https://drive.google.com/uc?id=17laVogv5ny9jS_QO8aXDlsI5FBOwGuMt)



In [21]:
def sigmoid(x):
    return 1 / (1 + math.pow(math.e, -x))

In [26]:
for x in np.arange(-5, 5):
    print(sigmoid(x))

0.006692850924284857
0.017986209962091562
0.04742587317756679
0.11920292202211757
0.2689414213699951
0.5
0.7310585786300049
0.8807970779778823
0.9525741268224331
0.9820137900379085


### Note on Exponentiation



Exponentiation serves multiple purposes. 
- An exponential value of any number is always nonnegative
- The exponential function is a **monotonic function**: with higher input values, outputs are also higher, so we won’t change the predicted class after applying it while making sure that we get non-negative values. 

<br>

$\large e^x$ returns: 
- 0 for negative infinity 
- 1 for the input of 0 
- increases for positive values

![](https://drive.google.com/uc?id=1HroSgcamG-yFwdbvLTa30Ku9xdG8onx8)

<br>

In addition, in the softmax activation function, the denominator takes a sum of all of the exponentiated outputs: this adds stability to the result as the **normalized exponentiation** is more about the difference between numbers than their magnitudes.


### Note on Translational Invariance



This is a just side note. 

No need to worry about it too much for now. Just keep in mind that Softmax is translationally invariant.

---

In short, instead of thinking about the output as actual class probabilities, view it as an indication based on the scores, which class is the most likely. 

In [None]:
print(softmax([1, 4]))
print(softmax([101, 104]))
print(softmax([-101, -104]))

[0.04742587317756678, 0.9525741268224331]
[0.04742587317756678, 0.9525741268224331]
[0.9525741268224331, 0.04742587317756678]


Depending on the problem, it can make sense to add a **_non/none_** class. 

For example, imagine: 
- index 0: none
- index 1: cat
- index 2: dog

In [None]:
print(f'{np.round(softmax([0, 1, 4]), 5)}')
print(f'{np.round(softmax([0, 101, 104]), 5)}')
print(f'{np.round(softmax([0, -101, -104]), 5)}')

[0.01715 0.04661 0.93624]
[0.      0.04743 0.95257]
[1. 0. 0.]


### Exploding Values


Two of the pervasive challenges with neural networks: “dead neurons”
and very large numbers (referred to as “exploding” values). 

“Dead” neurons and enormous numbers can wreak havoc down the line and render a network useless over time. 

---

The exponential function used in softmax activation is one of the sources of exploding values. 

We know the exponential function tends toward 0 as its input value approaches negative infinity, and the output is 1 when the input is 0. 

We can use this property to prevent the exponential function from overflowing. 

Let's subtract the maximum value from the list of input values. That would then change the output values to always be in a range from some negative value up to 0, as the largest number subtracted by itself returns 0, and any smaller number subtracted by it will result in a negative number - exactly the range discussed above. 

<br>

With Softmax, thanks to the normalization, we can subtract any
value from all of the inputs, and it will not change the output.


In [27]:
print(f'''
    exp(-np.inf): {np.exp(-np.inf)}
    exp(-3)     : {np.exp(-3)}
    exp(-2)     : {np.exp(-2)}
    exp(-1)     : {np.exp(-1)}
    exp(-.5)    : {np.exp(-.5)}
    exp(0)      : {np.exp(0)}
'''    
)


    exp(-np.inf): 0.0
    exp(-3)     : 0.049787068367863944
    exp(-2)     : 0.1353352832366127
    exp(-1)     : 0.36787944117144233
    exp(-.5)    : 0.6065306597126334
    exp(0)      : 1.0



#### Example

In [32]:
def softmax(logits):

    # two new lines added:
    logits = np.array(logits)
    logits = logits - logits.max()

    denominator = 0
    for logit in logits:
        denominator += math.pow(math.e, logit)

    probs = []
    for logit in logits:
        probs.append(
            math.pow(math.e, logit) / denominator
        )

    return probs

In [37]:
logits = [9999999, 9999990, 9999960, 9999943]

HOUSE_IDX = 0
TREE_IDX = 1
LIZARD_IDX = 2
SKUNK_IDX = 3

probs = softmax(logits)

print(f'''
    House : {probs[HOUSE_IDX]} 
    Tree  : {probs[TREE_IDX]}
    Lizard: {probs[LIZARD_IDX]}
    Skunk : {probs[SKUNK_IDX]}
    SUM   : {np.array(probs).sum()}
''')


    House : 0.9998766054240137 
    Tree  : 0.00012339457598623178
    Lizard: 1.1546799184790585e-17
    Skunk : 4.78030294763524e-25
    SUM   : 0.9999999999999999

