Why use softmax activation? We want to know how "wrong" the output is. Consider the following...
```python
output1 = [4.8, 1.21, 2.385]
output2 = [4.8, 4.79, 4.25]
```
How do we decide which on is more accurate? Technically they are both the same largest value, but relative to each other they are very different. 

Could do some kind of probability function / distribution and use the rectified linear function. But negative values won't work and will cause issues later. Clipping values means the model can't learn from that. 

Basically - we want to make the output something that we can use to teach the model and that make sense.

Solution - we use exponentiation to make the values all positive without causing isses that just squareing it would create. We do this with `y=e^x` where `e` is euler's number 2.718281828459045

In [41]:
import numpy as np
layer_outputs = [[4.8, 1.21, 2.385],
                 [8.9, -1.81, 0.2],
                 [1.41, 1.051, 0.026]]

In [42]:
exp_values = np.exp(layer_outputs)

In [45]:
exp_values

array([[1.21510418e+02, 3.35348465e+00, 1.08590627e+01],
       [7.33197354e+03, 1.63654137e-01, 1.22140276e+00],
       [4.09595540e+00, 2.86051020e+00, 1.02634095e+00]])

In [46]:
norm_values = exp_values / np.sum(exp_values, axis=1, keepdims=True)

In [47]:
print(norm_values)
print(sum(norm_values))

[[8.95282664e-01 2.47083068e-02 8.00090293e-02]
 [9.99811129e-01 2.23163963e-05 1.66554348e-04]
 [5.13097164e-01 3.58333899e-01 1.28568936e-01]]
[2.40819096 0.38306452 0.20874452]


Need to subtract the largest output value from the initial values to avoid an overflow error if the 
output values are very large. The end result is the same, but we prevent a potential overflow condition