**Softmax Activation Function**  
For our model we want it to be a classifier, so we need a activaion function meant for classification. One of these is the Softmax activation function. Why another activation function? In this case the rectified linear unit is unbounded, not normalized with other units and exclusive. Not norlmalized means a number or value can be anything, an output of [23, 87, 220] is without context and exclusive means each output is independet of each other.  
  
To adress this lack of context, the softmax activation on output data can take non-normalized (uncalibrated) inputs and produce a normalized distribution of probabilities for our classes. Thedistribution returned by the softwax activation function represents a confidence score for each class and will add upp to 1.

The softmax function: S(i,j) = e^z(i,j) / sum e^z(i,l)

In [1]:
layer_outputs = [4.8, 1.21, 2.385]

First step is to exponentiate the outputs. We do this with Eulers constant *e* to the power of the given parameter: y = e^x

In [1]:
from numpy import math

# values from previous steps, when describing what a neural network is
layer_outputs = [4.8, 1.21, 2.385]

# e -mathematical constant
math.e

# For each value in a vector, calculate the the exponential value
exp_values = []
for output in layer_outputs:
    exp_values.append(math.e ** output)
print('exponentiated values:')
print(exp_values)

exponentiated values:
[121.51041751873483, 3.353484652549023, 10.859062664920513]


Exponentiation serves multiple purposes. For calculating probabilities, we need non-negative values. Take for example output [4.8, 1.21, -2.385], even after normalization, the last value will still be negative wince we just divide all of them by their sum. A negative probability (or confidence) does not make sense to us. An exponential value of any number is always non-negative - it returns 0 for negative infinity, 1 for the input of 0, and increases for positive values.

In [9]:
# OMA - Exponential function y = e^x

# some random x values
values = [0,1,2,3,4]

# create empty list
dat = []

for value in values:
    d = [value, math.e**value]
    dat.append(d)

print(dat)

[[0, 1.0], [1, 2.718281828459045], [2, 7.3890560989306495], [3, 20.085536923187664], [4, 54.59815003314423]]


In [11]:
# Next normalize values
norm_base = sum(exp_values)
norm_values = []
for value in exp_values:
    norm_values.append(value / norm_base)
print('Normalized exponentiated values:')
print(norm_values)

print('Sum of normalized values:', sum(norm_values))

Normalized exponentiated values:
[0.8952826639572619, 0.024708306782099374, 0.0800090292606387]
Sum of normalized values: 0.9999999999999999


We can perform the same operation using NumPy accordingly 

In [15]:
import numpy as np

# Values from the eariler when described what a neural network is
layer_outpus = [4.8, 1.21, 2.385]

# For each value in a vector, calculate exponential value
exp_values = np.exp(layer_outputs)
print('exponentiated values:')
print(exp_values)

# Now normalize values
norm_values = exp_values / np.sum(exp_values)
print('normalized exponentiated values:')
print(norm_values)
print('sum of normalized values:', np.sum(norm_values))

exponentiated values:
[121.51041752   3.35348465  10.85906266]
normalized exponentiated values:
[0.89528266 0.02470831 0.08000903]
sum of normalized values: 0.9999999999999999


Notice that the reslt are similar but faster and easier to read with numpy. We can exponentiate all of the values with a single call of np.exp(), then immediately normalize them with the sum. To train in batches, we must convert this functionality to accept layer outputs in batches. That is easy:

In [25]:
# re-create inputs first
inputs = [[1.0, 2.0, 3.0, 2.5],
          [2.0, 5.0, -1.0, 2.0],
          [-1.5, 5.0, -1.0, -0.8]]

# Get unnormalized probilities
exp_values = np.exp(inputs)

# Normalize them for each sample
probabilities = exp_values / np.sum(exp_values, axis = 1, keepdims=True)
# in a 2D array or matrix axis = 0 refers to rows and axis = 1 refers to columns
# keepdim

print(exp_values)
print(probabilities)

[[  2.71828183   7.3890561   20.08553692  12.18249396]
 [  7.3890561  148.4131591    0.36787944   7.3890561 ]
 [  0.22313016 148.4131591    0.36787944   0.44932896]]
[[0.06414769 0.17437149 0.47399085 0.28748998]
 [0.04517666 0.90739747 0.00224921 0.04517666]
 [0.00149297 0.99303905 0.0024615  0.00300648]]


Regarding *axis = 1* above. In a 2D array or matrix axis = 0 refers to rows and axis = 1 refers to columns. An example of how axis affects the sum in numpy, use first the default which is *None*

In [27]:
import numpy as np

layer_outputs = np.array([[4.8, 1.1, 2.385],
                         [8.0, -1.81, 0.2],
                         [1.41, 1.051, 0.026]])

print('Sum without axis')
print(np.sum(layer_outputs))

print('This will be identical to the above since default is Non:')
print(np.sum(layer_outputs, axis=None))

Sum without axis
17.162
This will be identical to the above since default is Non:
17.162


Note that, without axis argument specified numpy will sum all the values even if they are in varying dimensions. Next, *axis=0*. means we sum row-wise, along axis 0. I.e the output has the same size as this axis, as at each of the positoin of this output, the values from all other dimensions at this points are summed to form it. In this case a 2D array, where we have only a single other dimensions, the columns, the output vector will sum these columns. This means we perform (4.8 + 8.0 + 1.41) etc.

In [28]:
print('Another way to think of it w/ a matrix == axis 0: columns:')
print(np.sum(layer_outputs, axis=0))

Another way to think of it w/ a matrix == axis 0: columns:
[14.21   0.341  2.611]


This is not what we want. We want sums of the rows. We want sums of rows. Before we show hot to do it in numpy we do it also from scratch:

In [30]:
print('But we want to sum the rows instead, like this w/ raw py:')

for i in layer_outputs:
    print(sum(i))

But we want to sum the rows instead, like this w/ raw py:
8.285
6.39
2.4869999999999997


We could take above and append these numbers to some list in any way we want. But we till use numpy and sum along axis 1:

In [32]:
print('Sw we can use sum axis 1, but note the current shape:')
print(np.sum(layer_outputs, axis=1))

Sw we can use sum axis 1, but note the current shape:
[8.285 6.39  2.487]


With above we did get the sums we wanted, but we want to simplify the outputs to a single value per sample.We´re tring to sum all the outputs from a layer for each sample in a batch, converting they layer´s output array with row length equal to the number of neurons in the layer, to just one value. We need a column vector with these values since it will let us normalize the entire batch of samples, sample-wise, with a single calculation.

In [34]:
print('Sum axis 1, but keep the same dimensions as input:')
print(np.sum(layer_outputs, axis=1, keepdims=True))

Sum axis 1, but keep the same dimensions as input:
[[8.285]
 [6.39 ]
 [2.487]]
