In [3]:
import numpy as np

<p><b>Softmax</b></p>
<p>The softmax function takes as input a vector $z$ of $K$ real numbers, and normalizes it into a probability distribution consisting of $K$ probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval $(0,1)$, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.</p>
$$ p_i = \frac{e^{o_i}}{\sum\limits_{j} e^{o_j}}$$
$$ y_i = \frac{e^{m_i}}{\sum\limits_{j} e^{m_j}}$$
$$L = -\sum\limits_{j} y_j \ln{p_j}$$

In [15]:
def softmax(x):
    exp_yi = np.exp(x - np.max(x))
    return exp_yi/exp_yi.sum()

In [16]:
#Example
x = np.array([1,2,3])
softmax(x)

array([0.09003057, 0.24472847, 0.66524096])

$$ \frac{\partial p_i }{\partial o_k} = 
\frac{\delta_{ik} e^{o_i} \sum e^{o_i} - e^{o_i}e^{o_k} }{\left(\sum e^{o_i}\right)^2}=
\delta_{ik} p_k - p_i p_k $$
$$\frac{\partial L}{\partial o_k} = -\sum\limits_{j} 
 \frac{y_j}{p_j}\left(\delta_{kj} p_k - p_j p_k \right) = 
-y_k + \sum\limits_{j} y_j p_k =
-y_k + p_k\sum\limits_{j} y_j =
-y_k+p_k$$

In [17]:
def softmax_loss_gradient(o,m):
    y = softmax(o)
    p = softmax(m)
    loss = -y.dot(np.log(p))
    return {'gradient': -softmax(m)+softmax(o), 'loss': loss}

In [19]:
#Example 
o = np.array([1,2,3])
m = np.array([1.5,2.5,3])
softmax_loss_gradient(o,m)

{'gradient': array([-0.03192108, -0.08677049,  0.11869157]),
 'loss': 0.8615407006196977}