# Establish CNNs from scratch
The cross-entropy loss is commonly used in CNNs. This concept comes from Information Theory. The cross-entropy from $P$ to $Q$,denoted $H(P,Q) \overset{def}{=} \sum_{j} -P(j) \log Q(j)$, is the expected surprisal of an observer with subjective probabilities $Q$  upon seeing data that was actually generated according to probabilities  $P$. The lowest possible cross-entropy is achieved when $P=Q$.

In [None]:
import numpy as np
from layer import Layer

In [1]:
class Softmax(Layer):

    def softmax(self):
        shifted_logits =  self.input- np.max(self.input)
        tmp = np.exp(shifted_logits)
        self.output = tmp / np.sum(tmp) 
        return self.output

    def cross_entropy(self,y_true:list):
        loss = -np.sum(y_true*np.log(self.output+1e-15))
        return np.maximum(loss,0)


    def forward(self,input):
        self.input=input
        #print(self.input)
        return self.softmax()
    
    
    def backward(self, output_gradient, learning_rate):
        return self.output-output_gradient

Supplementary information from Prof. Mu Li  
Since the softmax function and the corresponding cross-entropy loss are so common, it is worth understanding a bit better how they are computed. Here, we only consider the cross-entropy loss of consistent situations.  
$$
l(Y,\hat Y) = - \sum_{i=1}^{n} y_{i} \log \frac{exp(o_{i})}{\sum_{k=1}^{n} exp(o_{k})}  \\

=\sum_{i=1}^{n} y_{i} \log \sum_{k=1}^{n} exp(o_{k}) -\sum_{i=1}^{n} y_{i} o_{i}  \\

=\log \sum_{k=1}^{n} exp(o_{k}) - \sum_{i=1}^{n} y_{i} o_{i} 
$$  
To understand a bit better what is going on, consider the derivative with respect to any logit $o_{j}$.
$$
\partial_{o_{j}} l(Y,\hat Y) = \frac{exp(o_{i})}{\sum_{k=1}^{n} exp(o_{k})}-y_{i}=softmax(o_{j}) -y_{j}
$$  
In other words, the derivative is the difference between the probability assigned by our model, as expressed by the softmax operation, and what actually happened, as expressed by elements in the one-hot label vector. In this sense, it is very similar to what we saw in regression, where the gradient was the difference between the observation $y$ and estimate $\hat y$. This is not a coincidence. In any exponential family model, the gradients of the log-likelihood are given by precisely this term. This fact makes computing gradients easy in practice.