### What is a deep network?

The human brain contains (~$10^{11}$) neurons that form an intricate network to form an interface between our bodies and the world around us. The job of neuroscientists is to determine the nature of that network and how it allows for interactions with the external world such as perception, prediction, and action. The ultimate goal of deep learning is to be able to harness what biological neural networks can do in artificial models. However, most deep learning models share few similarities with their biological counterparts due to a lack of understanding of how the biological networks learn and operate. That being said, drastically simplified models of these networks have been developed in recent years such as feed-forward networks, recurrent neural networks, and the like. At the same time learning-rules such as backpropagation have been developed to train networks for a specific purpose. Unlike the general intelligence seen in biological brains, these networks are typically developed to perform very specific tasks, such as the classification of hand-written digits. The search for artificial general intelligence remains.

### Artificial neural networks (ANNs)

In its most abstract form, an artificial neural network is a composite function which takes an input and produces an output. All of the details of the network structure are summarized by a single symbol $\Phi$. That output is a *probability* on the character of the input. That is, given an input $x$ we predict the probability that input belongs to class $y$ by computing $P_{\Phi}(y|x)$. There are a number of steps between the input $x$ and having an estimate of $P_{\Phi}(y|x)$. To begin, we need to adopt some fundmental concepts in information theory.


### Information theory

Information entropy is an information theoretic concept introduced by Claude Shannon in a paper titled *A mathematical theory of communication* published in 1948. At it's core, information entropy tells us how much information is contained in the distribution of a variable. Bits are chosen as the unit of measure because information theory was originally devised to describe the novel communication systems of the mid 20th century: digital systems. 

### Entropy

Similar to statistical mechanics, information entropy $\mathbf{H}$ is a measure of uncertainty. In information theory, it is the average number of bits it takes to encode all possible states of the "system" $\chi$ given some probability distribution over those states $P(x)$. An example provides the quickest route to intuition so here is the definition straight away

\begin{equation*}
\textbf{H} = -\sum_{x\in \chi} P(x)\log_{2} P(x)
\end{equation*}

Note that you can use a $\log$ with whatever base you like as long as the units are noted. I use units of bits because this is intuitive but you can equivalently use a natural logarithm and units of 'nats'.

### An example

Consider a horse race where the horses are equally likely to win. For the sake of generality let's assume there are $N$ horses and we want to send someone a binary string that tells them which horse won the race. The entropy is 

\begin{equation*}
\textbf{H} = \log_{2}N
\end{equation*}

So, in general, it will take you $\log_{2}N$ bits to describe the winner. For two horses, you only need one bit, for three horses, you need ~1.6 bits and so on. Notice that the calculation simplied significantly under the assumption that $P(x)$ was uniform or *flat*. You might guess that a uniform distribution provides the highest entropy, and you would be correct. Also, from an optimization perspective, the entropy defines a lower bound on the number of bits we need to describe $\chi$. You simply can't do better.

Typically things are not this simple and you have to compute $\textbf{H}$ for a more complicated distribution. To deal with that more general case, it helps to realize that the space of states $\chi$ is fixed for a given scenario. Our job is to best approximate the distribution on that space (which is not always easy apriori). For a given $\chi$, **the distribution P(x) is what determines the entropy**. To illustrate, consider another case where $P(x)$ is a delta function at $x_{0}$. Plugging that in to the definition above will yield $\textbf{H} = 0$ meaning there is no uncertainty in $x$ at all.


### Cross-Entropy and KL-Divergence

Above we discussed that the entropy lies somewhere between zero and the entropy of a uniform distribution. There is another interesting measure in information theory referred to as the *cross-entropy*. The cross-entropy is a measure of the degree of similarity between two distributions of the same variable $P(x)$ and $Q(x)$. 

\begin{equation*}
\textbf{H}(P,Q) = -\sum P\log_{2} Q
\end{equation*}

Notice that if the distributions are the same, then the cross-entropy is equivalent to the entropy. The cross-entropy also has an optimization interpretation. Let's say we know the entropy of a "good" distribution $P(x)$. Then we have another $Q(x)$ for which the entropy is higher. If we can in such a way transform $Q(x)$ to look like $P(x)$, we have optimized it. In other words, as $Q$ deviates from $P$, the cross-entropy becomes greater than the entropy. This leads us to the definition of KL-divergence:

\begin{equation*}
\textbf{KL}(P,Q) = \textbf{H}(P,Q) - \textbf{H}(P)
\end{equation*}

The KL-Divergence is simply the difference between the cross-entropy and the entropy. As our predicted distribution gets closer to the actual distribution, the KL-Divergence tends to zero. Eventually, we will use these tools to understand a very common loss function in deep learning: cross-entropy loss.


### Cross-Entropy Loss

In the context of machine learning, minimizing KL-Divergence can be thought of as the training process. As stated above, a neural network is basically just a composite function $\Phi$, that eventually outputs a probability distribution $Q$ (see softmax section).

In supervised machine learning, we have a set of data called the *training set*. Having that set of data allows the calculate the distribution 

Ultimately, we want to find the composite function $\Phi$ that maps an input $x$ to the correct output $y$ and we do that by minimizing the cross-entropy (or KL-divergence). Essentially that means we match up the distributions predicted by our function $\Phi$ and the training data.

\begin{equation*}
\DeclareMathOperator*{\argmin}{argmin}
\Phi^{*} = \underset{\Phi}{\argmin} -\sum_{x\in \chi} P(x)\ln Q_{\Phi}(x)
\end{equation*}

### Softmax

Typically the last layer of an ANN is a softmax layer. The purpose of that layer is to map the output of the most recent transformation to the probability distribution $Q_{\Phi}$ for use in computing loss or making a prediction.

\begin{equation*}
\DeclareMathOperator*{\softmax}{softmax}
Q_{\Phi} = \softmax s_{\Phi}(x) = \frac{1}{Z}e^{s_{\Phi}(x)}
\end{equation*}

where the function $s_{\Phi}(x)$ is most recent transformation in our network. This is just an exponential probability distribution like that use in Boltzmann statistics.

\begin{equation*}
Z = \sum_{x} e^{s_{\Phi}(x)}
\end{equation*}