<h1>CS4618: Artificial Intelligence I</h1>
<h1>Neural Networks</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Acknowledgements</h1>
<ul>
    <li>The colourful diagrams are my own invention but were improved by seeing similar diagrams in materials produced by Sebastian Raschka.</li>
</ul>

<h1>Introduction</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$
<ul>
    <li><b>Neural Networks</b> are <em>loosely</em> inspired by what we know about our brains:
        <ul>
            <li>Networks of neurons.</li>
        </ul>
    </li>
    <li>However, they are <em>not</em> models of our brains.
         <ul>
             <li>E.g. there is no evidence that the brain uses the learning algorithm that is used by neural
                 networks.
             </li>
         </ul>
    </li>
</ul>

<h1>Biological Neurons</h1>
<ul>
    <li>Your brain is a network of about 10<sup>11</sup> neurons, each connected to about 10<sup>4</sup> others:
        <figure>
            <img src="images/brain.png" />
        </figure>
    </li>
    <li>Sufficient electrical activity on a neuron’s dendrites causes an electrical pulse
        to be sent down the axon, where it may activate other neurons.
        <figure style="text-align: center">
            <img src="images/cell.png" />
            <figcaption>
                <a style="fint-size: 0.8em" href="https://commons.wikimedia.org/wiki/File:Neuron.svg">https://commons.wikimedia.org/wiki/File:Neuron.svg</a>
            </figcaption>
        </figure>
    </li>
</ul>

<h1>Artificial Neurons</h1>
<ul>
    <li>A simple artificial neuron:
        <figure>
            <img src="images/ltu.png" />
        </figure>
    </li>
    <li>It has $n$ real-valued inputs, $\v{x}_1,\ldots,\v{x}_n$.</li>
    <li>The connections have real-valued weights, $\v{w}_1,\ldots,\v{w}_n$.</li>
    <li>The neuron also has a number $b$ called the <b>bias</b>.</li>
    <li>The neuron computes the weighted sum of its inputs and adds $b$:
        $$z = b + \v{w}_1\v{x}_1 + \v{w}_2\v{x}_2 + \cdots + \v{w}_n\v{x}_n$$
        or if $\v{x}$ is a row vector of the inputs and $\v{w}$ is a (column) vector of the weights
        $$z = b + \v{x}\v{w}$$
    </li>
    <li>The neuron then usually applies an <b>activation function</b>, $g$, to the weighted sum, $z$.
        Many activation functions have been proposed, including:
        <ul>
            <li><b>linear activation function</b>: $$g(z) = z$$</li>
            <li><b>step activation function</b>:
                $$g(z) = \left\{ \begin{array}{lr}
                    0 & \mbox{if } z < 0 \\
                    1 & \mbox{if } z \geq 0
                    \end{array}
                  \right.
                $$
            </li>
            <li><b>sigmoid activation function</b>: $$g(z) = \frac{1}{1 + e^{-z}}$$</li>
            <li><b>ReLU activation function</b> (ReLU stands for Rectified Linear Unit): $$g(z) = max(0, z)$$</li>
            <li><b>tanh activation function</b> (tanh is the hyperbolic tangent): $$g(z) = \tanh(z)$$
        </ul>
    </li>
    <li>Apart from the linear activation function, these activation functions are <b>non-linear</b>, which
        is important to the power of neural networks.
    </li>
</ul>

<h2>Dot product</h2>
<ul>
    <li>Although artificial neurons are inspired by real neurons, really all we're doing is the dot 
        product of two vectors, followed by element-wise application of the activation function.
    </li>
</ul>
<img src="images/nn_maths1.png" />

<h2>Relationship with Linear Models</h2>
<ul>
    <li>It should be clear that a single artificial neuron that uses the linear
        activation function (which, in effect, does nothing) gives us the same linear models that
        we had in Linear Regression.
        <ul>
            <li>The only differences are in terminology. In Linear Regression, the parameters are
                referred to as coefficients (denoted $\v{\beta}$); in a neuron, the parameters
                are the weights (denoted $\v{w}$) and the bias ($b$).
            </li>
            <li>If we learn the values of the weights and bias using MSE as our loss function, then we will be doing
                OLS regression.
            </li>
        </ul>
    </li>
    <li>It should be clear that a single artificial neuron that uses the sigmoid 
        activation function gives us the same models that we had when using Logistic Regression
        for binary classification. (And we could learn the weights using the binary cross-entropy function as our
        loss function.)
    </li>
</ul>

<h1>Layers of Neurons</h1>
<ul>
    <li>We don't usually have just one neuron. We have a <b>layer</b>, containing several neurons.</li>
    <li>For now let's consider what is called a <b>dense layer</b> (also a <b>fully-connected layer</b>):
        <ul>
            <li>every input is connected to every neuron in the layer.
                <figure>
                    <img src="images/layer.png" />
                </figure>
            </li>
       </ul>
   </li>
   <li>So now we have more than one output, one per neuron. But each is calculated in the same way as before:
       compute a weighted sum of the inputs; apply the activation function to the weighted sum.
    </li>
</ul>

<h2>Matrix multiplication</h2>
<ul>
    <li>Suppose there are $p$ neurons in this layer. We can put all the weights into a $m \times p$ matrix:
        <ul>
            <li>The first column contains the weights on the inputs going into the first neuron in the layer.
                There will be $m$ weights.
            </li>
            <li>The second column contains the weights going into the second neuron.</li>
            <li>And so on, one column for each of the $p$ neurons in the layer.</li>
        </ul>
        Call this matrix of weights $\v{W}$.
    </li>
    <li>We can put each neuron's bias into a (row) vector, $\v{b}$. This vector has $p$ values in it.</li>
    <li>If we assume $\v{x}$ is a row vector (as before), we can obtains all the outputs with simple calculations:
        $$\v{z} = \v{b} + \v{x}\v{W}$$
        &mdash; a matrix multiplication to get all the weighted sums to which we add the biases, giving a vector
        $\v{z}$ with $p$ cells in it. Then we apply $g$ element-wise to $\v{z}$:
        $$\v{a} = g(\v{z})$$
        $\v{a}$ is a vector with $p$ elements in it, one output per neuron in this layer.
    </li>
</ul>
<img src="images/nn_maths2.png" />

<h1>Multi-Layer Neural Networks</h1>
<ul>
    <li>We don't usually have just one layer. We have multiple layers.</li>
    <li>Let's assume they are also <b>dense layers</b>, e.g.:
        <figure>
            <img src="images/layers.png" />
        </figure>
    </li>
    <li>These neural networks contain:
        <ul>
            <li>an <b>input layer</b> (although this is not a layer of neurons);</li>
            <li>one or more <b>hidden layers</b>;</li>
            <li>an <b>output layer</b>.</li>
        </ul>
        Every neuron has a bias.
    </li>
    <li>The <b>depth</b> of a multi-layer neural network is simply the number of <em>layers of neurons</em>.
        <ul>
            <li>What is the depth of the network in the diagram above?</li>
        </ul>
    </li>
    <li>We would say that the network shown in the diagram is a <b>layered</b>, <b>dense</b>, <b>feedforward</b>
        network.
        <ul>
            <li>These networks have the simplest <b>architecture</b> (structure).</li>
            <li>In later lectures, we will see networks: 
                <ul>
                    <li>where there may be 'splits' and 'merges';</li>
                    <li>where not every layer is densely-connected; and</li>
                    <li>where outputs may feedback to the earlier layers (<b>recurrent networks)</b>.</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Matrix multiplication again</h2>
<ul>
    <li>Suppose there are $p^{(0)}$ neurons in the first layer, layer 0. We can put all the weights into a 
    $m \times p^{(0)}$ matrix. Call this matrix of weights $\v{W}^{(0)}$. We can put the biases of the
        neurons in layer 0 into a (row) vector with $p$ elements; call this vector $\v{b}^{(0)}$.
    </li>
    <li>Suppose there are $p^{(1)}$ neurons in the second layer. We can put all the weights between the first
        layer and the second layer into a $p^{(0)} \times p^{(1)}$ matrix, called $\v{W}^{(1)}$.
        <ul>
            <li>Why are its dimenions $p^{(0)} \times p^{(1)}$?</li>
        </ul>
        And we can put all the biases into a (row) vector with $p^{(1)}$ elements, $\v{b}^{(1)}$.
    </li>
    <li>If we assume $\v{x}$ is a row vector (as before), we can obtain all the outputs of the first layer 
        with simple calculations:
        $$\v{z}^{(0)} = \v{b}^{(0)} + \v{x}\v{W}^{(0)}$$
        $$\v{a}^{(0)} = g(\v{z}^{(0)})$$
    </li>
    <li>Then we can obtain all the outputs of the second layer with similar calculations:
        $$\v{z}^{(1)} = \v{b}^{(1)} + \v{a}^{(0)}\v{W}^{(1)}$$
        $$\v{a}^{(1)} = g(\v{z}^{(1)})$$
    </li>
    <li>If there are more layers, then we just do more matrix multiplications followed by element-wise application
        of $g$.
    </li>
</ul>
<img src="images/nn_maths3.png" />

<ul>
    <li>In fact, as you've seen before, when we make predictions for unseen examples (inference), 
        we often want predictions, not for a single
        object $\v{x}$, but for a set of objects $\v{X}$. This is also true during training, in the case of Batch
        Gradient Descent and Mini-Batch Gradient Descent.
    </li>
    <li>This is a simple generalization of the above:
        $$\v{Z}^{(0)} = \v{b}^{(0)} + \v{X}\v{W}^{(0)} \mbox{ and } \v{A}^{(0)} = g(\v{Z}^{(0)})$$
        $$\v{Z}^{(1)} = \v{b}^{(1)} + \v{A}^{(0)}\v{W}^{(1)} \mbox{ and } \v{A}^{(1)} = g(\v{Z}^{(1)})$$
        &hellip;and so on.
    </li>
</ul>
<img src="images/nn_maths4.png" />

<ul>
    <li>This is all that a neural network consists of! They are just collections of:
        <ul>
            <li>matrix multiplications; and</li>
            <li>element-wise activation functions.</li>
        </ul>
    </li>
    <li>Actually, that's not true. We need to generalize a little bit. They are just collections of:
        <ul>
            <li><b>affine transformations</b> (matrix multplication being one example of an affine
                transformation, which are linear operations); and
            </li>
            <li>element-wise functions (activation functions being one example).
            </li>
        </ul>
    </li>
    <li>Looking at neural networks in this way also helps us realise that a neural network simply defines
        a function as a composite of other functions. In the example above, the whole network computes the
        following:
        $$g^{(1)}( g^{(0)}( \v{X}\v{W}^{(0)} + \v{b}^{(0)} )\v{W}^{(1)} + \v{b}^{(1)} )$$
    </li>
</ul>

<h2>Why do we want more layers?</h2>
<ul>
    <li>A single neuron (or layer of neurons) gives us linear models.</li>
    <li>With linear models, there are problems we cannot solve.
        <ul>
            <li>For example, we cannot build a classifier that correctly classifies exclusive-or:
                <img src="images/xor.png" />
            </li>
            <li>Why not?</li>
            <li>(A recent paper in <i>Science Magazine</i> claims that a single layer of biological neurons
                <em>can</em> compute exclusive-or. If true, this confirms what we said earlier: artificial neural
                networks are inspired by the human brain, but they are not a model of the human brain.)
            </li>
        </ul>
    </li>
    <li>But, with multiple layers of neurons and the non-linearities of their activation functions, we eliminate
        these limitations.
        <ul>
            <li>E.g. if you search online, you'll find examples of two-layer networks that can correctly classify
                exclusive-or.
            </li>
            <li>Other things being equal, each extra hidden layer enlarges the set of hypotheses that the 
                network can represent: increasing complexity.
            </li>
            <li>In fact, the <b>universal approximation theorem</b> states that a feed-forward network with a
                finite
                (but abitrarily large) single hidden layer can approximate any
                continuous function (to any desired precision), under mild assumptions on the activation function.
            </li>
        </ul>
    </li>
</ul>

<h1>Training a Neural Network</h1>
<ul>
    <li>Brains learn by strengthening or weakening the connections between biological neurons. Analogously,
        neural networks learn by modifying the values of the weights and biases.
    </li>
    <li>It is our job to decide on the neural network architecture.</li>
    <li>And it is our job to choose the values of numerous <b>hyperparameters</b> that we will encounter.</li>
    <li>But we use a dataset and a learning algorithm to find the values of the the network's <b>parameters</b>.
        <ul>
            <li>The parameters of a neural network are its weights and biases.
            </li>
        </ul>
    </li>
    <li>A lot of this is done using <b>supervised learning</b>:
        <ul>
            <li>So we need a <b>labeled dataset</b>;</li>
            <li>a <b>loss function</b>; and</li>
            <li>a learning algorithm known as <b>backpropagation</b> (or <b>backprop</b>) that uses
                some variant of <b>Gradient Descent</b>.
            </li>
        </ul>
    </li>
</ul>

<h1>Deep Learning</h1>
<ul>
    <li>The word 'deep' in 'deep learning' does not mean profound.</li>
    <li>In deep learning, we have 'lots' of layers &mdash; tens or even hundreds.
    </li>
</ul>

<h2>Drivers of Deep Learning</h2>
<ul>
    <li>Hardware:
        <ul>
            <li>Faster CPUs but then highly-parallel Graphical Processing Units (GPus) and now specially-designed
                Tensor Processing Units (TPUs).
            </li>
        </ul>
    </li>
    <li>Data:
        <ul>
            <li>Sensors and the Internet have made vast datasets available: text, images, video, &hellip;
        </ul>
    </li>
    <li>Algorithmic advances:
        <ul>
            <li>The core ideas have been around a long time: Perceptrons (1950s), backpropagation
                (1980s or earlier), convolutional networks (1980s), LSTMs (1990s), &hellip;
            </li>
            <li>But new ideas from 2010 onwards:
                better weight initialization, batch normalization, different activation functions, variants of SGD, 
                numerous ways to avoid overfitting, new architectures, &hellip;
            </li>
        </ul>
    </li>
    <li>Freeware:
        <ul>
            <li>Toolkits/APIs;</li>
            <li>Educational resources.</li>
        </ul>
    </li>
    <li>Money!
    </li>
</ul>

<h2>Applications of Deep Learning</h2>
<ul>
    <li>It is excelling at 'perceptual' tasks, e.g.
         <ul>
             <li>image classification;</li>
             <li>image segmentation;</li>
             <li>speech recognition;</li>
             <li>handwriting transcription.</li>
         </ul>
    </li>
    <li>But it is finding ever wider application:
        <ul>
             <li>video recommendation;</li>
             <li>machine translaton;</li>
             <li>text-to-speech;</li>
             <li>question-answering;</li>
             <li>autonomous driving;</li>
            <li>the protein folding problem (<a href="https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology">AlphaFold</a>);</li>
             <li>super-human game playing (e.g. AlphaGo).</li>
         </ul>
     </li>
</ul>

<h1>Appendix</h1>
<h2>Representations</h2>
<ul>
    <li>One way of thinking about Machine Learning:
        <ul>
            <li>It uses guidance from a feedback signal to automatically find transformations that turn 
                input data into more useful representations.
            </li>
            <li>E.g. in the case of supervised learning, the feedback comes from the loss function and the
                algorithm seeks a representation that is closer to the target outputs.
            </li>
        </ul>
    </li>
    <li>Deep learning is about jointly finding <em>successive layers of representations</em>, usually
        in the form of the layers of a neural network.
        <ul>
            <li>The network takes in vectors (examples).</li>
            <li>The first layer in some sense transforms the input vectors into new vectors &mdash; a different representation of the input examples.
            </li>
            <li>The second layer transforms again into new vectors &mdash; another representation.</li>
        </ul>
    </li>
    <li>Since each layer produces a new representations, one way of thinking about this is, for the 
        kinds of tasks on which it is successful, deep learning <em>automates feature engineering</em>.
    </li>
</ul>