# The plot

- Neural Nets are simple mathematical models defining a function.
- A Neural Net with a fixed structure is characterized by the unit biases and connection weights parameters
- Thus a prior over biases and weights implies a prior over functions.
- However, the meaning of weights and biases in neural nets is obscure.
- Moreover, the number of hidden units (in a multilayer perceptron with single hidden layer) limits the set of functions representable by the neural net.
- By increasing the number of hidden units to infinity, and by selecting appropriate priors for the neural net parameters, the corresponding prior over functions is a Gaussian Process.

This tutorial has two objectives:

1. To show how neural networks can be used for building priors over spaces of functions
2. To show that there is a correspondence between the neural networks and gaussian processes.

# Background on Neural Nets
- [X] Mathematical formulation
- [X] Multilayer Perceptron
- [X] Representation of a single hidden-layer perceptron as a DAG, with labels for unit biases and connection weights
- [X] Mathematical expression for the corresponding perceptron with $\tanh$ activation function

Neural networks are mathematical models for defining a function mapping inputs $X$ to outputs $Y$, 

$$\,f: X \rightarrow Y$$

A neural network is composed of a number of connected **units**. The way that the units are connected to each other determines the *network structure*. In this tutorial, we will focus on **multilayer perceptrons**. The multilayer perceptron has three main properties:

1. *Feedforward connections* - connections between units are unidirectional and there are no cycles.
2. *Layered units* - the network is organized in layers, with each unit being connected only to units in the previous and next layers.
3. *Multiple layers* - there are more than two layers.

![Example Multilayer Perceptron](multilayer_perceptron.png "Example Multilayer Perceptron with a single hidden layer")

A multilayer perceptron with $I$ **input** and $O$ **output** units takes in a set of real inputs, $\mathbf{x} := \{x_i\}_{i=1}^I$, and computes the real outputs, $\mathbf{y}:=\{f_k(\mathbf{x})\}_{k=1}^O$, using one or more layers of **hidden** units. In a network with one hidden layer, as in the figure above, the computations can be summarized as 

$
\begin{align}
    f_k(\mathbf{x}) &= b_k + \sum_j v_{jk} h_j(\mathbf{x})
    \\
    h_j(\mathbf{x}) &= K\left(a_j + \sum_i u_{ij}x_i\right)
\end{align}
$

Here, $u_{ij}$ is the connection weight from the input unit $i$ to the hidden unit $j$, and $v_{jk}$ is the connection weight from hidden unit $j$ to output unit $k$. Each output unit $k$ and hidden unit $j$ is associated with a *unit bias*, $b_k$ and $a_j$, respectively. Each hidden unit passes its input values through an **activation function**, $K$, which is usually a nonlinear function, such as the [hyperbolic tangent](https://en.wikipedia.org/wiki/Hyperbolic_function#Standard_analytic_expressions). 

# Putting priors on Neurals Nets
- Everything is zero-mean Gaussian
- The number of hidden units increases to infinity
- Prove that the the joint distribution of the values of the function at any finite number of points is multivariate Gaussian

We want to use neural networks to induce priors over spaces of functions. 

In a multilayer perceptron with a single hidden layer and fixed number of units in each layer, the remaining parameters are the connections weights, $u_{ij}$ and $v_{jk}$, and unit biases, $a_j$ and $b_k$. 

### Stating the assumptions 
- The activation function is bounded
- $b_k$ and $v_{jk}$ are independent and normally distributed with mean zero and fixed finite variance
- $v_{jk}$ has mean zero and 
- $a_j$ and $u_{ij}$ are independent and identically distributed (separately)
- Only one input and one output units

### Remembering the central limit theorem

Let $\{X_1, X_2, \dots, X_n\}$ be a sequence of independent random variables, in which each $X_i$ has finite mean and variance, $\mathbb{E}[X_i] = \mu_i$ and $\text{Var}[X_i] = \sigma_i^2$.

### The expectation of each hidden unit's contribution to the output is zero

### The variance of each hidden unit's contribution to the output is finite (if $\sigma_v$ decreases with H)

### The output tends to Gaussian through the CLT

### Similarly several inputs leading to several outputs tends to a multivariate Gaussian -> Gaussian Process

### Changing the activation function

## Graph for generating smooth functions with Neural Nets
*Make it interactive, let the user pick the number of hidden units

In [3]:
class MultilayerPerceptron():
    '''Fully Connected Multilayer FeedForward Artificial Neural Network.'''
    def __init__(self, num_layers, num_input, num_output, num_hidden, 
                 activation_function,):
        pass
    
    class Neuron():
        '''Computational unit of a Neural Network.'''
        def __init__(self, bias, input_weights, input_values, activation_function):
            if len(input_weights) != len(input_values):
                raise ValueError('Different number of input values and weights.')
            pass
        
        def compute_output(x):
            '''Compute linear combination of input_values, weighted by input_weights'''
            pass
    
    
class BayesianPerceptron():
    def __init__(self):
        pass

In [None]:
# Test Multilayer Perceptron

## A NN with only one neuron outputs the same value as the input
nn = MultilayerPerceptron()
assert 

In [4]:
identity = lambda x: x
identity(3)

3

## Graph for generating smooth functions with Gaussian Processes
*RBF kernel, similar to the GP_tutorial_one

## Show the relationship between GP and Neural Net covariance for smooth functions

# Extra

- Sample Brownian motion functions from a GP and from a NN (with increasing number of hidden units)
- Show relationship between covariance functions

# Possible extra

- Talk about priors for networks with more than one hidden layers?