Topics:
--------

- Connection weights
- bias
- activation function
- capacity
- decision boundary of neuron
- single hidden layer neural network
- softmax activation function
- multilayer neural network
- Universal approximation theorem



_______________________________________
______________________________________

## Artificial Neuron

- **Neuron pre-activation(or input activation)**
$$a(\textbf{x})= b + \sum _{i}w_i x_i = b + \textbf{W}^T\textbf{x}$$

- **Neuron (output) activation**
$$h(\textbf{x})= g (a(\textbf{x})) = g(b + \sum _{i}w_i x_i)$$
![](assets/1.png)
- $\textbf{W}$ are the connection weights
- $b$ is the neuron bias
- $g(.)$ is called the activation function

![](assets/2.png)

## Activation Function
- **Linear activation function** 
$$g(a)=a$$
![](assets/3.png)
 - Performs no input squashing
 - Not very interesting
____________________________

- **Sigmoid activation function** 
$$g(a)=sigma(a)=\frac{1}{1+e^{-a}}$$
![](assets/4.png)
 - Squashes the neuron's pre-activation between 0 and 1.
 - Always Positive
 - Bounded
 - Strictly increasing
_______________________________________

- **Hyperbolic tangent ($\tanh $) activation function** 
$$g(a)= \tanh(a) = \frac{e^{(a)}-e^{(-a)}}{e^{(a)}+e^{(-a)}}= \frac{e^{(2a)}-1}{e^{(2a)}+1}$$
![](assets/5.png)
 - Squashes the neuron's pre activation between -1 and 1
 - Can be positive or negative
 - Bounded
 - Strictly increasing
 
_________________________________

- **Rectified linear activation function** 
$$g(a)= reclin(a) = max(0,a)$$
![](assets/6.png)
 - Bounded below by 0 (always non negative)
 - Not upper bounded
 - Strictly increasing
 - Tends to give neurons with sparse activities
 
 
 
## Capacity of single neuron

- Could do binary classification:
 - with sigmoid and tanh, can interpret neuron as estimating $p(y=1\mid \textrm{x})$
 - also known as logistic regression classifier
 - if greater than 0.5, predict class 1
 - otherwise, predict class 0
 ![](assets/7.png)
 
- Can solve linearly separable problems
![](assets/8.png)

- Can't solve non linear separable problems
![](assets/9.png)
- Unless the input is transformed in a better representation

## Multilayer neural network

1. single hidden layer neural network 
--------------------------------


![](assets/10.png)
- Hidden layer pre-activation:
$$a(\textbf{x})= b^{(1)}+ \textbf{W}^{(1)}\textbf{x}$$
$$\begin{pmatrix}
a(\textbf{x})_i= b^{(1)}_i +\sum _j \textbf{W}^{(1)}_{i,j}\textbf{x}_j
\end{pmatrix}$$

- Hidden layer activation:
$$h(\textrm{x})=g(a(\textrm{x}))$$

- Output layer activation:

$$f( \textbf{x})= o(b^{(2)} + \textbf{w}^{(a)^T} \textbf{h}^{(1)} \textbf{x})$$

o: is the output activation function

2. softmax activation function
---------------------------

- For multi-class classification:
 - we need multi outputs ( 1 output per class)
 - we would like to estimate the conditional probability  $p(y=c\mid \textrm{x})$
 
- we use the softmax activation function at the output:
$$\textbf{0(a)} = softmax(\textbf{a})= \begin{bmatrix}
\frac{e^{(a_1)}}{\sum_c e^{(a_c)} } &  
\cdots  & 
\frac{e^{(a_C)}}{\sum_c e^{(a_c)} }
\end{bmatrix}^T$$
 - strictly positive
 - sums to one
 
- Predicted class is the one with highest estimated probability


3. Multilayer neural network
---------------------------------

- Could have $L$ hidden layers:
![](assets/11.png)
 - layer pre-activation for $k>0$ $(\textbf{h}^{(0)}(\textbf{x})= \textbf{x})$
 $$\textbf{a}^{(k)}(\textbf{x})= \textbf{b}^{(k)}+\textbf{W}^{(k)}\textbf{h}^{(k-1)}(\textbf{x})$$
 
 - hidden layer activation (k from 1 to L):
 $$\textbf{h}(\textbf{x})=g(\textbf{a}^{(k)}(\textbf{x}))$$
 
 - output layer activation (k=L+1):
 $$\textbf{h}^{(L+1)}(\textbf{x})=o(\textbf{a}^{L+1}(\textbf{x}))=f(\textbf{x})$$
 
 ## Capacity of single hidden layer neural networks
 ![](assets/12.png)
 ![](assets/13.png)
 - **Universal approximation theorem**(Hornik, 1991)
  > "a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units"
  
 - The result applies for sigmoid, tanh, and many other hidden layer activation functions
 - This is a good result, but it doesn't mean there is a learning algorithm that can find the necessary parameter values!