# Feedforward Neural Networks

<hr>

**Gentle introdution to Neural Networks**<br>

A basic neural network with one *input* layer and an *output* layer can be represented pictorially as follows

<img alt="Basic Neural Network" src="assets/basic_neural_network.png" width="300">

Then, the neural network computes a non-linear weighted combination of its input:

$\hat y = f(z)$, where $z = \omega_0 + \sum_{i=1}^{d} x_i \omega_i$ and $f$ is generally a non-linear function called the activation function.

Common activation functions include:

- Rectified linear function (ReLU)

    $f(z) = \max \{0, z\}$ which forces all negative values to flatten out to 0


- Hyperbolic tangent function (tanh)

    $f(z) = \tanh(z) = \frac{\exp^{z} - \exp^{-z}}{\exp^z + \exp^{-z}} = 1 - \frac{2}{\exp^{2z} + 1}$
    
    

Eseentially, a neural network architecture is represented by:

- Functions, $f$
- Features, $X$ as an input layer
- Weights, $W$

<img alt="Neural Network Architecture" src="assets/neural_network_architecture.png" width="500">

****

**Introduction to Deep (*Feedforward*) Neural Networks**

The depth of a neural network is represented by the number of transformations (*layers*) from input (*exclusive*) to output layer (*inclusive*), which makes a *deep* neural network. The width of the network is the maximum number of nodes in a layer.

A deep neural network can mimic the human's brain network in breaking down information and processing it to arrive at a final output, say for e.g. given an image, broken into pixel data and finally output a binary classification result.

Suppose a 2D representation of data that is not linearly seperable, we can use a hidden neural network to transform its features to become linearly seperable and easily solvable. Here is the transformation using *linear*, *tanh* and *ReLU* activation functions.

<img alt="Linear Activation" src="assets/feature_transformation_neural_networks.png" width="400">

<img alt="Tanh Activation" src="assets/feature_transformation_tanh.png" width="400">

<img alt="ReLU Activation" src="assets/feature_transformation_relu.png" width="400">

****

**Backpropagation Algorithm**

In evaluating the neural network, typically we compute the model's loss against the true labels. A loss function is defined and can be represented as $Loss(y, f_L)$, where $f_L$ is the final output from the neural network and can be any loss function.

Then, we apply the idea of stochastic gradient descent to optimize the weights against the loss function by taking the derivative of the loss against each weight, $w_{ij}$, where $i$ represents the layer index and $j$ represents the node index.

For example, in a single node, multiple layers network, we can represent the derative this way.

<img alt="Optimizing weights" src="assets/optimizing_weights.png" width="400">

To find the derivative of the loss with respect to $f_1$, we can do the following:

<img alt="Derivative of loss with respect to first function" src="assets/backpropagation_one_step.png" width="400">

In general, we can find the derivative of the loss with respect to the function at any of the layers with the following:

<img alt="Loss Derivative against function" src="assets/loss_derivative_respect_to_function.png" width="200">

We can then start from the last layer and *propagate* backwards and use a closed-form formula to find the derivative of the loss against the first function.

<img alt="Derivative of loss at first function" src="assets/loss_derivative_at_first_function.png" width="500">

Then finally, we can compute the derivative using the first formula and apply stochastic gradient descent to find the optimal $w_1$

****

**Overly-complex models generate good performance**

- Overcapacity: adding random units to the neural network architecture helps to improve model performance
    - As long as most random units are doing something meaningful, then it will improve model performance
    - A harder task may not be linearly seperable with two units and may require more random units
    - $\therefore$ Larger models tend to to be easier to learn as units are adjusted collectively and sufficiently to solve the task


<img alt="Adding random units" src="assets/adding_random_units.png" width="500">


- Initialization plays a role in finding a good solution
    - Use random offset initialization to avoid symmetries caused by concentration around zero

<img alt="Random Initialization" src="assets/random_offset_initialization.png" width="500">



<hr>

# Basic code
A `minimal, reproducible example`