# Neural Networks Summary

* Different Types of Neurons
* Calculating Error
* Different NN architectures

---
## Different Types of Neurons

### Linear

Activation follows a linear function: 

$y = b(bias) + \sum{x_i w_i}$

### Sigmoid

Activation follows sigmoid function:

$z = b + \sum{x_i * w_i}$

$y = \frac{1}{1 + e^{-z}}$

### Binary Threshold

Activation function is on or off. b can be negative such that:

$z = b + \sum{x_i * w_i}$

$y =
  \begin{cases}
    1       & \quad \text{if } z \geq 0\\
    0       & \quad \text{otherwise} \\
  \end{cases}
$

### Rectified Linear

Activation has threshold and linear function beyond the threshold:

$z = b + \sum{x_i * w_i}$ (bias can be negative to elongate activation)

$y =
  \begin{cases}
    z       & \quad \text{if } z > 0\\
    0       & \quad \text{otherwise} \\
  \end{cases}
$

### Stochastic Binary Neuron

Activation is probability of producing activation:

$p(s=1) = \frac{1}{1 + e^{-z}}$

---

## Hyperparameter Tuning

Meant to help generalize the model to future data. Usually used on the cross-validation set.

Overfitting can also be avoided by:

* weight decay
* weight share
* early stopping
* model averaging
* bayes fitting
* drop out
* generative pretraining

### Gradient Descent

Calculate a loss function moving towards a global/local minumum. Run some data through the network, calculate loss, update weights and do it again. Calculating loss and updating weights _should_ help reduce the overall error.

### Grid Search

Given a set of hyperparameters, create a cartesian product of the params and run a model with each member of the set of the cartesian product.

Ex:

param a: {1,2,3}

param b: {a,b,c}

grid = {(1,a), (1,b) ... (3,c)}

```
for mem in grid:
  run_model(mem)

```

## Calculating Error and Adjusting Weights

### Linear Function

[helpful source](http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#adaptive-linear-neurons-and-the-delta-rule)

Calculates a linear loss on a continuous output.

NOTE: target = the true class label

Loss Function:

$J(w) = \frac{1}{2} \sum{ (target^i - output^i)^2 }$

Calculating Change in Weights:

$\Delta w_j = - \epsilon \frac{\partial J}{\partial w_j}$ where $\epsilon$ is the learning rate and j/w_j is the partial derivative of loss funtion with respect to the changing weights

$\Delta w_j = \epsilon \sum{(t^i - o^i)x_j^i}$

### Logistic Function

[Stanford Lecture Notes](http://cs229.stanford.edu/notes/cs229-notes1.pdf) I find those to be more helpful than understanding the lecture notes for this section.

Calculates logistic loss on a binary output. Penalizing _very_ wrong outputs very strongly and not so wrong outputs not so strongly.

Loss Function:

$J(w) = \displaystyle\prod_{i=1}^{m} p(y^i \mid x^i; w)$

We want to maximize the log likelihood. This is also known as the cross entropy function.

NOTE: $h(x) = \frac{1}{1 + e^{-z}}$

$log J(w) = \displaystyle\sum_{i=1}^{m}{y^i * log h(x)^i + (1 - y^i) log(1 - h(x)^i)} $

### Softmax Function

Used for classification of _K_ number of classes. All values of the output sum to one. All outputs represent a probability distribution across discrete alternatives.

$ P(y = j \mid x) = \displaystyle\frac{e^z_i}{\sum_{k=1}^{K}{e^z_k}} $

$ J(w) = -\displaystyle\sum_j{t_j log y_j} $

NOTE: lectures didn't show how to change weights


* Hessian Multiplicative connections

---
## Different NN Architectures

## Perceptron:

> A very simple network architecture. _Features_ are not learned, they're
> designed, and weights are learned.

* supervised
* linear
* binary output

### Learning Procedure:

Guaranteed to work:

```
foreach(trainingex) {
  if output is correct, don't change weights
  if output is 0, add input vector to weights
  if output is 1, subtract input vector to weights
}
```

[From SO post](https://stats.stackexchange.com/questions/137834/clarification-about-perceptron-rule-vs-gradient-descent-vs-stochastic-gradient)

$\partial L_{\pmb{w}}(y^{(i)}) = \begin{array}{rl} 
\{ 0 \},                         &   \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} > 0 \\
\{ -y^{(i)} \pmb{x}^{(i)} \},    &   \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} < 0 \\
[-1, 0] \times y^{(i)} \pmb{x}^{(i)},   &   \text{ if } \pmb{w}^\top\pmb{x}^{(i)} = 0 \\ 
\end{array}$
 
Weights from multiple models can be averaged and produce another valid
model

Can find patterns, but not patterns that "wrap-around"

## Recurrent Neural Network

> Generic structure of NN. Many special case instances follow

Good for processing sequences of data, speech and image recognition. Use internal memory.

* directed graph
* forward prop
* back prop

Cannot know the hidden states. We could only know a probability
distribution of space.

### Learning Procedure

This is backprop assuming SGD:

Inputs are multiplied by weights into a hidden neuron. Hidden neurons are then multiplied by separate weights to produce either more hidden neurons our output neurons. Forward pass complete. Error is computed and then weights in each layer (input => hidden, hidden => output) are updated once more.


## LTST memory NN

> an implementation of RNN with read gate, write gate and keep gate

* does not have vanishing/exploding gradient problem

## Feedforward Neural Network


## Hopfield NN

* special case of RNN without any hidden units