# Recurrent Neural Networks
In this notebook, I'll be capturing all the essential information that was provided on the lectures. Each definition and
algorithm will be researched and provided with an example.

Many applications involve temporal dependencies over time. It means that our current output depends not only on the
current input but also past ones (unlike feedforward approach). The neural network architectures you've seen so far were trained using the current inputs only. We did not consider
previous inputs when generating the current output. In other words, our systems did not have any memory elements.
RNNs address this very basic and important issue by using memory (i.e. past inputs to the network) when producing the
current output.
* temporal dependencies - dependencies that change over time (e.g., gifs, videos)

<img src="images/ff_rnn.png" width="300" height="300"/>

### Feedforward Neural networks
* Nonlinear function approximations
* Training (backpropagation & SGD)
* Evaluation

### Topics to be covered in Recurrent Neural Networks
* Applications
* Simple RNN Elman Network
* Training RNNs
* LSTM - Long Short Term Memory

### History of RNN
1. Time delay Neural Networks (TDNN 1989)
Inputs from past timestamps were introduced to the network changing the actual external inputs. Disadvantage: they had limited to the window of time chosen.
2. Simple RNN / Elman Network (1990)
3. Long Short-Term Memory (LSTM mid 1990s)

### RNN's biggest flaws
RNNs have a key flaw, as capturing relationships that span more than 8 or 10 steps back is practically impossible.
This flaw stems from the "vanishing gradient" problem in which the contribution of information decays geometrically over time.
As you may recall, while training our network we use backpropagation. In the backpropagation process we adjust our weight matrices with the use of a gradient. In the process, gradients are calculated by continuous multiplications of derivatives. The value of these derivatives may be so small, that these continuous multiplications may cause the gradient to practically "vanish".

#### How to overcome that?
* LSTM
* Faster Hardware
* Residual Networks: skipping (residual) connections as part of the network architecture (ResNets). They yield lower training error by reintroducing outputs from shallower layers in the network to compensate the vanishing data. (However, this doesn't solve the fundamental problem, it only avoids it.)
* Other Activation Functions: ReLU suffers less from vanishing gradients (they saturate in one direction)

### Papers to check out
* <a href="https://arxiv.org/pdf/1404.7828.pdf">Deep Learning in NN</a>
* <a href="https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog1402_1">Elman Network original paper</a>
* <a href="http://www.bioinf.jku.at/publications/older/2604.pdf">LSTM original paper</a>
* <a href="https://arxiv.org/pdf/1511.06939.pdf">Netflix Recommendations</a>

### Applications of RNN and LSTM
- Speech Recognition, NLP & Chatbots
- Time Series prediction: traffic patterns, movie selection, stock movements
- Gesture Recognition