# Projects

Your final project should show innovation in at least data preprocessing or network architecture. Straighforward applications of standard processing pipelines on standard network architectures are not acceptable. Possible projects include:

- Train a convolutional neural network on depth data, e.g. from an Intel RealSense, to augment object recognition
- Combine the output of a YOLO or MASK R-CNN classifier with a word embedding to understand scene context
- Classify force/torque data from assembly to estimate whether an assembly is successful or failing
- Build a pipeline for gaze detection that can fit on an embedded computer
- Participate in a competition on Kaggle or the OpenAI gym 

# Recurrent Networks: LSTMs

Recurrent networks are adding the time dimension to a neural network by feeding their output back to the input. This is done in a relatively straightforward fashion. First, the output of a cell is directly fed back to a special recurrent cell. Second, in order to capture N time steps, the recurrent cell is simply replicated N times, connecting the output of the first to the input of the second and so on. This is illustrated below. Dotted lines show the flow of information during backpropagation.

<center>
    <img src="figs/rnn_backprop.svg" width="50%">
</center>

Although RNNs can theoretically make relationships between very new and very old information, for example two related words in a sentence that are a few words appart, but this relationship is difficult to learn using backpropagation, however. The reason can be understood, when considering how the gradient is calculated. The gradient is calculated for the "loss" with respect to the weights $W$. The loss $L$ is calculated via some suitable metric that compares the output $y(t)$.

## The vanishing and exploding gradient problem
Note that when implementing, running, and training an RNN, all timesteps are presented at the same time. Also note that the parameters $u$, $v$, and $w$ are constant. Computing the gradient for the loss function $L$ with respect to the weights w looks as follows:

$$ \frac{\partial L}{\partial w}=\sum_{\forall t}\frac{\partial L_t}{\partial w}$$

Here, t=4. Similar equations can be written for the parameters $u$ and $v$. 

Lets consider the gradient at timestep $t=4$ using the chain rule:
    
$$ \frac{\partial L_4}{\partial w}=\frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\frac{\partial h_3}{\partial w} $$

Further following the computation graph reveals the following four branches
$$ \frac{\partial L_4}{\partial w} = \frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\frac{\partial h_3}{\partial h_3}\frac{\partial h_3}{\partial w}
+ \frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial w}
+ \frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\frac{\partial h_3}{\partial h_1}\frac{\partial h_1}{\partial w}
+ \frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\frac{\partial h_3}{\partial h_0}\frac{\partial h_0}{\partial w}
$$

or

$$ \frac{\partial L_4}{\partial w} = \sum_{t=0}^{t=3}\frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\frac{\partial h_3}{\partial h_t}\frac{\partial h_3}{\partial w} $$

which is further expanded to 

$$ \frac{\partial L_4}{\partial w} = \sum_{t=0}^{t=3}\frac{\partial L_4}{\partial y_4}\frac{\partial y_4}{\partial h_3}\prod_{j=t+1}^{3}\frac{\partial h_j}{\partial h_{j-1}}\frac{\partial h_3}{\partial w} $$



With gradients smaller than one, the product in the gradient calculation will quickly diminish the contribution of this gradient, making it difficult to capture long-term relationships. Alternatively, with gradients larger than one, the product will quickly become very large. This is known as <i>vanishing</i> or <i>exploding</i> gradient problem. 

## Long Short Term Memory (LSTM)

There exist other cell models, that do not suffer from the vanishing/exploding gradient problem. One of which is the so-called <i>long short term memory</i> (LSTM) cell. Instead of just using one network layer using a <i>tanh</i>, but four non-linear activations that interact with each other. A good resource that graphically explains this is the blog post <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTMs</a>. A typical LSTM cell is shown below. 

<center>
    <img src="figs/LSTM.svg" width="50%">
</center>

Inputs ($x$), outputs ($y$), and hidden states $h$ and $c$ are vectors. Thick black lines indicate the flow of data. Two lines merging, such as $x_t$ and $h_{t-1}$ indicate concatenation. Black lines diverging indicate data being copied, maintaining their dimension. The circled "x", "+" and "tanh" symbols indicate element-wise multiplication and addition, respectively. The rectangles are neural network layers with sigmoid and tanh activation.

The LSTM seems complicated at first. The critical element of the LSTM cell is the <i>cell state</i> $c_t$. The four neural networks below can affect how the cell state changes from $c_{t-1}$ to $c_t$. We note that if $f$ consists of all ones, the cell state remains unchanged. If $f$ consists of all zeros, the previous cell state will be forgotten. The sigmoid layer leading to $f$ is therefore also called the <i>forget gate layer</i>. The cell state can be further changed by adding the dot product of $i$ and $g$.

There exist a large variety of LSTM models, a notable one being the <i>Gated Recurrent Unit</i> (GRU), which shows minor changes in performance for some types of data.

## Implementation

LSTMs are a drop-in replacement for simple RNN units, albeit requiring four times the amount of 

In [4]:
from keras.layers import LSTM
from keras.models import Sequential
import numpy as np

time_steps=5

model = Sequential()
model.add(LSTM(units=1, input_shape=(time_steps,1), activation="tanh")) # build RNN that takes up to time_step values
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 1)                 12        
Total params: 12
Trainable params: 12
Non-trainable params: 0
_________________________________________________________________
