# Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM)

In [1]:
from IPython.display import Image

* **Image recognition** $\Rightarrow$ mostly **CNNs**
* **Speech-to-text** $\Rightarrow$ mostly **RNNs**, particularly **LSTM**

In [10]:
#to display image using IPython's display() function (no alignment):
#display(Image(filename='data/whats_for_dinner.png', width = 600, height = 300))

*Image display code in markdown cell*
<img src='data/whats_for_dinner.png' width="400" height="200" align="center"/>
<img src='data/how_nns_work.png' width="400" height="200" align="center"/>

### Recurrent Neural Networks
* On a very basic level, you can think of neural networks as a very complicated "voting process" (over-simplification)
* Say we know that our roommate has a very consistent cycle of what they make for dinner: pizza then sushi then waffles then sushi. We can easily (and 100% accurately) predict what they will make for dinner tonight, knowing what they made last night. But say we weren't home last night (so don't know what he made last night for dinner). Well, we would just think back to the night before and calculate accordingly:
<img src='data/yester_yesterday.png' width="500" height="250" align="center"/>
***
#### A vector describing the weather:
<img src='data/weather_vectors.png' width="500" height="250" align="center"/>

* Vectors are computer's native language; everything gets reduced to a list of numbers before it goes into an algorithm.

#### A vector for "It's Tuesday":
<img src='data/its_tuesday.png' width="500" height="250" align="center"/>

#### A vector for our prediction for dinner tonight:
<img src='data/dinner_pred.png' width="500" height="250" align="center"/>

We can also group together our inputs and outputs as vectors, or separate lists of numbers
<img src='data/din_vects.png' width="500" height="250" align="center"/>

Here we can see how our prediction for one day get's recycled for the next day's dinner prediction:
<img src='data/rec_pred.png' width="500" height="250" align="center"/>

Above you can see how you can still make a prediction about what is for dinner tonight, even if, say, we've been out of town for two weeks. We just ignore the "new information" part and unwind this vector in time until we do have some information to base it on.

When these vectors are unwrapped, they look like this:
<img src='data/unrolled_vects.png' width="500" height="250" align="center"/>

The above charts are a very nice and tidy picture of **Recurrent Neural Networks**

* The **hyperbolic tangent (tanh) squashing function** helps the model to "behave":
    * Sigmoid shape ranging from -1 to 1
    * tanh squashing function **symbol**:
<img src='data/tanh_sym.png' align="center"/>    
    * For small numbers, your "squashed version" is very similar to your original number
    * As your number gets larger, it gets more and more squashed
<img src='data/tanh_squash.png' width="500" height="250" align="center"/>

* By ensuring that the output is always less than 1 and more than -1, you can process information through the model loops as many times as you want, without risking explosively large (nonsensical) or small (meaningless) outputs.
    * In a feedback loop, this is an example of **negative feedback** or **atenuating feedback**
    
#### Mistakes an RNN can make
* `Doug saw Doug.`
* `Jane saw Spot saw Doug saw... `
* `Spot. Doug. Jane.`
* **Because each of our predictions only looks back one time step (it has very short-term memory), it doesn't use the information from further back and it's subject to these types of mistakes.**
* In order to overcome this, we take our Recurrent Neural Network and we expand it and add some more pieces to it
* The critical part that we add to the middle is **`memory`**; we want to be able to remember what happened many time steps ago.
<img src='data/memory.png' width="500" height="250" align="center"/>

* Above, you'll notice we've add a few more symbols.
* First, the plus junction: 

### Plus Junction
<img src='data/plus.png' align="center"/>
<img src='data/plus_junc.png' width="500" height="250" align="center"/>

* Input vectors of equal length
* Output vector is of same size as each of your input vectors
* Output vector is the sum, element by element, of the two input vectors

### Times Junction
<img src='data/times_sym.png' align="center"/>
<img src='data/times_junc.png' width="500" height="250" align="center"/>

* Times junction: element by element multiplication 
* Once again, input vectors are of same size as output vector
* The times junction allows you to do something pretty cool called **Gating** (similar to weighting?)

### Gating
<img src='data/gating.png' width="500" height="250" align="center"/>

* Gating lets us control what passes through and what gets blocked
* To do gating, it's nice to have a value that you know is always between 0 and 1 

### Logistic (Sigmoid) Function
<img src='data/sig_squash_sym.png' align="center"/>
<img src='data/sig_squash_junc.png' width="500" height="250" align="center"/>

* Minimum of 0, maximum 1


When we put all of these things together, we get:
<img src='data/memory.png' width="500" height="250" align="center"/>

* We still have the combination of our previous predictions and our new information.
* Those vectors get passed below and we make predictions based upon them 
* Those predictions get passed through, but **a copy of those predictions is held onto for the nex time step, the next pass through the network**
    * **Some of them are forgotten**
    * **Some of them are remembered**
        * **The ones that are remembered are added back into the prediction**
<img src='data/remembered_pred.png' width="500" height="250" align="center"/>

* So now we have not just prediction, but predictions, but **predictions plus the memories that we've accumulated and that we haven't chosen to forget yet.**
* **There is an entirely separate neural network here to decide when to forget what:**
<img src='data/memory_nn.png' width="500" height="250" align="center"/>

* Basically: "Based on what we're seeing right now, what do we want to remember? What do we want to forget?"
* This lets us hold to things for as long as we want
* **When we are combining our predictions with our memories, we may not necessarily want to release all of those memories out as new predictions each time**
<img src='data/mem_release.png' width="500" height="250" align="center"/>
<img src='data/selection.png' width="500" height="250" align="center"/>

* So, we want a filter to keep our memories inside and let our predictions out
* We add another gate for that to do **selection** (see above)
* **Selection** has it's own neural network, so it's own voting process, so that our new information and our previous predictions can be used to vote on what all the gates should be, what should be kept internal and what should be released as a prediction
* We also introduce another squashing function after the plus junction to make sure that we keep our predictions within the realms of -1 to 1
* Each of these things (when to forget and when to release things from memory) are learned by their own neural networks 

### Long short-term memory
* The only other piece we need to add to complete our picture here is yet another set of gates which lets us ignore possible predictions
* This is an **attention mechanism**
* It lets things that aren't immediately relevant be set aside so they don't cloud the predictions in memory going forward
* It has its own neural network and its own logistic squashing function and its own gating activity 

<img src='data/ignoring.png' width="500" height="250" align="center"/>

* Clearly LSTM has a lot of pieces that work together

#### Epoch 1:
<img src='data/epoch1.png' width="500" height="250" align="center"/>

#### Epoch2: 
<img src='data/epoch2.png' width="500" height="250" align="center"/>

* **What this shows is that LSTM can look back 2, 3, many time steps... and use that information to make good predictions about what's going to happen next.**
    * As a note: regular, vanilla neural networks can look back a couple time steps as well, but not very many
    * **LSTM can look back many time steps**
    
    
## Sequential Patterns
* LSTMs are really useful in some surprisingly practical applications:
    * Text
        * translation; LSTM is able to represent word-to-word, phrase-to-phrase, sentence-to-sentence grammar structures
    * Speech
    * Audio
    * Video
    * Physical Processes
    * Robotics
    * Anything embedded in time (almost everything)
    
#### Mathematical structure of LSTMs
<img src='data/LSTM_math.png' width="500" height="250" align="center"/>