# Week 1 : Recurrent Neural Networks 

## Recurrent Neural Networks 

### Why Sequence models 

Examples of sequence data:
* speech recognition 
* music generation 
* sentiment classification 
* DNA sequence analysis
* machine translation 
* video activity recognition 
* name entity recognition

### Notation 

$x$ : Harry Potter and Hermione Granger invented a new spell. 

$x$ : $x^{<1>} = \text{Harry}, x^{<2>} = \text{Potter}, \ldots, x^{<9>} = \text{spell}$ 

$y$ : $y^{<1>} = 1, y^{<2>} = 1, \ldots, y^{<9>} = 0$

$x^{(i)<t>}$ : represents the $i$th training example and the $t$th object in the sequence

$T_x, T_y$ : the length of the input sequence and the length of the output sequence, respectively. Varies, obviously, for each training example

We store all the words in a dictionary/vocabulary

We do one-hot encoding for each object in our input sequence for each input sequence in our training set such that: 
$$x^{(i)<1>} = \text{Harry} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\  0 \end{bmatrix}$$

In dealing with words not in the vocabulary:
$$<\text{unk}> = \text{some_index (maybe the last) in vocab/dict}$$

### Recurrent Neural Network Model 

Why not a standard FFNN? 
* different inputs/outputs in the training set maybe have different lengths
* doesn't share features learned across different positions of a text (no knowledge of sequences) 

**Recurrent Neural Network**

Similar explanation to what you saw in udacity course 

* activation $a^{<t>}$ is saved across multiple time steps  
* parameters are shared between time steps 

Weakness in the current RNN example: 
* only glean information from the previous parts of the sequence, not from later 
    * He said, "Teddy Roosevelt was a great President." 
    * He said, "Teddy bears are on sale!" 

    * BRNN (Bi-directional RNN) helps this 
    
RNN calculations:
* for some timestep $t$ in this (weak) RNN
    * $a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a$) where $g$ is maybe $\text{tanh}$ or $\text{ReLU}$
    * $\hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_y)$

* simplify that notation above ^
   * $a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$ is simplified to $ g(W_{a}[a^{<t-1>},x^{<t>}] + b_a)$
       * $W_{aa}$ and $W_{ax}$ are stacked horizontally s.t. $\begin{bmatrix} W_{aa} & W_{ax} \end{bmatrix} = W_{a}$
       * $[a^{<t-1>},x^{<t>}] = \begin{bmatrix} a^{<t-1>} \\ x^{<t>} \end{bmatrix}$
       
   * what this does is compresses the number of parameter matrices from 2 to 1 
   
   * similarly, for $\hat{y}$, we have $\hat{y}^{<t>} = g(W_{y}a^{<t>} + b_y)$ (aka we excluded the subscript $a$)
   

### Backpropagation Through Time 

Define the loss function:
$\mathcal{L}^{<t>}(x^{<t>},y^{<t>}) = -y^{<t>}\log{\hat{y}^{<t>}} - (1-y^{<t>})\log{(1-\hat{y}^{<t>})} $ 

$\mathcal{L}(\hat{y},y) = \sum\limits_{t=1}^{T_y}\mathcal{L}^{<t>}(\hat{y}^{<t>},y^{<t>})$

### Different Types of RNNs

Many to many where the input and output lengths are different 
* Machine (spoken) language translation 
    * encoder and decoder 




### Lanuage Model and Sequence Generation 

What is language modelling?
Gives out probabilities based on how likely a sentence is actually phrased

Speech Recognition
* $P($The apple and pair salad$)$ 
* $P($The apple and pear salad$)$
* $P(\text{sentence}) = ?$

#### How do you build a language model?? 

Training set : a large corpus of text 

**Tokenize** : form a dictionary for each object, in this case, word. Pair them with one-hot vectors 
* some models have <EOS> token, but we won't use that here 
* <UNK> will be used tho, to denote words that are not in our vocabulary/dictionary



Ng goes through a one to many RNN example