## RNN Explained

![title](images\rnn.PNG)

## Points
* The state, h changes for every step

* The weight matrices (U, W, V) is always same across time steps during prediction -> drastic parameter reduction

* 3 W, 2 b are the GD updateable parameters

* Inputs is usually 3D -> num_examples, time_steps, dimension

* dimension above is 1 -> univariate

* At any step t, computations are:
    1. concatenate $X_t, h_{t_1}$
    2. feed this to FC layer with weights $ U, W$ -> perform activation
    3. feed this to another FC output layer -> $\hat y_t$
     

## Forward Propagation

1. Predict State

$ h_t = \phi(U X_t + W h_{t-1} + b_h) $

2. Predict Output from State

$ \hat y_t = V h_t + b_y $

## Backward Propagation

We need to find partial derivatives of paramater matrices W, U, V

Assume regression and loss is mse. Lets develop backprop for L3 step.

1. $ L_3 = 0.5 * (y_3 - \hat y_3)^2 $

2. $ \frac{\partial L_3}{\partial y_3} = (y_3 - \hat y_3) $

3. ** V ** : 
    * $ \frac{\partial L_3}{\partial V} = \frac{\partial L_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial V} $

    * $ \frac{\partial L_3}{\partial V} = (y_3 - \hat y_3)*h_3 $


4. ** by **: 
    * $ \frac{\partial L_3}{\partial by} = \frac{\partial L_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial by} $

    * $ \frac{\partial L_3}{\partial by} = (y_3 - \hat y_3) $

5. ** W ** : 
    * $ \frac{\partial L_3}{\partial W} = \frac{\partial L_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial h_3} \frac{\partial h_3}{\partial W} $

    * $ \frac{\partial L_3}{\partial W} = (y_3 - \hat y_3) * V * \frac{\partial h_3}{\partial W} $

    * $h_3$ depends on W directly and indirectly through $h_2 $ and so on

    * $ \frac{\partial h_3}{\partial W} = g' * \frac {\partial z_3}{\partial W} $, where $ z_3 = W h_2 + U x_3 $

    * $ \frac{\partial h_3}{\partial W} = g' * [h_2 + W \frac {\partial z_2}{\partial W}]$

    * $ \frac{\partial h_3}{\partial W} = g' * [h_2 + W [g' ( h_1 + \frac {\partial z_1}{\partial W})]] $

    * As we keep back propagate from later time step, larger will the summation and inturn multiplication with W and g'
        
6. ** U ** : 
    * $ \frac{\partial L_3}{\partial U} = \frac{\partial L_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial h3} \frac{\partial h3}{\partial U} $

    * $ \frac{\partial L_3}{\partial U} = (y_3 - \hat y_3) * V * \frac{\partial h_3}{\partial W} $

    * $ \frac{\partial h_3}{\partial U} = g' * [x_3 + W [g' ( x_2 + \frac {\partial z_1}{\partial U})]] $
    
7. ** bh **: 
    * $ \frac{\partial L_3}{\partial bh} = \frac{\partial L_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial h_3} \frac{\partial h_3}{\partial bh} $

    * $ \frac{\partial L_3}{\partial bh} = (y_3 - \hat y_3) * V * g' $
 

## Scenarios :

1. Gradients can explode or vanish

2. Expensive to calculate (if long time steps or sequence lengths)

## Solutions :

1. Explode - Gradients Clipping (Make weight to unit vectors)
    * g = g/||g|| if g > threshold
    * g otherwise
    * Usually threshold is betw -1, 1
    * Clipping involves normalization -> preserving direction

2. Vanish - Other architecture (LSTM, GRU)

3. Expensive - Truncated Back Prop
    * You proceed till k timestep and backprop and update weights

    * Then keep on moving till entire time step

    * This avoids calculating all time steps and backprop from entire time step to update!

## Why Vanishing Gradients

* We saw \frac{\partial L_3}{\partial W} depends on \frac{\partial h_3}{\partial W} -> \frac{\partial h_2}{\partial W} and so on

* Ignore tanh and Ux terms in h calculation -> $h_3 = W h_2 $

* Generally, $ h_{t+n} = W^n h_t $

* $ || h_{t+n}|| = \lambda^n ||h_t|| $

* This can explode or vanish $\frac{\partial L_3}{\partial W}$ depending on $\lambda$

* If you include tanh' and multiplication of it over terms will vanish the gradients
