$
\newcommand{\pdv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\ipdv}[2]{\partial #1/\partial #2}
\newcommand{\dd}[1]{\,\textit{d}#1\,}
\newcommand{\softmax}[1]{\Softmax\left(#1\right)}
\newcommand{\smax}[1]{\Smax\left(#1\right)}
\newcommand{\exp}[1]{e^{#1}}
\newcommand{\grad}{\nabla}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\idm}{\mathbb{1}}  % \idm identity matrix
\newcommand{\mean}[1]{\left\langle #1 \right\rangle}
\DeclareMathOperator{\Softmax}{softmax}
\DeclareMathOperator{\expval}{\mathbb{E}}
\DeclareMathOperator{\Smax}{smax}
\DeclareMathOperator{\relu}{ReLU}
\DeclareMathOperator{\mat}{Mat}
\DeclareMathOperator{\GL}{GL}
\DeclareMathOperator{\SL}{SL}
\DeclareMathOperator{\diag}{diag}
\DeclareMathOperator{\sgn}{sgn}
\DeclareMathOperator{\lexp}{exp}
$

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Sequence-Models" data-toc-modified-id="Sequence-Models-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Sequence Models</a></span></li><li><span><a href="#Summary-of-RNN-types" data-toc-modified-id="Summary-of-RNN-types-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Summary of RNN types</a></span></li><li><span><a href="#Gated-Recurrent-Unit" data-toc-modified-id="Gated-Recurrent-Unit-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Gated Recurrent Unit</a></span></li></ul></div>

## Sequence Models

* Given a sequence $x$, each word is denoted by $x^{\mean{t}}$
* The length of a sequence $x$ is denoted by $T_x$
* Given multiple examples, we denote $x^{(m)\mean{t}}$ the $t$-th element of the $m$-th sequence, and similarly we denote the length of the $m$-th sequence by $T_x^{(m)}$

We create a dictionary, or vocabulary, of words, say, with 10 thousand words. Dictionaries as large as 100 thousand words are not uncommon, and dictionaries with 1 million words show up in some large companies. 

Each word is one-hot encodded with the dictionary, so if a dictionary has 10 thousand words, each word is represented by a 10 thousand elements vector.

* $a^{\mean{0}} = 0$
* $a^{\mean{1}} = g_a\Big( w_{aa}a^{\mean{0}} + w_{ax}x^{\mean{1}} + b_a \Big)$, with $g_a$ usually a $\tanh$ or $\relu$.
* $\hat{y}^{\mean{1}} = g_y\Big( w_{ya}a^{\mean{1}} + b_y \Big)$, with $g_y$ depending on the task given: for a simple classification it could be a sigmoid, for example.

* So, in general 
\begin{equation}
    a^{\mean{t}} = g_a\Big( w_{aa}a^{\mean{t-1}} + w_{ax}x^{\mean{t}} + b_a \Big),
\end{equation}
and
\begin{equation}
    \hat{y}^{\mean{t}} = g_y\Big( w_{ya}a^{\mean{t}} + b_y \Big)
\end{equation}

To simplify the notation, lets denote
\begin{equation}
    w_a = [w_{aa} | w_{ax}],
\end{equation}
and then
\begin{equation}
    a^{\mean{t}} = g_a\Big( w_{a}[a^{\mean{t-1}},x^{\mean{t}}] + b_a \Big).
\end{equation}

So if $a$ was 100 dimensional and $x$ was 10 thousand dimensional, then $w_{aa}$ would be $(100,100)$, $w_{ax}$ would be $(100,10000)$, and finally $w_a$ would be $(100,100+10000)$. On the other hand, $[a^{\mean{t-1}},x^{\mean{t}}]$ has dimension $(100+10000,1)$


# Backpropagation through time

## Summary of RNN types

* One to one
* One to many
* Many to one
* Many to many with $T_x = T_y$
* Many to many with $T_x \neq T_y$ 


## Gated Recurrent Unit

* Memory cell denoted by $c$, with $c^{\mean{t}} = a^{\mean{t}}$.
* Current memory denoted by $\tilde{c}$, with $\tilde{c}^{\mean{t}} = \tanh\Big( w_c \big[c^{\mean{t-1}},x^{\mean{t}} \big] + b_c \Big)$
* Probability of updating memory denoted by $\Gamma_u$ with $\Gamma_u = \sigma\Big( w_u \big[c^{\mean{t-1}},x^{\mean{t}} \big] + b_u \Big)$
* New memory cell:
\begin{equation}
    c^{\mean{t}} = \Gamma_u * \tilde{c}^{\mean{t}}+ (1-\Gamma_u)*c^{\mean{t-1}}
\end{equation}