# 1. Introduction

## 1.1. Building Blocks

* **Activation Function**

>|Activation Function|$\hspace{50mm}$Formula|$\hspace{30mm}$Output|
|-|-|-|
|**Heaviside**|$\phi(z_i)=\bigg\{ \begin{matrix} 0&z_i<0 \\ 1&z_i\geq0 \end{matrix}$||
|**Sigmoid**|$\phi(z_i)=\frac{1}{1+\exp(-z_i)}$|$0\leq y_i(\mathbf{x}) \leq 1$|
|**Softmax**|$\phi(z_i)=\frac{\exp(z_i)}{\sum_j \exp(z_j)}$|$0\leq y_i(\mathbf{x}) \leq 1 \; \sum_i y_i(\mathbf{x})=1$|
|**tanh**|$\phi(z_i)=\frac{\exp(z_i)-\exp(-z_i)}{\exp(z_i)+\exp(-z_i)}$|$-1\leq y_i(\mathbf{x}) \leq 1$|
|**ReLU**|$\phi(z_i)=\max(0,z_i)$||
|**Noisy ReLU**|$\phi(z_i)= \max(0,z_i+\epsilon)$|$\epsilon\sim\mathcal{N}(0,\sigma^2)$|
|**Leaky ReLU**|$\phi(z_i)=\bigg\{ \begin{matrix} z_i&z_i\geq0 \\ \alpha z_i&z_i<0 \end{matrix}$|$\;$|

* **Pooling/Max-Out Functions** (reduces #parameters)

>|Pooling Function|$\hspace{70mm}$Formula|
|-|-|
|**maxout**|$\phi(y_1,y_2,y_3) = \max(y_1,y_2,y_3)$|
|**soft-maxout**|$\phi(y_1,y_2,y_3)=\log \left( \sum^3_{i=1} \exp(y_i) \right)$|
|**p-norm**|$\phi(y_1,y_2,y_3) = (\sum^3_{i=1} |y_i|^p)^{1/p}$|



## 1.2. Examples of NN 

* **CNN**

>$$\phi(z_{ij}) = \phi \left( \sum_{kl} w_{kl} x_{(i-k)(j-l)} \right)$$

>* **Parameters:**
>  * **Number(depth)** of filters
>  * **Receptive Field** (height $\times$ width $\times$ depth)
>  * **Stride** / **Dilation** / **Zero-padding** 

* **Autoencoders** (Non-linear Feature Extraction / can be used to **denoise** data)

>* Training criterion: $E(\boldsymbol{\theta}) = \sum^n_{p=1} f(\mathbf{x}_p,\hat{\mathbf{x}}_p)$

# 2. Network Training and Error Back Propagation

## 2.1. Training Criteria

* **Classification**

>$$\mathcal{D} = \{(\mathbf{x}_1,\mathbf{t}_1),...,(\mathbf{x}_N,\mathbf{t}_N)\}$$

>* **Least Squares Error:**

>$$E(\boldsymbol{\theta}) = \frac{1}{2} \sum^N_{p=1} ||\mathbf{y}(\mathbf{x}_p) - \mathbf{t}_p||^2 = \frac{1}{2} \sum^N_{p=1} \sum^K_{i=1} (y_i(\mathbf{x}_p) - t_{pi})^2$$

>* **Cross Entropy:**

>\begin{align}
E(\boldsymbol{\theta}) &= - \sum^N_{p=1} \sum^K_{i=1} t_{pi} \log (y_i(\mathbf{x}_p)) \\
\text{binary} \rightarrow &= - \sum^N_{p=1} (t_p \log (y(\mathbf{x}_p)) + (1-t_p) \log (1-y(\mathbf{x}_p)))
\end{align}

* **Regression**

>$$\mathcal{D} = \{(\mathbf{x}_1,\mathbf{y}_1),...,(\mathbf{x}_N,\mathbf{y}_N)\}$$

>* **Least Squares Error:**

>$$E(\boldsymbol{\theta}) = \frac{1}{2} \sum^N_{p=1} (\mathbf{y}(\mathbf{x}_p) - \mathbf{y}_p)^T (\mathbf{y}(\mathbf{x}_p) - \mathbf{y}_p)$$

>* LS is equivalent to MLE with a single Gaussian

>$$E(\boldsymbol{\theta}) = \sum^N_{p=1} \log (p(\mathbf{y}_p|\mathbf{x}_p;\boldsymbol{\theta}))$$

## 2.2. Mixture Density NN

* **Predict a Mixture of $M$ Gaussians**

>$$\mathbf{y}_m (\mathbf{x}_p) = \begin{bmatrix} \mathcal{F}^{(c)}_m (\mathbf{x}_p) \\ \mathcal{F}^{(\mu)}_m (\mathbf{x}_p) \\ \mathcal{F}^{(\sigma)}_m (\mathbf{x}_p) \end{bmatrix} = \begin{bmatrix} \text{prior} \\ \text{mean} \\ \text{variance} \end{bmatrix}$$

>$$p(\mathbf{y}_p|\mathbf{x}_p;\boldsymbol{\theta}) = \sum^M_{m=1} \mathcal{F}_m^{(c)} (\mathbf{x}_p) \mathcal{N} \left( \mathbf{y}_p;\mathcal{F}_m^{(\mu)} (\mathbf{x}_p), \mathcal{F}_m^{(\sigma)} (\mathbf{x}_p) \right)$$

## 2.3. Back Propagation

* **Single Layer Perceptron Training** (ignore bias)

>\begin{align}
\frac{\partial E(\boldsymbol{\theta})}{\partial w_i} &= \left( \frac{\partial z}{\partial w_i} \right) \left( \frac{\partial y(\mathbf{x})}{\partial z} \right) \left( \frac{\partial E(\boldsymbol{\theta})}{\partial y(\mathbf{x})} \right) \\
\frac{\partial E^{(p)}(\boldsymbol{\theta})}{\partial w_i} &= x_{pi} \times y(\mathbf{x}_p) (1-y(\mathbf{x}_p)) \times (y(\mathbf{x}_p) - t_p)
\end{align}

* **Multiple Layer** $\rightarrow$ use **Backward Recursion**

>$$\frac{\partial E(\boldsymbol{\theta})}{\partial \mathbf{z}^{(k)}} = \boldsymbol{\delta}^{(k)} = \boldsymbol{\Lambda}^{(k)} \mathbf{W}^{(k+1)T} \boldsymbol{\delta}^{(k+1)} \;\;\;,\;\;\; \frac{\partial E(\boldsymbol{\theta})}{\partial \mathbf{W}^{(k)}} = \boldsymbol{\delta}^{(k)} \mathbf{x}^{(k)T}$$

>$$\boldsymbol{\Lambda}^{(k)} = \frac{\partial \mathbf{y}^{(k)} }{\partial \mathbf{z}^{(k)}}
\;\;\;,\;\;\;
\mathbf{W}^{(k+1)} = \frac{\partial \mathbf{z}^{(k+1)}}{\partial \mathbf{y}^{(k)}} = \frac{\partial \mathbf{z}^{(k+1)}}{\partial \mathbf{x}^{(k+1)}}
\;\;\;,\;\;\;
\boldsymbol{\delta}^{(k+1)} = \frac{\partial E(\boldsymbol{\theta})}{\partial \mathbf{z}^{(k+1)}}
$$

>* $\boldsymbol{\Lambda}^{(k)}$: activation derivative matrix
>* $\mathbf{W}^{(k+1)}$: weight matrix
>* $\boldsymbol{\delta}^{(k+1)}$: error vector

# 3. Optimization

## 3.1. Gradient Descent

* **Stochastic Gradient Descent**

>$$ \boldsymbol{\theta} [\tau+1] = \boldsymbol{\theta} [\tau] - \Delta \boldsymbol{\theta} [\tau] = \boldsymbol{\theta} [\tau] - \eta \frac{\partial E}{\partial \boldsymbol{\theta}} \bigg| _{\boldsymbol{\theta}[\tau]}$$

* **Batch/online Gradient Descent**

>$$E(\boldsymbol{\theta}) = - \sum_{p\in \tilde{\mathcal{D}}} \sum^K_{i=1} t_{pi} \log (y_i(\mathbf{x}_p))$$

>* **Batch-size:** small (poorly estimated gradient) / large (each update expensive)

## 3.2. Gradient Descent Refinements

* **Momentum**

>\begin{align}
\Delta \boldsymbol{\theta} [\tau] &= \eta \frac{\partial E (\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \bigg|_{\boldsymbol{\theta}[\tau]} + \alpha \Delta \boldsymbol{\theta} [\tau-1]\\
&= \eta \nabla (E(\boldsymbol{\theta}[\tau])) + \alpha \Delta \boldsymbol{\theta} [\tau-1]
\end{align}

* **Adaptive Learning Rates**

>$$\eta[\tau+1] = \bigg\{ \begin{matrix} 1.1 \eta[\tau] & \text{if } E(\boldsymbol{\theta}[\tau]) < E(\boldsymbol{\theta}[\tau-1]) \\ 
0.5 \eta[\tau] & \text{if } E(\boldsymbol{\theta}[\tau]) > E(\boldsymbol{\theta}[\tau-1])\end{matrix}$$

>* Increase $\eta$ when going in the **correct direction**

## 3.3. Second-order Approximation

* **Second-order Approximation**

>$$E(\boldsymbol{\theta}) = E(\boldsymbol{\theta}[\tau]) + (\boldsymbol{\theta} - \boldsymbol{\theta}[\tau])^T \mathbf{g} + \frac{1}{2} (\boldsymbol{\theta}-\boldsymbol{\theta}[\tau])^T \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}[\tau]) + \mathcal{O}(\boldsymbol{\theta}^3)$$

>$$\mathbf{g} = \nabla E(\boldsymbol{\theta}[\tau]) \;\;\;,\;\;\;
(\mathbf{H})_{ij} = h_{ij} = \frac{\partial^2 E(\boldsymbol{\theta})}{\partial \theta_i \partial \theta_j} \bigg|_{\boldsymbol{\theta}[\tau]}$$

* **Ignore Higher-order Terms** & **Equate to Zero**

>$$\nabla E(\boldsymbol{\theta}) = \mathbf{g} + \mathbf{H}(\boldsymbol{\theta} - \boldsymbol{\theta}[\tau])$$

>$$\boldsymbol{\theta}[\tau + 1] = \boldsymbol{\theta}[\tau] - \mathbf{H}^{-1} \mathbf{g} \;\;\;\rightarrow\;\;\; \Delta \boldsymbol{\theta}[\tau] = \mathbf{H}^{-1} \mathbf{g}$$

>* $\mathbf{H}^{-1} \mathbf{g}$: **Newton direction**

* **Issues**

>* **Computational cost:** Hessian evaluation: $\mathcal{O}(N^2)$ & Hessian inversion: $\mathcal{O}(N^3)$
>* **Highly non-quadratic surface** $\rightarrow$ unstable optimization
>* $\mathbf{H}$ must be **positive-definite** (if not, $\mathbf{H}^{-1} \mathbf{g}$ might head towards a maximum or saddle point)

>$$\mathbf{v}^T \mathbf{Hv} > 0 \;\;\;\forall\;\mathbf{v} \;\;\; \rightarrow \;\;\; \tilde{\mathbf{H}} = \mathbf{H} + \lambda \mathbf{I}$$

* **QuickProp**

>* **Assumptions:** quadratic error surface & independent weight gradients (i.e. diagonal Hessian)

>\begin{align}
E(\theta) &\approx E(\theta[\tau]) + b(\theta - \theta[\tau]) + a(\theta - \theta[\tau])^2 \\
\\
\frac{\partial E(\theta)}{\partial \theta} &\approx b + 2a (\theta - \theta[\tau])
\end{align}

>* **Condition:** after update $\Delta\theta[\tau]$, the gradient should be zero

>$$g[\tau-1] = b-2a\Delta\theta[\tau-1],\;\;\; 0 = b+2a\Delta\theta[\tau],\;\;\; g[\tau] = b$$

>$$\rightarrow \Delta\theta[\tau] = \frac{g[\tau]}{g[\tau-1] - g[\tau]} \Delta\theta[\tau-1]$$

## 3.4. Optimization Refinement

* **Data Pre-processing:** subtract by mean, divide by stdev
* **Dropout:** randomly de-activate $x$% of the nodes
* **Regularization**

>$$\tilde{E}(\boldsymbol{\theta}) = E(\boldsymbol{\theta}) + \nu \Omega (\boldsymbol{\theta}) \;\;\;,\;\;\; \Omega(\boldsymbol{\theta}) = \frac{1}{2} \sum^{L+1}_{l=1} \sum_{i,j} w_{ij}^{(l)2}$$

* **Network Initialization:** e.g. Gaussian random initialization $\rightarrow$ sigmoid

* **Xavier Initialization**

>$$\text{Var}(y_i) = \text{Var}(\mathbf{w}^T_i \mathbf{x}) = n\times \text{Var}(w_{ij}) \text{Var}(x_i)$$

>$$\text{Var}(w_{ij}) = \frac{1}{n}$$

>* Assume $n$-dim input with zero mean and identity variance
>* Avoid having too small or too large weights
>* Prevents **exploding/vanishing gradients**

* **Batch Normalization**

>$$\mathbf{y}_p^{(k-1)} \;\;\; \underset{\text{normalize}}{\rightarrow} \;\;\; \tilde{y}_{pj}^{(k-1)} = \frac{y^{(k-1)}_{pj} - \tilde{\mu}_j^{(k)}}{\tilde{\sigma}^{(k)}_j} \;\;\;\underset{\text{input to layer } k}{\rightarrow}$$

>* Normalization over a batch $\mathcal{D}$

>$$\frac{\partial E(\boldsymbol{\theta})}{\partial \mathbf{y}^{(k-1)}} = \frac{\partial \tilde{\mathbf{y}}^{(k-1)}}{\partial \mathbf{y}^{(k-1)}} \frac{\partial E(\boldsymbol{\theta})}{\partial \tilde{\mathbf{y}}^{(k-1)}}$$

>* Normalization at test time: **expected normalization**

>$$\bar{\mu}_j^{(k)} = \mathbb{E} \left[ \tilde{\mu}_j^{(k)} \right] \;\;\;,\;\;\;
\bar{\sigma}_j^{(k)2} = \frac{m}{m-1} \mathbb{E} \left[ \tilde{\sigma}_j^{(k)2} \right]$$

# 4. Deep Learning for Sequence Data

## 4.1. Sequence I/O Pair Modelling

><img src='images/image12.png' width=500>

* **RNN - Recurrent Neural Networks** (i.e. Elman Networks)

>\begin{align}
\mathbf{h}_t &= \mathbf{f}^h (\mathbf{W}^f_h \mathbf{x}_t + \mathbf{W}^r_h \mathbf{h}_{t-1} + \mathbf{b}_h) \\
\mathbf{y}(\mathbf{x}_{1:t}) &= \mathbf{f}^f (\mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y)
\end{align}

>* $\mathbf{h}_t$ encodes $\mathbf{x}_{1:t}$

>$$\mathcal{F}(\mathbf{x}_{1:t}) = \mathcal{F}(\mathbf{x}_t,\mathbf{x}_{1:t-1}) \approx \mathcal{F}(\mathbf{x}_t,\mathbf{h}_{t-1}) \approx \mathcal{F}(\mathbf{h}_t) = \mathbf{y}(\mathbf{x}_{1:t}) = \mathbf{y}_t$$

* **Jordan Networks**

>* $\mathbf{h}_t$ encodes $\mathbf{y}_{1:t}$

>$$\mathcal{F}(\mathbf{x}_{1:t},\mathbf{y}_{1:t-1}) \approx \mathcal{F}(\mathbf{x}_t,\mathbf{y}_{1:t-1}) \approx \mathcal{F}(\mathbf{x}_t,\mathbf{h}_{t-1}) \approx \mathcal{F}(\mathbf{h}_t) = \mathbf{y}_t$$

* **Bi-directional RNN**

>$$\mathcal{F}_t (\mathbf{x}_{1:T}) = \mathcal{F}(\mathbf{x}_{1:t},\mathbf{x}_{t:T}) \approx \mathcal{F}(\mathbf{h}_t, \tilde{\mathbf{h}}_t) = \mathbf{y}_t(\mathbf{x}_{1:T})$$

* **GRU - Gated Recurrent Unit**

><img src='images/image13.png' width=300>

>\begin{align}
\mathbf{i}_f &= \boldsymbol{\sigma} (\mathbf{W}^f_f \mathbf{x}_t + \mathbf{W}^r_f \mathbf{h}_{t-1} + \mathbf{b}_f)\\
\mathbf{i}_o &= \boldsymbol{\sigma} (\mathbf{W}^f_o \mathbf{x}_t + \mathbf{W}^r_o \mathbf{h}_{t-1} + \mathbf{b}_o)\\
\tilde{\mathbf{h}}_t &= \mathbf{f} (\mathbf{W}^f_h \mathbf{x}_t + \mathbf{W}^r_h (\mathbf{i}_f \odot \mathbf{h}_{t-1}) + \mathbf{b}_h) \\
\mathbf{h}_t &= \mathbf{i}_o \odot \mathbf{h}_{t-1} + (\mathbf{1} - \mathbf{i}_o) \odot \tilde{\mathbf{h}}_t
\end{align}

>* $\mathbf{i}_f$: forget gate (gating over time)
>* $\mathbf{i}_o$: output gate (gating over features and time)
>* $\odot$: element-wise multiplication

* **LSTM - Long-Short Term Memory Networks**

><img src='images/image14.png' width=200>

>* Three Gates - **Forget Gate** ($\mathbf{i}_f$), **Input Gate** ($\mathbf{i}_i$) and **Output Gate** ($\mathbf{i}_o$)

>\begin{align}
\mathbf{i}_f &= \boldsymbol{\sigma} (\mathbf{W}^f_f \mathbf{x}_t + \mathbf{W}^r_f \mathbf{h}_{t-1} + \mathbf{W}^m_f \mathbf{c}_{t-1} + \mathbf{b}_f)\\
\mathbf{i}_i &= \boldsymbol{\sigma} (\mathbf{W}^f_i \mathbf{x}_t + \mathbf{W}^r_i \mathbf{h}_{t-1} + \mathbf{W}^m_i \mathbf{c}_{t-1} +\mathbf{b}_i)\\
\mathbf{i}_o &= \boldsymbol{\sigma} (\mathbf{W}^f_o \mathbf{x}_t + \mathbf{W}^r_o \mathbf{h}_{t-1} + \mathbf{W}^m_o \mathbf{c}_{t} +\mathbf{b}_o)\\
\end{align}

>* **Memory Cell** and **History Vector**

>\begin{align}
\mathbf{c}_t &= \mathbf{i}_f \odot \mathbf{c}_{t-1} + \mathbf{i}_i \odot \mathbf{f}^m (\mathbf{W}^f_c \mathbf{x}_t + \mathbf{W}^r_c \mathbf{h}_{t-1} + \mathbf{b}_c) \\
\mathbf{h}_t &= \mathbf{i}_o \odot \mathbf{f}^h (\mathbf{c}_t)
\end{align}

>* Memory cell weight matrices $\mathbf{W}^m$: diagonal

* **Residual Networks**

><img src='images/image08.png' width=200>

>$$\mathbf{y}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}$$

* **Highway Connections**

><img src='images/image15.png' width=200>

>\begin{align}
\mathbf{i}_h &= \boldsymbol{\sigma} (\mathbf{W}^f_l \mathbf{x}_t + \mathbf{W}^r_l \mathbf{h}_{t-1} + \mathbf{b}_l) \\
\mathbf{h}_t &= \mathbf{i}_h \odot \mathbf{f} (\mathbf{W}^f_h \mathbf{x}_t + \mathbf{W}^r_h \mathbf{h}_{t-1} + \mathbf{b}_h) + (\mathbf{1} - \mathbf{i}_h) \odot \mathbf{x}_t
\end{align}

## 4.2. Input Sequence to Target

* **Averaging**

><img src='images/image16.png' width=300>

>$$\mathbf{h}_t = \mathcal{F}(\mathbf{x}_t) \;\;\;\rightarrow\;\;\;
\mathbf{c} = \frac{1}{T} \sum^T_{t=1} \mathbf{h}_t \;\;\;\rightarrow\;\;\;
\mathbf{h} = \mathcal{F}(\mathbf{c}) \;\;\;\rightarrow\;\;\;
\mathbf{y} = \mathcal{F}(\mathbf{h})$$

* **RNN Encoding** (Sequence Embedding)

><img src='images/image17.png' width=300>

>$$\mathbf{h}_t = \mathcal{F}(\mathbf{x}_t,\mathbf{h}_{t-1}) \;\;\;\rightarrow\;\;\;
\mathbf{h} = \mathcal{F}(\mathbf{h}_T)$$

>* Use bi-directional information to avoid focusing on later inputs ($\mathbf{h} = \mathcal{F}(\mathbf{h}_T,\tilde{\mathbf{h}}_1)$)

* **Attention Mechanism** (yields a **pmf** over the sequence)

><img src='images/image18.png' width=400>

>$$\tilde{\mathbf{h}} = \mathcal{F}(\mathbf{k}) \;\;\;\rightarrow\;\;\;
e_t = \mathcal{F} (\tilde{\mathbf{h}},\mathbf{h}_t) \;\;\;\rightarrow\;\;\;
\alpha_t = \frac{\exp(e_t)}{\sum^T_{i=1} \exp(e_i)} \;\;\;\rightarrow\;\;\;
\mathbf{c} = \sum^T_{t=1} \alpha_t \mathbf{h}_t$$

>* **Form of Attention Mechanism** (i.e. how relevant is that observation to the key)
>  * **Dot-product:** $e_t = \mathbf{h}^T_t \mathbf{W}_{xk} \tilde{\mathbf{h}}$
>  * **Additive:** $e_t = \mathbf{w}^T \tanh (\mathbf{W}_x \mathbf{h}_t + \mathbf{W}_k \tilde{\mathbf{h}})$

## 4.3. Sequence to Sequence

* **Encoder-Decoder Sequence Models**

>\begin{align}
p(\mathbf{y}_{1:K}|\mathbf{x}_{1:L}) &= \prod^K_{i=1} p(\mathbf{y}_i |\mathbf{y}_{1:i-1}, \mathbf{x}_{1:L}) \\
&\approx \prod^K_{i=1} p(\mathbf{y}_i|\mathbf{y}_{i-1},\tilde{\mathbf{h}}_{i-1},\mathbf{c})
\end{align}

>* Map $\mathbf{x}_{1:L}$ to a fixed-length vector $\mathbf{c} \rightarrow \mathbf{c} = \phi(\mathbf{x}_{1:L})$
>* **RNN Encoder-Decoder Model:** $\mathbf{c} = \phi(\mathbf{x}_{1:L}) = \mathbf{h}_L$
>  * **Limitation:** context-dependence is global

><img src='images/image19.png' width=400>

* **Attention-Based Models** (introduce **attention layer**)

><img src='images/image20.png' width=400>

>$$p(\mathbf{y}_{1:K}|\mathbf{x}_{1:L}) \approx \prod^K_{i=1} p(\mathbf{y}_i | \mathbf{y}_{i-1}, \tilde{\mathbf{h}}_{i-1}, \mathbf{c}_i) \approx \prod^K_{i=1} p(\mathbf{y}_i | \tilde{\mathbf{h}}_i)$$

>$$e_{i\tau} = \mathcal{F}(\tilde{\mathbf{h}}_{i-1}, \mathbf{h}_\tau) \;\;\;\rightarrow\;\;\;
\alpha_{i\tau} = \frac{\exp(e_{i\tau})}{\sum^L_{j=1} \exp(e_{ij})} \;\;\;\rightarrow\;\;\;
\mathbf{c}_i = \sum^L_{\tau=1} \alpha_{i\tau} \mathbf{h}_{\tau}$$

>* $e_{i\tau}$: how well position $i-1$ in output matches position $\tau$ in input

* **Inference for NMT**

>$$P(w_i|\hat{w}_{0:i-1},w_{1:L}^{(s)}) \approx P(w_i|\hat{w}_{0:i-1},\mathbf{x}_{1:L}) \approx P(w_i|\hat{\mathbf{y}}_{0:i-1},\mathbf{c}_{1:i}) \approx P(w_i|\tilde{\mathbf{h}}_i)$$

>1. Embed previous word $\hat{w}_{i-1} \rightarrow \hat{\mathbf{y}}_{i-1}$
>1. Compute context information $\mathbf{c}_i = \mathcal{F} (\mathbf{x}_{1:L}, \tilde{\mathbf{h}}_{i-1})$
>1. Generate new history vector $\tilde{\mathbf{h}}_i = \mathcal{F} (\tilde{\mathbf{h}}_{i-1}, \hat{\mathbf{y}}_{i-1},\mathbf{c}_i)$
>1. Given $\tilde{\mathbf{h}}_i$, compute pmf over embedding space $\mathbf{y}$
>1. Draw $\hat{w}_i$ from pmf