# Recurrent Neural Networks
## Natural Language Processing

Author: Binghen Wang

Last Updated: 8 Jan, 2023

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a> |
    <a href="./Convolutional Neural Networks.ipynb">Convolutional Neural Networks</a>
    <br>
    <b>RNN navigation:</b> <a href="./Recurrent Neural Networks.ipynb">Basics</a> 
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
- [Word Embeddings](#WE)
    - [One-hot Representation vs Featurized Representation](#WE-1)
    - [Transfer Learning and Word Embeddings](#WE-2)
    - [Analogical Reasoning](#WE-3)
- [Learning Word Embeddings](#LWE)
    - [Embedding Matrix](#LWE-1)
    - [Neural Language Model](#LWE-2)
    - [The Skip-gram Model](#LWE-3)
        - [Basic Model](#LWE-3-1)
        - [Hierarchical Softmax](#LWE-3-2)
        - [Negative Sampling](#LWE-3-3)
    - [GloVe (Global Vectors for Word Representation)](#LWE-4)
- [Applying Word Embeddings](#AWE)
    - [Debiasing Word Embeddings](#AWE-1)

<a name = "WE"></a>
## Word Embeddings

<a name = "WE-1"></a>
### One-hot Representation vs Featurized Representation
Previously, the language models make use of the **one-hot representation** of a word, due largely to its ease of implementation. Yet, one-hot representation has a drawback in that it regards each word as a new class and does not establish any relationship/similarity between the words.

Consider for instance the following two sentences:
<blockquote>
    I want a glass of orange <u>juice</u>. <br>
    I want a glass of apple _____.
</blockquote>

Learning that juice usually comes after orange does not easily generalize to the case of apple.<br>

From a lingual perspective, apple and orange are closely related in that they are both sweet fruits that contain a lot of water. To be able to make the learning more efficient and generalizable, we can use a **featurized representation**.

<div style = "text-align: center;">
    <img src="./images/word embeddings.png" style="width:80%;" >
</div>

<a name = "WE-2"></a>
### Transfer Learning and Word Embeddings
Word embeddings could make learning more efficient by:
- requiring fewer labelled data to train the model
- dealing better with unseen words
- using more compact representations for words (each word is represented with a shorter vector)

A popular way to train language models using word embeddings is through **transfer learning**, which takes the following steps:
1. Learn word embeddings from large text corpus (1-100 billion words) or **download** pre-trained embeddings online.
2. **Transfer** the embeddings to new task with a smaller training set (say, 100k words).
3. (Optional) Continue to **finetune** the word embeddings with new data (only if there is a large volumne of training data).

<a name = "WE-3"></a>
### Analogical Reasoning
<blockquote>
    <b>Man</b> is to <b>woman</b>, as <b>king</b> is to <b>queen</b>.
</blockquote>
Let $e_{\text{man}}$ denote the featurized representation of the word man. We expect:
$$
e_{\text{man}} - e_{\text{woman}} \approx e_{\text{king}} - e_{\text{queen}}
$$

**Formalized Problem**: Find word $w$ that satisfies:
$$
\text{argmax}_w \text{sim}(e_w,  e_{\text{king}} - e_{\text{man}} + e_{\text{woman}})
$$
where $\text{sim}$ is a similarity function. 

A commonly used similarity function is the **cosine similarity**:
$$
\text{sim}(u,v) = \frac{u^{\mathsf{T}}v}{\Vert u \Vert_2\Vert v \Vert_2}
$$

<div style = "text-align: center;">
    <img src="./images/cosine similarity.png" style="width:80%;" >
</div>

<a name = "LWE"></a>
## Learning Word Embeddings

<a name = "LWE-1"></a>
### Embedding Matrix
Given a vocabulary of size 10,000: \[a, aaron, $\dots$, zulu, \<UNK\>\], consider a **word embedding** of length 300. The embedding matrix is $300 \times 10,000$ and is given by:

<div style = "text-align: center;">
    <img src="./images/embedding matrix.png" style="width:80%;" >
</div>

The embedding matrix could be used to convert a **one-hot representation** into a **featurized representation**. 
$$
E\,o_{6527} = e_{6527}
$$
It can also employ vectorization and convert a entire sentence in one fell swoop.
$$
E\,[o_{6527}, o_{456}, \dots, o_{271}] = [e_{6527}, e_{456}, \dots, e_{271}]
$$

<a name = "LWE-2"></a>
### Neural Language Model

#### Basic idea 
- Use a fixed historical window to process the input (say, a four-word history)
- Train a language model and treat the embedding matrix $E$ as a parameter of the model
- Use gradient descent and backprop to update the parameters of the model

<div style = "text-align: center;">
    <img src="./images/learning word embeddings nlm.png" style="width:50%;" >
</div>

<div class = "alert alert-block alert-info"><b>Note:</b> The <b>accuracy</b> of the model is of <b>less importance</b> than the training of the model.</div>

#### Other context/target pairs
<blockquote>
    I want a glass of orange <u>juice</u> to go along with my cereal.
</blockquote>

**Target**: <font color= "red">juice</font>

**Context**: 
- Last 4 words: <font color = 'blue'>a glass of orange</font>
- 4 words on the left & right: <font color = 'blue'>a glass of orange to go along with</font>
- Last 1 word: <font color = 'blue'>orange</font>
- Nearby 1 word (skip-gram model) <font color = 'blue'>glass</font>

<a name = "LWE-3"></a>
### The Skip-gram Model
<a name = "LWE-3-1"></a>
#### Basic Model
Instead of choosing the last 4 words to learn word embeddings, we can also choose a nearby 1 word (e.g., sample a word from the surrounding 10 words of a word), which gives us the **Skip-gram Model**.

<div style = "text-align: center;">
    <img src="./images/word2vec.png" style="width:30%;" >
</div>

Denote the context word (input word) as $c$ and the target word (output word) as $t$, so the word embeddings for them are $e_c$ and $e_t$ respectively. Then the softmax probabilities are calculated as:
$$
P(t\vert c) = \frac{\exp\left(\theta_t^Te_c\right)}{\sum_{j=1}^{10000} \exp\left(\theta_j^Te_c\right)}
$$

Using the one-hot representation for the labels and predictions, the **loss function** is defined by:
$$
L(\hat y, y) = - \sum_{i=1}^{10000} y_i \log \hat y_i.
$$

<div class = "alert alert-block alert-info"><b>Tip on sampling the context word:</b> Avoid sampling uniformly (which would result in most sampled words belong to a small class of common words) and instead try to make a balanced sample of common and uncommon words.</div>

<a name = "LWE-3-2"></a>
#### Hierarchical Softmax

Calculating 10,000 probabilities and adding them up can cause the learning of the model to be really inefficient. Different ways have been come up with to speed up the algorithm, one of them being the **hierarchical softmax**. It works by using a binary **Huffman tree** in place of the original softmax layer to reduce the number of classes that are being predicted at one time and consequently reduces the cost of computation from $c$ to $\log_2 c$. In a binary Huffman tree, common words tend to appear near the root while uncommon words are buried deep down the tree.

<div style = "text-align: center;">
    <img src="./images/hierarchical softmax.png" style="width:30%;" >
</div>

<div class = "alert alert-block alert-info"><b>A speedup technique:</b> Group words together by their frequency.</div>


<a href = "https://arxiv.org/pdf/1310.4546.pdf">Mikolov et al. (2013)</a> gave a convenient mathematical formula for the probabilities (fed into the loss function) using hierarchical softmax.
<blockquote>
    Using the notation in <a href = "https://arxiv.org/pdf/1310.4546.pdf">Mikolov et al. (2013)</a>, the <b>objective</b> of the basic Skip-gram model is
    $$
    \frac{1}{T} \sum_{t = 1}^{T} \sum_{-c\leq j \leq c, j \neq 0} \log p(w_{t+j} \vert w_t)
    $$
    where
    <ul>
        <li> $c$ is the size of the training context (which can be a function of the center word $w_t$).
    </ul>
    The <b>$p(w_O \vert w_I)$ defined using the softmax function</b> is
    $$
    p(w_O \vert w_I) = \frac{\exp\left(\theta_{w_O}^{\mathsf{T}} e_{w_I}\right)}{\sum_{w=1}^{W}\exp\left(\theta_{w}^{\mathsf{T}} e_{w_I}\right)}
    $$
    where
    <ul>
        <li> $w_O$ and $w_I$ are the output and input words;
        <li> $W$ is the number of words in the vocabulary.
    </ul>
    The <b>$p(w \vert w_I)$ defined using the hierarchical softmax</b> is
    $$
    p(w \vert w_I)= \Pi_{j=1}^{L(w)-1} \sigma\left( \text{Cond}\left\{n(w,j+1) = \text{ch}(n(w,j))\right\}\cdot \theta_{n(w,j)}^{\mathsf{T}}e_{w_I}\right)
    $$
    where
    <ul>
        <li> $n(w,j)$ is the $j$-th node on the path from the root to $w$;
        <li> $\text{ch}(n)$ is any <b>arbitrary</b> fixed child of $n$;
        <li> $\text{Cond}\{x\}$ is 1 if $x$ is true and 0 otherwise;
        <li> $\sigma(x) = \frac{1}{1+\exp{(-x)}}$.
    </ul>
</blockquote>

<a name = "LWE-3-3"></a>
#### Negative Sampling
<div style = "text-align: center;">
    <img src="./images/negative sampling.png" style="width:40%;" >
</div>

The **objective** of Negative sampling:
$$
\log \sigma(\theta_{w_O}^{\mathsf{T}}e_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i\sim P_n(w)}\left[\log\sigma(-\theta_{w_i}^{\mathsf{T}}e_{w_I})\right]
$$

The choice for $k$ can be:
- 5-20 for small training datasets.
- 2-5 for large datasets.

The choice for $P_n(w)$ can be:
$$
P_n(w) = \frac{{U(w)}^{3/4}}{\sum_{j=1}^W {U(w_j)}^{3/4}}
$$
where
$U(\cdot)$ is the unigram distribution (each word's sample frequency in the corpus).

<div class = "alert alert-block alert-info"><b>Tip on sampling the context word:</b> Downsampling frequent words help speed up the learning process. <a href = "https://arxiv.org/pdf/1310.4546.pdf">Mikolov et al. (2013)</a> suggest using the following probability to discard each word $w_i$ upon training: $$P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$$where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold.</div>


<a name = "LWE-4"></a>
### GloVe (Global Vectors for Word Representation)
Let 
$$
X_{ij} = \text{numebr of times word i appears in the context of word j}.
$$

<div class = "alert alert-block alert-success"><b>Note:</b> If we use the previous definition for the context (i.e., $\pm c$ words), then it follows that $$X_{ij}= X_{ji}.$$ But this need not be the case.</div>

The **objective** of GloVe is
$$
\min \sum_{i=1}^{W} \sum_{j=1}^{W} f(X_{ij})\left(\theta_i^{\mathsf{T}}e_j +b_i + b_j^\prime - \log X_{ij}\right)
$$
where $f(X_{ij})$ is the weight function, with $f(X_{ij})= 0$ if $X_{ij} = 0$. $f(X_{ij})$ can assign different weights to frequent and infrequent words.

Two important **features** of GloVe:
- $\theta_i$ and $e_j$ are **symmetric**. Therefore, for the final output, we can take the average of the two for each word, $e_w^{\text{(final)}} = \frac{\theta_w + e_w}{2}$.
- Representations are **not unique** and thus the featurization uninterpretable. For any compatible matrix $A$ such that $A^{\mathsf{T}}A = I$,
$$
{(A\theta_i)}^{\mathsf{T}}(Ae_j) = \theta_i^{\mathsf{T}} A^{\mathsf{T}}A e_j = \theta_i^{\mathsf{T}} e_j.
$$

<a name = "AWE"></a>
## Applying Word Embeddings
<a name = "AWE-1"></a>
### Debiasing Word Embeddings
Word embeddings could reflect the biases of the text corpus from which they are trained. Debiasing the word embeddings so that they do not discriminate against certain gender, ethnicity, sexual orientation and age group is an important pre-requisite for trustworthy learning algorithms. Examples of biases could be illustrated using analogical reasoning:
<blockquote>
    Man:Doctor - Woman:Nurse
</blockquote>

#### A Simplified Debiasing Algorithm (from Deep Learning Specialization by Deeplearning.AI)
<div style = "text-align: center;">
    <img src="./images/debiasing word embeddings.png" style="width:80%;" >
</div>

**Steps**:
1. Identify the bias direction.
$$
\text{bias}_{\text{gender}} = \text{avg}(e_{he}-e_{she} , e_{male}-e_{female}, \dots , e_{grandpa} - e_{grandma})
$$
2. Neutralize: For every word that is not definitional, project to get rid of bias.
3. Equalize pairs.


## References
- <a href = "https://www.coursera.org/learn/nlp-sequence-models/">Sequence Models</a>, **Deep Learning Specialization** (Andrew Ng, DeepLearning.AI)
- <a href = "https://arxiv.org/pdf/1310.4546.pdf">Distributed Representations of Words and Phrases and their Compositionality.</a> (Mikolov et al., 2013)