# Recurrent Neural Networks
## Natural Language Processing

Author: Binghen Wang

Last Updated: 1 Jan, 2023

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a> |
    <a href="./Convolutional Neural Networks.ipynb">Convolutional Neural Networks</a>
    <br>
    <b>RNN navigation:</b> <a href="./Recurrent Neural Networks.ipynb">Basics</a> 
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
- Word Embeddings
- Learning Word Embeddings
- Applying Word Embeddings

## Word Embeddings
### One-hot Representation vs Featurized Representation
Previously, the language models make use of the **one-hot representation** of a word, due largely to its ease of implementation. Yet, one-hot representation has a drawback in that it regards each word as a new class and does not establish any relationship/similarity between the words.

Consider for instance the following two sentences:
<blockquote>
    I want a glass of orange <u>juice</u>. <br>
    I want a glass of apple _____.
</blockquote>

Learning that juice usually comes after orange does not easily generalize to the case of apple.<br>

From a lingual perspective, apple and orange are closely related in that they are both sweet fruits that contain a lot of water. To be able to make the learning more efficient and generalizable, we can use a **featurized representation**.

<div style = "text-align: center;">
    <img src="./images/word embeddings.png" style="width:80%;" >
</div>

### Transfer Learning and Word Embeddings
Word embeddings could make learning more efficient by:
- requiring fewer labelled data to train the model
- dealing better with unseen words
- using more compact representations for words (each word is represented with a shorter vector)

A popular way to train language models using word embeddings is through **transfer learning**, which takes the following steps:
1. Learn word embeddings from large text corpus (1-100 billion words) or **download** pre-trained embeddings online.
2. **Transfer** the embeddings to new task with a smaller training set (say, 100k words).
3. (Optional) Continue to **finetune** the word embeddings with new data (only if there is a large volumne of training data).

### Analogy Reasoning
<blockquote>
    <b>Man</b> is to <b>woman</b>, as <b>king</b> is to <b>queen</b>.
</blockquote>
Let $e_{\text{man}}$ denote the featurized representation of the word man. We expect:
$$
e_{\text{man}} - e_{\text{woman}} \approx e_{\text{king}} - e_{\text{queen}}
$$

**Formalized Problem**: Find word $w$ that satisfies:
$$
\text{argmax}_w \text{sim}(e_w,  e_{\text{king}} - e_{\text{man}} + e_{\text{woman}})
$$
where $\text{sim}$ is a similarity function. 

A commonly used similarity function is the **cosine similarity**:
$$
\text{sim}(u,v) = \frac{u^{\mathsf{T}}v}{\Vert u \Vert_2\Vert v \Vert_2}
$$

<div style = "text-align: center;">
    <img src="./images/cosine similarity.png" style="width:80%;" >
</div>

## Learning Word Embeddings

### Embedding Matrix
Given a vocabulary of size 10,000: \[a, aaron, $\dots$, zulu, \<UNK\>\], consider a **word embedding** of length 300. The embedding matrix is $300 \times 10,000$ and is given by:

<div style = "text-align: center;">
    <img src="./images/embedding matrix.png" style="width:80%;" >
</div>

The embedding matrix could be used to convert a **one-hot representation** into a **featurized representation**. 
$$
E\,o_{6527} = e_{6527}
$$
It can also employ vectorization and convert a entire sentence in one fell swoop.
$$
E\,[o_{6527}, o_{456}, \dots, o_{271}] = [e_{6527}, e_{456}, \dots, e_{271}]
$$

### Neural Language Model

### Word2Vec Skip-grams Model