# Word Embeddings

![](../figs/intro_nlp/embeddings/embeddings.png)

## What are word embeddings?

Word embeddings are a way of representing words as vectors. The vectors are learned from text data and are able to capture some of the semantic and systactic information of the words. 

For example, the word`cat` is similar to `doc` from the following sentences:

"The cat is lying on the floor and the dog was eating”,
 
"The doc was lying on the floor and the cat was eating”

In a mathematical sense, a word embedding is a parameterized function of the word:

$$ f_{\theta}(w) = \theta $$

where $\theta$ is a vector of real numbers. The vector $\theta$ is the embedding of the word $w$.

In a broad sense, `embedding` refers to a lower-dimensional dense vector representation of a higher-dimensional object.
  - in NLP, this higher-dimensional object will be a document.
  - in computer vision, this higher-dimensional object will be an image.

Examples of embeddings and non-embeddings:

  - **Non-embeddings**:
    - one-hot encoding, bag-of-words, TF-IDF, etc.
    - counts over LIWC dictionary categories.
    - sklearn CountVectorizer count vectors
  - **Embeddings**:
    - word2vec, GloVe, BERT, ELMo, etc.
    - PCA reductions of the word count vectors
    - LDA topic shares
    - compressed encodings from an autoencoder



## Categorical Embeddings

![](../figs/intro_nlp/embeddings/1.png)

Categorical embeddings are a way of representing categorical variables as vectors.

For a binary classification problem with outcome $Y$:
    - If you have a high-dimensional categorical variable $X$, (e.g. 1000 categories), you can represent $X$ as a vector of length 1000.
    - It is computationally expensive for a ML model to learn from a high-dimensional categorical variable.

Instead, you can represent $X$ as a lower-dimensional vector of length $k$ (e.g. 10). This is called a categorical embedding. 

Embedding approaches:

1. PCA applied to the dummy variables $X$ to get a lower-dimensional vector representation of $\tilde{X}$.
2. Regress $Y$ on $X$, predict $\hat{Y}(X_i)$, use that as a feature in a new model.



### An embedding layer is matrix multiplication:

$$
\underbrace{h_1}_{n_E \times 1} = \underbrace{\omega_E}_{n_E \times n_W} \cdot \underbrace{x}_{n_x \times 1} 
$$

- $x$ = a categorical variable (e.g., representing a word)
  - One-hot vector with a single item equaling one.
  - Input to the embedding layer.
- $h_1$ = the first hidden layer of the neural net
  - The output of the embedding layer.
- The embedding matrix $\omega_E$ encodes predictive information about the categories.
- It has a spatial interpretation when projected into 2D space.
  - Each row of $\omega_E$ is a vector in $n_E$-dimensional space.
  - The rows of $\omega_E$ are the coordinates of the points in the vector space.
  - The points are the categories.
  - The distance between the points is the similarity between the categories.
  - The angle between the points is the relationship between the categories.

### Embedding Layers versus Dense Layers

An embedding layer is statistically equivalent to a fully-connected dense layer with one-hot vectors as input and linear activation.

- Embedding layers are much faster for many categories (>~50)

## Word Embeddings

> Word embeddings are neural network layers that map words to dense vectors.


Documents are lists of word indexes ${w_1 ,w_2 ,...,w_{n_i} }$.

- Let $w_i$ be a one-hot vector (dimensionality $n_w$ = vocab size) where the associated word’s index equals one.
- Normalize all documents to the same length L; shorter documents can be padded with a null token.
- This requirement can be relaxed with recurrent neural networks.

The embedding layer replaces the list of sparse one-hot vectors with a list of n E -dimensional ($n_E$ << $n_w$ ) dense vectors

$$ \mathbf{X} = [x_1 \ldots x_L ] $$

where

$$
\underbrace{x_j}_{n_E \times 1} = \underbrace{\mathbf{E}}_{n_E \times n_W} \cdot \underbrace{w_j}_{n_w \times 1}
$$

$\mathbf{E}$ a matrix of word vectors. The column associated with the word at $j$ is selected by the dot-product with one-hot vector $w_j$.

$\mathbf{X}$ is flattened into an $L * n_E$ vector for input to the next layer.


![](../figs/intro_nlp/embeddings/4.png)


### Why do we need neural networks for word embeddings?

There are a lot of shallow algorithms that work well for clustering.
- k-means
- hierarchical clustering
- spectral clustering
- PCA

The reasons we use neural networks for word embeddings are:
- They are able to learn the relationships between words.
- They can be used as input to a downstream task.
- They create a mapping of discrete words to continuous vectors.
- They solve the curse of dimensionality.

## Neural Language Models

Word embeddings were proposed by {cite}`bengio2003neural` as a way to represent words as vectors.

Bengio’s method could train a neural network such that each training sentence could inform the model about a number of semantically available neighboring words, which was known as `distributed representation of words`. The nueural network preserved relationships between words in terms of their contexts (semantic and syntactic).

![](../figs/intro_nlp/embeddings/bengio.png)


This introduced a neural network architecture approach that laid the foundation for many current approaches. 

This neural network has three components:
- **Embedding layer**: maps words to vectors, the parameters are shared across the network.
- **Hidden layer**: a fully connected layer with a non-linear activation function.
- **Output layer**: produces a probability distribution over the vocabulary using a softmax function.

### Step 1: Indexing the words. 

For each word in the sentence, we assign an index.

```python
word_list = " ".join(raw_sentence).split()
word_list = list(set(word_list))
word2id = {w: i for i, w in enumerate(word_list)}
id2word = {i: w for i, w in enumerate(word_list)}
n_class = len(word2id)
```

### Step 2: Building the model.

```python
class NNLM(nn.Module):
    def __init__(self):
        super(NNLM, self).__init__()
        self.embeddings = nn.Embedding(n_class, m) #embedding layer or look up table

        self.hidden1 = nn.Linear(n_step * m, n_hidden, bias=False)
        self.ones = nn.Parameter(torch.ones(n_hidden))

        self.hidden2 = nn.Linear(n_hidden, n_class, bias=False)
        self.hidden3 = nn.Linear(n_step * m, n_class, bias=False) #final layer

        self.bias = nn.Parameter(torch.ones(n_class))

    def forward(self, X):
        word_embeds = self.embeddings(X) # embeddings
        X = word_embeds.view(-1, n_step * m) # first layer
        tanh = torch.tanh(self.ones + self.hidden1(X)) # tanh layer
        output = self.bias + self.hidden3(X) + self.hidden2(tanh) # summing up all the layers with bias
        return word_embeds, output
```

- An embedding layer is a lookup table that maps each word to a vector.
- Once the input index of the word is embedded, it is passed through the first hidden layer with bias added to it.
- The output of the first hidden layer is passed through a tanh activation function.
- The output from the embedding layer is also passed through the final layer where the output of the tanh layer is added to it.


### Step 3: Loss and optimization function.

Now that we have the model, we need to define the loss function and the optimization function.

We are using the cross-entropy loss function and the Adam optimizer.

The cross-entropy loss function is made up of two parts:
- The softmax function: this is used to normalize the output of the model so that the sum of the probabilities of all the words in the vocabulary is equal to one.
- The negative log-likelihood: this is used to calculate the loss.



```python
model = NNLM()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
```


### Step 4: Training the model.

Finally, we train the model.


```python
for epoch in range(5000):
    optimizer.zero_grad()
    embeddings, output = model(input_batch)

    # output : [batch_size, n_class], target_batch : [batch_size]
    loss = criterion(output, target_batch)
    if (epoch + 1) % 1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()
    optimizer.step()

# Predict
predict = model(input_batch).data.max(1, keepdim=True)[1]

# Test
print([sen.split()[:2] for sen in raw_sentence], '->', [id2word[n.item()] for n in predict.squeeze()])
```



### Summary

- Word embeddings are a way to represent words as low-dimensional dense vectors.
- These embeddings have associated learnable vectors, which optimize themselves through back propagation. 
- Essentially, the embedding layer is the first layer of a neural network.
- They try to preserve the semantic and syntactic relationships between words.

![](../figs/intro_nlp/embeddings/w2v.png)


## Word2Vec

Word2Vec is a neural network architecture that was proposed by {cite}`mikolov2013distributed` in 2013. It is a shallow, two-layer neural network that is trained to reconstruct linguistic contexts of words.

The problem of the previous neural network is that it is computationally expensive to train. The hidden layer computes probability distribution for all the words in the vocabulary. This is because the output layer is a fully connected layer.

Word2Vec solves this problem by removing hidden layers and sharing the projection layer for all the words in the vocabulary.

### Main idea

- Use a `binary classifier` to predict which words appear in the context of (i.e. near) a target word.
- The `parameters of that classifier` provide a dense vector representation of the target word (embedding).
- Words that appear in similar contexts (that have high distributional similarity) will have very similar vector representations.
- These models can be trained on large amounts of raw text (and pre-trained embeddings can be downloaded).

### Two models

- **Continuous Bag of Words (CBOW)**: predicts the target word from the context words.
- **Skip-gram**: predicts the context words from the target word.

## References

- [Word2Vec](https://arxiv.org/abs/1301.3781)
- [A Visual Guide to FastText Word Embeddings](https://amitness.com/2020/06/fasttext-embeddings/)
- [FastText](https://fasttext.cc/)
- [Get FastText representation from pretrained embeddings with subword information](http://christopher5106.github.io/deep/learning/2020/04/02/fasttext_pretrained_embeddings_subword_word_representations.html)
- [The Ultimate Guide to Word Embeddings](https://neptune.ai/blog/word-embeddings-guide)
- [Neural Network Language Model.ipynb](https://colab.research.google.com/drive/12TQ4CmY6jUnFlQZFnKenmKL3UdTkcatx?usp=sharing#scrollTo=bxwcGfO8eI6G)