# 1 - Natural Language Processing
<hr>

In language modeling, we have seen that we can represent words from a dictionary as vectors using one-hot encoding where all components are zero except for one.

<br>

<div style="text-align:center">
    <img src="images/words-one-hot.png" width=500>
    <caption><center><font color="purple">One-hot vectors</font></center></caption>
</div>

The advantage of such an encoding is that the calculation of a word vector and looking up a word given its vector is easy. On the other hand, this form of encoding does not contain any information about the relationships of words between each other. An alternative sort of word vectors are word embeddings. In such vectors, each component of a vector reflects a different feature of a word meaning (e.g. age, sex, food/non-food, word type, etc,.). Therefore the components can all have non-null values. Words that are semantically similar have similar values in the individual components. For visualization we could also reduce dimensionality to two (or three) dimensions, e.g. by applying the [t-SNE algorithm](https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf). By doing so, it turns out that words with similar meanings are in similar positions in vector space.

<br>

<div style="text-align:center">
    <img src="images/vector-space.png" width=300>
    <caption><center><font color="purple">Words in a vector space</font></center></caption>
</div>



## 2 - Properties of word embeddings
<hr>

Word embeddings have become hugely popular in NLP and can, for example, be used for NER. Oftentimes an existing model can be adjusted for a specific task by performing additional training on suitable training data (transfer learning). This training set and also the dimensionality of the word vectors can be much smaller. The relevance of a word embedding $e$ is similar to the vector of a face in face recognition in computer vision: it is a vectorized representation of the underlying data. An important distinction, however, is that in order to get word embeddings, a model needs to learn a fixed-size vocabulary. Vectors for words outside this vocabulary can not be calculated. In contrast a CNN could calculate a vector for a face it has never seen before.

Word embeddings are useful to model analogies and relationships between words.

$$e_{man} - e_{woman} \approx e_{king} - e_{queen}$$

The distance between the vectors for "man" and "woman" is similar to the distance between the vectors for "king" and "queen", because those two pairs of words are related in the same way. We can also observe that a trained model has learned the relationship between these two pairs of words because the vector representations of their distances is approximately parallel. This also applies to other kinds of word pairings, like verbs in different tenses or the relationship between a country and its capital:

<br>

<div style="text-align:center">
    <img src="images/word-embeddings.png" width=900>
</div>

Therefore we could get the following equation:

$$e_{king} - e_{man} + e_{woman} \approx e_{queen}$$

This way the word embedding for "queen" can be calculated using the embeddings of the other words. To get the word for its embedding we can use a similarity function $\text{sim}$, which measures the similarity between two embeddings $u$ and $v$. Often the **cosine similarity** is used for this function:

$$\text{sim}(u, v) = \frac{u^T v}{||u||_2 ||v||_2}$$

With the help of the similarity function we can find the word for "queen" by comparing the embedding $e_{queen}$ against the embeddings of all other word from the vocabulary:

$$w = \text{argmax} \ \text{sim} \left( e_{queen}, e_{king} - e_{man} + e_{woman} \right)$$


## 2.1 - Embedding matrix

The embeddings of the words in the vocabulary can be precomputed and stored in an **embedding matrix** denoted by $E$. This is efficient because the learned embeddings don't need to be computed each time. The embedding $e_j$ of a word $j$ from the vocabulary can easily be retrieved by multiplying its one-hot encoding $O_j$ with the embedding matrix:

$$e_j = E \cdot O_j$$

For instance, given $E$ and the one-hot encoding $O_{6257}$, we can compute $e_{6257}$ by multiplying $E$ with $O_{6257}$ as:

$$
e_{6257} = E \cdot O_{6257} = 
\begin{bmatrix}
e_{11} & e_{12} & \cdots & e_{1n} \\
e_{21} & e_{22} & \cdots & e_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
e_{m1} & e_{m2} & \cdots & e_{mn}
\end{bmatrix} 
\cdot
\begin{bmatrix}
0 \\
0 \\
\vdots \\
1 \\
\vdots \\
0
\end{bmatrix}
=
\begin{bmatrix}
e_{1,6257} \\
e_{2,6257} \\
\vdots \\
e_{m,6257}
\end{bmatrix}
$$

If our vocabulary is a 10,000 dimensional vector where each word is encoded as a 300 dimensional, then the matrix $E$ is a $300 \times 10,000$ dimensional; multiplying it with the one-hot vector $O_{6257}$ which is 10,000 dimensional, we obtain a $300 \times 1$ embedding.

$$
e_{6257} = E \cdot O_{6257} = 
\begin{bmatrix}
\text{a} & \text{aaron} & \cdots & \text{orange}_{6257} & \cdots & \text{zulu} \\
\text{a} & \text{aaron} & \cdots & \text{orange}_{6257} & \cdots & \text{zulu} \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
\end{bmatrix}_{ \ 300 \times 10,000} 
\cdot
\begin{bmatrix}
0 \\
0 \\
\vdots \\
1_{6257} \\
\vdots \\
0
\end{bmatrix}_{ \ 10,000 \times 1}
=
\begin{bmatrix}
e_{1,6257} \\
e_{2,6257} \\
\vdots \\
e_{m,6257}
\end{bmatrix}_{ \ 300 \times 1}
$$

Since most of the multiplications are zeros, in pratice we use specialized functions to look up an embedding.

## 3 - Learning Word Embeddings
<hr>

### 3.1 - Word2Vec

Word2Vec (W2V) is a prevalent model for learning word embeddings, offering two distinct approaches:

1. Skip-Gram
2. CBOW (Continuous Bag Of Words)

#### <font color="purple">Skip-Gram</font>

The Skip-Gram model operates on the principle of predicting surrounding words given a specific word in a sentence. Here, we differentiate between two types of words:

- **Context Word:** The word based on which predictions are made.
- **Target Words:** The surrounding words within a specified window that the model attempts to predict.

For a given context word, the Skip-Gram model aims to predict the target words within a defined window size. For instance, if the window size is 5, the model predicts the 5 words before and after the context word.

Consider a vocabulary of 10,000 words. For training, we select a pair of words: a context word "orange" and a target word "juice". The embeddings of these words, denoted as $e_c$ (for "orange") and $e_t$ (for "juice"), are derived using the embedding matrix $E$ and one-hot encoding $O_j$:

$$e_j = E \cdot O_j$$

In the case of predicting the target word "juice" from the context word "orange", the embedding $e_c$ is input into a softmax unit to compute the probability of each word in the vocabulary being the target word:

$$p(t \mid c) = \frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000} e^{\theta_j^T e_c}}$$

Here, $\theta_t$ represents the parameters of the softmax function for the target word.

The output $\hat{y}$ is a probability distribution across the entire vocabulary. The training objective is to adjust $\theta_t$ to maximize the likelihood of the actual target word. The loss function, based on negative log-likelihood, is defined as:

$$\mathcal{L}(\hat{y}, y) = - \sum_{i=1}^{10,000} y_i \log{\hat{y}_i}$$

where $y$ is the one-hot encoding of the actual target word.

While effective, the Skip-Gram model faces computational challenges, particularly with the softmax function over large vocabularies. To mitigate this, hierarchical softmax can be used. This method applies a binary tree structure to reduce the complexity of the probability distribution computation, leading to faster processing and scalability.

### 3.2 - Negative Sampling

Negative Sampling offers a more efficient approach to compute word embeddings compared to the Skip-Gram model, particularly in the context of large vocabularies. It reformulates the learning problem by generating a training set comprising both valid (positive) and artificially generated invalid (negative) context-target pairs.

For each valid context-target word pair like (orange, juice), we generate $k$ negative samples, leading to a training set of $k+1$ samples. That is, the samples might look like this:

| context | word | target? |
| ---- | ---- | ---- |
| orange | juice | 1 |
| orange | king | 0 |
| orange | book | 0 |
| orange | the | 0 |
| orange | of | 0 |

Here's how it works:

- **Positive Sample Generation:** For a given context word, a target word is sampled from its surrounding window, creating a valid pair. This pair is labeled as "1", indicating a genuine context-target relationship like (orange, juice).
- **Negative Sample Generation:** Next, $k$ additional target words are randomly chosen from the entire vocabulary, irrespective of their actual context. These pairs are labeled as "0", representing false context-target pairs. The typical range for $k$ is 5-20 for smaller datasets and 2-5 for larger ones.

The resultant training set consists of $k+1$ word pairs, turning the learning problem into binary classification. In each iteration, a neural network is trained on these $k+1$ pairs rather than the entire vocabulary as in Skip-Gram.

The probability of a target word $t'$ occurring in the context of word $c$ is modeled as:

$$P(y=1 \mid c, t') = \sigma \left( \theta^{T}_{t'} e_c \right)$$

Here, $\sigma$ denotes the sigmoid function, $\theta_{t'}$ the parameter vector for the target word $t'$, and $e_c$ the embedding of teh context word $c$.

The objective is to adjust $\theta_{t'}$ to minimize the cost function, aiming for $P(y=1 \mid c,t)$ to be close to 1 for true pairs and close to 0 for negative samples.

### 3.3 - GloVe

Global Vectors for Word Representation (GloVe) is an alternative word embedding method to Word2Vec (W2V), notable for its simplicity and effectiveness. Unlike W2V, which relies on local context information, GloVe focuses on word co-occurrence statistics across the entire corpus.

<b><font color="purple">Co-Occurrence Matrix</font></b>

GloVe constructs a co-occurrence matrix where, for a given word $i$, it counts the occurrences of every other word $j$ within a defined context (like a window around word $i$). This generates a co-occurrence value $x_{ij}$ for each word pair.

<b><font color="purple">Objective Function</font></b>

GloVe aims to minimize the following cost function, for a vocabulary of 10,000 words:

$$\text{minimize} \sum_{i=1}^{10,000} \sum_{j=1}^{10,000} f(x_{ij}) \left( \theta^T_i e_j + b_i + b'_j - \log{x_{ij}} \right)^2$$

Here, $\theta_i$ and $e_j$ are the word vectors for words $i$ and $j$, respectively, while $b_i$ and $b'_j$ are scalar biases for these words.

<b><font color="purple">Weighting Function $f(x_{ij})$</font></b>

The function $f(x_{ij})$ is a weighting term with specific characteristics:

- It is zero if $x_{ij}=0$, effectively filtering out pairs of words that never co-occur and avoiding undefined logarithms.
- It assigns higher weights to more frequent words, but not disproportionately high to prevent common words (like stop words) from dominating.
- It gives some weight to less frequent words, ensuring that they still contribute meaningfully to the embeddings.

<b><font color="purple">Symmetry in Word Vectors</font></b>

In GloVe, $\theta_i$ and $\theta_j$ are symmetric in the learning objective. Consequently, the final word vector for a word $w$ can be obtained by averaging $e_w$ and $\theta_w$.

Despite its simplicity, GloVe effectively captures complex patterns and relationships in language. Its global perspective on word co-occurrence is a distinctive feature compared to local context-focused methods like W2V. This broader view often makes GloVe a preferred choice for researchers seeking a balance between computational efficiency and representational power in word embeddings.

## 4 - Sentiment Classification
<hr>

Sentiment classification (SC) is the process of deciding from a text whether the writer likes or dislikes something. This is for example required to map textual reviews to star-ratings (1 star=bad, 5 stars=great).

The learning problem in SC is to learn a function which maps an input $x$ (e.g. a restaurant review) to a discrete output $y$ (e.g. a star-rating). Therefore the learning problem is a multinomial classification problem where the predicted class is the number of stars. For such learning problems, however, training data is usually sparse.

A simple classifier could consist of calculating the word embeddings for each word in the review and calculating their average. This average vector could then be fed into a softmax classifier which calculates the probability for each of the target classes. This also works for long reviews because the average vector will always have the same dimensionality. However, by averaging the word vectors, we lose information about the order of the words. This is important because the word sequence "not good" should have a negative impact on the star-rating whereas when calculating the average of "not" and "good" individually, the negative meaning is lost and the star-rating would be influenced positively. An alternative would therefore be to train an RNN with the word embeddings.


<br>

<div style="text-align:center">
    <img src="images/rnn-for-sentiment-classification.png" width=700>
</div>

## 5 - Debiasing Word Embeddings
<hr>

Word Embeddings can suffer from bias depending on the training data used. The term bias denotes bias towards gender/race/age etc,. Such stereotypes can become a problem because they can enforce stereotypes by learning inappropriate relationships between words (e.g. man is to computer programmer as woman is to homemaker). To neutralize such biases, we could perform the following steps:

1. **Identify bias direction:** If, for example, we want to reduce the gender bias we could define pairs of male and female forms of words and average the difference between their embeddings. The resulting vector gives us the bias direction $g$.
2. **Neutralize:** For every word that is not definitional, project to get rid of bias. Definitional means that the gender is important for the meaning of the word. An example for a definitional word is grandmother or grandfather, because here the gender information cannot be omitted without losing semantic meaning. We can compute the neutralized embedding as follows:

$$e^{\text{bias_component}} = \frac{e . g}{||g||^2_2} \cdot g$$

$$e^{\text{debiased}} = e - e^{\text{bias_component}}$$

The figure below should help visualize what neutralizing does. If using a 50-dimensional word embedding, the 50 dimensional space can be split into two parts: The bias-direction $g$, and the remaining 49 dimensions, which we’ll call $g_\perp$. In linear algebra, we say that the 49 dimensional $g_\perp$ is perpendicular to g. The neutralization step takes a vector such as $e_{\text{receptionist}}$ and zeros out the component in the direction of $g$ , giving us $e^{\text{debiased}}_{\text{receptionist}}$. Even though $g_\perp$ is 49 dimensional, given the limitations of what we can draw on a screen, we illustrate it using a 1 dimensional axis below.

<br>

<div style="text-align:center">
    <img src="images/debiasing-neutralize.png" width=900>
</div>


### <font color="purple">Equalize pairs</font>

Equalization is applied to pairs of words that we might want to have differ only through the gender property. As a concrete example, suppose that "actress" is closer to "babysit" than "actor." By applying neutralizing to "babysit" we can reduce the gender-stereotype associated with babysitting. But this still does not guarantee that "actor" and "actress" are equidistant from "babysit." The equalization algorithm takes care of this. The key idea behind equalization is to make sure that a particular pair of words are equidistant from the 49-dimensional $g_\perp$. The equalization step also ensures that the two equalized steps are now the same distance from $e^{\text{debiased}}_{\text{receptionist}}$, or from any other work that has been neutralized. In pictures, this is how equalization works:

<br>

<div style="text-align:center">
    <img src="images/debiasing-equalize.png" width=900>
</div>

In the above steps we used gender bias as an example, but the same steps can be applied to eliminate other types of bias too. Word embedding will almost alway suffer from bias that is intrinsically contained in the corpora they wore learned from.