# Explanation

The **word2vec** model introduces the concept of embeddings models, which form word representaiotns that preserve semantic and syntactic meaning.

Specifically, the model can take in words and encode them into vectors that are meant to be _representations_ of those words. These representations have special properties, in that just like how certain words are related to each other, the representations of related words should also be related.

A cool example that indicates the power of this construction is that the following vector computation holds true

$$\textrm{Vector}(\textrm{"King"}) - \textrm{Vector}(\textrm{"Man"}) + \textrm{Vector}(\textrm{"Woman"}) = \textrm{Vector}(\textrm{"Queen"})$$

This suggests that the vectors contain some information about the words that preserve semantic meaning of words in relation to each other.

### Word2Vec

This paper experiments with a variety of architectures to try to create these embedding models, and then evaluates the performance of the different approaches.

The architectures vary in their implementations, but there are several commonalities between them that define how all embedding models work.

First, the embedding models all compress the inputs into a $D$ dimensional subspace in a hidden layer, and then use further hidden layers or similarity calculations based on the vectors in the hdiden layer to accomplish language based tasks.

This hidden layer is the representation layer where the embeddings are created - embeddings are actually just the vectors created by hidden layers in neural networks where the networks are force to learn useful representations to accomplish a task within a specific subspace. In these cases, the embeddings for each word would be $D$ dimensional.

The most effective architecture in this paper, the continuous Skip-gram model uses the task of predicting what words will appear in the same contexts to form representations by trying to push embedding vectors of wrods appearing together closer together in the embedding space.

This forms complex relationships between the embeddings of different words as each word gets modified by it's complex relationships with many other words, until the vector for each word contains rich information about it's meaning.

In general, all the embedding models work by forcing words that appear frequently together in training to be forced into similar parts of the embedding space, implying that these words relate to each other

### Phrase2Vec

The **phrase2vec** model builds on word2vec, improving on some of it's implementations and also adding the ability to embed phrases in addition to just words. This is motivated by the fact that certain phrases (for example, the name of a city or a sports team) composed of individual words may actually adopt meanings completely unrelated to the words they're made up of, creating the need for more complex embeddings.

Additionally, this implementation introduces a few optimizations on top of the original word2vec model. Most notably, it introduces the sub-sampling of frequent words (similar to removing stop-words in traditional NLP) so that embeddings of frequent words dont get pushed in random directions given their frequency.

# My Notes

## 📜 [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781)

> We propose two novel model architectures for computing continuous vector representations of words from very large data sets

> We show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

> Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary.

> This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data.

> However, the simple techniques are at their limits in many tasks.

In many domains, like transcribed speech data, there isn’t much high quality data, so more complex models may be necessary to represent the available.

> Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities.

> It was shown for example that _vector(”King”) - vector(”Man”) + vector(”Woman”)_ results in a vector that is closest to the vector representation of the word _Queen_

> In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words.

> We design a new comprehensive test set for measuring both syntactic and semantic regularities, and show that many such regularities can be learned with high accuracy.

This paper tries to focus on creating word representations that preserve syntactic and semantic regularities between different words (the representations contain meeting, and arithmetic can be performed).

> Moreover, we discuss how training time and accuracy depends
> on the dimensionality of the word vectors and on the amount of the training data.

### Model Architectures

> In this paper, we focus on distributed representations of words learned by neural networks.

Neural networks have been shown to be far more effective and computationally efficient than previous approaches like LSA and LDA.

> We will try to maximize the accuracy, while minimizing the computational complexity.

Minimize the the number of parameters needed for the model to fully train while making sure it’s still accurate.

**1. Feedforward Neural Net Language Model (NNLM)**

> The probabilistic feedforward neural network language model has been proposed in. It consists of input, projection, hidden and output layers.

> At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary.

> The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix.

> Moreover, the hidden layer is used to compute probability distribution over all the words in the
> vocabulary, resulting in an output layer with dimensionality V .

The model consists of:

(1) An input layer of N (often N=10) of the previous words in a one-hot-encoded 1-of-V encoding style for the vocabulary V

(2) There’s a projection layer with dimensionality N × D that maps each vector in the input layer linearly onto a learned representation. This projection layer is the **embedding layer**.

(3) Then, there’s a hidden layer which computes based on the projection layer and has some non-linearities.

(4) Finally, there’s an output layer of dimension V, corresponding with the prediction of the next word.

**2. Recurrent Neural Net Language Model (RNNLM)**

> Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length, and because theoretically RNNs can efficiently represent more complex patterns than the shallow neural networks.

RNN used for this representation task since they’re theoretically better at word modeling.

> The RNN model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections.

The embedding layer in this model is not a projection layer but is instead the layer passing information forward in time since it has to form short term memory of the past hidden state.

**3. Parallel Training of Neural Networks**

> To train models on huge data sets, we have implemented several models on top of a large-scale distributed framework called DistBelief, including the feedforward NNLM and the new models proposed in this paper.

### New Log-Linear Models

> In this section, we propose two new model architectures for learning distributed representations of words that try to minimize computational complexity.

> The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model.

Building models optimized for simplicity. While they may not be able to have as complex representations as neural networks for the task, their computational efficiency is a big benefit.

**1. Continuous Bag-of-Words Model**

> The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged).

> We call this architecture a bag-of-words model as the order of words in the history does not influence the projection.

> Furthermore, we also use words from the future; we have obtained the best performance on the task introduced in the next section by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word.

In this model, we remove the non-linear layer for the sake of removing complexity, and each word uses the same projection matrix, rather than have it’s own matrix in the projection layer, meaning that there is no information about word position.

**2. Continuous Skip-gram Model**

> The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence.

> More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word.

> We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity.

This model just tries to predict which words are nearby based on the current word and a certain range.

### Results

> ”What is the word that is similar to small in the same sense as biggest is similar to big?”

> Somewhat surprisingly, these questions can be answered by performing simple algebraic operations with the vector representation of words.

> Finally, we found that when we train high dimensional word vectors on a large amount of data, the resulting vectors can be used to answer very subtle semantic relationships between words, such as a city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin.

**1. Task Description**

> To measure quality of the word vectors, we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions.

![Screenshot 2024-05-15 at 1.49.07 PM.png](../../images/Screenshot_2024-05-15_at_1.49.07_PM.png)

> Question is assumed to be correctly answered only if the closest word to the. vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes.

> We believe that usefulness of the word vectors for certain applications should be positively correlated with this accuracy metric.

**2. Maximization of Accuracy**

> We have used a Google News corpus for training the word vectors. This corpus contains about 6B tokens. We have restricted the vocabulary size to 1 million most frequent words.

> Increasing amount of training data twice results in about the same increase of computational complexity as increasing vector size twice.

Increasing the vector size of word representations has the same effect as a larger training set.

**3. Comparison of Model Architectures**

The skip-gram model performs best overall.

![Screenshot 2024-05-15 at 1.57.00 PM.png](../../images/Screenshot_2024-05-15_at_1.57.00_PM.png)

### Examples of the Learned Relationships

![Screenshot 2024-05-15 at 1.58.46 PM.png](../../images/Screenshot_2024-05-15_at_1.58.46_PM.png)

> It is also possible to apply the vector operations to solve different tasks. For example, we have observed good accuracy for selecting out-of-the-list words, by computing average vector for a list of words, and finding the most distant word vector.

### Conclusion

> In this paper we studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks.

> We observed that it is possible to train high quality word vectors using very simple model architectures.

> Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary.

💬 **Comments**

This paper introduces the intuition for embeddings - it’s actually not a model trained to produce embeddings, but a model trained for a task, and created in a way so that the model is forced to create embeddings somewhere in order to accomplish it’s goal.

Each different model proposed in this paper takes a different approach to language modeling that forces the model to build it’s own embeddings, and the tradeoffs taken are for the sake of computational complexity.

In general, as a model is forced to model certain relationships between words & tokens more accurately to accomplish some task, words that appear together or are contextually similar will be adjusted in the representation space to cluster more closely together, and hopefully, to model relevant syntactic and semantic relationships.

It’s actually very surprising that this happens naturally in the way that the models learn.

The intuition of embeddings is to isolate the available representation space models have to learn so that when they optimize their model of language in their representation space, that space can then be used practically.



## 📜 [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546)

> The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.
>

> In this paper we present several extensions that improve both the quality of the vectors and the training speed.
>

This paper adds many optimizations to the skip-gram model introduced in the word2vec paper.

> An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases.
>

> We present a simple method for finding phrases in text, and show
that learning good vector representations for millions of phrases is possible.
>

Given the limitations on individual word representations, it also focused on how more complex phrases can be represented by embeddings.

> Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words.
>

This is the core intuition behind why embeddings are useful to models, and why they have the properties they do after training.

> Unlike most of the previously used neural network architectures for learning word vectors, training of the Skip-gram model does not involve dense matrix multiplications.
>

> This makes the training extremely efficient: an optimized single-machine implementation can train on more than 100 billion words in one day.
>

> We show that subsampling of frequent words during training results in a significant speedup (around 2x - 10x), and improves accuracy of the representations of less frequent words.
>

> In addition, we present a simplified variant of Noise Contrastive Estimation (NCE) for training the Skip-gram model that results
in faster training and better vector representations for frequent words.
>

One major focus of this paper is to improve the performance and training efficiency of the existing skip-gram model.

> Using vectors to represent the whole phrases makes the Skip-gram model considerably more expressive.
>

The other major focus is in using vectors to represent phrases rather than words, which enables the embeddings to represent much more.

> The extension from word based to phrase based models is relatively simple. First we identify a large number of phrases using a data-driven approach, and then we treat the phrases as individual tokens during the training.
>

> To evaluate the quality of the phrase vectors, we developed a test set of analogical reasoning tasks that contains both words and phrases.
>

### The Skip-gram Model

> The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.
>

This task is especially conducive to creating useful context in the word embeddings - you need to very clearly be able to derive different types of related words from the embedding of each word.

> Formally, given a sequence of training words $w_1, w_2, w_3, …, w_T$, the objective of the Skip-gram model is to maximize the average log probability
>

$$
\frac{1}{T}\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j}|w_t)
$$

Meaning, for each word (summation over $T$ terms), we want to increase the total chance that the model predicts the presence of all surrounding words within context window $c$ (summation over $j$ terms, bounded by $c$).

This can be framed as a minimization by changing the main term to a $- \log$ probability.

> The basic Skip-gram formulation defines $p(w_{t+j}|w_t)$ using the softmax function:
>

$$
p(w_O|w_I) = \frac{\exp({v'_{w_O}}^Tv_{w_I})}{\sum_{w=1}^W \exp{{v'_w}^Tv_{w_I}}}
$$

> This formulation is impractical because the cost of computing $\nabla \log p(w_O|w_I)$ is proportional to $W$.
>

The Skip-gram cost function is meant to maximize the models predicted probability of the presence of each of the words that’s in the actual $2c$ word context window.

The model stores an embedding vector for each word in the vocabulary, and each output is computed by multiplying the dot product of the target word’s embedding vector with the embedding vector for each other word in the vocabulary.

The model works as follows:

(1) The input layer is a 1-of-V one hot encoded vector with V inputs ($V$ being the vocabulary size)

(2) The projection layer directly maps each word in the input to an $D$ dimensional embedding (linear mapping). Thus this layer has dimension $V$by $D$. Since only one input neuron is active at a time, only one embedding row (the embedding of the target word) is active at once.

(3) The output layer then computes the dot products of the embedding vectors of all other words with the target word, and then these scores are passed through the softmax function, indicating the probability of each word appearing in the context window of the target word

Through this optimization, the cost function forces words that appear in similar contexts to be pushed into closer regions (increasing similarity) in the embedding space. This is the core intuition behind how embeddings spaces actually develop the emergent representations that they have.

This probability is technically calculated using a softmax on the entire set of probabilities calculated by the model on the $W$ outputs (corresponding with each word in the vocabulary), which is a very computationally expensive calculation.

Instead, approximation methods are used to calculate this probability more efficiently.

**1. Hierarchical Softmax**

> A computationally efficient approximation of the full softmax is the hierarchical softmax.
>

> The main advantage is that instead of evaluating $W$ output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about $\log_2(W)$ nodes.
>

This method works by creating a binary tree representing the $W$ outputs where higher nodes summarize the joint probabilities of all of its child nodes. Then, only a subset of these nodes need to be traversed (low probability branches can be completely cut off), making the sampling far more efficient.

**2. Negative Sampling**

> An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE). […] NCE posits that a good model should be able to differentiate data from noise by means of logistic regression.
>

NCE is a method to approximate the value of softmax. Instead of taking a softmax across a large number of logits to compute scores for each output, instead, each output becomes a sigmoid activated score computation, meant to indicate whether a word is a “context” word that is actually associated with the input word, or a “noise” word.

All the words that are context words should converge toward being predicted as noise words. However, for efficiency, the noise words are randomly selected by a noise distribution.

> While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality.
>

Instead of NCE, we use a simplified version called NEG

$$
\log \sigma({v'_{w_O}}^Tv_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma({-v'_{w_i}}^Tv_{w_I})]
$$

Here, we want to maximize the probability $\log \sigma({v'_{w_O}}^Tv_{w_I})$, meaning we want to maximize the dot products (similarities) of the target word embedding vector with the embedding vectors of the correct context words.

Additionally, for $k$ randomly sampled words from the noise distribution $P_n(w)$, we want to maximize the average $\log \sigma({-v'{w_i}}^Tv{w_I})$ of these $k$ words corresponding embedding vectors with the target words embedding vectors. This quantity tries to maximize the similarity between the *opposite* of the noise words embedding vectors and target words embedding vector, or effectively minimizes the similarity of the two embedding vectors.

> The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.
>

In order for NCE to effectively mirror the probability distribution that would be learned if the softmax function were used, we would have to use correct numerical probabilities from the noise distribution, but since we don’t need this level of accuracy for this case (since the goal is just to create good embedding vector representations), the NEG function is sufficient.

> Both NCE and NEG have the noise distribution $P_n(w)$ as a free parameter. We investigated a number of chocies for $P_n(w)$ and found that the unigram distribution […] $U(w)^{3/4}/Z$ outperformed significantly […] on every task we tried.
>

**3. Subsampling of Frequent Words**

> In very large corpora, the most frequent words can easily occur hundreds of millions of times. […] Such words usually provide less information value than the rare words.
>

> While the skip-gram model benefits from observing the co-occurrences of “France” and “Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, as nearly every word co-occurs frequently within a sentence with “the.”
>

> The vector representations of frequent words do not change significantly after training on several million examples.
>

Frequent words in the corpus don’t add much information to the embeddings of other words, and they don’t soak up context from all the variety of words that surround them.

> To counter the imbalance between the rare and frequent words, we used a simple subsampling approach: each word in the training set is discarded with probability computed by the formula
>

$$
P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}
$$

> where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold, typically around $10^{-5}$
>

The most frequent words beyond the threshold are highly likely to be sampled out of the data set, and the order of word frequencies is still preserved.

This effectively compresses the frequencies in the dataset to maintain the same order but have much smaller variance.

> It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words, as will be shown in the following sections.
>

### Empirical Results

![Screenshot 2024-05-15 at 4.24.44 PM.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/d53d4b03-ec38-429d-9915-08fc298fc6e9/a45e6fc0-ef0e-4d06-b56d-657b4f752c5a/Screenshot_2024-05-15_at_4.24.44_PM.png)

> The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation.
>

> The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.
>

### Learning Phrases

> As discussed earlier, many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts.
>

> Phrases are formed based on the unigram and bigram counts. The $\delta$ is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed.
>

$$
\textrm{score}(w_i, w_j) = \frac{\textrm{count}(w_iw_j) - \delta}{\textrm{count}(w_i) \times \textrm{count}(w_j)}
$$

> The bigrams with score above the chosen threshold are then used as
phrases.
>

Each phrase is then replaced with it’s own (new) token in the dataset.

**1. Phrase Skip-Gram Results**

> Surprisingly, while we found the Hierarchical Softmax to achieve lower performance when trained without subsampling, it became the best
performing method when we downsampled the frequent words.
>

![Screenshot 2024-05-15 at 4.41.55 PM.png](../../images/Screenshot_2024-05-15_at_4.41.55_PM.png)

![Screenshot 2024-05-15 at 4.39.09 PM.png](../../images/21c8f6ba-e040-483e-9fef-836a63aed4fd/Screenshot_2024-05-15_at_4.39.09_PM.png)

### Additive Compositionality

> We demonstrated that the word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics.
>

> Interestingly, we found that the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations.
>

The created model also allows words to be added together.

> The additive property of the vectors can be explained by inspecting the training objective.
>

The training objectives in the creation of embedding models are heavily responsible for the resulting behaviors that are viable in the representation space.

### Comparison to Published Word Representations

![Screenshot 2024-05-15 at 4.48.17 PM.png](../../images/Screenshot_2024-05-15_at_4.48.17_PM.png)

### Conclusion

> This work has several key contributions. We show how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible.
>

> A very interesting result of this work is that the word vectors can be somewhat meaningfully combined using just simple vector addition.
>

> Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combination of these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity
>

💬 **Comments**

The quality and properties of the embedding model is a result of the specific training methods and objective functions used for the model. The relationships between words are enforced by the types of representations learned to model the training problem.

This paper mainly introduces the technical details of the skip-gram model and how it’s design is conducive to learning good word embeddings.

It also shows us how to add the ability to embed complex phrases into the embeddings model.