# Speech and Language Processing. Daniel Jurafsky & James H. Martin

## Chapter 6 - Vector Semantics and Embeddings

https://web.stanford.edu/~jurafsky/slp3/6.pdf

**Introduction**

Words that occur in similar contexts tend to have similar meanings.

This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis. The hypothesis was first formulated in the 1950s by linguists.

Vector semantics instantiate this linguistic hypothesis by learning representations of the meaning of words, called embeddings, directly from their distributions in texts. These representations are used in every
natural language processing application that makes use of meaning, and underlie the more powerful contextualized word representations like ELMo and BERT that we will introduce in Chapter 10.

Representation learning, automatically learning useful representations of the input text. Finding such self-supervised ways to learn representations of the input, instead of creating representations by hand via feature engineering, is an important focus of NLP research.

**Lexical semantics**

More generally, a model of word meaning should allow us to draw useful inferences that will help us solve meaning-related tasks like question-answering, summarization, detecting paraphrases or plagiarism, and dialogue.

Here the form mouse is the lemma, also called the citation form. The form mouse would also be the lemma for the word mice; dictionaries don’t have separate definitions for inflected forms like mice. Similarly sing is the lemma for sing, sang, sung. In many languages the infinitive form is used as the lemma for the verb, so Spanish dormir “to sleep” is the lemma for duermes “you sleep”. The specific forms sung or carpets or sing or duermes are called wordforms.

We call each of these aspects of the meaning of mouse a word sense. The fact that lemmas can be polysemous (have multiple senses) can make interpretation difficult (is someone who types “mouse info” into a search engine looking for a pet or a tool?).

A more formal definition of synonymy (between words rather than senses) is that two words are synonymous if they are substitutable one for the other in any sentence without changing the truth conditions of the sentence, the situations in which the sentence would be true. We often say in this case that the two words have the same
propositional meaning.

Principle of contrast: is the assumption that a difference in linguistic form is always associated with at least some difference in meaning.

The notion of word similarity is very useful in larger semantic tasks. Knowing how similar two words are can help in computing how similar the meaning of two phrases or sentences are, a very important component of natural language understanding tasks like question answering, paraphrasing, and summarization.

The meaning of two words can be related in ways other than similarity.One such class of connections is called word relatedness, also traditionally called word association in psychology. One common kind of relatedness between words is if they belong to the same semantic field. Semantic fields are also related to topic models, like Latent Dirichlet Allocation , LDA, which apply unsupervised learning on large sets of texts to induce sets of
associated words from text. 

Closely related to semantic fields is the idea of a semantic frame. A semantic frame is a set of words that denote perspectives or participants in a particular type of event.

Words have affective meanings or connotations. The word connotation has different meanings in different fields, but here we use it to mean the aspects of a word’s meaning that are related to a writer or reader’s emotions,
sentiment, opinions, or evaluations. For example some words have positive connotations (happy) while others have negative connotations (sad). Some words describe positive evaluation (great, love) and others negative evaluation (terrible, hate). Positive or negative evaluation expressed through language is called sentiment, as we
saw in Chapter 4, and word sentiment plays a role in important tasks like sentiment analysis, stance detection, and many applications of natural language processing to the language of politics and consumer reviews.

Early work on affective meaning (Osgood et al., 1957) found that words varied along three important dimensions of affective meaning. These are now generally called valence, arousal, and dominance, defined as follows:

- valence: the pleasantness of the stimulus
- arousal: the intensity of emotion provoked by the stimulus
- dominance: the degree of control exerted by the stimulus

Thus words like happy or satisfied are high on valence, while unhappy or annoyed are low on valence. Excited or frenzied are high on arousal, while relaxed or calm are low on arousal. Important or controlling are high on dominance, while awed or influenced are low on dominance. Each word is thus represented by three numbers, corresponding to its value on each of the three dimensions, [...]. Osgood et al. (1957) noticed that in using these 3 numbers to represent the meaning of a word, the model was representing each word as a point in a threedimensional space, a vector whose three dimensions corresponded to the word’s rating on the three scales. This revolutionary idea that word meaning word could be represented as a point in space (e.g., that part of the meaning of heartbreak can be represented as the point [2.45,5.65,3.58]) was the first expression of the vector
semantics models that we introduce next.

**Vector semantics**

“The meaning of a word is its use in the language” --> Instead of using some logical language to define each word, we should define words by some representation of how the word was used by actual people in speaking and understanding.

Linguists of the period like Joos (1950), Harris (1954), and Firth (1957) (the linguistic distributionalists), came up with a specific idea for realizing Wittgenstein’s intuition: define a word by its environment or distribution in language use. A word’s distribution is the set of contexts in which it occurs, the neighboring words or grammatical environments. The idea is that two words that occur in very similar distributions (that occur together with very similar words) are likely to have the same meaning.

Vector semantics thus combines two intuitions: the distributionalist intuition (defining a word by counting what other words occur in its environment), and the vector intuition of Osgood et al. (1957) we saw in the last section on connotation: defining the meaning of a word w as a vector, a list of numbers, a point in Ndimensional space. There are various versions of vector semantics, each defining the numbers in the vector somewhat differently, but in each case the numbers are based in some way on counts of neighboring words.

The idea of vector semantics is thus to represent a word as a point in some multidimensional semantic space. Vectors for representing words are generally called *embeddings*, because the word is embedded in a particular vector space.

The sentiment analysis classifier we saw in Chapter 4 only works if enough of the important sentimental words that appear in the test set also appeared in the training set. But if words were represented as embeddings, we could assign sentiment as long as words with similar meanings as the test set words occurred in the training set.

*Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.*

**Vectors and documents**

In a term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection of documents. The term-document matrix is first defined as part of the vector space model of information retrieval.  In this model, a document is represented as a count vector. In real term-document matrices, the vectors representing each document would have dimensionality $|V|$, the vocabulary size. The ordering of the numbers in a vector space is not arbitrary; each position indicates a meaningful dimension on which the documents can vary.

Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval. A real term-document matrix, of course, wouldn’t just have 4 rows and columns,
let alone 2. More generally, the term-document matrix has $|V|$ rows (one for each word type in the vocabulary) and $D$ columns (one for each document in the collection); as we’ll see, vocabulary sizes are generally in the tens of thousands, and the number of documents can be enormous (think about all the pages on the web).



**Words and vectors**

The word vector is now a row vector rather than a column vector, and hence the dimensions of the vector are different. For documents, we saw that similar documents had similar vectors, because similar documents tend to have similar words. This same principle applies to words: similar words have similar vectors because they tend to occur in similar documents. The term-document matrix thus lets us represent the meaning of a word by the documents it tends to occur in. 

Term-term matrix, more commonly called the word-word matrix or the term- word-word matrix context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality $|V| × |V|$ and each cell records the number of times the row (target) word and the column (context) word co-occur in some context
in some training corpus. The context could be the document, in which case the cell represents the number of times the two words appear in the same document. It is most common, however, to use smaller contexts, generally a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word.

Note that $|V|$, the length of the vector, is generally the size of the vocabulary, usually between 10,000 and 50,000 words. But of course since most of these numbers are zero these are sparse vector representations, and there are efficient algorithms for storing and computing with sparse matrices.

**Cosine for measuring similarity**

By far the most common similarity metric is the cosine of the angle between the vectors. The cosine is based on the dot product operator from linear algebra, also called the inner product:

$<v, w> = v \cdot w = \sum_{i=1}^k v_i w_i $

The dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alternatively, vectors that have zeros in different dimensions—orthogonal vectors—will have a dot product of 0, representing their strong dissimilarity.

The cosine similarity metric between two vectors v and w thus can be computed as:

$cosine(v, w) = \frac{v \cdot w} {|v| |w|} $

**TF-IDF: Weighing terms in the vector**

It’s a bit of a paradox. Words that occur nearby frequently (maybe pie nearby cherry) are more important than words that only appear once or twice. Yet words that are too frequent—ubiquitous, like the or good— are unimportant. How can we balance these two conflicting constraints?

The tf-idf algorithm (the ‘-’ here is a hyphen, not a minus sign) is the product of two terms, each term capturing one of these two intuitions:

The first is the term frequency (Luhn, 1957): the frequency of the word t in the document d. We can just use the raw count as the term frequency:

$tf_{t,d} = count(t, d)$

Alternatively we can squash the raw frequency a bit, by using the $log_{10}$ of the frequency instead. The intuition is that a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document. Because we can’t take the $log$ of 0, we normally add 1 to the count:

$tf_{t,d} = log(count(t, d)+1)$

The second factor is used to give a higher weight to words that occur only in a few documents. Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection; terms that occur frequently across the entire collection aren’t as helpful. The document frequency $df_t$ of a term $t$ is the number of documents it occurs in. Document frequency is not the same as the collection frequency of a term, which is the total number of times the word appears in the whole collection in any document.

We emphasize discriminative words via the inverse document frequency or $idf$ term weight (Sparck Jones, 1972). The $idf$ is defined using the fraction $\frac{N}{df_t}$, where $N$ is the total number of documents in the collection, and $df_t$ is the number of documents in which term $t$ occurs.

Because of the large number of documents in many collections, this measure too is usually squashed with a log function. The resulting definition for inverse document frequency ($idf$) is thus:

$idf_t = log \left( \frac{N}{df_t} \right)$

The tf-idf weighted value $w_{t,d}$ for word $t$ in document $d$ thus combines term frequency $tf_{t,d}$ with the inverse document frequency $idf_t$ of term $t$:

$w_{t,d} = tf_{t,d} \times idf_t$

The tf-idf weighting is the way for weighting co-occurrence matrices in information retrieval, but also plays a role in many other aspects of natural language processing. It’s also a great baseline, the simple thing to try first. We’ll look at other weightings like PPMI (Positive Pointwise Mutual Information) in Section 6.7.

*Applications*

- Finding words that are similar: can find the $k$ most similar words to any target word $w$ by computing the cosines between $w$ and each of the $V−1$ other words, sorting, and looking at the top $k$.
- Finding documents that are similar: We represent a document by taking the vectors of all the words in the document, and computing the centroid of all those vectors. The centroid is the multidimensional version of the mean; the centroid of a set of vectors is a single vector that has the minimum sum of squared distances to each of the vectors in the set. Given $k$ word vectors, the centroid document vector $d$ is: $d = \frac{\sum_{i=1}^k w_i}{k}$. Then we apply the $cos(d_i, d_j)$ to know their similarity.  

**Pointwise Mutual Information (PPMI)**

PPMI is one of the most important concepts in NLP. It is a measure of how often two events $x$ and $y$ occur, compared with what we would expect if they were independent:

$I(x,y) = log_2\frac{P(x,y)}{P(x)P(y)}$

The pointwise mutual information between a target word $w$ and a context word $c$ (Church and Hanks 1989, Church and Hanks 1990) is then defined as:

$PMI(w,c) = log_2\frac{P(w,c)}{P(w)P(c)}$

The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE). The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events. Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance. PMI is a useful tool whenever we need to find words that are strongly associated.

PMI values range from negative to positive infinity. But negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous. . For this reason it is more common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero:

$PPMI(w,c) = max \left(log_2\frac{P(w,c)}{P(w)P(c)}, 0\right)$

So, for a given co-occurrence matrix $F$ with $W$ rows (*words*) and $C$ columns (*contexts*) where $f_{ij}$ gives the count of $w_i$ in context $c_j$. We can compute the matrix PPMI where $ppmi_{ij} = max \left(log_2\frac{P(w_i,c_j)}{P(w_i)P(c_j)}, 0\right)$ where:

- $P(w_i, c_j) = \frac{f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$
- $P(w_i) = \frac{\sum_{j=1}^C f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$
- $P(c_j) = \frac{\sum_{i=1}^W f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$

PMI has the problem of being biased toward infrequent events; very rare words tend to have very high PMI values. One way to reduce this bias toward low frequency events is to slightly change the computation for $P(c)$, using a different function $P_\alpha(c)$ that raises the probability of the context word to the power of $\alpha$:

$P_\alpha(c) = \frac{count(c)^\alpha}{\sum{c}count(c)^\alpha}$

Good value of $\alpha = 0.75$

Another alternative: Laplaca smoothing with $k \in [0.1;3]$.

**Word2Vec**

An alternative method for representing a word: the use of vectors that are short (of length perhaps 50-1000) and dense (most values are non-zero). It turns out that dense vectors work better in every NLP task than sparse vectors.

Skip-gram with negative sampling (SGNS) --> Word2Vec. The intuition of word2vec is that instead of counting how often each word $w$ occurs near, say, *apricot*, we’ll instead train a classifier on a binary prediction task: “Is
word $w$ likely to show up near *apricot*?” We don’t actually care about this prediction task; instead we’ll take the learned classifier *weights* as the word *embeddings*. First, word2vec simplifies the task (making it binary classification instead of word prediction). Second, word2vec simplifies the architecture (training a logistic regression classifier instead of a multi-layer neural network with hidden layers that demand more sophisticated training algorithms).

- Treat the target word and a neighboring context word as positive examples.
- Randomly sample other words in the lexicon to get negative samples.
- Use logistic regression to train a classifier to distinguish those two cases.
- Use the regression weights as the embeddings.

*The classifier*

Let’s start by thinking about the classification task, and then turn to how to train. Imagine a sentence like the following, with a target word apricot, and assume we’re using a window of ±2 context words. Our goal is to train a classifier such that, given a tuple $(t, c)$ of a target word $$t paired with a candidate context word $c$ it will return the probability that c is a real context word: $P(+|t,c)$

The probability that word $c$ is not a real context word for $t$ is: $P(-|t,c) = 1 - P(+|t,c)$

How does the classifier compute the probability $P$? The intuition of the skipgram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a
high dot product (*cosine*, the most popular similarity metric, is just a normalized dot product). In other words:

$Similarity(t,c) \approx t \cdot c$

And then, we convert it into a probability using a sigmoid function:

$P(+|t,c) = \frac{1}{1+e^{-t \cdot c}} \mapsto P(-|t,c) = \frac{e^{-t \cdot c}}{1+e^{-t \cdot c}}$

Skip-gram makes the strong but very useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities for all the $k$ words in context $c$:

$P(+|t,c_{1:k}) = \displaystyle \prod_{i=1}^k \frac{1}{1+e^{-t \cdot c_i}}$

$log(P(+|t,c_{1:k})) = \displaystyle \sum_{i=1}^k log \left( \frac{1}{1+e^{-t \cdot c_i}} \right) $

*Learning skipgram embeddings*

We go through all the documents in the corpus for a given context window and create tuples of $(t, c)$ where $t$ is the word the algorithm is positioned on and $c$ is the context around that word of a certain $k$ size. Then, we take that tuple in pairs and assign positive counts to the word $t$ and the word $c_i$ that stand as a $+$, i.e. words that appear together. Then, we sample $k$ other words that don't match $t$ to generate the negative ($-$) samples. These sampled words are known as *noise words*.

Noise words are sampled based on their own probability in the corpus. However, a fractionary power of the probability is used in favor of smoothing the high and low probability terms and make them more homogeneous. As it was seen before, a power of $\alpha=0.75$ is used:

$P^\alpha (w) = \frac{count(w)^\alpha}{\sum_{w^* \in V} count(w^*)^\alpha}$

Then, the objective of the learning algorithm is to maximize the similarity of the positive embeddings and minimize the negative embeddings such:

$L(\theta) = \displaystyle \sum_{(t,c) \in +} log(P(+|(t,c))) + \displaystyle \sum_{(t,c) \in -} log(P(-|(t,c))) = \frac{1}{1+e^{-t \cdot c}} + \displaystyle \sum_{i=1}^{k} log \frac{1}{1 + e^{-n_i \cdot t}}$

where $n_i$ is the noise word. We can then use stochastic gradient descent to train to this objective, iteratively
modifying the parameters (the embeddings for each target word $t$ and each context word or noise word $c$ in the vocabulary) to maximize the objective.

Note that the skip-gram model thus actually learns two separate embeddings for each word $w$: the target embedding $t$ and the context embedding $c$. These target embeddings are stored in two matrices, the target matrix $T$ and the context matrix $C$. So each row $i$ of the target matrix $T$ is the $1 × d$ vector embedding $t_i$ for word
$i$ in the vocabulary $V$, and each column $j$ of the context matrix $C$ is a $d × 1$ vector
embedding $c_j$ for word $j$ in $V$.

**Semantic properties of embeddings**

Vector semantic models have a number of parameters. One parameter that is relevant to both sparse tf-idf vectors and dense word2vec vectors is the size of the context window used to collect counts. This is generally between 1 and 10 words on each side of the target word (for a total context of 3-20 words).

