# Vector Semantics
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

*Recommended Reading*:
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)

*Notebook Covers Material of*:
- [SLP](https://web.stanford.edu/~jurafsky/slp3/6.pdf) Chapter 6: Vector Semantics and Embeddings

__Requirements__

- spaCy
- [gensim](https://radimrehurek.com/gensim/)

## Words as Vectors (Embeddings)
- Word embeddings is the process by which words are transformed into vectors of (real) numbers.
- Definition of meaning by distributional similarity / usage: similar words are close in "space"


### One-Hot Encoding
- sparse vectors
- most basic way to turn a token into a vector
- method
    - associate a unique integer index with every word in a vocabulary of size $V$
    - turn this integer index $i$ into a binary vector of size $V$ (i.e. the size of the vocabulary)
    - the vector has all values `0` except for the $i$th entry, which is `1`

## Co-Occurence Matrices and Word as Vectors

### Term-Document Matrix
- could be used to represent words, where dimension are documents

### TF-IDF
- sparse vectors
- generally used to represent documents, where dimensions are words

#### TF: Term Frequency
$$\text{tf}_{t,d} = \text{count}(t,d)$$
$$\text{tf}_{t,d} = \log_{10}(\text{count}(t,d) + 1)$$

`+1` is because log of 0 is undefined.

Alternatively:

$$\text{tf}_{t,d} = 
\begin{cases}
1 + \log_{10}(\text{count}(t,d)), & \text{if count}(t,d) > 0\\
0, & \text{otherwise}
\end{cases}$$

#### IDF: Inverse Document Frequency

$$\text{idf}_t = \frac{N}{\text{df}_t}$$

Usually in log space, like term frequency.

$$\text{idf}_t = \log_{10}(\frac{N}{\text{df}_t})$$

- $\text{df}_t$ is the number of documents in which term $t$ occurs
- $N$ is the total number of documents in the collection.


The __tf-idf__ weighted value $w_{t,d}$ for word $t$ in document $d$ is the combination of $\text{tf}_{t,d}$ and $\text{idf}_t$:

$$w_{t,d} = \text{tf}_{t,d} \times \text{idf}_t$$


### Term-Term Matrix
- a.k.a. "word-word" or "word-context" matrix
- words are represented by a function of the counts of nearby words 
- size $|V| \times |V|$, where $V$ is the vocabulary size
    - usually context is taken to be a document or words in a window around the target word

### Pointwise Mutual Information (PMI) and Positive Pointwise Mutual Information (PPMI)
- used for term-term matrices
- "the best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance."

#### Pointwise Mutual Information (PMI)
- a measure of how often two events $x$ and $y$ occur, compared with what we would expect if they were independent:

$$I(x, y) = \log_2 \frac{P(x, y)}{P(x)P(y)}$$


The pointwise mutual information between a target word $w$ and a context word $c$ is defined as:

$$\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w)P(c)}$$


#### Positive Pointwise Mutual Information (PMI)
- PMI values range from negative to positive infinity.
- negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable
- it is more common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero

$$\text{PPMI}(w, c) = \max(\log_2 \frac{P(w, c)}{P(w)P(c)}, 0)$$


#### PPMI Matrix
To get a PPMI matrix from a co-occurrence matrix $F$, where $W$ rows are words and $C$ columns are contexts, and $f_{ij}$ is the number of times word $w_i$ appears in context $c_j$ (i.e. value of the cell).

$$P(w,c) = \frac{f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$$

$$P(w) = \frac{\sum_{j=1}^C f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$$

$$P(c) = \frac{\sum_{i=1}^W f_{ij}}{\sum_{i=1}^W \sum_{j=1}^C f_{ij}}$$


- PMI has the problem of being biased toward infrequent events: very rare words tend to have very high PMI values.
- Thus, $P(c)$ is computed as $P_{\alpha}(c)$ that raises the probability of the context word to the power of $\alpha$ (e.g. $0.75$)
    - Alternative is Laplace smoothing

$$\text{PPMI}_{\alpha}(w, c) = \max(\log_2 \frac{P(w, c)}{P(w)P_{\alpha}(c)}, 0)$$

$$P_{\alpha}(c) = \frac{\text{count}(c)^{\alpha}}{\sum_{c}\text{count}(c)^{\alpha}}$$

### Building Co-Occurence Matrix

In [38]:
# let's define a function to build vocabulary
def get_vocab(samples):
    vocab = set()
    for s in samples:
        words = s if type(s) is list else s.split()
        vocab = vocab.union(set(words))
    return sorted(list(vocab))

In [39]:
import numpy as np

def cooc_matrix(samples, vocab):
    m = np.zeros((len(vocab), len(vocab)))
    # let's co-occurence be document level (i.e. sentence)
    for s in samples:
        for w1 in s:  # rows
            for w2 in s:  # columns
                i = vocab.index(w1)
                j = vocab.index(w2)
                m[i][j] += 1 
    return m

In [40]:
data = [
    "the capital of France is Paris", 
    "Rome is the capital of Italy",
]

vocab = get_vocab(data)
print(vocab)
cm = cooc_matrix([s.split() for s in data], vocab)
print(cm)


['France', 'Italy', 'Paris', 'Rome', 'capital', 'is', 'of', 'the']
[[1. 0. 1. 0. 1. 1. 1. 1.]
 [0. 1. 0. 1. 1. 1. 1. 1.]
 [1. 0. 1. 0. 1. 1. 1. 1.]
 [0. 1. 0. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 2. 2. 2. 2.]
 [1. 1. 1. 1. 2. 2. 2. 2.]
 [1. 1. 1. 1. 2. 2. 2. 2.]
 [1. 1. 1. 1. 2. 2. 2. 2.]]


In [41]:
print(cm[vocab.index('Rome')])
print(cm[vocab.index('Paris')])
print(cm[vocab.index('is')])

[0. 1. 0. 1. 1. 1. 1. 1.]
[1. 0. 1. 0. 1. 1. 1. 1.]
[1. 1. 1. 1. 2. 2. 2. 2.]


### Exercise
- Extend the co-occurence matrix computation to allow specifying window of context (as tuple for previous and next words)
- Define a funtion to compute PPMI on co-occurence matrix (Optional)

## Vector Similarity
- two words are similar in meaning if their context __vectors__ are similar
- __Cosine similarity__ measures the similarity between two vectors of an __inner product space__. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

### Dot Product

- dot product (inner product)

$$\vec{v}\cdot\vec{w} = \sum^N_{i=1}v_i w_i = v_1 w_1 + v_2 w_2 + ... + v_N w_N$$

- vector length (L2 norm $||v||_2$)

$$|\vec{v}| = \sqrt{\sum^N_{i=1} v_i^2}$$ 

$$ |\vec{v}| = \sqrt{\vec{v}\cdot\vec{v}} = \sqrt{\sum^N_{i=1} v_i v_i} = \sqrt{\sum^N_{i=1} v_1 v_1 + v_2 v_2 + ... + v_N v_N}$$

### Cosine Similarity

- L2 normalized dot product of 2 vectors
    - $\theta$ is the angle between $\vec{v}$ and $\vec{w}$

$$\vec{v}\cdot\vec{w} = |\vec{v}||\vec{w}|\cos\theta$$

$$\cos\theta = \frac{\vec{v}\cdot\vec{w}}{|\vec{v}||\vec{w}|}$$

$$\text{CosSim}(\vec{v},\vec{w}) = \frac{\vec{v}\cdot\vec{w}}{|\vec{v}||\vec{w}|} = \frac{\sum^N_{i=1}v_i w_i}{\sqrt{\sum^N_{i=1} v_i^2} \sqrt{\sum^N_{i=1} w_i^2}}$$

#### Cosine Distance
$$\text{Cosine Distance}(\vec{v}, \vec{w}) = 1 - \text{Cosine Similarity}(\vec{v}, \vec{w})$$

### Exercises
- Implement one-hot encoding (binary vecorization)
    - takes vocabulary and a sentence as arguments (lists of words)
    - outputs numpy vector (`ndarray`)
- Implement a function to compute __cosine similarity__ using `numpy` methods
    - `np.dot`
    - `np.sqrt`
- Using the defined functions
    - vectorize the sentences:
        - "the capital of France is Paris"
        - "Rome is the capital of Italy"
    - compute cosine similarity between them
    - compare similarity values to the cosine similarity using the ouput of (`scipy.spatial.distance.cosine`)
        - i.e. use *distance* to compute *similarity*


## Training Word Embeddings with gensim

### Word2Vec
- dense vectors
- representation is created by training a classifier to distinguish nearby and far-away words
- Variants
    - SKIP-GRAM
    - CBOW
- Refer to [documentation](https://radimrehurek.com/gensim/models/word2vec.html) for details
- [Tutorial](https://rare-technologies.com/word2vec-tutorial/)

In [None]:
%%bash
pip install python-Levenshtein

In [None]:
# training the model
from gensim.models import Word2Vec
model = Word2Vec(sentences=[d.split() for d in data], vector_size=10, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [None]:
# loading the model
model = Word2Vec.load("word2vec.model")
print(model)

In [None]:
# getting word vectors
print(model.wv['Rome'])
# getting most similar
print(model.wv.most_similar('Rome', topn=3))

## Pre-Trained Embeddings
- Training embeddings is computationally expensive
- Many pre-trained models are available

In [None]:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
# Download the 'word2vec-google-news-300' embeddings
# w2v = gensim.downloader.load('word2vec-google-news-300')

## Word Embeddings in spaCy

> To make them compact and fast, spaCy's small pipeline packages (all packages that end in `sm`) don't ship with word vectors, and only include context-sensitive tensors. This means you can still use the `similarity()` methods to compare documents, spans and tokens -- but the result won't be as good, and individual tokens won't have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:

> `python -m spacy download en_core_web_lg`

> Pipeline packages that come with built-in word vectors make them available as the `Token.vector` attribute. `Doc.vector` and `Span.vector` will default to an __average of their token vectors__. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors.

> Each `Doc`, `Span`, `Token` and `Lexeme` comes with a `.similarity` method that lets you compare it with another object, and determine the similarity. 

In [None]:
import spacy
spacy.cli.download('en_core_web_lg')

### Accessing Embedding Vectors

In [6]:
import spacy
import numpy as np

nlp = spacy.load('en_core_web_lg')
txt = 'Rome is the capital of Italy'
doc = nlp(txt)

tok = doc[0]  # let's get Rome

print("string:", tok.text)
# print("vector:", tok.vector)
print("vector dimension:", len(tok.vector))
print("spacy vector norm:", tok.vector_norm)
print("numpy vector norm:", np.sqrt(np.dot(tok.vector, tok.vector)))
print("numpy linalg norm:", np.linalg.norm(tok.vector))

string: Rome
vector dimension: 300
spacy vector norm: 7.191654
numpy vector norm: 7.191654
numpy linalg norm: 7.191654


In [8]:
from scipy.spatial.distance import cosine

# let's get Paris & compare its vector to rome
paris = nlp('Paris')[0]
print(paris.text)

print("spacy CosSim({}, {}):".format(tok.text, paris.text), tok.similarity(paris))
print("scipy CosSim({}, {}):".format(tok.text, paris.text), 1 - cosine(tok.vector, paris.vector))
# print("_our_ CosSim({}, {}):".format(tok.text, paris.text), cossim(tok.vector, paris.vector))

Paris
spacy CosSim(Rome, Paris): 0.58241165
scipy CosSim(Rome, Paris): 0.5824115872383118


### Evaluation: Analogy Task
In the word analogy task, we complete the sentence of the form

"$w_1$ is to $w_2$ as $w_3$ is to $w4$", where $w_4$ is a blank. 

For instance:

"*man* is to *woman* as *king* is to **__**", and our goal is to guess the missing word (*queen*)

The task is approached using cosine similarity between vector differences: 

$$\vec{w_2} - \vec{w_1} \approx \vec{w_4} - \vec{w_3}$$

$$\vec{w_4} \approx = \vec{w_3} + \vec{w_2} - \vec{w_1}$$

$$w = \arg\max_{w \in V}(\vec{w} \cdot (\vec{w_3} + \vec{w_2} - \vec{w_1}))$$


$$w = \arg\max_{w \in V}\text{CosSim}(\vec{w_2} - \vec{w_1}, \vec{w} - \vec{w_3})$$

#### Analogy using Most Similar
> For each of the given vectors, find the `n` most similar entries to it by cosine. 
Queries are by vector. Results are returned as a (`keys`, `best_rows`, `scores`)

In [36]:
def analogy_spacy(w1, w2, w3):
    v1 = nlp.vocab[w1].vector
    v2 = nlp.vocab[w2].vector
    v3 = nlp.vocab[w3].vector
    
    # relation vector
    rv = v3 + v2 - v1
    
    # n=1 & sorted by default
    ms = nlp.vocab.vectors.most_similar(np.asarray([rv]), n=10)
    print(ms)
    
    # getting words & scores
    for i, key in enumerate(ms[0][0]):
        print(nlp.vocab.strings[key], ms[2][0][i])


In [37]:
print(analogy_spacy('man', 'woman', 'king'))
# expected output ('Queen', 0.7881)

(array([[13176088972490086564, 14826469074451677028,  7464393751932445219,
         7102492827649024548, 10168488388102651113,  4176741725343376093,
         5247273317732208552, 11742085837932180620, 12278543830867659210,
         4527521648030784477]], dtype=uint64), array([[391588,   2183,   3150,  27270,   5310,   6026,  59856,  94889,
         11900,   7474]], dtype=int32), array([[0.8024, 0.8024, 0.8024, 0.8024, 0.7881, 0.7881, 0.7881, 0.6401,
        0.6401, 0.6401]], dtype=float32))
KIng 0.8024
King 0.8024
king 0.8024
KING 0.8024
Queen 0.7881
queen 0.7881
QUEEN 0.7881
PRINCE 0.6401
prince 0.6401
Prince 0.6401
None


In [31]:
# print(analogy_spacy('Rome', 'Italy', 'Paris'))
# ('france', 0.7606)

3 0.7606 france -inf
4 0.7606 France 0.7606
5 0.7606 FRANCE 0.7606
9 0.6176 Europe 0.7606
('france', 0.7606)


#### Exercise
Implement analogy compuatation function that
- takes 3 words of the analogy task as an input
- outputs the word 4
- make use of spacy vectors

version 2:
- implement without using spacy's `most_similar`