# word2vec as Matrix Factorization

Let's start with a simple training set.

In [1]:
train = ['The student speaks, as the student wants to learn. We learn what the student wants.']

In [None]:
# (Later)
# There are plenty of books freely available online!
# The Adventures of Sherlock Holmes by Arthur Conan Doyle
# http://www.gutenberg.org/ebooks/1661
'''!wget http://www.gutenberg.org/files/1661/1661-0.txt
with open('1661-0.txt') as f:
    train = [f.read()]'''

In [None]:
# (Even later)
# Bigger dataset of 100 MB (17M words)
'''!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
with open('text8') as f:
    train = [f.read()]'''

In [2]:
len(train[0].split()), 'words'

(15, 'words')

First, let's standardize the text (lowercase, remove punctuation, etc.).

In [3]:
%%time
from sklearn.feature_extraction.text import CountVectorizer

transformer = CountVectorizer()
transformer.fit(train)
analyzer = transformer.build_analyzer()

tokens = analyzer(train[0])
print(tokens)

['the', 'student', 'speaks', 'as', 'the', 'student', 'wants', 'to', 'learn', 'we', 'learn', 'what', 'the', 'student', 'wants']
CPU times: user 1.12 s, sys: 998 ms, total: 2.12 s
Wall time: 391 ms


In [4]:
encoder = transformer.vocabulary_
encoder

{'the': 4,
 'student': 3,
 'speaks': 2,
 'as': 0,
 'wants': 6,
 'to': 5,
 'learn': 1,
 'we': 7,
 'what': 8}

In [5]:
decoder = transformer.get_feature_names()
decoder

['as', 'learn', 'speaks', 'student', 'the', 'to', 'wants', 'we', 'what']

The context is a window of size `WINDOW_SIZE` around each word of the corpus.

If the corpus if $w_0, \ldots, w_{n - 1}$, the context of a word $w_i$ is all words $w_{i - L}, w_{i - L + 1}, \ldots, w_{i + L}$ where $L$ represents the `WINDOW_SIZE`.

Write a piece of code that builds a **word**-context count matrix. Be careful of corner cases.

**The** student $\rightarrow$ *student* is a context of ***The***, so we should increment that word-context pair  
The **student** speaks $\rightarrow$ *The* and *speaks* are contexts of ***student***  
student **speaks** as

In [None]:
%%time
from collections import Counter

WINDOW_SIZE = 1  # Should be 1 as a start, then 5 for bigger corpuses
counts = Counter()  # Should contain the number of word-context occurrences (keys are pairs)
nb_word = Counter()  # Should contain the number of occurrences of each word
nb_context = Counter()  # Should contain the number of occurrences of each context

for pos, word in enumerate(tokens):
    # Your code here
    pass
# Check counts

We will now build a word-context PMI matrix (*pointwise mutual information*), empirically given by:

$$ PMI(w, c) = \log \frac{P(w, c)}{P(w)P(c)} = \log \frac{\#(w, c) |D|}{\#(w) \#(c)} $$

where $|D|$ is the number of words in the corpus, $\#(w), \#(c), \#(w, c)$ are respectively the number of occurrences of word, context and word-context pair.

This matrix will be sparse: please populate `rows`, `cols`, and `data` lists, for word indices (using `encoder` defined as vocabulary), contexts, and counts.

In [None]:
%%time
import numpy as np

rows = []  # Contains words indices
cols = []  # Contains context indices
data = []  # Contains values of the matrix

for (word, context), count in counts.items():
    # Your code here
    pass

In [None]:
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

pmi = csr_matrix((data, (rows, cols)), shape=(len(nb_word), len(nb_context)))
pmi

We are going to compute the SVD of this matrix to reduce the dimensionality.

The (compact) singular value decomposition of a $m \times n$ matrix $M$ of rank $r$ is $U \Sigma V^T$ where:

- $U$ is semi-unitary and of size $m \times r$
- $\Sigma$ is diagonal, $r \times r$
- $V^T$ is semi-unitary and of size $r \times n$, i.e. $U^T U = V^T V = I_{r \times r}$.

We usually order the singular values in decreasing order, to explain as much variance as possible ($k$-SVD is the best approximation of rank $k$).

In [None]:
%%time
u, sigma, vt = svds(pmi, k=3)
# 1 min 55 s on text8

In [None]:
pmi.min(), pmi.max()

In [None]:
embeddings = u * sigma
embeddings.shape

In [None]:
u.shape, sigma.shape, vt.shape

## Word similarity

Now let's play with embeddings!

Implement the cosine similarity:

$$ cos(u, v) = \frac{\langle u, v \rangle}{|| u ||_2 || v ||_2} $$

then check that you have the same results as `sklearn`.

It is now the moment to move to the Sherlock Holmes corpus. Go back to first cell!

Pick a few words in the vocabulary (`encoder`) and compute their 20 closest neighbors. 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Your code here

Actually, the similarity values have noise. Let's remove the negative values from the PMI matrix.

> *When  representing  words,  there  is  some  intuition  behind  ignoring  negative  values:  humans  can easily think of positive associations (e.g.  “Canada” and “snow”) but find it much harder to invent negative ones (“Canada” and “desert”).  This suggests that the perceived similarity of two words is more influenced by the positive context they share than by the negative context they share.  It therefore makes some intuitive sense to discard the negatively associated contexts and mark them as “uninformative” (0) instead*

It is now a PPMI matrix: positive pointwise mutual information.

In [None]:
ppmi = pmi.copy()
ppmi.data[ppmi.data < 0] = 0
ppmi.eliminate_zeros()  # Remove the now-zero values
ppmi
# Now recompute the SVD with ppmi

## Semantic analogies

Now **let's replace logic with algebra.**

(Not everyone will be satisfied with this statement, I guess.)

Recompute the PPMI matrix and embeddings for the bigger dataset, `text8`.

Then attempt to answer some questions where we have to find $b^*$ in:

$$ a \textrm{ is to } a^* \textrm{ as } b \textrm{ is to } b^* $$

(ex. *Paris is to France as Tokyo is to Japan*)

How to express this in terms of embeddings?

In [None]:
# Once text8 has been trained
# encoder['paris'], encoder['france'], encoder['tokyo'], encoder['japan']
# a a* b (b*?)

Does it work? You may want to normalize the embeddings. Please do so into `embed_unit`.

Please look for nasty analogies (shortcuts that are unfair).

In [None]:
# Your code here

Please note that it is possible to reuse your factorization method to learn the embeddings. This is the topic of a future homework!

# References

Great reads!

Levy, O., & Goldberg, Y. (2014). [Neural word embedding as implicit matrix factorization.](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf) In Advances in neural information processing systems (pp. 2177–2185).

Doyle, A. C. (1891). [The Adventures of Sherlock Holmes: Adventure I. — A Scandal in Bohemia.](http://www.gutenberg.org/ebooks/1661) The Strand Magazine, vol. 2, pp. 61–75 (July 1891). 