# Experiments with word embeddings

In this notebook, we'll have some fun with **<font color="magenta">word embeddings</font>**: distributed representations of words. 

We'll see how such an embedding can be constructed by applying principal component analysis to a suitably transformed matrix of word co-occurrence probabilities. For computational reasons, we'll use the moderately sized **Brown corpus of present-day American English** for this.

## 1. Accessing the Brown corpus

The *Brown corpus* is available as part of the Python Natural Language Toolkit (`nltk`).

In [None]:
import numpy as np
import pickle
import nltk
nltk.download('brown')
nltk.download('stopwords')
from nltk.corpus import brown, stopwords
from scipy.cluster.vq import kmeans2
from sklearn.decomposition import PCA

The corpus consists of 500 samples of text drawn from a wide range of sources. When these are concatenated, they form a very long stream of over a million words, which is available as `brown.words()`. Let's look at the first 50 words.

In [None]:
for i in range(50):
    print (brown.words()[i],)

Before doing anything else, let's remove stopwords and punctuation and make everything lowercase. The resulting sequence will be stored in `my_word_stream`.

In [None]:
my_stopwords = set(stopwords.words('english'))
word_stream = [str(w).lower() for w in brown.words() if w.lower() not in my_stopwords]
my_word_stream = [w for w in word_stream if (len(w) > 1 and w.isalnum())]

Here are the initial 20 words in `my_word_stream`.

In [None]:
my_word_stream[:20]

## 2. Computing co-occurrence probabilities

<font color="magenta">Step 1: Get a list of words and their frequencies.</font>

In [None]:
N = len(my_word_stream)
words = []
totals = {}
for i in range(1, N-1):
    w = my_word_stream[i]
    if w not in words:
        words.append(w)
        totals[w] = 0
    totals[w] = totals[w] + 1

<font color="magenta">Step 2: Decide on the vocabulary.</font> There are two potentially distinct vocabularies: the words for which we will obtain embeddings (`vocab_words`) and the words we will consider when looking at context information (`context_words`). We will take the former to be all words that occur at least 20 times, and the latter to be all words that occur at least 100 times. These choices are pretty arbitrary: by all means, play around with them and find something better.

In [None]:
vocab_words = [w for w in words if totals[w] > 19]
context_words = [w for w in words if totals[w] > 99]

How large are these two word lists? **Note down these numbers: you will need to enter them as part of this week's assignment.**

In [None]:
len(vocab_words), len(context_words)

<font color="magenta">Step 3: Get co-occurrence counts.</font> These are defined as follows, for a small constant `window_size=2`.

* Let `w0` be any word in `vocab_words` and `w` any word in `context_words`.
* Each time `w0` occurs in the corpus, look at the window of `window_size` words before and after it. If `w` appears in this window, we say it appears in the context of (this particular occurrence of) `w0`.
* Define `counts[w0][w]` as the total number of times `w` occurs in the context of `w0`.

The function `get_counts` computes the `counts` array, and returns it as a dictionary (of dictionaries).

In [None]:
def get_counts(window_size=2):
    counts = {}
    for w0 in vocab_words:
        counts[w0] = {}
    for i in range(window_size, N-window_size):
        w0 = my_word_stream[i]
        if w0 in vocab_words:
            for j in (list(range(-window_size,0)) + list(range(1,window_size+1))):
                w = my_word_stream[i+j]
                if w in context_words:
                    if w not in counts[w0].keys():
                        counts[w0][w] = 1
                    else:
                        counts[w0][w] = counts[w0][w] + 1
    return counts

Define `probs[w0][]` to be the distribution over the context of `w0`, that is:
* `probs[w0][w] = counts[w0][w] / (sum of all counts[w0][])`

This is computed by the function `get_co_occurrence_dictionary`, given `counts`.

In [None]:
def get_co_occurrence_dictionary(counts):
    probs = {}
    for w0 in counts.keys():
        sum = 0
        for w in counts[w0].keys():
            sum = sum + counts[w0][w]
        if sum > 0:
            probs[w0] = {}
            for w in counts[w0].keys():
                probs[w0][w] = float(counts[w0][w])/float(sum)
    return probs

The final piece of information we need is the frequency of different context words. The function below, `get_context_word_distribution`, takes `counts` as input and returns (again, in dictionary form) the array:

* `context_frequency[w]` = sum of all `counts[][w]` / sum of all `counts[][]` 

In [None]:
def get_context_word_distribution(counts):
    counts_context = {}
    sum_context = 0
    context_frequency = {}
    for w in context_words:
        counts_context[w] = 0
    for w0 in counts.keys():
        for w in counts[w0].keys():
            counts_context[w] = counts_context[w] + counts[w0][w]
            sum_context = sum_context + counts[w0][w]
    for w in context_words:
        context_frequency[w] = float(counts_context[w])/float(sum_context)
    return context_frequency

## 3. The embedding

Based on the various pieces of information above, we compute the **pointwise mutual information matrix**:
* `PMI[i,j] = MAX(0, log probs[ith vocab word][jth context word] - log context_frequency[jth context word])`

The embedding of any word can then be taken as the corresponding row of this matrix. However, to reduce the dimension, we will apply PCA.

In [None]:
print ("Computing counts and distributions")
counts = get_counts(2)
probs = get_co_occurrence_dictionary(counts)
context_frequency = get_context_word_distribution(counts)
#
print ("Computing pointwise mutual information")
n_vocab = len(vocab_words)
n_context = len(context_words)
pmi = np.zeros((n_vocab, n_context))
for i in range(0, n_vocab):
    w0 = vocab_words[i]
    for w in probs[w0].keys():
        j = context_words.index(w)
        pmi[i,j] = max(0.0, np.log(probs[w0][w]) - np.log(context_frequency[w]))

Now reduce the dimension of the PMI vectors using principal component analysis. Here we bring it down to 100 dimensions, and then normalize the vectors to unit length.

In [None]:
pca = PCA(n_components=100)
vecs = pca.fit_transform(pmi)
for i in range(0,n_vocab):
    vecs[i] = vecs[i]/np.linalg.norm(vecs[i])

It is useful to save this embedding so that it doesn't need to be computed every time.

In [None]:
fd = open("embedding.pickle", "wb")
pickle.dump(vocab_words, fd)
pickle.dump(context_words, fd)
pickle.dump(vecs, fd)
fd.close()

## 4. Experimenting with the embedding

We can get some insight into the embedding by looking at the nearest neighbor of different words in the embedded space.

In [None]:
def word_NN(w):
    if not(w in vocab_words):
        print ("Unknown word")
        return
    v = vecs[vocab_words.index(w)]
    neighbor = 0
    curr_dist = np.linalg.norm(v - vecs[0])
    for i in range(1, n_vocab):
        dist = np.linalg.norm(v - vecs[i])
        if (dist < curr_dist) and (dist > 0.0):
            neighbor = i
            curr_dist = dist
    return vocab_words[neighbor]

**Note down the answers to the following queries. You will need to enter them as part of this week's assignment.**

In [None]:
word_NN('pulmonary')

In [None]:
word_NN('communism')

In [None]:
word_NN('world')

In [None]:
word_NN('london')