#  Word Embeddings 
Implement a word embedding approach that is a bit simpler than word2vec. The key idea is to look at co-occurrences between center words and context words (somewhat like in word2vec) but without any pesky learning of model parameters.

## Load the Brown Corpus

The dataset for this part is the (in)famous [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) that is a collection of text samples from a wide range of sources, with over one million unique words. Good for us,  can find the Brown corpus in nltk. 

In [1]:
import nltk
import re
from nltk.corpus import brown
from nltk.corpus import stopwords
import numpy as np

In [4]:
nltk.download('brown')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/achadha7/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
brown_words = brown.words()
stop_words = set(stopwords.words('english'))

%store brown_words
%store stop_words

Stored 'brown_words' (ConcatenatedCorpusView)
Stored 'stop_words' (set)


## 1.1 Dataset Pre-processing
OK, now we need to do some basic pre-processing. For this part you should:

* Remove stopwords and punctuation.
* Make everything lowercase.

Then, count how often each word occurs. We will define the 5,000 most  frequent words as your vocabulary (V). We will define the 1,000 most frequent words as our context (C). Include a print statement below to show the top-20 words after pre-processing.

In [2]:
%store -r brown_words
%store -r stop_words

In [3]:
# Remove stop words, punctuation from brown corpus vocab
def preprocess_data(vocab, stop_words):
    brown_tokens = [''.join(re.split('\W+', word.lower())) for word in vocab if word.lower() not in stop_words]
    return brown_tokens

In [4]:
brown_words_cleaned = preprocess_data(brown_words, stop_words)
corpus_length = len(brown_words_cleaned)
print("Brown corpus length: ", corpus_length)

fdist = nltk.FreqDist(brown_words_cleaned)
fdist.pop("")

context = fdist.most_common(1000)
vocabulary = fdist.most_common(5000)

Brown corpus length:  686163


In [5]:
print("Vocabulary:")
vocabulary[:14]

Vocabulary:


[('one', 3297),
 ('would', 2714),
 ('said', 1961),
 ('new', 1635),
 ('could', 1601),
 ('time', 1598),
 ('two', 1412),
 ('may', 1402),
 ('first', 1361),
 ('like', 1292),
 ('man', 1207),
 ('even', 1170),
 ('made', 1125),
 ('also', 1069)]

In [6]:
print("Context:")
context[:14]

Context:


[('one', 3297),
 ('would', 2714),
 ('said', 1961),
 ('new', 1635),
 ('could', 1601),
 ('time', 1598),
 ('two', 1412),
 ('may', 1402),
 ('first', 1361),
 ('like', 1292),
 ('man', 1207),
 ('even', 1170),
 ('made', 1125),
 ('also', 1069)]

## 1.2 Building the Co-occurrence Matrix 

For each word in the vocabulary (w), we want to calculate how often context words from C appear in its surrounding window of size 4 (two words before and two words after).

In other words, we need to define a co-occurrence matrix that has a dimension of |V|x|C| such that each cell (w,c) represents the number of times c occurs in a window around w. 

In [11]:
def create_index_map(words):
    mmap = {}
    for w in words:
        mmap[w[0]] = len(mmap)
    return mmap

vocabulary_to_index_map = create_index_map(vocabulary)
context_to_index_map = create_index_map(context)
index_map = (vocabulary_to_index_map, context_to_index_map)

In [9]:
def create_co_occurence_matrix(words, index_map, window_size = 2):
    
    matrix = np.ones((len(vocabulary), len(context)))

    vocabulary_to_index_map = index_map[0]
    context_to_index_map = index_map[1]
    corpus_length = len(words)
    
    for i in range(corpus_length):
        word1 = words[i]
        if word1 not in vocabulary_to_index_map:
            continue

        w1_id = vocabulary_to_index_map[word1]
        start_range = i - window_size
        end_range = i + window_size

        for j in range(start_range, end_range + 1):
            if j >= 0 and j < corpus_length:
                word2 = words[j]
                if word2 not in context_to_index_map:
                    continue
                else: 
                    w2_id = context_to_index_map[word2]
                    matrix[w1_id][w2_id] += 1
    return matrix

co_occurence_matrix = create_co_occurence_matrix(brown_words_cleaned, index_map)

## 1.3 Probability Distribution

Using the co-occurrence matrix, we can compute the probability distribution Pr(c|w) of context word c around w as well as the overall probability distribution of each context word c with Pr(c).  

In [13]:
# This method returns a matrix where each cell represents Pr(c|w)
# Pr(c|w) = count(c,w)/ count(w)
def create_probablity_distribution_matrix(co_occurence_matrix):
    pmatrix = np.copy(co_occurence_matrix)
    pmatrix = pmatrix/pmatrix.sum(axis = 1)[:, None]
    return pmatrix

# Returns a vector of probablities of context words Pr(c)
def create_probabilities_context_words(co_occurence_matrix):
    context_words_counts = co_occurence_matrix.sum(axis = 0)
    return context_words_counts/context_words_counts.sum()

In [14]:
print("Calculating Pr(c|w) for each c,w combination in co-occurence matrix")
pmatrix = create_probablity_distribution_matrix(co_occurence_matrix)
print(pmatrix)

Calculating Pr(c|w) for each c,w combination in co-occurence matrix
[[  3.58512804e-01   6.85738776e-03   2.57152041e-03 ...,   2.14293368e-04
    2.14293368e-04   5.35733419e-04]
 [  8.03313669e-03   3.42286934e-01   6.77795908e-03 ...,   1.25517761e-04
    1.25517761e-04   2.51035522e-04]
 [  5.43109301e-03   1.22199593e-02   4.46254809e-01 ...,   2.26295542e-04
    2.26295542e-04   2.26295542e-04]
 ..., 
 [  9.87166831e-04   9.87166831e-04   9.87166831e-04 ...,   9.87166831e-04
    9.87166831e-04   9.87166831e-04]
 [  9.84251969e-04   9.84251969e-04   9.84251969e-04 ...,   9.84251969e-04
    9.84251969e-04   9.84251969e-04]
 [  1.95312500e-03   9.76562500e-04   9.76562500e-04 ...,   9.76562500e-04
    9.76562500e-04   9.76562500e-04]]


In [16]:
print("Calculating Pr(c) for each context word")
print()
print("Top 5 context words probabilities:")
context_probablities = create_probabilities_context_words(co_occurence_matrix)[:, None]
print(context_probablities[0:5])
print(context_probablities.shape)

Calculating Pr(c) for each context word

Top 5 context words probabilities:
[[ 0.00279049]
 [ 0.00251259]
 [ 0.00164824]
 [ 0.0018586 ]
 [ 0.00183776]]
(1000, 1)


## 1.4 Embedding Representation

Now we can represent each vocabulary word as a |C| dimensional vector using this equation:

Vector(w)= max(0, log (Pr(c|w)/Pr(c)))

This is a traditional approach called *pointwise mutual information* that pre-dates word2vec by some time. 

In [17]:
# Want to divide each column in the probablities matrix with the context word probablities
# Pr(c|w)/ Pr(c)

ppmi_matrix = np.copy(pmatrix)
ppmi_matrix =  (ppmi_matrix.T / context_probablities).T

zero_matrix = np.zeros((len(vocabulary), len(context)))

ppmi_matrix = np.maximum(zero_matrix, np.log(ppmi_matrix))
ppmi_matrix.shape

(5000, 1000)

In [18]:
%store ppmi_matrix

Stored 'ppmi_matrix' (ndarray)


## 1.5 Analysis

So now we have some embeddings for each word. But are they meaningful? For this part, let's:

- First, cluster the vocabulary into 100 clusters using k-means. Look over the words in each cluster, can you see any relation beween words? Discuss your observations.

- Second, for the top-20 most frequent words, find the nearest neighbors using cosine distance (1- cosine similarity). Do the findings make sense? Discuss.

In [19]:
index_vocabulary_map = {v:k for k,v in vocabulary_to_index_map.items()}
index_context_map = {v:k for k, v in context_to_index_map.items()}

In [20]:
%store -r ppmi_matrix
from sklearn.cluster import KMeans
kmean_model = KMeans(n_clusters=100, init='k-means++', max_iter=100)
kmean_model.fit(ppmi_matrix)
%store kmean_model

Stored 'kmean_model' (KMeans)


In [21]:
%store -r kmean_model
order_centroids = kmean_model.cluster_centers_.shape

In [22]:
order_centroids = kmean_model.cluster_centers_.argsort(axis=1)[:, ::-1]

print("Printing 5 elements for 3 clusters that are nearest to centroid")
for i in range(3):
    print()
    print("Cluster %d:" % i),
    print("==========")
    for ind in order_centroids[i, :10]:
        print(' %s' % index_vocabulary_map[ind]),
    print

Printing 5 elements for 3 clusters that are nearest to centroid

Cluster 0:
 washington
 last
 25
 george
 week
 state
 city
 square
 kennedy
 congress

Cluster 1:
 development
 program
 public
 national
 economic
 state
 research
 government
 education
 planning

Cluster 2:
 god
 christ
 son
 born
 faith
 love
 mother
 life
 death
 man


### K-Means code reference - 
https://pythonprogramminglanguage.com/kmeans-text-clustering/

In [23]:
from sklearn.neighbors import NearestNeighbors

neighbours_model = NearestNeighbors(n_neighbors=7, metric = 'cosine')
neighbours_model.fit(ppmi_matrix)
%store neighbours_model

Stored 'neighbours_model' (NearestNeighbors)


In [24]:
%store -r neighbours_model

top_10_nn = neighbours_model.kneighbors(ppmi_matrix[:10,], n_neighbors=7)
top_10_nn_indices = top_10_nn[1]

for i in range(10):
    print("")
    print("Top word", str(i+1), "- ", index_vocabulary_map[i])
    print("===========")
    for j in top_10_nn_indices[i, :]:
        print(index_vocabulary_map[j])


Top word 1 -  one
one
another
thing
day
least
good
man

Top word 2 -  would
would
like
say
never
things
could
let

Top word 3 -  said
said
mr
maggie
hal
skyros
borden
smiling

Top word 4 -  new
new
york
yankees
city
orleans
jersey
central

Top word 5 -  could
could
see
hear
never
would
anything
way

Top word 6 -  time
time
long
first
period
short
place
spent

Top word 7 -  two
two
three
ago
hundred
years
weeks
four

Top word 8 -  may
may
also
seem
desirable
well
find
might

Top word 9 -  first
first
time
second
last
two
place
day

Top word 10 -  like
like
look
felt
would
know
looked
think
