##  Loading libraries

Module nltk is natural language processing toolkit which provides certain requirements. From this we would import stopwords in english and corpus, here brown, which contains 500 samples joined together which will work as context.

In [152]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
%matplotlib inline

In [153]:
nltk.download('stopwords')
nltk.download('brown')
from nltk.corpus import stopwords,brown

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\deshw\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\deshw\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [154]:
from sklearn.decomposition import PCA
import scipy.cluster.vq as mod

In [155]:
sw=set(stopwords.words('english'))
corpus=[str(word).lower() for word in brown.words() if (word.lower() not in sw) & (len(word)>1) & (word.isalnum())]

After loading liabraries and data we have already defined our corpus, the word stream. Now its time for our vocabulary and context.<br>
<br>
vocab  : This defines the size of our vocabulary. How many words we want to take into consideration.<br>
<br>
context: These can be regarded as features which contain probablity of frequent words occuring in the context of vocabulary words. In short, considering context similarity could be derived from vocabulary words to only those words which are frequent. These are in the context.

In [156]:
N=len(corpus)
count={}
words=[]
for word in corpus:
    count[word]=count.get(word,0)+1
    if word not in words:
        words.append(word)

In [6]:
vocab=[word for word in words if count[word]>19]
context=[word for word in words if count[word]>99]

For less computation in future we could save the the two in pickle. Pickling refers to serializing the data and storing in the form of bytes stream, i.e. in memory. 

In [256]:
import pickle
file=open('embeddings.pickle','wb')
string='The file contains vocabulary words, context words and the word distribution matrix repectively'
pickle.dump(string,file)
pickle.dump(vocab,file)
pickle.dump(context,file)

## Process of embedding

Counting the words in the context of vocab words with a user defined window. Window may be the number of words that should be taken in the context of vocab words. 

In [44]:
def context_counts(window=2):
    c={}
    for w0 in vocab:
        c[w0]={}
        
    for i in range(window,N-window):
        w0=corpus[i]
        if w0 in vocab:
            for j in list(range(i-window,i))+list(range(i+1,i+window+1)):
                w=corpus[j]
                if w in context:
                    c[w0][w]=c[w0].get(w,0)+1
    return c            

Then we come to the second step, calculation of probablity of a word to be in the context of vocab word. Which is defined by the ratio of the number of times the word appeared in the surrounding of the vocab words to the number of words in the context of that particular vocab word.

In [45]:
def co_occurence(vector):
    co={}
    for w0 in vector:
        co[w0]={}
        add=sum(vector[w0].values())
        if add>0:
            for w in vector[w0]:
                co[w0][w]=float(vector[w0][w]/add)
    return co  

Next we calculate the probablity of all vocab words. The ratio of the number of times a word appeared in context of ANY word to the TOTAL number of words considered in the contexts of vocab words.

In [46]:
def context_prob(vector):
    con_prob={}
    con_tot={}
    total=0
    for w0 in vector:
        for w in vector[w0]:
            total=total+vector[w0][w]
            con_tot[w]=con_tot.get(w,0)+vector[w0][w]
    for w in con_tot:
        con_prob[w]=con_tot[w]/total
    return con_prob        

The last one is to define the resultant probablity. Here we take 0 for those words which are not in the context and those which have a negative probablity(the words which appeared more in the context of other words than the word referring to). Log of the probablity is taken. Thus a probablity distribution is defined for every vocab word in reference to every context word.

In [100]:
def word_distribution(counts):
    co=co_occurence(counts)
    probab=context_prob(co)
    n,d=len(vocab),len(context)
    distribution=np.zeros((n,d))

    for i in range(n):
        w0=vocab[i]
        for w in co[w0].keys():
            j=context.index(w)
            distribution[i,j]=max(0.0,np.log(co[w0][w])-np.log(probab[w]))
    return distribution     

## Final Working

Calling functons for final probablity distribution over context.

In [259]:
counts=context_counts()
dis=word_distribution(counts)
pickle.dump(dis,file)
file.close()

But the features in the given method contains much features which can be narrowed down to specific ones by using principal component analysis which steers the data into new directions, natural directoins or cordinate system of the matrix, which contains much of the variability in the principal components only by using Eigen vectors and Eigen values.

In [260]:
from sklearn.decomposition import PCA

file=open('embeddings.pickle','rb')
print(pickle.load(file))
vocab=pickle.load(file)
context=pickle.load(file)

model=PCA(100)
distribution=model.fit_transform(pickle.load(file))
for i in range(len(vocab)):
    distribution[i]=distribution[i]/np.linalg.norm(distribution[i])

The file contains vocabulary words, context words and the word distribution matrix repectively


Now its time to check it out.

In [261]:
def word_extract(word):
    if word in vocab:
        diff=np.zeros(len(vocab))
        i=vocab.index(word)
        for j in range(len(vocab)):
            diff[j]=np.linalg.norm(distribution[i]-distribution[j])
        diff[i]=1000
        dist=np.argmin(diff)
        return vocab[dist]
        
    else:
        print('Word not found!')
        return

In [262]:
word_extract('human')

'experience'

Using bag-of-words, the process converts data into a vector where each row refers to a line and each row corresponds to the words in the vocab. If they are present, it gives out 1 else 0, a binary representation.<br>
While in case of word embedding it gives out a distributed representation, a probablity distribution. Every word is represented as a vector. And two similar words have little difference between them, they are directed to a similar direction.