# INTRODUCTION

Word2Vec, an NLP word embedding algorithm of the last five years, creates a vector per word in a corpus vocabulary based on the conditional probability of other "context" words appearing next to the target word. Several thousand-dimensioned one-hot vectors, corresponding to an individual word's index in a larger vocabulary, can be condensed into generally 10 - 500 dimensions. Interestingly, these condensed vectors have been shown to capture meaning in their directions. 

While words, atomic in nature and with straightforward conditional probability calculations, can be straightforwardly translatable into vectors, attempting to vectorize phrases and sentences poses a much greater challenge, due to their compositionality and the grammar rules associated with the more complex meanings that phrases and sentences can convey. Another consideration, the phrase vectors should ideally be commensurable with the word vectors, i.e., must be able to be cast into the same overall vector space, if only through some sort of transformation. 

The following is an attempt to capture the semantics of phrases and sentences through use of PyDictionary (https://github.com/geekpradd/PyDictionary/tree/master/PyDictionary). A dictionary provides a list of words with equivalent groups of sentences/phrases, i.e., definitions. This list of keys and values could serve as an interesting test case for understanding the effectiveness of strategies for capturing higher meaning. 

Doc2Vec is one such attempt to vectorize sentences. The algorithm can be performed via two different methods, one of which, "Distributed Memory," mimics the Continuous Bag of Words (CBoW) model of Word2Vec in that it predicts a target word based on either an average or concatenation of context words. In DM, an additional "paragraph vector" is added into the average/concatenation. The same vector is applied to each context window in a sentence, with a unique vector for each sentence. So, the vector is aiding in the prediction tasks of words in the sentence.

Another method, known as the Recursive Neural Network, can be seen as a generalization of the more well-known Recurrent Neural Network. Recurrent Neural Networks are often used in sequence predictions tasks, in which they assume a linear time. Recursive Neural Networks, on the other hand, break sequences (here sentences) into branches of a tree, performing prediction tasks on all of the branches. Constituency parsing breaks sentences into branches that reflect the grammar of the sentence, which makes RecNN's a good candidate for use in Natural Language Processing. 

The project is organized as follows: 

1) An overview of the underlying word2vec model, trained on the Wikipedia Corpus. 
2) An attempt with a Doc2Vec modification to commensurate the paragraph vectors of definitions with the definitional words' vectors. 
3) An attempt with Recursive Neural Networks to generate embeddings for phrases and definitions, using cosine similarity with the definitional word as the prediction task. 

# Wikipedia Corpus Word Vector Overview

The word vectors and paragraph vectors used in this project come from a child class of Gensim's Doc2Vec (https://radimrehurek.com/gensim/models/doc2vec.html), called "doc2vecwordfixed." This model allows for word vectors to be fixed during paragraph vector training, as to be explained in the next section. First we load the model:

In [4]:
import numpy as np
import os
os.chdir('C:\\Users\\InfiniteJest\\Documents\\Python_Scripts')
import doc2vecwordfixed
#import matplotlib.pyplot as plt

wikimodel = doc2vecwordfixed.Doc2VecWordFixed.load('wiki100dmnolbls001samp3000mc')



Some basic statistics on the corpus:

In [8]:
vocabcount = {}
totalwordcount = 0
for word in wikimodel.vocab.keys():
    wordcount = wikimodel.vocab[word].count
    totalwordcount += wordcount
    vocabcount[word] = wordcount
mostfrequent = sorted(vocabcount, key=lambda key: vocabcount[key], reverse=True)
print('Number of Documents: ', wikimodel.corpus_count)
print('Total Word Count: ', totalwordcount)
print('Top 100 Words')
print(mostfrequent[:100])

Number of Documents:  4240287
Total Word Count:  2202030161
Top 100 Words
['the', 'of', 'and', 'in', 'to', 'was', 'is', 'for', 'as', 'on', 'by', 'with', 'he', 'at', 'that', 'from', 'his', 'it', 'an', 'are', 'were', 'which', 'also', 'this', 'or', 'be', 'first', 'has', 'new', 'had', 'one', 'their', 'not', 'after', 'its', 'who', 'but', 'two', 'her', 'they', 'have', 'she', 'references', 'th', 'all', 'other', 'been', 'time', 'when', 'school', 'during', 'may', 'year', 'into', 'there', 'world', 'city', 'up', 'more', 'no', 'university', 'de', 'state', 'years', 'national', 'united', 'american', 'only', 'over', 'external', 'links', 'most', 'team', 'three', 'film', 'between', 'can', 'would', 'out', 'some', 'later', 'where', 'about', 'used', 'st', 'south', 'states', 'season', 'born', 'such', 'under', 'him', 'then', 'part', 'made', 'second', 'war', 'john', 'known', 'while']


We can look at the top cosine similarity of various words: 

In [6]:
print("Man: ",wikimodel.most_similar('man', topn=10))
print()
print("Code: ", wikimodel.most_similar('code', topn=10))
print()
print("Jump: ", wikimodel.most_similar('jump', topn=10))
print()
print("Dirty: ", wikimodel.most_similar('dirty', topn=10))
print()
print("Physics: ", wikimodel.most_similar('physics', topn=10))
print()
print("Happy: ", wikimodel.most_similar('happy', topn=10))
print()
print("Betrothed: ", wikimodel.most_similar('betrothed', topn=10))

Man:  [('boy', 0.7870639562606812), ('girl', 0.7832261323928833), ('woman', 0.7731713056564331), ('lad', 0.7303745746612549), ('thief', 0.6730341911315918), ('person', 0.6582547426223755), ('swordsman', 0.6563944220542908), ('gambler', 0.6371715664863586), ('gentleman', 0.6117371320724487), ('thug', 0.6019390225410461)]

Code:  [('codes', 0.6878383755683899), ('specification', 0.6829293370246887), ('registration', 0.6401716470718384), ('identification', 0.6284632682800293), ('procedure', 0.6190042495727539), ('type', 0.6061720252037048), ('identifier', 0.6001975536346436), ('standard', 0.5941685438156128), ('protocol', 0.5930469036102295), ('prefix', 0.5893802642822266)]

Jump:  [('jumper', 0.7318518161773682), ('hurdles', 0.5937206745147705), ('metre', 0.550831139087677), ('metres', 0.5491101741790771), ('jumpers', 0.5344479084014893), ('meter', 0.5268185138702393), ('discus', 0.5235909223556519), ('javelin', 0.5028536915779114), ('throw', 0.498761385679245), ('speed', 0.4979646801948

In some cases, the closely aligned vectors make perfect sense, as in Physics with Chemistry, Mathematics, and Biochemistry. It further captures multiple meanings, like Dirty with nasty and sexy. Yet, these vectors are trained, not to represent meaning, but rather the conditional probability of a word appearing next to other words. So, they are really showing what words have similar contexts and can sometimes miss underlying semantics. For example, Betrothed matches closest with unbeknownst and unfaithful, which are actually contradictory to the meaning. 

Next, we can look at visualizing vectors. To do that, we must first reduce dimensionality from 100 to 2. We accomplish this using the tSNE algorithm, part of the sci-kit learn package (http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html): 

In [9]:
from sklearn.manifold import TSNE
tsne = TSNE()
Y = tsne.fit_transform(wikimodel.syn0)

In [None]:
def pickwords(model1, wordvectransform, testwords):   #wordvectransform is the tsne vectors
    testvocabvectors = []
    for word in testwords:
        testvocabvectors.append(model1[word])  #retrieves the vector of each word from the model
    modelindices = []
    for i in testwords:
        modelindices.append(list(model1.vocab).index(i))
    mag = 3*len(testwords) 
    plt.scatter(mag*wordvectransform[modelindices[:], 0], mag*wordvectransform[modelindices[:], 1])
    for label, x, y in zip(testwords, mag*wordvectransform[modelindices[:], 0], mag*wordvectransform[modelindices[:], 1]):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    return plt.show()

Choose a set of words and retrieve their tSNE-transformed vectors.

In [13]:
testwords = ['up', 'down', 'man', 'woman', 'king', 'queen', 'happy', 'sad', 'emotions', 'car', 'drive', 'bike', 'ride']

testvocabvectors = []
for word in testwords:
    testvocabvectors.append(wikimodel[word])  #retrieves the vector of each word from the model
modelindices = []
for i in testwords:
    modelindices.append(list(wikimodel.vocab).index(i))
mag = 3*len(testwords) 
#plt.scatter(mag*wordvectransform[modelindices[:], 0], mag*wordvectransform[modelindices[:], 1])
#for label, x, y in zip(testwords, mag*wordvectransform[modelindices[:], 0], mag*wordvectransform[modelindices[:], 1]):
#    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
#plt.show()

In [14]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
import numpy as np

output_notebook()

N = len(wikimodel.vocab.keys())
x = wikimodel[modelindices[:], 0] * 100
y = np.random.random(size=N) * 100
radii = Y[modelindices[:]] * 1.5
colors = [
    "#%02x%02x%02x" % (int(r), int(g), 150) for r, g in zip(50+2*x, 30+2*y)
]

TOOLS="crosshair,pan,wheel_zoom,box_zoom,reset,box_select,lasso_select"

# create a new plot with the tools above, and explicit ranges
p = figure(tools=TOOLS, x_range=(-100,100), y_range=(-100,100))

# add a circle renderer with vectorized colors and sizes
p.circle(x,y, radius=radii, fill_color=colors, fill_alpha=0.6, line_color=None)

# show the results
show(p)


TypeError: unhashable type: 'list'

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
firstcomponent = pca.components_[0]
model.similar_by_vector(firstcomponent)

from sklearn.metrics.pairwise import cosine_similarity
onecomponent = pca2.fit_transform(model.syn0)

def getmodelindices(model, wordlist, docindices=False):
    indices = []
    if docindices == True:
        for word in wordlist:
            indices.append(list(model.docvecs.doctags).index(word))
    else:
        for word in wordlist:
            indices.append(list(model.vocab.keys()).index(word))
    return indices


            


def concatenatedocandwordvectors(model, vector_size):
    combinedsyn0 = []
    combinedwordlist = []
    for word in list(model.vocab.keys()):
        if word in list(model.docvecs.doctags):
            docindex = list(model.docvecs.doctags).index(word)
            combinedsyn0.append(np.concatenate([model[word],model.docvecs[docindex]]))
    combinedsyn0 = np.vstack(combinedsyn0)
    for i in combinedsyn0:
        for j in range(len(model.syn0)):
            if np.array_equal(i[0:vector_size], model.syn0[j]):
                combinedwordlist.append(model.index2word[j])
    return combinedsyn0, combinedwordlist
    
finalvecs, wordlist = concatenatedocandwordvectors(model, 100)

from sklearn.manifold import TSNE
tsne = TSNE()
tsne2 = TSNE()
Y = tsne.fit_transform(model.syn0)
#Y2 = tsne.fit_transform(finalvecs)
Y2 = tsne2.fit_transform(model.docvecs)

def pickwords(model1, docvectransform, wordvectransform):
    words = set(model1.docvecs.doctags.keys())
    testinput = input("Type a list of words you would like to compare:")
    testwords = testinput.split(" ")
    for word in testwords:
        if word not in words:
            testwords.pop(testwords.index(word))
    testvocabvectors = []
    testdocvectors = []
    for word in testwords:
        testvocabvectors.append(model1[word])
        docindex = list(model1.docvecs.doctags).index(word)
        testdocvectors.append(model1.docvecs[docindex])
    
    modelindices = []
    modeldocindices = []
    for i in testwords:
        modelindices.append(list(model1.vocab).index(i))
        modeldocindices.append(list(model1.docvecs.doctags).index(i))
    mag = 3*len(testwords)
    import matplotlib.pyplot as plt    
    plt.scatter(mag*wordvectransform[modelindices[:], 0], mag*wordvectransform[modelindices[:], 1])
    for label, x, y in zip(testwords, mag*wordvectransform[modelindices[:], 0], mag*wordvectransform[modelindices[:], 1]):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.show()

    plt.scatter(mag*docvectransform[modeldocindices[:], 0], mag*docvectransform[modeldocindices[:], 1])
    for label, x, y in zip(testwords, mag*docvectransform[modeldocindices[:], 0], mag*docvectransform[modeldocindices[:], 1]):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.show()
    

#testwords = list(model.docvecs.doctags.keys())

testwords = ['king', 'queen', 'man', 'woman', 'poor', 'rich', 'garbage', 'beautiful', 'argue', 'love', 'hate', 'life', 'bother', 'exam', 'human', 'animal', 'hint', 'fear', 'anxiety', 'lose', 'win', 'dog', 'cat', 'bird', 'mouse', 'big', 'large', 'boring', 'war', 'weapon', 'peace', 'prosperity']

import matplotlib.pyplot as plt 
mag = 1000000
combinedmodelindices = []
controlindices = []


####Do if not concatenating the wordvecs
if "wordlist" not in globals():
    wordlist = list(set(list(model.docvecs.doctags)).intersection(model.vocab.keys()))
docvecindices = getmodelindices(model, wordlist, docindices=True)
wordvecindices = getmodelindices(model, wordlist)

testwordindices = []
testdocindices = []
for word in testwords:
    if word in set(wordlist):
        testwordindices.append(wordvecindices[wordlist.index(word)])
        testdocindices.append(docvecindices[wordlist.index(word)])
        

plt.scatter(mag*Y[testwordindices[:], 0], mag*Y[testwordindices[:], 1])
for label, x, y in zip(testwords, mag*Y[testwordindices[:], 0], mag*Y[testwordindices[:], 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()

plt.scatter(mag*Y2[testdocindices[:], 0], mag*Y2[testdocindices[:], 1])
for label, x, y in zip(testwords, mag*Y2[testdocindices[:], 0], mag*Y2[testdocindices[:], 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()
####
    
    
for word in testwords:
    if word in set(wordlist):
        combinedmodelindices.append(wordlist.index(word))
    controlindices.append(list(model.vocab.keys()).index(word))
plt.scatter(mag*Y[controlindices[:], 0], mag*Y[controlindices[:], 1])
for label, x, y in zip(testwords, mag*Y[controlindices[:], 0], mag*Y[controlindices[:], 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()

plt.scatter(mag*Y2[combinedmodelindices[:], 0], mag*Y2[combinedmodelindices[:], 1])
for label, x, y in zip(testwords, mag*Y2[combinedmodelindices[:], 0], mag*Y2[combinedmodelindices[:], 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()

# Deriving Embeddings from a PyDictionary Recursive Neural Network

Recursive Neural Networks have primarily been used in NLP for parsing and sentiment analysis. The usefulness of them resides in their ability to break sentences down into parts that can be individually trained and assigned a vector in the same space as the word vectors. Unlike word2vec, which trains the vector space through the prediction conditional probility distributions with other words, RecNN's have traditionally used sentiment labels, such as a number 0 through 5 indicating how positive or negative a word or phrase is, or parsing labels. 

For the current PyDictionary corpus, the training process is modified in the following ways:
1) The word2vec embeddings trained from the Wikipedia corpus are used and kept fixed. 
2) The prediction task is to predict the definitional word from its parsed definition. That is, the labels are the word2vec embeddings of the definitional word. 
3) The objective is now to maximize the dot product of the proposed word/phrase/sentence vectors with the label vectors.

In [None]:
GIVE AN EXAMPLE!!!