<center>
<h1>Learning Word Embeddings with Stan Lee </h1>
<h2>A practitioner's exploration into what’s possible using word embeddings.</h2>
<h3>Arian Pasquali</h3>
<h4>heavily based on Daniel Loureiro's work (https://github.com/danlou/dspt8) </h4>
<h3>Tomar, Novembro 2018</h3>
</center>

<img src="https://images-na.ssl-images-amazon.com/images/I/91YWN2-mI6L._SL1500_.jpg">






# Plan for this lecture


1. Why are Word Embeddings Relevant?
2. Key Concept 1: Context Windows
3. Key Concept 2: Two Shallow Neural Networks Approaches
4. Selecting a Corpus (Marvel Comics)
5. Training a Word Embedding Model
6. Computing Similarities
7. Looking up Similar Words
8. Reasoning by Analogy
9. Evaluating Model Quality
10. Visualizing the Vector Space
11. Sentence Embeddings
12. Looking up Similar Sentences

# Dependencies

* Python 3.5.x
* numpy
* sklearn (0.18.1)
* scipy
* bokeh
    * I recommend you use the [Anaconda distribution of Python (4.2.x)](https://repo.continuum.io/archive/index.html) which includes all of the above
* gensim (1.0.1)
* tensorflow (1.0.1)

In [20]:
!pip install numpy
!pip install sckit-learn==0.18.2
!pip install scipy
!pip install bokeh
!pip install gensim==1.0.1
!pip install tensorflow==1.0.1

## Pretrained models
Download all pretrained models from https://www.dropbox.com/sh/3bmd0ktqzg2bki6/AAAKdjbv3Gq3WP_2baxIP20ha?dl=0

Download wikipedia embedding model http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M.magnitude

<center><h1>Why are Word Embeddings Relevant?</h1></center>
<img src="images/sparse_words.png">
<p style='text-align:right'>Source: TensorFlow</p>


> The use of word representations... has become a "secret sauce" for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling.

[Luong, Socher, Manning (2013)](https://nlp.stanford.edu/~lmthang/data/papers/conll13_morpho.pdf)

<center>
<h1>Context Windows</h1>
<h3>"You shall know a word by the company it keeps" (Firth, 1957)</h3></center>
<img src="images/window.png">

<center><h1>Two Shallow Neural Networks Approaches</h1></center>\
<img src="images/cbow_sg.png">

# Dataset

### Marvel Wikia Corpus (http://marvel.wikia.com)

* Brand new, compiled for a DSPT session
* 27,045 text documents with details about characters, items, locations, etc. (one file)
* Demonstrate that word vectors can be useful with smaller corpuses (<< 1B tokens)
* Pre-processed - tokenized and joined multi-word expression
* Released under CC BY-NC-SA 4.0 license (free of infringements)
* Code for compiling this corpus - https://gist.github.com/danlou/532f761b6f568e20ee10a613aacea716

<img src="http://www.unkind.pt/fotos/familias/marvelcomics_1463995066.jpg">

In [3]:
!tail -15 corpus/marvel.txt # blank line separates documents, docs are composed of article_id, title, sentences.

However , only in the case of Thunderbolt_Ross had another_being 's consciousness supplanted Zzzax 's own as the dominant one controlling Zzzax 's physical form Electricity Physiology : Zzzax is a creature of pure electricity ( or to be more precise a psionically charged electromagnetic field in humanoid form ) .
It can generate electricity ( usually in the form of lightning bolts ) , manipulate nearby electrical fields , and fly .
Zzzax absorbs electricity from the human brain to survive ; this usually kills its victims , and usually gives Zzzax temporary personality traits similar to those of the_person it has absorbed .
Only his foe the_Hulk has proven immune to this ability .
Although Zzzax is a being composed of energy , what passes as its corporeal form is able to lift matter as a normal humanoid body ( as though it had hands , muscles , and a skeletal structure ) .
Zzzax possesses the ability to lift in excess of 100_tons or more based on its level of energy it has absorbed

In [5]:
# let's load the corpus
from collections import OrderedDict
marvel_articles = OrderedDict()

article_lines = []
with open('corpus/marvel.txt', encoding='utf-8') as f:
    for line in f:
        line = line.strip()

        if len(line) > 0:
            article_lines.append(line)
        else:
            article_id = int(article_lines[0])
            article_name = article_lines[1]
            article_text = article_lines[2:]
            marvel_articles[article_id] = dict(name=article_id, text=article_text)

            article_lines = []  # reset for next article/document

In [129]:
len(marvel_articles)

27045

In [8]:
tony_stark = marvel_articles[1868]
print(tony_stark['text'][:5])
print(len(tony_stark['text']))

['The biological parents of Tony_Stark were two S.H.I.E.L.D agents , Amanda_Armstrong and Jude , who met during a courier mission .', 'After Jude saved Amanda from an assassin , they got to know each other and fell in love .', 'Following a two-year relationship , Amanda became pregnant .', 'A week before giving birth to the baby , Jude revealed to have been a Hydra double-agent with little regard for anybody but Amanda and himself who sold out fellow S.H.I.E.L.D soldiers , and was even responsible for the incident that had almost cost Amanda her life .', "During a discussion when he was trying to convince Amanda to accept Hydra 's protection , she attacked Jude and killed him ."]
452


# Training a Word Embedding Model

We'll be using the highly praised [gensim](https://github.com/RaRe-Technologies/gensim) package and their implementation of word2vec.

In [9]:
# we need to isolate the sentence tokens from the corpus
marvel_sents = [sent for e in marvel_articles.values() for sent in e['text']]
marvel_sents = [sent.lower().split() for sent in marvel_sents]
len([w for sent in marvel_sents for w in sent]) # our total number of tokens, 6M, very small

6226813

In [10]:
# we also want to know the number of cores on the machine to take advantage of multithreading
import multiprocessing
cores = multiprocessing.cpu_count()

In [49]:
!mkdir pretrained

mkdir: cannot create directory ‘pretrained’: File exists


In [12]:
from gensim.models import Word2Vec

# creating models will always take a few minutes, load pretrained whenever available
import os.path
if os.path.isfile('pretrained/w2v_sg_marvel'):
    word_model = Word2Vec.load('pretrained/w2v_sg_marvel')

else:
    # params are standard 300 dimensions and flag to use skip-gram instead of cbow
    word_model = Word2Vec(marvel_sents, size=300, min_count=2, sg=1, workers=cores)
    word_model.save('pretrained/w2v_sg_marvel')

In [13]:
# voila, we have word vectors!
word_model.wv['iron_man']

array([-0.43522027,  0.11205714,  0.19221838, -0.05546402,  0.38778165,
        0.1834785 ,  0.62625   ,  0.01586861,  0.26890317,  0.13237992,
        0.36268172, -0.12491541,  0.0723379 ,  0.16060433,  0.26925954,
       -0.13638765,  0.23534346,  0.08005994,  0.16897567, -0.04062948,
        0.10002907, -0.03739918,  0.20564765, -0.05345098,  0.36546943,
       -0.23307633,  0.29653943, -0.19866562,  0.374645  ,  0.24621761,
        0.30302477,  0.32009596, -0.22005185, -0.09522678, -0.41295645,
       -0.14005767, -0.11456599,  0.05049056,  0.13850957, -0.28785157,
       -0.2615202 , -0.03051927, -0.378367  ,  0.27678624, -0.27755815,
       -0.01003014, -0.3273174 ,  0.07882054, -0.2675577 ,  0.31397986,
       -0.50161934,  0.53130054, -0.12663077, -0.06605022, -0.06589571,
        0.03520188,  0.24786277,  0.12557603, -0.3209612 , -0.20897345,
        0.11799174, -0.5177514 ,  0.10159639, -0.0465021 , -0.5377606 ,
       -0.28742516, -0.11256847,  0.18536718, -0.05616513,  0.60

# Computing Similarities

<img src="images/w2v_espresso.jpg">

To measure how similar two words are, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors A and B, the consine similarity is defined as follows:

<img src="images/cos.png">

where A.B is the dot product (or inner product) of two vectors, ||A|| is the norm (or length) of the vector A , and θ is the angle between A and B . This similarity depends on the angle between A and B . If A and B are very similar, their cosine similarity will be close to 1 ; if they are dissimilar, the cosine similarity will take a smaller value.



### Let's compute some similarities by hand

In [14]:
# let's compute some similarities by hand
import numpy as np

v1 = word_model.wv['iron_man']
v2 = word_model.wv['tony_stark']
v3 = word_model.wv['matt_murdock']

print(np.dot(v1, v2)/(np.linalg.norm(v1)*np.linalg.norm(v2)))
print(np.dot(v1, v3)/(np.linalg.norm(v1)*np.linalg.norm(v3)))

0.70316815
0.3982348


As you can see, **v1** and **v2** are very similar while **v1** and **v3** are not that similar. 

The figure below illustrates this intuition.  

<img src="https://datascience-enthusiast.com/figures/cosine_sim.png"/>

### Easier way to compute similarities using Gensim api

In [51]:
# making sure we've computed vector normalization
word_model.init_sims()  

print(word_model.similarity('iron_man', 'tony_stark'))
print(word_model.similarity('iron_man', 'matt_murdock'))

0.7031681981850164
0.39823477658647377


# Looking up Similar Words

### fetch the normalized vector (faster computation)

In [16]:
# fetch the normalized vector id (simplifies computation)
w1_idx = word_model.wv.index2word.index('avengers')

#model.wv.syn0norm stores the normalized vectors
v1_norm = word_model.wv.syn0norm[w1_idx]

sims = np.dot(word_model.wv.syn0norm, v1_norm)  # cosine sims for ALL vecs
sims = [word_model.wv.index2word[idx] for idx in np.argsort(sims)]  # corresponding words
sims = sims[::-1]  # reverse the list (lower is best)
sims[:10]

['avengers',
 'the_avengers',
 'new_avengers',
 'x-men',
 'mighty_avengers',
 'avengers_team',
 'the_new_avengers',
 'west_coast_avengers',
 'force_works',
 'secret_avengers']

 ### or, once more, with gensim

In [17]:
# or, once more, with gensim
word_model.most_similar('avengers', topn=10)

[('the_avengers', 0.7016434669494629),
 ('new_avengers', 0.6698393225669861),
 ('x-men', 0.6675894260406494),
 ('mighty_avengers', 0.6660788059234619),
 ('avengers_team', 0.6639204025268555),
 ('the_new_avengers', 0.653827965259552),
 ('west_coast_avengers', 0.6492305994033813),
 ('force_works', 0.648532509803772),
 ('secret_avengers', 0.6462224125862122),
 ('ultimates', 0.6440812349319458)]

# Reasoning by Analogy


<img src="https://developers.google.com/machine-learning/crash-course/images/linear-relationships.svg"/>

### Can we discover the true identity of super-heroes?

In [18]:
identities = [('iron_man', 'tony_stark'), ('the_hulk', 'bruce_banner'), ('captain_america', 'steve_rogers'), ('falcon', 'sam_wilson')]

def normvec(label):  # aux method
    return word_model.wv.syn0norm[word_model.wv.index2word.index(label)]

#algebric operation (tony_stark vector - iron_man + captain_america)
v = normvec('tony_stark') - normvec('iron_man') + normvec('captain_america')

word_model.similar_by_vector(v)

[('captain_america', 0.7874563932418823),
 ('steve_rogers', 0.5517191886901855),
 ('cap', 0.5504655838012695),
 ('rogers', 0.5347710251808167),
 ('tony_stark', 0.5166157484054565),
 ('barnes', 0.5102876424789429),
 ('sharon_carter', 0.5017931461334229),
 ('fred_davis', 0.48994091153144836),
 ('bucky_barnes', 0.4847586750984192),
 ('brian_falsworth', 0.48364752531051636)]

In [22]:
# the equivalent with gensim api
word_model.most_similar(positive=['tony_stark', 'captain_america'], negative=['iron_man'])

[('steve_rogers', 0.5517191886901855),
 ('cap', 0.5504656434059143),
 ('rogers', 0.5347710251808167),
 ('barnes', 0.5102876424789429),
 ('sharon_carter', 0.5017931461334229),
 ('fred_davis', 0.48994091153144836),
 ('bucky_barnes', 0.4847586750984192),
 ('brian_falsworth', 0.48364752531051636),
 ('anthony_stark', 0.4830740988254547),
 ('nick_fury', 0.48017236590385437)]

<img src="images/pair_relations.png">

# Evaluating - Analogies (syntactic and semantic)

This is the formal method to assess the quality of our word embeddings model. 
Given an input tuple we try to guess the correspondend analog tuple.

Ex. Lisbon Portugal - London England

In [19]:
!head eval/questions-words.txt

: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba


In [15]:
import logging
logging.basicConfig(level=logging.INFO)
results = word_model.accuracy('eval/questions-words.txt')

INFO:gensim.models.keyedvectors:capital-common-countries: 9.6% (15/156)
INFO:gensim.models.keyedvectors:capital-world: 4.3% (6/139)
INFO:gensim.models.keyedvectors:currency: 0.0% (0/18)
INFO:gensim.models.keyedvectors:city-in-state: 3.3% (18/544)
INFO:gensim.models.keyedvectors:family: 39.5% (135/342)
INFO:gensim.models.keyedvectors:gram1-adjective-to-adverb: 1.6% (13/812)
INFO:gensim.models.keyedvectors:gram2-opposite: 1.0% (2/210)
INFO:gensim.models.keyedvectors:gram3-comparative: 13.0% (121/930)
INFO:gensim.models.keyedvectors:gram4-superlative: 5.9% (16/272)
INFO:gensim.models.keyedvectors:gram5-present-participle: 14.8% (120/812)
INFO:gensim.models.keyedvectors:gram6-nationality-adjective: 3.0% (19/633)
INFO:gensim.models.keyedvectors:gram7-past-tense: 13.9% (206/1482)
INFO:gensim.models.keyedvectors:gram8-plural: 11.8% (103/870)
INFO:gensim.models.keyedvectors:gram9-plural-verbs: 12.1% (51/420)
INFO:gensim.models.keyedvectors:total: 10.8% (825/7640)


In [35]:
!ls pretrained

exemplo.txt				     w2v_sg_marvel.syn1neg.npy
pv_dm_concat_marvel			     w2v_sg_marvel_text8
pv_dm_concat_marvel.docvecs.doctag_syn0.npy  w2v_sg_marvel_text8.syn1neg.npy
pv_dm_concat_marvel.syn1neg.npy		     w2v_sg_marvel_text8.wv.syn0.npy
skip_thoughts_uni_2017_02_02.tar.gz	     w2v_sg_marvel.wv.syn0.npy
w2v_sg_marvel				     wiki-news-300d-1M.magnitude


Not so good... maybe we can do better with more data

## Expanding Vector Space

Learning using Wikipedia corpus. With an extende dataset we can learn better word embeddings.

In [None]:
!wget http://mattmahoney.net/dc/text8.zip -P ./models
!unzip ./models/text8.zip  -d ./models

In [142]:
# the text8 corpus is 100MB of tokenized text from wikipedia (N tokens)
from gensim.models.word2vec import Text8Corpus

if os.path.isfile('pretrained/w2v_sg_marvel_text8'):
    # load from pretrained
    word_model = Word2Vec.load('pretrained/w2v_sg_marvel_text8')

else:
    # continue training
    word_model.train(Text8Corpus('corpus/text8'))
    word_model.save('pretrained/w2v_sg_marvel_text8')

results = word_model.accuracy('eval/questions-words.txt')

INFO:gensim.utils:loading Word2Vec object from pretrained/w2v_sg_marvel_text8
INFO:gensim.utils:loading wv recursively from pretrained/w2v_sg_marvel_text8.wv.* with mmap=None
INFO:gensim.utils:loading syn0 from pretrained/w2v_sg_marvel_text8.wv.syn0.npy with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:loading syn1neg from pretrained/w2v_sg_marvel_text8.syn1neg.npy with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded pretrained/w2v_sg_marvel_text8
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:gensim.models.keyedvectors:capital-common-countries: 39.1% (61/156)
INFO:gensim.models.keyedvectors:capital-world: 43.2% (60/139)
INFO:gensim.models.keyedvectors:currency: 0.0% (0/18)
INFO:gensim.models.keyedvectors:city-in-state: 23.7% (129/544)
INFO:gensim.models.keyedvectors:family: 36.8% (126/342)
INFO:gensim.models.keyedvectors:gram1-adjective-to-adverb: 2.2% (18/81

# Visualizing the Vector Space

In [28]:
# select random n vectors (normalized)
from random import sample
words = sample(list(word_model.wv.vocab.keys()), 1000)
vecs = [normvec(n) for n in words]

from sklearn.manifold import TSNE
# reduce dimensionality to 2 dimensions so we can plot in 2 D 
vecs_2d = TSNE(n_components=2).fit_transform(vecs)

from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

source = ColumnDataSource(data=dict(x=vecs_2d[:, 0], y=vecs_2d[:, 1], labels=words))

p = figure()
p.scatter('x', 'y', size=8, source=source)

# include labels
labels = LabelSet(x='x', y='y', text='labels', level='glyph',
                  x_offset=5, y_offset=5, source=source, render_mode='canvas')
p.add_layout(labels)

show(p)

Find any clusters? We may need a larger corpus for this...

# Visualizing embeddings using with TensorFlow (TensorBoard)
<img src="images/tensorboard_ironman.png">

[Tutorial](https://www.tensorflow.org/get_started/embedding_viz)
http://projector.tensorflow.org/

In [None]:
#save model into w2v format
word_model.wv.save_word2vec_format("./pretrained/w2v_sg_marvel_text8_w2v_format")

In [151]:
#Convert gensim word2vec format to TensorFlow format
!python -m gensim.scripts.word2vec2tensor -i ./pretrained/w2v_sg_marvel_text8_w2v_format -o pretrained/w2v_sg_marvel_text8_tensorflow_format

2018-11-26 18:16:23,124 : MainThread : INFO : running /home/arian/Developer/workspace/notebook-labs/dspt8/dev/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py -i ./pretrained/w2v_sg_marvel_text8_w2v_format -o pretrained/w2v_sg_marvel_text8_tensorflow_format
2018-11-26 18:16:23,125 : MainThread : INFO : loading projection weights from ./pretrained/w2v_sg_marvel_text8_w2v_format
2018-11-26 18:16:33,200 : MainThread : INFO : loaded (59368, 300) matrix from ./pretrained/w2v_sg_marvel_text8_w2v_format
2018-11-26 18:16:41,379 : MainThread : INFO : 2D tensor file saved to pretrained/w2v_sg_marvel_text8_tensorflow_format_tensor.tsv
2018-11-26 18:16:41,379 : MainThread : INFO : Tensor metadata file saved to pretrained/w2v_sg_marvel_text8_tensorflow_format_metadata.tsv
2018-11-26 18:16:41,388 : MainThread : INFO : finished running word2vec2tensor.py


In [156]:
!ls ./pretrained/w2v_sg_marvel_text8_tensorflow_format_metadata.tsv
!ls ./pretrained/w2v_sg_marvel_text8_tensorflow_format_tensor.tsv

./pretrained/w2v_sg_marvel_text8_tensorflow_format_metadata.tsv
./pretrained/w2v_sg_marvel_text8_tensorflow_format_tensor.tsv


Load model tensors and metadata  using http://projector.tensorflow.org/.
Use **Embeddings** tab.

This concludes the overview of what's possible with dense representations of words.

# Sentence embeddings

How to use word embeddings to build a sentence representation.

<center><h1>Paragraph Vectors (doc2vec)</h1></center>\
<img src="images/doc2vec.png">

# Searching Similar Sentences

Install dependencies

In [None]:
!pip install nltk
!pip install annoy
!pip install pymagnitude --no-cache --no-cache-dir -vvv
!pip install pandas

### Indexing models using PyMagnitude
Sentence embeddings can easily use lot of RAM. 
In this example we will use a pretrained word embedding model from wikipedia using a library called *pymagnitude* (https://pypi.org/project/pymagnitude/).


This library allows us to store our model in the disk and make efficient searches with very low memory footprint. 

In [30]:
# download pretrained word embedding model trained with the entire english wikipedia
# 1.6Gb
!wget http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M.magnitude -O pretrained/wiki-news-300d-1M.magnitude

In [33]:
from pymagnitude import *

#loading pretrained model
wiki_vectors = Magnitude("./pretrained/wiki-news-300d-1M.magnitude")

In [38]:
#check vector size 
wiki_vectors.dim

300

In [34]:
def avg_sentence_vector(words, model):
    '''this method generates the sentence vector computing the
    average all words vectors in a given sentence'''
    
    # get word embedding vector size
    num_features = model.dim
    
    # create a corresponding empty vector to store our new sentence vector
    featureVec = np.zeros((num_features,), dtype="float32")
    
    nwords = 0

    #query model for each word vector
    for word_vec in wiki_vectors.query(words):
        nwords = nwords+1
        #we add every new word vector 
        featureVec = np.add(featureVec, word_vec)

    if nwords > 0:
        #we divide the vector by the number of words in the sentence
        featureVec = np.divide(featureVec, nwords)

    # sentence representation is the avg of every word in the sentence
    return featureVec

def embed_sentence(words, model):
    return avg_sentence_vector(words, model)


In [35]:
embed_sentence("he mutant rendered Bruce_Banner unconscious without draining him of all his life energy", wiki_vectors)

array([-0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00164389,
       -0.00164389, -0.00164389, -0.00164389, -0.00164389, -0.00

### Saving and loading model

Here we encode all sentences in our dataset and create an index file. 

In [43]:
os.path.isfile('./pretrained/marvel.sentences.annoy.index')

True

In [44]:
from annoy import AnnoyIndex

# load from pretrained
embedding_vector_size = wiki_vectors.dim

#Define the vector size of the embedding
a = AnnoyIndex(embedding_vector_size)

if os.path.isfile('./pretrained/marvel.sentences.annoy.index'):
    
    print("loading pretrained sentence model index")
    a.load('./pretrained/marvel.sentences.annoy.index')
    
else:
    print("Total of sentences to encode: ",len(marvel_sents))
    print("Encoding sentences. This will take a while ...")

    for idx, sent in enumerate(marvel_sents):
        sent_embedding = embed_sentence(sent, wiki_vectors)
        if(idx % 10000 == 0):
            print(idx, "processed sentences")

        a.add_item(idx, sent_embedding)

    print("building search index ...")
    a.build(-1)
    print("saving index file ...")
    a.save('./pretrained/marvel.sentences.annoy.index')
    
    
    


loading pretrained sentence model index


In [46]:
def print_similar_sentences(sent, top_NN=10):
    '''
    this method encodes the input sentence into a vector representation.
    with the sentence vector representation we query the Annooy index to find the Nearest Neighbors.
    '''
    
    #encodes the input sentence
    encoding = embed_sentence(sent.split(), wiki_vectors)
    
    #query the Annooy index to find the Nearest Neighbors
    result = a.get_nns_by_vector(encoding, top_NN, include_distances=True)
    
    sorted_ids = result[0]
    scores = result[1]
    
    num = len(sorted_ids)
    
    print("Sentence:")
    print("", sent)
    print("\nNearest neighbors:")
    for i in range(1, num):
        print(" %d. %s (%.3f)" % (i, ' '.join(marvel_sents[sorted_ids[i]]), scores[i]))
           

In [47]:
print_similar_sentences('professor gilbert is an accomplished scientist with extensive knowledge of robotics .', 10)

Sentence:
 professor gilbert is an accomplished scientist with extensive knowledge of robotics .

Nearest neighbors:
 1. walker is an accomplished inventor with skill in electronics and engineering . (0.345)
 2. chandra is a gifted geneticist and an expert in nanotechnology . (0.359)
 3. garrett achieved international respect as a brilliant research scientist and inventor whose particular specialty was genetic manipulation . (0.372)
 4. luther is a capable fighter and an has an understanding of electronics . (0.375)
 5. he was an archaeologist with an advanced knowledge of archaeology . (0.387)
 6. he is an accomplished thief and smuggler with extensive military training . (0.390)
 7. he is an expert in genetic engineering and mutation . (0.397)
 8. accomplishments in philosophy : warlock also is an accomplished self-taught philosopher . (0.397)
 9. dr._tempest_bell is an intelligent scientist specializing in astrobiology . (0.408)


In [122]:
print_similar_sentences('he mutant rendered Bruce_Banner unconscious without draining him of all his life energy')

Sentence:
 he mutant rendered Bruce_Banner unconscious without draining him of all his life energy

Nearest neighbors:
 1. mentor blocked all of drax 's memories of his old life , instilling in him monomaniacal hate for thanos . (0.376)
 2. mentor blocked all of drax 's memories of his old life , instilling in him monomaniacal hate for thanos . (0.376)
 3. he created a suit_of_armor which would siphon off his wards ' mutant energies for his own benefit , granting him incredible powers . (0.377)
 4. during his life of crime , he encountered venom in his spider-man alias and had his leg eaten by him . (0.379)
 5. for his vengeance he transported the suns near our galaxy into his omnipotent brain . (0.382)
 6. this incurable disease was slowing robbing maas of his mobility and eventually robbed him of his life . (0.382)
 7. david would lash out with his nascent mutant psychic powers , killing their attackers and absorbing karami’s mind into his subconscious . (0.386)
 8. whiphand possesse

In [123]:
print_similar_sentences("hector told logan where to find rojas , and was killed by felix .")

Sentence:
 hector told logan where to find rojas , and was killed by felix .

Nearest neighbors:
 1. taras then told natasha to kill logan , but logan instead killed him and allowed natasha to depart . (0.313)
 2. taras then told natasha to kill logan , but logan instead killed him and allowed natasha to depart . (0.313)
 3. after he was told by blackie , spider-man came here to free martha and billy_connors , defeating man-mountain_marko and silvermane . (0.319)
 4. nestor was asked by logan where to find ritter . (0.336)
 5. as soon as preston left , ellie was told by joshua that they were going to leave . (0.345)
 6. chapman encountered mister_fantastic , invisible_woman and black_panther and traveled to hong_kong where he was killed by dolph . (0.358)
 7. natasha and melissa are then ordered by norman_osborn to be interrogated and disposed of . (0.364)
 8. while escaping , luther and the others would be attacked by nazis and be rescued by the_sub-mariner who came to recapture luthe

In [111]:
print_similar_sentences("they were surrounded , with no escape in sight")

Sentence:
 they were surrounded , with no escape in sight

Nearest neighbors:
 1. they were successful in stopping both foes , but because they fought them instead of fleeing the scene , they were captured by freedom_force a second time . (0.358)
 2. not knowing that these people were mutants , davis became enamored of storm , and managed to get in a dance with her before they were all attacked by tong assassins . (0.358)
 3. none of them were glad to see the others , but with each having their reasons for getting back together , they formed the nightstalkers ; by day , they were private investigators , by night , they fought a number of supernatural villains . (0.364)
 4. along with the other students , they were pulled into limbo , where they were rescued by prodigy . (0.365)
 5. along with the other students , they were pulled into limbo , where they were rescued by prodigy . (0.365)
 6. along with the other students , they were pulled into limbo , where they were rescued by prodigy

# Other Relevant Embeddings

* GloVe
    * more explicit training method
* fastText
    * by the author of word2vec, new subword vectors, classifier

# Recommended Reading

* [Deep Learning, NLP, and Representations](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)
* [Meanings are Vectors](http://sanjaymeena.io/tech/word-embeddings/)
* [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
* [A Word is Worth a Thousand Vectors](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/#footnote1)
* [On word embeddings](http://sebastianruder.com/word-embeddings-1/)
* [Operations on word vectors](https://datascience-enthusiast.com/DL/Operations_on_word_vectors.html)


> You have now the wording embedding super power!
STAN LEE, 2018

> With great power there must also come ... great responsibility! 
STAN LEE, Amazing Fantasy, 15, August 1962


<img src="https://www.emaisgoias.com.br/wp-content/uploads/2018/04/stan-lee.jpg">