<center>
<h1>Have we cracked semantics?</h1>
<h2>A practitioner's exploration into what’s possible.</h2>
<h2>Data Science Portugal Meetup (DSPT) \#8</h2>
<h3>Daniel Loureiro (danlou.github.io)</h3>
<h3>Braga, April 5th 2017</h3>
<img src="images/logo_no_shadow.png">
</center>

# Plan for this talk

1. Why are Word Embeddings Relevant?
2. Key Concept 1: Context Windows
3. Key Concept 2: Two Shallow NN Approaches
4. Selecting a Corpus
5. Training a Word Model
6. Computing Similarities
7. Looking up Similar Words
8. Reasoning by Analogy
9. Evaluating - Analogies
10. Visualizing the Vector Space
11. Recommendation with doc2vec - Related Pages
12. Search with Skip-Thought Vectors - Similar Sentences

# Dependencies

* Python 3.5.x
* numpy
* sklearn (0.18.1)
* scipy
* bokeh
    * I recommend you use the [Anaconda distribution of Python (4.2.x)](https://repo.continuum.io/archive/index.html) which includes all of the above
* gensim (1.0.1)
* tensorflow (1.0.1)

### A bit about my path

* *2011* - BIC at GECAD on Sentiment Analysis, EPIA


* *2012* - Startup Pirates, Mini Seedcamp with AskPepito


* *2013* - BsC Computer Science from FCUP, LxMLS


* *2014* - Started PepFeed, incubated Startup Braga


* *2015* - Raised seed round for PepFeed, more CEO/CTO


* *2016* - Joined Followprice, back to full-time DS


* *2017* - Left Followprice, TBD

<center><h1>Why are Word Embeddings Relevant?</h1></center>
<img src="images/sparse_words.png">
<p style='text-align:right'>Source: TensorFlow</p>


> The use of word representations... has become a "secret sauce" for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling.

[Luong, Socher, Manning (2013)](https://nlp.stanford.edu/~lmthang/data/papers/conll13_morpho.pdf)

<center>
<h1>Context Windows</h1>
<h3>"You shall know a word by the company it keeps" (Firth, 1957)</h3></center>
<img src="images/window.png">

<center><h1>Two Shallow NN Approaches</h1></center>\
<img src="images/cbow_sg.png">

# Selecting a Corpus

### Marvel Wikia Corpus (http://marvel.wikia.com)

* Brand new, compiled for this DSPT session
* 27,045 text documents with details about characters, items, locations, etc. (one file)
* Demonstrate that word vectors can be useful with smaller corpuses (<< 1B tokens)
* Pre-processed - tokenized and joined multi-word expression
* Released under CC BY-NC-SA 4.0 license (free of infringements)
* Code for compiling this corpus - https://gist.github.com/danlou/532f761b6f568e20ee10a613aacea716

In [1]:
!tail -50 corpus/marvel.txt # blank line separates documents, docs are composed of article_id, title, sentences.

As a result Zzzax was grounded and its field was disrupted .
Somehow Zzzax was imprisoned by S.H.I.E.L.D ( Earth-616 ) soon after Hawkeye and Wonder_Man defeated it .
SHIELD kept Zzzax , which was still in humanoid form , confined within a large insulated vacuum tube , and then transferred it into another such tube at Gamma_Base , New_Mexico .
There former General `` Thunderbolt '' Ross , who was obsessed with a desire to destroy the_Hulk , submitted to a SHIELD experiment to transform him into a superhuman being by infusing some of Zzzax 's `` living electricity '' into Ross 's body .
But the experiment went awry , and Ross 's psychic_energy -- his mind , in effect -- was absorbed by Zzzax , and Zzzax broke free .
But , strangely , Ross 's mind , perhaps because of the strength of its hatred for the_Hulk , took control of Zzzax , submerging Zzzax 's own personality .
Meanwhile , Ross ' original physical body remained alive , but all of that body 's independent thought processes 

In [2]:
# let's load the corpus
from collections import OrderedDict
marvel_articles = OrderedDict()

article_lines = []
with open('corpus/marvel.txt', encoding='utf-8') as f:
    for line in f:
        line = line.strip()

        if len(line) > 0:
            article_lines.append(line)
        else:
            article_id = int(article_lines[0])
            article_name = article_lines[1]
            article_text = article_lines[2:]
            marvel_articles[article_id] = dict(name=article_id, text=article_text)

            article_lines = []  # reset for next article/document

In [3]:
len(marvel_articles)

27045

In [4]:
tony_stark = marvel_articles[1868]
print(tony_stark['text'][:25])
print(len(tony_stark['text']))

['The biological parents of Tony_Stark were two S.H.I.E.L.D agents , Amanda_Armstrong and Jude , who met during a courier mission .', 'After Jude saved Amanda from an assassin , they got to know each other and fell in love .', 'Following a two-year relationship , Amanda became pregnant .', 'A week before giving birth to the baby , Jude revealed to have been a Hydra double-agent with little regard for anybody but Amanda and himself who sold out fellow S.H.I.E.L.D soldiers , and was even responsible for the incident that had almost cost Amanda her life .', "During a discussion when he was trying to convince Amanda to accept Hydra 's protection , she attacked Jude and killed him .", 'Traumatized by this development , Amanda asked S.H.I.E.L.D to ensure her future baby would find a safe and happy home .', 'However , director Nick_Fury followed the same procedure used for unwanted pregnancies in the agency , and the baby was left in an orphanage in Sofia , Bulgaria after Amanda birthed him i

# Training a Word Model

We'll be using the highly praised [gensim](https://github.com/RaRe-Technologies/gensim) package and their implementation of word2vec.

In [5]:
# we need to isolate the sentence tokens from the corpus
marvel_sents = [sent for e in marvel_articles.values() for sent in e['text']]
marvel_sents = [sent.lower().split() for sent in marvel_sents]
len([w for sent in marvel_sents for w in sent]) # our total number of tokens, 6M, very small

6226813

In [6]:
# we also want to know the number of cores on the machine to take advantage of multithreading
import multiprocessing
cores = multiprocessing.cpu_count()

In [7]:
from gensim.models import Word2Vec

# creating models will always take a few minutes, load pretrained whenever available
import os.path
if os.path.isfile('pretrained/w2v_sg_marvel'):
    word_model = Word2Vec.load('pretrained/w2v_sg_marvel')

else:
    # params are standard 300 dimensions and flag to use skip-gram instead of cbow
    word_model = Word2Vec(marvel_sents, size=300, min_count=2, sg=1, workers=cores)
    word_model.save('pretrained/w2v_sg_marvel')

In [8]:
# voila, we have word vectors!
word_model.wv['iron_man']

array([ 0.23367834,  0.29375732, -0.45883524, -0.17062214,  0.06527832,
        0.03812009, -0.295131  ,  0.01019869, -0.12619233, -0.04480179,
       -0.59440571,  0.31791714, -0.28028834, -0.23229396,  0.13952047,
       -0.31170291, -0.27238059, -0.07909798,  0.20126045, -0.08187994,
        0.00599575, -0.1405274 , -0.11241629,  0.16259389, -0.20842955,
       -0.05150102,  0.13531914, -0.14348233, -0.33023039, -0.30291417,
       -0.61079025,  0.17066939,  0.26637861, -0.09950258,  0.12790087,
        0.37538701, -0.35994497,  0.20974636,  0.00361202,  0.20891951,
       -0.16384113, -0.22473757,  0.07013547, -0.09483541,  0.05740248,
        0.09562255,  0.31106865,  0.02298987, -0.01928948, -0.00395847,
       -0.1754694 ,  0.34890848, -0.06538704, -0.42149439, -0.29546908,
        0.00601043,  0.0114274 , -0.18908688,  0.01014967, -0.33985901,
       -0.18500525, -0.84454107, -0.11101522, -0.30179295,  0.06611467,
        0.07937766, -0.03002396, -0.2946749 , -0.0373145 , -0.11

# Computing Similarities

<img src="images/cos.png">

In [9]:
# let's compute some similarities then
import numpy as np

v1 = word_model.wv['iron_man']
v2 = word_model.wv['tony_stark']
v3 = word_model.wv['matt_murdock']

print(np.dot(v1, v2)/(np.linalg.norm(v1)*np.linalg.norm(v2)))
print(np.dot(v1, v3)/(np.linalg.norm(v1)*np.linalg.norm(v3)))

0.701744
0.408435


In [10]:
# or more simply using gensim
word_model.init_sims()  # making sure we've computed norms, may be unnecessary
print(word_model.similarity('iron_man', 'tony_stark'))
print(word_model.similarity('iron_man', 'matt_murdock'))

0.701743639756
0.408435017743


# Looking up Similar Words

<img src="images/w2v_espresso.jpg">

In [11]:
# fetch the normalized vector (simplifies computation)
w1_idx = word_model.wv.index2word.index('avengers')
v1_norm = word_model.wv.syn0norm[w1_idx]

sims = np.dot(word_model.wv.syn0norm, v1_norm)  # cosine sims for ALL vecs
sims = [word_model.wv.index2word[idx] for idx in np.argsort(sims)]  # corresponding words
sims = sims[::-1]  # reverse the list (lower is best)
sims[:10]

['avengers',
 'the_avengers',
 'mighty_avengers',
 'avengers_team',
 'secret_avengers',
 'force_works',
 'new_avengers',
 'unity_division',
 'west_coast_avengers',
 'x-men']

In [12]:
# or, once more, with gensim
word_model.most_similar('avengers', topn=10)

[('the_avengers', 0.700478196144104),
 ('mighty_avengers', 0.6753344535827637),
 ('avengers_team', 0.6537984609603882),
 ('secret_avengers', 0.65278559923172),
 ('force_works', 0.647817850112915),
 ('new_avengers', 0.6450065970420837),
 ('unity_division', 0.6299926042556763),
 ('west_coast_avengers', 0.6266465187072754),
 ('x-men', 0.6255587339401245),
 ('avengers_west_coast', 0.6206583976745605)]

# Reasoning by Analogy

In [13]:
# can we discover the true identity of super-heroes?
identities = [('iron_man', 'tony_stark'), ('the_hulk', 'bruce_banner'), ('captain_america', 'steve_rogers'), ('falcon', 'sam_wilson')]

def normvec(label):  # aux method
    return word_model.wv.syn0norm[word_model.wv.index2word.index(label)]

v = normvec('tony_stark') - normvec('iron_man') + normvec('captain_america')

word_model.similar_by_vector(v)

[('captain_america', 0.772327184677124),
 ('rogers', 0.56960129737854),
 ('steve_rogers', 0.5688270330429077),
 ('cap', 0.562447190284729),
 ('tony_stark', 0.5591169595718384),
 ('sharon_carter', 0.5284056663513184),
 ('bucky_barnes', 0.5162870287895203),
 ('james_barnes', 0.504021167755127),
 ('barnes', 0.4948274791240692),
 ('bucky', 0.4910045266151428)]

In [14]:
# the equivalent with gensim
word_model.most_similar(positive=['tony_stark', 'captain_america'], negative=['iron_man'])

[('rogers', 0.56960129737854),
 ('steve_rogers', 0.5688270330429077),
 ('cap', 0.562447190284729),
 ('sharon_carter', 0.5284056663513184),
 ('bucky_barnes', 0.5162870287895203),
 ('james_barnes', 0.504021167755127),
 ('barnes', 0.4948274791240692),
 ('bucky', 0.4910045564174652),
 ('nick_fury', 0.489604651927948),
 ('the_red_skull', 0.4890736937522888)]

<img src="images/pair_relations.png">

# Evaluating - Analogies (syntactic and semantic)

In [15]:
import logging
logging.basicConfig(level=logging.INFO)
results = word_model.accuracy('eval/questions-words.txt')

INFO:gensim.models.keyedvectors:capital-common-countries: 10.4% (19/182)
INFO:gensim.models.keyedvectors:capital-world: 5.9% (9/152)
INFO:gensim.models.keyedvectors:currency: 0.0% (0/28)
INFO:gensim.models.keyedvectors:city-in-state: 3.3% (18/544)
INFO:gensim.models.keyedvectors:family: 38.6% (132/342)
INFO:gensim.models.keyedvectors:gram1-adjective-to-adverb: 2.3% (20/870)
INFO:gensim.models.keyedvectors:gram2-opposite: 0.5% (1/210)
INFO:gensim.models.keyedvectors:gram3-comparative: 16.3% (123/756)
INFO:gensim.models.keyedvectors:gram4-superlative: 8.3% (20/240)
INFO:gensim.models.keyedvectors:gram5-present-participle: 13.7% (111/812)
INFO:gensim.models.keyedvectors:gram6-nationality-adjective: 3.0% (19/633)
INFO:gensim.models.keyedvectors:gram7-past-tense: 12.6% (187/1482)
INFO:gensim.models.keyedvectors:gram8-plural: 11.7% (102/870)
INFO:gensim.models.keyedvectors:gram9-plural-verbs: 15.8% (60/380)
INFO:gensim.models.keyedvectors:total: 10.9% (821/7501)


Not so good... maybe we can do better with more data

## Expanding Vector Space

In [16]:
# the text8 corpus is 100MB of tokenized text from wikipedia (N tokens)
from gensim.models.word2vec import Text8Corpus

if os.path.isfile('pretrained/w2v_sg_marvel_text8'):
    # load from pretrained
    word_model = Word2Vec.load('pretrained/w2v_sg_marvel_text8')

else:
    # continue training
    word_model.train(Text8Corpus('corpus/text8'))
    word_model.save('pretrained/w2v_sg_marvel_text8')

results = word_model.accuracy('eval/questions-words.txt')

INFO:gensim.utils:loading Word2Vec object from pretrained/w2v_sg_marvel_text8
INFO:gensim.utils:loading wv recursively from pretrained/w2v_sg_marvel_text8.wv.* with mmap=None
INFO:gensim.utils:loading syn0 from pretrained/w2v_sg_marvel_text8.wv.syn0.npy with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:loading syn1neg from pretrained/w2v_sg_marvel_text8.syn1neg.npy with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded pretrained/w2v_sg_marvel_text8
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:gensim.models.keyedvectors:capital-common-countries: 35.9% (56/156)
INFO:gensim.models.keyedvectors:capital-world: 28.1% (39/139)
INFO:gensim.models.keyedvectors:currency: 0.0% (0/18)
INFO:gensim.models.keyedvectors:city-in-state: 18.8% (108/576)
INFO:gensim.models.keyedvectors:family: 32.5% (111/342)
INFO:gensim.models.keyedvectors:gram1-adjective-to-adverb: 3.1% (27/87

# Visualizing the Vector Space

In [17]:
# select random n vectors (normalized)
from random import sample
words = sample(list(word_model.wv.vocab.keys()), 1000)
vecs = [normvec(n) for n in words]

from sklearn.manifold import TSNE
vecs_2d = TSNE(n_components=2).fit_transform(vecs)

from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

source = ColumnDataSource(data=dict(x=vecs_2d[:, 0], y=vecs_2d[:, 1], labels=words))

p = figure()
p.scatter('x', 'y', size=8, source=source)

# include labels
labels = LabelSet(x='x', y='y', text='labels', level='glyph',
                  x_offset=5, y_offset=5, source=source, render_mode='canvas')
p.add_layout(labels)

show(p)

Find any clusters? We may need a larger corpus for this...

# TSNE with TensorFlow (TensorBoard)
<img src="images/tsne.gif">
[Tutorial](https://www.tensorflow.org/get_started/embedding_viz)

This concludes the overview of what's possible with dense representations of words.

Next we'll look at derivative methods for recommendation and search.

# Recommendation - Related Pages

<center><h1>Paragraph Vectors (doc2vec)</h1></center>\
<img src="images/doc2vec.png">

In [18]:
from gensim.models.doc2vec import Doc2Vec
from collections import namedtuple

if os.path.isfile('pretrained/pv_dm_concat_marvel'):

    # load pretrained
    doc_model = Doc2Vec.load('pretrained/pv_dm_concat_marvel')
    
else:

    docs = []
    Document = namedtuple('Document', 'words tags id')
    for doc_id in marvel_articles:
        doc_name = marvel_articles[doc_id]['name']
        doc_text = marvel_articles[doc_id]['text']
        doc_words = [word.lower() for sent in doc_text for word in sent.split()]
        docs.append(Document(doc_words, [doc_name], doc_id))

    # PV-DM with concatenation (preserves ordering, recommended by authors)
    doc_model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores, iter=20)

    alpha, min_alpha, passes = (0.025, 0.001, 20)
    alpha_delta = (alpha - min_alpha) / passes

    for epoch in range(passes):
        shuffle(docs)  # shuffling gets best results

        doc_model.alpha, doc_model.min_alpha = alpha, alpha
        doc_model.train(docs, total_examples=doc_model.corpus_count)

        print('completed pass %i at alpha %f' % (epoch + 1, alpha))
        alpha -= alpha_delta

    doc_model.save('pretrained/pv_dm_concat_marvel')

INFO:gensim.utils:loading Doc2Vec object from pretrained/pv_dm_concat_marvel
INFO:gensim.utils:loading wv recursively from pretrained/pv_dm_concat_marvel.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:loading docvecs recursively from pretrained/pv_dm_concat_marvel.docvecs.* with mmap=None
INFO:gensim.utils:loading syn1neg from pretrained/pv_dm_concat_marvel.syn1neg.npy with mmap=None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded pretrained/pv_dm_concat_marvel


In [19]:
# get semantically related pages with a simple similarity lookup among all pages
# would be perfect for http://marvel.wikia.com/wiki/Peter_Parker_(Earth-616)
doc_model.docvecs.most_similar('Peter Parker')

INFO:gensim.models.doc2vec:precomputing L2-norms of doc weight vectors


[('Mary Jane Watson', 0.6484307050704956),
 ('Otto Octavius', 0.6395471096038818),
 ('Doris Urich', 0.5727440118789673),
 ('May Reilly', 0.5726701021194458),
 ('Sally Green', 0.5616764426231384),
 ('Nancy Stacy', 0.5563001036643982),
 ('Kaine Parker', 0.55009925365448),
 ('Ben Reilly', 0.5442812442779541),
 ('Norman Osborn', 0.544121265411377),
 ('John Jonah Jameson', 0.5329504013061523)]

In [20]:
# another example
doc_model.docvecs.most_similar('Avengers')

[('Illuminati', 0.5736404657363892),
 ('Steven Rogers', 0.5377088785171509),
 ('Avengers vs. X-Men (Event)', 0.5318738222122192),
 ('Anthony Stark', 0.5049740672111511),
 ('Time Runs Out', 0.4869905114173889),
 ('Infinity (Event)', 0.4693588614463806),
 ('Avengers (Heroes Reborn)', 0.4689474403858185),
 ('Civil War (Event)', 0.46763166785240173),
 ('Carol Danvers', 0.4660409092903137),
 ('Canarsie', 0.4616173207759857)]

In [21]:
# TODO: compare results with tf-idf

# Search - Similar Sentences

<center><h1>Skip-Thought Vectors (sent2vec)</h1></center>\
<img src="images/skipthoughts.png">

In [22]:
from skip_thoughts import configuration
from skip_thoughts import encoder_manager

# based on https://github.com/tensorflow/models/tree/master/skip_thoughts#encoding-sentences
PRETRAINED_UNI_DIR = 'pretrained/skip_thoughts_uni_2017_02_02/'
VOCAB_FILE = PRETRAINED_UNI_DIR+"vocab.txt"
EMBEDDING_MATRIX_FILE = PRETRAINED_UNI_DIR+"embeddings.npy"
CHECKPOINT_PATH = PRETRAINED_UNI_DIR+"model.ckpt-501424"

encoder = encoder_manager.EncoderManager()
encoder.load_model(configuration.model_config(),
                   vocabulary_file=VOCAB_FILE,
                   embedding_matrix_file=EMBEDDING_MATRIX_FILE,
                   checkpoint_path=CHECKPOINT_PATH)

# TODO: use vocabulary expansion with previously defined word_model

INFO:tensorflow:Reading vocabulary from pretrained/skip_thoughts_uni_2017_02_02/vocab.txt


INFO:tensorflow:Reading vocabulary from pretrained/skip_thoughts_uni_2017_02_02/vocab.txt


INFO:tensorflow:Loaded vocabulary with 930914 words.


INFO:tensorflow:Loaded vocabulary with 930914 words.


INFO:tensorflow:Loading embedding matrix from pretrained/skip_thoughts_uni_2017_02_02/embeddings.npy


INFO:tensorflow:Loading embedding matrix from pretrained/skip_thoughts_uni_2017_02_02/embeddings.npy


INFO:tensorflow:Loaded embedding matrix with shape (930914, 620)


INFO:tensorflow:Loaded embedding matrix with shape (930914, 620)


INFO:tensorflow:Building model.


INFO:tensorflow:Building model.


INFO:tensorflow:Loading model from checkpoint: pretrained/skip_thoughts_uni_2017_02_02/model.ckpt-501424


INFO:tensorflow:Loading model from checkpoint: pretrained/skip_thoughts_uni_2017_02_02/model.ckpt-501424


INFO:tensorflow:Successfully loaded checkpoint: model.ckpt-501424


INFO:tensorflow:Successfully loaded checkpoint: model.ckpt-501424


In [23]:
# this is a workaround for a bug in python for Mac OS X for files > 2GB
# issue: http://bugs.python.org/issue24658
# workaround: http://stackoverflow.com/a/41613221/5641547
class MacOSFile(object):
    def __init__(self, f):
        self.f = f
    def __getattr__(self, item):
        return getattr(self.f, item)
    def read(self, n):
        if n >= (1 << 31):
            buffer = bytearray(n)
            pos = 0
            while pos < n:
                size = min(n - pos, 1 << 31 - 1)
                chunk = self.f.read(size)
                buffer[pos:pos + size] = chunk
                pos += size
            return buffer
        return self.f.read(n)

import pickle
import platform
    
if os.path.isfile('pretrained/marvel_sent_encodings.p'):
    # this isn't as much pretrained as it's precomputed
    
    if platform.system() == 'Darwin':
        encodings = pickle.load(MacOSFile(open('pretrained/marvel_sent_encodings.p', 'rb')))
    else:
        encodings = pickle.load(open('pretrained/marvel_sent_encodings.p', 'rb'))
    
else:
    data = [' '.join(sent) for sent in marvel_sents]
    # generate skip-thought vectors for each sentence in the dataset
    encodings = encoder.encode(data)
    
    if platform.system() == 'Darwin':
        pickle.dump(MacOSFile(open('pretrained/marvel_sent_encodings.p', 'wb')))  #untested
    else:
        pickle.dump(open('pretrained/marvel_sent_encodings.p', 'wb'))

In [24]:
import scipy.spatial.distance as sd

# helper function to generate nearest neighbors / similar sentences.
def get_sim_sents(sent, num=10):
    encoding = encoder.encode([sent])[0]
    scores = sd.cdist([encoding], encodings, "cosine")[0]
    sorted_ids = np.argsort(scores)
    print("Sentence:")
    print("", sent)
    print("\nNearest neighbors:")
    for i in range(1, num + 1):
        print(" %d. %s (%.3f)" %
              (i, ' '.join(marvel_sents[sorted_ids[i]]), scores[sorted_ids[i]]))

# some example sentences from the corpus
get_sim_sents('professor gilbert is an accomplished scientist with extensive knowledge of robotics .')

Sentence:
 professor gilbert is an accomplished scientist with extensive knowledge of robotics .

Nearest neighbors:
 1. chandra is a gifted geneticist and an expert in nanotechnology . (0.240)
 2. dr. suki is a skilled scientist in the area of chemistry . (0.255)
 3. dr._tempest_bell is an intelligent scientist specializing in astrobiology . (0.262)
 4. the master is a highly proficient engineer and scientist , specializing in the use of an_alien technology of undetermined origin . (0.272)
 5. walker is an accomplished inventor with skill in electronics and engineering . (0.277)
 6. meranno is a highly intelligent and gifted research scientist . (0.288)
 7. lord kofi_whitemane is a member of the kymellian race , who have built a peaceful , benevolent civilization with highly advanced technology . (0.289)
 8. dr. howard is a licensed to practice medicine and is a skilled surgeon . (0.289)
 9. kaga is an extremely intelligent and wealthy individual with access to vast resources . (0.297

In [25]:
# more subtle example
get_sim_sents('hector told logan where to find rojas , and was killed by felix .')

Sentence:
 hector told logan where to find rojas , and was killed by felix .

Nearest neighbors:
 1. chapman encountered mister_fantastic , invisible_woman and black_panther and traveled to hong_kong where he was killed by dolph . (0.284)
 2. by order of bastion , pierce blew up the black birds , and was killed by cyclops . (0.285)
 3. eventually , wild_child investigated nemesis 's disappearance and met up with the children_of_the_night , and was captured by rok . (0.294)
 4. masters managed to convince bobby that the black_rider he encountered was an_impostor and left to rescue marie . (0.314)
 5. during the attack his_sister was injured and essex logan stepped into believing he was dead and prompting him to kill essex . (0.317)
 6. the_vision broke bauer out of the camp and assisted him in fleeing to portugal . (0.318)
 7. clea and doctor_strange escaped dormammu , but met with umar , who wanted to kill doctor_strange . (0.319)
 8. micro tried to recruit punisher but was killed by f

In [26]:
# an unseen example
get_sim_sents('they were surrounded , with no escape in sight')

Sentence:
 they were surrounded , with no escape in sight

Nearest neighbors:
 1. they were pinned down by enemy fire . (0.739)
 2. they all seemed to have suffered in a battle they had no memory of . (0.751)
 3. hovering unnamed in their only appearance (0.751)
 4. each was armed with a drill on the front , allowing them to bore into anything in their path . (0.754)
 5. the vdbn were notorious for attacking anything in sight and then fighting among themselves when there were no enemies in sight . (0.757)
 6. a mob formed , attacking and destroying everything in sight , with only sheldon helping the injured . (0.757)
 7. there were numerous tunnels stretching out of sight , many unexplored . (0.757)
 8. there , they would save the inhabitants from the savage hairy_ones and continue on through the valley_of_the_mists . (0.762)
 9. the x-men themselves had set a trap around to them , bringing the phoenix-powered namor and a squad of x-men . (0.762)
 10. together with the leatherneck_raid

# Other Relevant Embeddings

* GloVe
    * more explicit training method
* fastText
    * by the author of word2vec, new subword vectors, classifier

# Recommended Reading

* [Deep Learning, NLP, and Representations](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)
* [Meanings are Vectors](http://sanjaymeena.io/tech/word-embeddings/)
* [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
* [A Word is Worth a Thousand Vectors](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/#footnote1)
* [On word embeddings](http://sebastianruder.com/word-embeddings-1/)