# Word vectors - word2vec

Word vectors are a method that allow you to represent words as dense vectors when you input them to your model. This is an alternative to the sparse vectors we have been creating so far with `CountVectorizer` or `TfIdfVectorizer`.


## Why word vectors?

One hot encoded words don't let us learn anything about how similar two words are. Say we have a vocabulary of three: green, red, dog. Assume red is encoded as `[1, 0, 0]`, green as `[0, 1, 0]` and dog as `[0, 0, 1]`. Any distance metric you choose will not be able to show you that red and green are more similar to each other than to dog.

With word vectors we could represent red as `[0.1, 0.2]`, green as `[0.15, 0.21]` and dog as `[0.8, 0.87]`. Now the distance (cosine or euclidean) between red and green is small, and the distance to dog is large.

This notebook is about building some intuition about vector spaces and answering the crucial question of how do we come up with a new vector space?

## A first example: colours

Let's start with a vector space that we are all familiar with: the vector space of RGB colours.

Colours can be represented as vectors with three dimensions: red, green, and blue. We can use these vectors to answer questions like: which colours are similar? What's the most likely colour name for an arbitrarily chosen set of values for red, green and blue? Given the names of two colours, what's the name of those colours' "average"?

We will be using this [colour data](https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json) from the [xkcd colour survey](https://blog.xkcd.com/2010/05/03/color-survey-results/). The data relates a colour name to the RGB value associated with that colour. [Here's a page that shows what the colors look like](https://xkcd.com/color/rgb/). Download the colour data and put it in a directory near to this notebook (I used `../data/`).

---

Aside on colour spaces:
* If you're interested in perceptually accurate colour math in Python, consider using the [colormath library](http://python-colormath.readthedocs.io/en/latest/).
* A fantastic video about how to design colourmaps for plots: https://www.youtube.com/watch?v=xAoljeRJ3lU - how people spent a lot of time to design matplotlibs new default colourmap.
* ["Make it pop"](https://predictablynoisy.com/makeitpop-intro), if you can't use Jet as a colourmap anymore, just distort your data instead! This piece is not seriously suggesting you distort the data, but it explains how you could.

---

Let's go!

In [2]:
import json

import numpy as np


colour_data = json.loads(open("../data/xkcd.json").read())

The following function converts colours from hex format (`#1a2b3c`) to a tuple of integers:

In [3]:
def hex_to_int(s):
    s = s.lstrip("#")
    return np.array([int(s[:2], 16), int(s[2:4], 16), int(s[4:6], 16)])

In [4]:
colours = dict()
for item in colour_data['colors']:
    colours[item["color"]] = hex_to_int(item["hex"])

Testing it out:

In [5]:
colours['olive']

array([110, 117,  14])

In [6]:
colours['red']

array([229,   0,   0])

In [7]:
colours['black']

array([0, 0, 0])

In [8]:
colours['cyan']

array([  0, 255, 255])

### Vector math

A useful helper for computing the distance between two vectors:

In [9]:
def distance(coord1, coord2):
    # do you know what these two lines do?
    coord1 = np.asarray(coord1)
    coord2 = np.asarray(coord2)
    return np.linalg.norm(coord1 - coord2)

assert np.allclose(distance([10, 1], [5, 2]), 5.0990195135927845)
distance([10, 1], [5, 2])

5.0990195135927845

And the `meanv` function takes a list of vectors and finds their mean or average:

In [10]:
def meanv(coords):
    coords = np.atleast_2d(coords)
    return np.sum(coords, axis=0) / float(coords.shape[0])
    

m = meanv([[0, 1], [2, 2], [4, 3]])
assert np.allclose(m, [2.0, 2.0]), m
meanv([[0, 1], [2, 2], [4, 3]])

array([2., 2.])

Just as a test, the following cell shows that the distance from "red" to "green" is greater than the distance from "red" to "pink":

In [11]:
distance(colours['red'], colours['green']) > distance(colours['red'], colours['pink'])

True

### Finding the closest item

The `closest()` function below is a helper to find the `n` nearest neighbours for a query colour.

> Note: Calculating "nearest neighbours" like this is fine for the examples in this notebook, but plesae don't ever use it. There are grown up versions of this:
* SciPy's [kdtree](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html)
* [KDtree in scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html)
* [Annoy](https://pypi.python.org/pypi/annoy) by Spotify for approximate nearest neighbours

In [12]:
# Create your own `closest` to make sure you understand
# how this works. The below verison makes use of several
# advanced ideas from Python.
# Bonus:
# Can you write a version that uses numpy constructs?
# This would likely be much faster (to be sure benchmark it).
def closest(space, coord, n=10):
    pass

Testing it out, we can find the ten colours closest to "green":

In [13]:
closest(colours, colours['green'])

['green',
 'kelly green',
 'irish green',
 'true green',
 'emerald green',
 'kelley green',
 'grass green',
 'vibrant green',
 'grassy green',
 'emerald']

... or the ten colours closest to (150, 60, 150):

In [14]:
closest(colours, [150, 60, 20])

['burnt umber',
 'red brown',
 'brownish red',
 'russet',
 'brick',
 'rust',
 'auburn',
 'brown red',
 'rust brown',
 'warm brown']

In [15]:
from IPython.display import HTML

def int_to_hex(colour):
    r, g, b = colour
    return "#{:02x}{:02x}{:02x}".format(r,g,b)

HTML("""<div style='background: {};
                    height: 100px;
                    width: 100px' />""".format(int_to_hex([150, 60, 20])))

What would you call this colour?

### Colour magic

The interesting thing about representing words as vectors is that the vector operations we defined earlier appear to operate on language the same way they operate on numbers. For example, if we find the word closest to the vector resulting from subtracting "red" from "purple," we get a series of "blue" colors:

In [16]:
closest(colours, colours['purple'] - colours['red'])

['cobalt blue',
 'royal blue',
 'darkish blue',
 'true blue',
 'royal',
 'prussian blue',
 'dark royal blue',
 'deep blue',
 'marine blue',
 'deep sea blue']

This matches our intuition about RGB colors, which is that purple is a combination of red and blue. Take away the red, and blue is all you have left.

You can do something similar with addition. What's blue plus green?

In [17]:
closest(colours, colours['blue'] + colours['green'])

['bright turquoise',
 'bright light blue',
 'bright aqua',
 'cyan',
 'neon blue',
 'aqua blue',
 'bright cyan',
 'bright sky blue',
 'aqua',
 'bright teal']

That's right, something like turquoise or cyan! What if we find the average of black and white? Predictably, we get grey:

In [18]:
# the average of black and white: medium grey


['medium grey',
 'purple grey',
 'steel grey',
 'battleship grey',
 'grey purple',
 'purplish grey',
 'greyish purple',
 'steel',
 'warm grey',
 'green grey']

We can use colour vectors to reason about relationships between colours. In the cell below, finding the difference between "pink" and "red" then adding it to "blue" seems to give us a list of colours that are to blue what pink is to red (a slightly lighter, less saturated shade):

In [19]:
# an analogy: pink is to red as X is to blue
pink_to_red = colours['pink'] - colours['red']
closest(colours, pink_to_red + colours['blue'])

['neon blue',
 'bright sky blue',
 'bright light blue',
 'cyan',
 'bright cyan',
 'bright turquoise',
 'clear blue',
 'azure',
 'dodger blue',
 'lightish blue']

The examples above are fairly simple from a mathematical perspective but they are demonstrating that it is possible to use math to reason about how people use language.

## Distributional semantics

In the previous section, the examples are interesting because of a simple fact: colours that we think of as similar are "closer" to each other in RGB vector space. In our colour vector space, you can think of the words identified by vectors close to each other as being *synonyms*, in a sense: they sort of "mean" the same thing. They are also, for many purposes, *functionally identical*. Think of this in terms of writing, say, a search engine. If someone searches for "mauve trousers," then it is probably also okay to show them results for, say,

In [20]:
for cname in closest(colours, colours['mauve']):
    print(cname + " trousers")

mauve trousers
dusty rose trousers
dusky rose trousers
brownish pink trousers
old pink trousers
reddish grey trousers
dirty pink trousers
old rose trousers
light plum trousers
ugly pink trousers


Is it possible to create a vector space for all English words that has this same "closer in space is closer in meaning" property? How do we build such a vector space for all words in a language? We learn them!

To understand how that works, we have to back up a bit and ask the question: what does *meaning* mean? No one really knows, but one theory popular among computational linguists, computer scientists and other people who make search engines is the [Distributional Hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics), which states that:

    Linguistic items with similar distributions have similar meanings.
    
What's meant by "similar distributions" is *similar contexts*. Take for example the following sentences:

    It was really cold yesterday.
    It will be really warm today, though.
    It'll be really hot tomorrow!
    Will it be really cool Tuesday?
    
According to the Distributional Hypothesis, the words `cold`, `warm`, `hot` and `cool` must be related in some way (i.e., be close in meaning) because they occur in a similar context, i.e., between the word "really" and a word indicating a particular day. (Likewise, the words `yesterday`, `today`, `tomorrow` and `Tuesday` must be related, since they occur in the context of a word indicating a temperature.)

In other words, according to the Distributional Hypothesis, a word's meaning is just a big list of all the contexts it occurs in. Two words are closer in meaning if they share contexts.

## Word vectors from predicting contexts

For us to be able to learn such "contexts" from data we need to be able to do this as an unsupervised task. We will need a lot of text to learn about rare words and their contexts. There is nearly an infinite amount of text data on the internet (wikipedia, news, books, ...), but it is not labelled. Or is it?

Actually every sentence provides many examples of the context in which a word is being used.
We can take each word, plus some context words around it and construct a task that tries to predict the central word from the context. Or tries to predict the context from the central word. This is an auxiliary task that we are using to learn our continuous representation (or word vectors)!


### Skip-Gram models

Task: given a word, predict surrounding words.
This is a supervised task, but we do not need a "labelled" dataset. Each document yields many examples for free! We are not interested in performance for this task, just want to learn representations.

(This idea of constructing a "self-supervised" task to learn embeddings will come back again later. For example if you want to learn an embedding for documents.)

This is based on the idea that neural networks are first and foremost feature transformers. All the intermediate layers of a neural network transform features from one representation to a new one that is (slightly) more useful. We will input our words as one-hot encoded vectors that form a bag of words and as part of solving the task of predicting the surrounding words the neural network will learn a better representation!


<img src="cbow_skipgram_1.png" style="height: 300px;" />

In fact this will be a very simple neural network. Just one layer. In this figure `V` is the size of the vocabulary, `N` the dimensionality of the embedding space, and `C` the number of context words. The input and output are represented as `V` dimensional one-hot encoded vectors. It is the intermediate layer of size `N` that will hold our continous, dense vector embedding.

### Implementations

* [Gensim](https://radimrehurek.com/gensim/)
* spaCy
* Word2vec
* Tensorflow
* Mostly not something you train yourself, just use pre-made word vectors


### GloVe vectors

In practive you do not have to create your own word vectors from scratch! Many researchers have made downloadable databases of pre-trained vectors. One such project is Stanford's [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). These 300-dimensional vectors are included with spaCy.

## Word vectors in spaCy

Okay, let's have some fun with real word vectors. We're going to use the GloVe vectors that come with spaCy to creatively analyze and manipulate the text of *Frankenstein*.

> The default model in spacy does not have word vectors, you have to install
> at least the "medium" sized model with `python -m spacy download en_core_web_md`

First, make sure you've got `spacy` imported:

In [21]:
import spacy

> Download the text of Frankenstein from http://www.gutenberg.org/ebooks/84
> I placed my copy in `../data/`.

The following cell loads the language model and parses the input text:

In [22]:
# load spacy's `en_core_web_md` model into avariable called `nlp`
# process all of Frankenstein through spacy and place the result
# in a variable called `doc`
# we want to use `en_core_web_md` because it contains good word
# vectors for english

The cell below creates a list of unique words (or tokens) in the text, as a list of strings.

In [24]:
# get all of the words in the text file that aren't punctuation
# and store them in `tokens`
tokens = ...

In [25]:
tokens[:10]

['file',
 'farewell',
 'image',
 'equal',
 'startled',
 'clear',
 'task',
 'files',
 'voyages',
 'besieged']

You can see the vector of any word in spaCy's vocabulary using the `vocab` attribute, like so:

In [26]:
nlp.vocab['football'].vector

array([ 4.3758e-01,  4.8054e-01, -6.6233e-03, -1.7448e-01,  4.4043e-01,
       -3.1442e-01,  5.3397e-01,  4.6729e-01,  1.0159e-01,  2.7244e+00,
       -7.6771e-01,  1.7797e-01,  7.5324e-01,  2.2318e-01,  2.2972e-01,
        1.3957e-02,  1.1651e-01,  1.5249e-01,  5.4813e-01,  1.5076e-02,
       -2.6553e-01, -6.2747e-01,  2.1325e-01, -2.2996e-01, -1.8529e-01,
        5.5864e-01, -8.4094e-01,  7.0990e-01, -2.9615e-01,  2.5878e-01,
       -8.8265e-02, -3.4606e-01, -2.5755e-01, -8.2202e-02,  6.2264e-02,
        1.8445e-01, -1.5417e-04, -2.3137e-01, -5.9262e-03,  3.9931e-01,
        1.4636e-01,  2.2448e-01,  6.0828e-01,  6.3412e-01, -3.2131e-01,
       -6.8973e-01, -1.6703e-01,  5.3806e-01, -4.4930e-01,  6.4328e-02,
       -2.3113e-01,  2.0405e-01,  1.1714e-01, -7.9628e-01, -9.4335e-02,
       -1.1125e-01,  2.9469e-01,  3.9459e-01,  3.8567e-01,  2.5055e-01,
       -5.9500e-03,  8.5024e-01,  2.1191e-02,  1.9802e-01, -2.0702e-01,
        2.5742e-01,  2.9258e-01,  4.8071e-01, -2.7758e-01,  3.70

For the sake of convenience, the following function gets the vector of a given string from spaCy's vocabulary:

In [27]:
def vec(s):
    return nlp.vocab[s].vector

### Cosine similarity and finding closest neighbors

The cell below defines a function `cosine()`, which returns the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors. Cosine similarity is another way of determining how similar two vectors are, which is more suited to high-dimensional spaces. [See the Encyclopedia of Distances for more information and even more ways of determining vector similarity.](http://www.uco.es/users/ma1fegan/Comunes/asignaturas/vision/Encyclopedia-of-distances-2009.pdf)

In [28]:
# Define a function `cosine(vec1, vec2)` that computes the
# cosine distance between two vectors. Use numpy constructs
# from `numpy.linalg` where needed.

from numpy.linalg import norm

def cosine(v1, v2):
    pass

The following cell shows that the cosine similarity between `football` and `barbeque` is larger than the similarity between `trousers` and `octopus`.

The vectors are working how we expect them to:

In [29]:
cosine(vec('football'), vec('barbeque')) > cosine(vec('trousers'), vec('octopus'))

True

Define a function that iterates through a list of tokens and returns the token whose vector is most similar to a given vector.

In [30]:
# Define a function `spacy_closest(tokens, vector, n=10)` that finds
# the `n` tokens closest to `vector` by checking against the vectors
# of the tokens in the list `tokens`.

def spacy_closest(list_of_all_tokens, vector, n=10):
    pass

Using this function, get a list of synonyms, or words closest in meaning (or distribution, depending on how you look at it), to any arbitrary word in spaCy's vocabulary.

In [31]:
# what's the closest equivalent of basketball?
# remember these tokens came from Frankenstein!
spacy_closest(tokens, vec("basketball"))

['coach',
 'league',
 'college',
 'game',
 'junior',
 'sport',
 'school',
 'leagues',
 'play',
 'ball']

In [None]:
# Does basketball even come up in Frankenstein?
"basketball" in tokens

### Fun with spaCy, Frankenstein, and vector arithmetic

Now we can start doing vector arithmetic and finding the closest words to the resulting vectors. For example, what word is closest to the halfway point between day and night?

Remember that `tokens` only contains words that spacy reconises and are present in the text of _Frankenstein_.

In [32]:
# halfway between day and night
spacy_closest(tokens, meanv([vec("day"), vec("night")]))

['night',
 'Night',
 'Day',
 'day',
 'evening',
 'Morning',
 'morning',
 'afternoon',
 'nights',
 'week']

Variations of `night` and `day` are still closest, but after that we get words like `evening` and `morning`, which are indeed halfway between day and night!

What words are close to "wine" and "cheese" in *Frankenstein*?

In [33]:
spacy_closest(tokens, vec("wine"))

['wine',
 'vineyards',
 'drink',
 'moonshine',
 'taste',
 'draught',
 'tasted',
 'cheese',
 'dinner',
 'delicious']

In [34]:
spacy_closest(tokens, vec("cheese"))

['cheese',
 'bread',
 'soup',
 'milk',
 'roasted',
 'delicious',
 'Vegetables',
 'vegetables',
 'loaf',
 'softened']

If you subtract "alcohol" from "wine" and find the closest words to the resulting vector, you're left with simply a lovely dinner:

In [35]:
spacy_closest(tokens, vec("wine") - vec("alcohol"))

['wine',
 'vineyards',
 'graceful',
 'exquisite',
 'marvellous',
 'magnificent',
 'delightful',
 'dinner',
 'lovely',
 'leaved']

What are the closest words to "water"? What about adding "frozen" to "water"?

In [36]:
spacy_closest(tokens, vec("water"))

['water',
 'waters',
 'Salt',
 'dry',
 'Ocean',
 'ocean',
 'heat',
 'pebble',
 'sands',
 'Sea']

In [37]:
spacy_closest(tokens, vec("water") + vec("frozen"))

['water',
 'Frozen',
 'frozen',
 'cold',
 'Cold',
 'chilly',
 'ices',
 'ice',
 'freezing',
 'Salt']

You can even do analogies! For example, the words most similar to "grass":

In [38]:
spacy_closest(tokens, vec("grass"))

['grass',
 'sod',
 'herbage',
 'foliage',
 'trees',
 'gardener',
 'garden',
 'brambles',
 'bushes',
 'pebble']

If you take the difference of "blue" and "sky" and add it to grass, you get the analogous word ("green"):

In [39]:
# analogy: blue is to sky as X is to grass
blue_to_sky = vec("blue") - vec("sky")
spacy_closest(tokens, blue_to_sky + vec("grass"))

['grass',
 'sod',
 'green',
 'yellow',
 'red',
 'pink',
 'blue',
 'leaf',
 'herbage',
 'white']

Hopefully you are now convinced that the vector space in which these words are embedded is useful for finding similar words. The definition of "similar" roughly corresponds to what humans think of as similar words as well, but it is not perfect.

## Sentence similarity

How about computing the similarity of sentences? The simlest thing to do to get the vector for a sentence is to average its component vectors, like so:

In [40]:
def sent2vec(s):
    sent = nlp(s)
    return meanv([w.vector for w in sent])

Let's find the sentence in our text file that is closest in "meaning" to an arbitrary input sentence. First, we'll get the list of sentences:

In [41]:
sentences = list(doc.sents)

The following function takes a list of sentences from a spaCy parse and compares them to an input sentence, sorting them by cosine similarity.

Here a function `spacy_closest_sent(space, input_sent, n=10)` that finds the `n`
sentences closest to `input_sent` in a collection of sentences in `space`.

In [42]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sent2vec(input_str)
    return sorted(space,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

Here are the sentences in *Frankenstein* closest in meaning to "My favourite food is strawberry ice cream." (Extra linebreaks are present because we didn't strip them out when we originally read in the source text.)

In [43]:
for sent in spacy_closest_sent(sentences, "My favourite food is strawberry ice cream."):
    print(sent.text)
    print("---")

The vegetables in the gardens, the milk and cheese that I saw
placed at the windows of some of the cottages, allured my appetite.
---
I greedily devoured the
remnants of the shepherd’s breakfast, which consisted of bread, cheese,
milk, and wine; the latter, however, I did not like.  
---
I
had first, however, provided for my sustenance for that day by a loaf
of coarse bread, which I purloined, and a cup with which I could drink
more conveniently than from my hand of the pure water which flowed by
my retreat.  
---
My food is not
that of man; I do not destroy the lamb and the kid to glut my appetite;
acorns and berries afford me sufficient nourishment.  
---
For some
time I sat upon the rock that overlooks the sea of ice.  
---
Their nourishment
consisted entirely of the vegetables of their garden and the milk of
one cow, which gave very little during the winter, when its masters
could scarcely procure food to support it.  
---
we
saw some dogs drawing a sledge, with a man in it, across

## In production

spaCy (unsurprisingly) has all this built in. For a production system you would not code your own cosine similarity function and please do not use our toy "nearest neighbour" functions but a kdtree or other efficient implementation.

In [44]:
strawberry = nlp("My favourite food is strawberry ice cream.")
chocolate = nlp("My favourite dessert is chocolate cake.")
running = nlp("My favourite hobby is running marathons.")

for base in (strawberry, chocolate, running):
    print("Base:", base)
    for other in (strawberry, chocolate, running):
        print("Other:", other)
        print("Similarity:", base.similarity(other))
        
    print()

Base: My favourite food is strawberry ice cream.
Other: My favourite food is strawberry ice cream.
Similarity: 1.0
Other: My favourite dessert is chocolate cake.
Similarity: 0.9139987084798473
Other: My favourite hobby is running marathons.
Similarity: 0.7208354513782261
Base: My favourite dessert is chocolate cake.
Other: My favourite food is strawberry ice cream.
Similarity: 0.9139987084798473
Other: My favourite dessert is chocolate cake.
Similarity: 1.0
Other: My favourite hobby is running marathons.
Similarity: 0.7291110517011878
Base: My favourite hobby is running marathons.
Other: My favourite food is strawberry ice cream.
Similarity: 0.7208354513782261
Other: My favourite dessert is chocolate cake.
Similarity: 0.7291110517011878
Other: My favourite hobby is running marathons.
Similarity: 1.0


To compute the vector of a sentence spaCy averages the vectors of the words in a sentence. This works surprisingly well for short sentences. After about 10-15 words the vectors become too noisy to be useful, so this is not a good option for (long) documents.

In [45]:
# Use spacy's builtin methods to compute the similarity between sentences and
# create a `spacy_closest_sent(space, input_str, n=10)` based on that.

def spacy_native_closest_sent(space, input_str, n=10):
    input = nlp(input_str)
    return sorted(space,
                  key=lambda x: input.similarity(x),
                  reverse=True)[:n]

In [46]:
for sent in spacy_native_closest_sent(sentences,
                                      "My favourite food is strawberry ice cream."):
    print(sent.text)
    print("---")

The vegetables in the gardens, the milk and cheese that I saw
placed at the windows of some of the cottages, allured my appetite.
---
I greedily devoured the
remnants of the shepherd’s breakfast, which consisted of bread, cheese,
milk, and wine; the latter, however, I did not like.  
---
I
had first, however, provided for my sustenance for that day by a loaf
of coarse bread, which I purloined, and a cup with which I could drink
more conveniently than from my hand of the pure water which flowed by
my retreat.  
---
My food is not
that of man; I do not destroy the lamb and the kid to glut my appetite;
acorns and berries afford me sufficient nourishment.  
---
For some
time I sat upon the rock that overlooks the sea of ice.  
---
Their nourishment
consisted entirely of the vegetables of their garden and the milk of
one cow, which gave very little during the winter, when its masters
could scarcely procure food to support it.  
---
we
saw some dogs drawing a sledge, with a man in it, across

## Bonus

Use spacy's vocab to create the list of tokens together with an efficient
lookup method.

* Can you replicate some of the word-math from the slides?
* What is the most similar word to "king" - "man" + "woman"?
* Can you fix spelling mistakes using word vectors?
* Unfortunately the only German model available for spacy does not
  include word vectors. If you have time, download the fastText vectors
  for German and see how things go.

## Further resources

* [Word2vec](https://en.wikipedia.org/wiki/Word2vec) is another procedure for producing word vectors which uses a predictive approach rather than a context-counting approach. [This paper](http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf) compares and contrasts the two approaches. (Spoiler: not much difference.)
* If you want to train your own word vectors on a particular corpus, the popular Python library [gensim](https://radimrehurek.com/gensim/) has an implementation of Word2Vec that is relatively easy to use. [There's a good tutorial here.](https://rare-technologies.com/word2vec-tutorial/)
* When you're working with vector spaces with high dimensionality and millions of vectors, iterating through your entire space calculating cosine similarities is slow. Use [Annoy](https://pypi.python.org/pypi/annoy) to make the lookup faster.