<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Word2vec-(Skipgram)" data-toc-modified-id="Word2vec-(Skipgram)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Word2vec (Skipgram)</a></span><ul class="toc-item"><li><span><a href="#Model-Details" data-toc-modified-id="Model-Details-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Model Details</a></span></li><li><span><a href="#The-Hidden-Layer" data-toc-modified-id="The-Hidden-Layer-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>The Hidden Layer</a></span></li><li><span><a href="#The-Output-Layer" data-toc-modified-id="The-Output-Layer-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>The Output Layer</a></span></li></ul></li><li><span><a href="#Improving-Word2vec" data-toc-modified-id="Improving-Word2vec-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Improving Word2vec</a></span><ul class="toc-item"><li><span><a href="#Negative-Sampling" data-toc-modified-id="Negative-Sampling-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Negative Sampling</a></span><ul class="toc-item"><li><span><a href="#Mathematical-Notation" data-toc-modified-id="Mathematical-Notation-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Mathematical Notation</a></span></li><li><span><a href="#Selecting-Negative-Samples" data-toc-modified-id="Selecting-Negative-Samples-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Selecting Negative Samples</a></span></li></ul></li><li><span><a href="#Subsampling-Frequenct-Words" data-toc-modified-id="Subsampling-Frequenct-Words-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Subsampling Frequenct Words</a></span></li><li><span><a href="#Detecting-Phrases" data-toc-modified-id="Detecting-Phrases-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Detecting Phrases</a></span></li></ul></li><li><span><a href="#Spacy" data-toc-modified-id="Spacy-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Spacy</a></span><ul class="toc-item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Part-of-Speech-Tagging" data-toc-modified-id="Part-of-Speech-Tagging-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Part of Speech Tagging</a></span></li><li><span><a href="#Named-Entity-Recognition" data-toc-modified-id="Named-Entity-Recognition-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Named Entity Recognition</a></span></li><li><span><a href="#Token-Level-Attribute" data-toc-modified-id="Token-Level-Attribute-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Token Level Attribute</a></span></li><li><span><a href="#Dependency-Parsing" data-toc-modified-id="Dependency-Parsing-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Dependency Parsing</a></span></li></ul></li><li><span><a href="#Implementation" data-toc-modified-id="Implementation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Implementation</a></span></li><li><span><a href="#Final-Thoughts" data-toc-modified-id="Final-Thoughts-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Final Thoughts</a></span><ul class="toc-item"><li><span><a href="#Hyperparameters" data-toc-modified-id="Hyperparameters-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Hyperparameters</a></span></li><li><span><a href="#Applications" data-toc-modified-id="Applications-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Applications</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Resources</a></span></li></ul></li><li><span><a href="#Reference" data-toc-modified-id="Reference-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))

from formats import load_style
load_style(plot_style = False)

In [2]:
os.chdir(path)

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import spacy
import numpy as np
import pandas as pd
import tensorflow as tf
from time import time
from joblib import cpu_count
from collections import Counter
from gensim.models import Word2Vec
from sklearn.datasets import fetch_20newsgroups

%watermark -a 'Ethen' -d -t -v -p numpy,scipy,pandas,sklearn,tensorflow,gensim,spacy

Using TensorFlow backend.


Ethen 2017-08-11 17:56:35 

CPython 3.5.2
IPython 6.1.0

numpy 1.13.1
scipy 0.19.1
pandas 0.19.2
sklearn 0.18.1
tensorflow 1.2.1
gensim 2.3.0
spacy 1.9.0


# Word2vec (Skipgram)

At a high level `Word2Vec` is a unsupervised learning algorithm that uses a shallow neural network (with one hidden layer) to learn the vectorial representations of all the unique words/phrases for a given corpus. The advantage that word2vec offers is it tries to preserve the semantic meaning behind those terms. For example, a document may employ the words "dog" and "canine" to mean the same thing, but never use them together in a sentence. Ideally, the word2vec algorithm would be able to learn the context and place them together in similar vector semantic space.

We'll start off by using the Gensim's implementation of the algorithm to provide a high-level intuition.

In [3]:
# the .data attribute will access the raw data, where
# each element is a document
newsgroups_train = fetch_20newsgroups(subset = 'train')

# gensim’s Word2vec expects a sequence of sentences as its input,
# where each sentence a list of words. We'll be lazy for now
# and not perform any sort of text preprocessing
sentences = [doc.strip().split() for doc in newsgroups_train.data]

# example output of the data
print('raw data:\n\n', newsgroups_train.data[0])
print('example input:\n', sentences[0])

raw data:

 From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





example input:
 ['From:', 'lerxst@wam.umd.edu', "(where's", 'my', 'thing)', 'Subject:', 'WHAT', 'car', 'is', 'this!?', 'Nntp-Posting-Host:', 'rac3.wam.umd.edu', 'Organization:', 'University', 'of', 'Maryland,', 'College', 'Park', 'Lines:', '15', 'I', 'was', 'wonderi

In [4]:
# apart from the input sentence, the only additional paramter
# we'll set is to specify use all possible cpu to train the model
workers = cpu_count()

start = time()
word2vec = Word2Vec(sentences, workers = workers)
elapse = time() - start
print('elapse time:', elapse)

# obtain the learned word vectors (.wv.syn0)
# and the vocabulary/word that corresponds to each word vector
word_vectors = pd.DataFrame(word2vec.wv.syn0, index = word2vec.wv.index2word)
print('word vector dimension: ', word_vectors.shape)
word_vectors.head()

elapse time: 8.487048149108887
word vector dimension:  (44593, 100)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,0.875965,-1.072996,0.651279,0.969601,-1.239173,-6e-06,-0.026741,-0.494909,-1.147683,-0.673804,...,-0.058065,-2.117051,-1.844739,-3.408572,-0.452634,1.05992,-0.580822,-0.359145,-1.448598,1.7704
to,0.517399,-1.550736,0.561461,-0.410522,-0.834121,-0.747776,-3.933393,2.577003,-0.962767,-1.607448,...,0.324609,-0.802488,0.560674,-0.493972,-0.855662,-4.484555,0.664931,-0.638924,-0.567126,3.08296
of,0.739635,-0.995467,-1.177334,0.167292,-0.138621,-0.094872,-0.55302,-0.909594,-1.468312,-1.804325,...,-1.890942,1.142688,1.20502,-2.180235,-3.122204,-1.757854,-2.372113,-3.707302,-1.036076,-0.625225
a,-1.265011,-2.267246,0.982152,-2.038259,-3.091936,0.127317,-2.154119,0.969919,-1.305455,1.667451,...,2.273709,-0.930035,-1.947077,-1.215089,0.988526,-0.069218,-0.330214,-0.889438,-0.013739,0.19229
and,-0.607925,-2.396938,-0.081813,-0.21372,-0.008861,-0.413868,-0.706492,-0.05903,0.028941,-2.246091,...,1.299553,0.082098,-0.096058,-1.223995,-0.801336,-0.227632,0.429083,-0.770775,-1.180003,0.248578


After the model has learned the word vectors, we can use them to look up related words and phrases (words that have similar semantic meaning) for a given term of interest by comparing distances between the vectors using distance metric such as cosine distance.

In [5]:
word2vec.wv.most_similar(positive = ['computer'], topn = 5)

[('machine', 0.8475774526596069),
 ('keyboard', 0.8243974447250366),
 ('application', 0.8197864294052124),
 ('network', 0.813165009021759),
 ('modem', 0.8092001676559448)]

Apart from finding similar words using distance metric such as cosine distance, word vectors have the remarkable property that analogies between words seem to be encoded in the difference between word vectors. For example there seems to be a constant male-female difference vector:

<img src="img/word_vectors.png" width="40%" height="40%">

$$
\begin{align}
W(woman) - W(man)
&\tilde{=} W(aunt) - W(uncle) \\
&\tilde{=} W(queen) - W(king)
\end{align}
$$

This property means we also perform vector manipulation (e.g. addition and subtraction) with our word vectors. For example, you might have heard or saw the famous example of: King - male + female = queen. In the next section, we'll be taking a look at this model's inner workings. In explanations "borrows" heavily from the next blog post listed below. [Blog: Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [Blog: Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/).

## Model Details

The way the model works underneath the hood is that trains a neural network by do the following: Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. 

There is a **window size** hyperparameter to the algorithm that quantifies the word "nearby". A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total). There are some implementations that does something even fancier: Instead of using a fix $k$ window around each word, the window is uniformly distributed from $1, 2, ..., K$, where $K$ is the max window size we specify.

The diagram below shows some of the training samples (word pairs) we would take from the sentence "The quick brown fox jumps over the lazy dog." We'll used a small window size of 2 just for the example. The word highlighted in blue is the input word.

<img src="img/skipgram.png" width="60%" height="60%">

After feeding a bunch of word pairs to the network, it is going to tell us the probability for every word in our vocabulary being "nearby" word that we chose. The output probabilities are going to relate to how likely it is find each vocabulary word nearby our input word. For example, if we gave the trained network the input word "Soviet", the output probabilities should be much higher for words like "Union" and "Russia" than for unrelated words like "watermelon" and "kangaroo".


So how is this all represented? First of all, we know we can't feed a word just as a text string to a neural network (or probably any machine learning model), i.e. we need a way to represent the words to the network. To do this, we first build a vocabulary of words from our training documents. We'll assume that our corpus has a vocabulary size of 10,000.

We're going to represent an input word like "ants" as a one-hot vector. This vector will have 10,000 components (one for every unique word in our vocabulary) and we'll place a "1" in the position corresponding to the word "ants", and 0s in all of the other positions. The output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability that a randomly selected nearby word is that vocabulary word. Here's the architecture of our single-layer neural network.

<img src="img/word2vec_architecture.png" width="70%" height="70%">

There is no activation function on the hidden layer neurons, but the output neurons use softmax. We’ll come back to this later.

When training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when we evaluate the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).

An alternative diagram that depicts that Skip-gram model architecture well:

<img src="img/skipgram_architecture.png" width="30%" height="30%">

Where we're using the centre word $w_{(t)}$ to predict the surrounding words and the training objective is to learn word vector representations, a.k.a projections that are good at predicting the nearby words.

## The Hidden Layer

Let's say that we wish to learn word vectors with 300 features. The number of features is a hyperparameter that we would have to tune to our application to see which one yields the best result. So the hidden layer is going to be represented by a weight matrix with 10,000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron).

Now if we look at what would happen when we multiply the 1 x 10,000 one-hot vector representation of the word with a 10,000 x 300 matrix that represents the hidden layer's weight, it will effectively just select the matrix row corresponding to the "1". The following figure is a small example that does a matrix multiplication of a 1 x 5 one hot vector with a 5 x 3 hidden layer's weight to give you a visual. 

<img src="img/hidden_layer.png" width="70%" height="70%">

This means that the hidden layer of this model is really just operating as a lookup table and the output of the hidden layer is essentially the "word vector" for the input word.

## The Output Layer

The 1 x 300 word vector for "ants" then gets fed to the output layer. The output layer is a Softmax regression classifier. There's another documentation on Softmax Regression [here](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/deep_learning/softmax.ipynb), but the gist of it is that each output neuron, one per word in our vocabulary will produce an output probability between 0 and 1 and the sum of all these output values will add up to 1.

Specifically, each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function `exp(x)` to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes. Here's an illustration of calculating the output probability for the word "car".

<img src="img/output_layer.png" width="70%" height="70%">

Note that neural network does not know anything about the offset of the output word relative to the input word. In other words, it does not learn a different set of probabilities for the word before the input versus the word after.

Recall that in the beginning of the documentation, we mentioned that the goal for word2vec is to represent each word in the corpus as the vector representation while trying to preserve semantic meaning. This means that if two different words have very similar "contexts" (that is, what words are likely to appear around them), then our model needs to output very similar results for these two words. And one way for the network to output similar context predictions for these two words is if the word vectors are similar. So to hit the notion home, if two words have similar contexts, then word2vec is motivated to learn similar word vectors for these two words!

# Improving Word2vec

You may have noticed that the skip-gram neural network contains a huge number of weights ... For our example with 300 features and a vocab of 10,000 words, that's 3M weights in the hidden layer and output layer each! Training this on a large dataset would be slow and prone to overfitting, so the word2vec authors introduced a number of tweaks to make training feasible.

- Modifying the optimization objective with a technique they called "Negative Sampling", which causes each training sample to update only a small percentage of the model's weights.
- Subsampling frequent words to decrease the number of training examples.
- Treating common word pairs or phrases as single word in the model.

It's worth noting that subsampling frequent words and applying Negative Sampling not only reduced the compute burden of the training process, but also improved the quality of their resulting word vectors as well.

## Negative Sampling

Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately. In other words, each training sample will tweak all of the weights in the neural network. As we discussed above, the size of our word vocabulary means that our skip-gram neural network has a tremendous number of weights, all of which would be updated slightly by every one of our billions of training samples! Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them. Here's how it works:

When training the network on the word pair ("fox", "quick"), we want the "correct output" of the network, that is the output neuron corresponding to "quick" to output a 1, and for all of the other thousands of output neurons to output a 0.

With negative sampling, we are instead going to randomly select just a small number of "negative" words (let's say 5) to update the weights for. (In this context, a "negative" word is one for which we want the network to output a 0 for). We will also still update the weights for our "positive" word (which is the word "quick" in our current example).

> The paper says that selecting 5-20 words works well for smaller datasets, and we can get away with only 2-5 words for large datasets.

Recall that the output layer of our model has a weight matrix that's 300 x 10,000. So we will just be updating the weights for our positive word ("quick"), plus the weights for 5 other words that we want to output 0. That's a total of 6 output neurons, and 1,800 weight values total. That's only 0.06% of the 3M weights in the output layer!

In the hidden layer, only the weights for the input word are updated (this is true whether you're using Negative Sampling or not).

### Mathematical Notation

This section goes back and re-visit the mathematical notation for Word2vec skipgram's objective function. Hopefully, the math won't look so daunting after having an understanding of the model from a non-mathematical standpoint.

In this model, we are given a corpus of words $w$ and their contexts $c$ (the word pair of the targeted word that we've sampled). Our goal is to set the parameters $\theta$ of $p(c | w; \theta)$ to maximize our objective function $J_\theta$, i.e. the corpus probability, with regards to our model parameters $\theta$:

$$
\begin{align}
J_{\theta} &= 
arg\underset{\theta}{max} \prod_{w \in Text} \bigg[ \prod_{c \in C(w)} p(c | w; \theta) \bigg]
\end{align}
$$

Here $C(w)$ is word $w$'s set of contexts.

One approach for parameterizing the $p(c | w; \theta)$ part of the skip-gram model is the classic softmax objective function:

$$
\begin{align}
p(c | w; \theta) &= \dfrac{exp(v_c \cdot v_w)}{ \sum_{c' \in C} exp(v_c' \cdot v_w)}
\end{align}
$$

Where:

- $v_c$ and $v_w$ are the vector representations for word $c$ and $w$ respectively
- $C$ is the set of all available contexts

While the objective function above can be computed, it is computationally expensive to do so due to the summation $\sum_{c' \in C} exp(v_c' \cdot v_w)$ since there can be thousands if not million of them. And this is where negative sampling comes in. 

Instead of trying to estimate the probability of the word pair directly, we train a model to differentiate the target word from noise (negative sample). We can thus reduce the problem of predicting the correct word to a binary classification task, where the model tries to distinguish positive/genuine data from noise/negative samples.

We know for the word pair we generated from the data, we want to maximize its probability, i.e.

$$
\begin{align}
arg\underset{\theta}{max} \prod_{(w, c) \in D} p(D = 1 \big| c, w; \theta)
\end{align}
$$

Here:

- $D$ is the set of all word and context pairs we extract from the text
- $p(D = 1 \big| w, c)$ the probability that the word pair $(w, c)$ came from the corpus data

We then generate the set $D'$ of random (w, c) pairs, assuming they are all incorrect. The name "negative sampling" stems from the set $D'$ of randomly sampled negative examples. Note that when we pick one random word from the vocabulary, there is some tiny chance that the picked word is actually a valid context. If we consider the large number of vocabulary we have, we can argue that the probability is really really tiny, but a lot of the packages still do take care of removing these "accidents".

Since this is a binary classification loss, we can use logistic regression to minimize the negative log-likelihood, leading to the objective function:

\begin{align}
J_{\theta} 
&= arg\underset{\theta}{max} 
\prod_{(w, c) \in D} p(D = 1 \big| c, w; \theta) \prod_{(w, c) \in D'} p(D = 0 \big| c, w; \theta) \\
&= arg\underset{\theta}{max} 
\prod_{(w, c) \in D} p(D = 1 \big| c, w; \theta) \prod_{(w, c) \in D'} \big(1 - p(D = 1 \big| c, w; \theta)\big) \\
&= arg\underset{\theta}{max} 
\sum_{(w, c) \in D} log \big(p(D = 1 \big| c, w; \theta)\big) \sum_{(w, c) \in D'} log \big(1 - p(D = 1 \big| c, w; \theta)\big) \\
&= arg\underset{\theta}{max} 
\sum_{(w, c) \in D} log \big( \dfrac{1}{1 + \text{exp}(-v_c \cdot v_w)} \big) \sum_{(w, c) \in D'} log \big (1 - \dfrac{1}{1 + \text{exp}(-v_c \cdot v_w)} \big) \\
&= arg\underset{\theta}{max}
\sum_{(w, c) \in D} log \big( \dfrac{1}{1 + \text{exp}(-v_c \cdot v_w)} \big) \sum_{(w, c) \in D'} log \big (\dfrac{1}{1 + \text{exp}(v_c \cdot v_w)} \big) \\
\end{align}

Note that there're various other approaches to approximate the expensive softmax objective function. Negative sampling is soley discussed here because it is fast and works really really well. If you would like to go deeper with this topic, the following link might be a good place to start. [Blog: On word embeddings - Part 2: Approximating the Softmax](http://ruder.io/word-embeddings-softmax/index.html)

### Selecting Negative Samples

One little detail that's missing from the description above is how do we select the negative samples.

The negative samples are chosen using the unigram distribution. Essentially, the probability of selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples. Instead of using the raw frequency for $w_i$, $\text{freq}(w_i)$, in the original word2vec paper, each word is given a weight that's equal to it's frequency (word count) raised to the 3/4 power. The probability for selecting a word is just it's weight divided by the sum of weights for all words.

$$
\begin{align}
P(w_i) = \frac{ {\text{freq}(w_i)}^{3/4} }{\sum_{j=0}^{n} \left({\text{freq}(w_j)}^{3/4} \right) }
\end{align}
$$

This decision to raise the frequency to the 3/4 power appears to be empirical; as the author claims it outperformed other functions (e.g. just using unigram distribution).

Side note: The way this selection is implemented in the original word2vec C code is interesting. They have a large array with 100M elements (which they refer to as the unigram table). They fill this table with the index of each word in the vocabulary multiple times, and the number of times a word’s index appears in the table is given by $P(w_i) \times \text{table_size}$. Then, to actually select a negative sample, we just generate a random integer between 0 and 100M, and use the word at that index in the table. Since the higher probability words occur more times in the table, we're more likely to pick those.

## Subsampling Frequenct Words

Word2vec has two additional parameters for discarding some of the input words: words appearing less
than `min-count` times are not considered as either words or contexts, and in addition frequent words are down-sampled as defined by the `sample` parameter.

There are two potential issues with frequently appeared words like "the":

- When looking at word pairs that includes "the", e.g. ("fox", "the"), "the" doesn't tell us much about the meaning of "fox", since it appears in the context of pretty much every word.
- We will have more than enough samples of ("the", "the other word for the word pair") than we need to learn a good vector for "the".

Word2Vec implements a "subsampling" scheme to address this. For each word we encounter in our training text, there is a chance that we will discard it from the text. The probability that we cut the word is related to the word's frequency.

$$
\begin{align}
\text{probability of keeping the word } w_i 
&= (\sqrt{\frac{z(w_i)}{0.001}} + 1) \cdot \frac{0.001}{z(w_i)}
\end{align}
$$

Where:

- $z(w_i)$ is the fraction of the total words in the corpus that are that word. For example, if the word "peanut" occurs 1,000 times in a 1 billion word corpus, then z("peanut") = 1E-6.
- There is also a parameter called `sample` which controls how much subsampling occurs, and the default value is 0.001. Smaller values of `sample` mean words are less likely to be kept

Here are some interesting observations of this subsampling function (again this is using the default sample value of 0.001).

- $\text{probability of keeping the word } w_i = 1$ (100% chance of being kept) when $z(w_i) <= 0.0026$. This means that only words which represent more than 0.26% of the total words will be subsampled
- $\text{probability of keeping the word } w_i = 0.5$ (50% chance of being kept) when $z(w_i) <= 0.00746$
- $\text{probability of keeping the word } w_i = 0.033$ (3.3% chance of being kept) when $z(w_i) = 1.0$. That is, if the corpus consisted entirely of word $w_i$, which of course is ridiculous

## Detecting Phrases

Word pair like "Boston Globe" (a newspaper) has a much different meaning than the individual words "Boston" and "Globe". So it makes sense to treat "Boston Globe", wherever it occurs in the text, as a single word with its own word vector representation.

The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$
\begin{align}
\frac{count(A B) - count_{min}}{count(A) \cdot count(B)} \cdot N > threshold
\end{align}
$$

Where:

- $count(A)$ is the number of times token $A$ appears in the corpus
- $count(B)$ is the number of times token $B$ appears in the corpus
- $count(A B)$ is the number of times the tokens $A B$ appear in the corpus in this specific order
- $N$ is the total size of the corpus vocabulary
- $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times,
- $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase
    
As we can infer the formula is designed to make phrases out of words which occur together often relative to the number of individual occurrences. And a higher threshold value will favors phrases made of infrequent words in order to avoid making phrases out of common words like "and the" or "this is".

# Spacy

After covering a bit of theory about of the Word2vec skipgram model, we'll take a step back and perform some text preprocessing using [**spaCy**](https://spacy.io).

<img src="img/spaCy.png" width="80%" height="80%">

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. According to the author, spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

Not only does it handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

But it is also written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the _GIL_).

The first step to use `spaCy` is to constructs a language processing pipeline, here we're:

- Loading the pre-trained english model
- Grabbing a sample text and hand it over to spaCy and be prepared to wait...

```python
# download the package first
pip install spacy

# after that download the trained english model
python -m spacy download en
```

In [6]:
# we'll be using reviews of a hotel obtained from tripadvisor’s website
# to showcase the general process
reviews = pd.read_table('hotelreviews.txt', names = ['text'])
reviews.head()

Unnamed: 0,text
0,Nice place Better than some reviews give it cr...
1,what a surprise What a surprise the Sheraton w...
2,Good location Boston from 17th Floor of ...
3,Find an alternative to the Sheraton We stayed ...
4,Barely Tolerable If it were possible to give o...


In [7]:
# load the model/pipeline, once we have
# loaded the object, we can call it as
# though it were a function
nlp = spacy.load('en')

In [8]:
# grab a single document, hand it over to spacy
doc = reviews.loc[0, 'text']
parsed_doc = nlp(doc)
parsed_doc

Nice place Better than some reviews give it credit for. Overall, the rooms were a bit small but nice. Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city). Overall, it was a good experience and the staff was quite friendly. 

...1/20th of a second or so. Although the text looks exactly the same as before, a lot has actually happened under the hood. Let's take a look at what we got during that time. From here, we'll start to look at the functionalities/properties that spaCy provided us out of the box.

## Tokenization

The first one is sentence detection/segmentation (note that all of these features have already been computed, all we're doing now is accessing it via attribute). Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating over the document.

In [9]:
# access the sents attribute, which is a
# generator that we can loop through
for num, sentence in enumerate(parsed_doc.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print()

# access the first token
print('tokens:')
print(parsed_doc[0])

Sentence 1:
Nice place Better than some reviews give it credit for.

Sentence 2:
Overall, the rooms were a bit small but nice.

Sentence 3:
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).

Sentence 4:
Overall, it was a good experience and the staff was quite friendly.

tokens:
Nice


## Part of Speech Tagging

Part-of-speech tags (POST) are the properties of the word that are defined by the usage of the word in a grammatically correct sentence. These tags can be used as the text features in information filtering, statistical models and rule based parsing.

In [10]:
# .orth_ access the raw string and
# .pos_ acess the part of speech tags
token_text = [token.orth_ for token in parsed_doc]
token_pos = [token.pos_ for token in parsed_doc]

post = pd.DataFrame(list(zip(token_text, token_pos)),
                    columns = ['token_text', 'part_of_speech'])
post.head(5)

Unnamed: 0,token_text,part_of_speech
0,Nice,ADJ
1,place,NOUN
2,Better,PROPN
3,than,ADP
4,some,DET


From the table above, we can see that the word "Nice" is an adjective and so on.

## Named Entity Recognition

Spacy also consists of a fast entity recognition model which is capable of identifying entity phrases from the document. Entities can be of different types, such as – person, location, organization, dates, numericals etc. 

In [11]:
# For a given document, the standard way to access entity is
# to use the .ents attribute; for each entity, we can then
# access the .label_ attribute to check the entity type for
# the entity that got flagged;

# please check the documentation to see what the label for
# entity means
# https://spacy.io/docs/usage/entity-recognition#entity-types
for num, entity in enumerate(parsed_doc.ents):
    print('Entity {}:'.format(num + 1), entity.orth_, '-', entity.label_)

Entity 1: Better - FAC
Entity 2: the Prudential Center - ORG


We can also perform token-level entity analysis. This is basically the name entity recognition that we've already looked at, but at the token by token level. It also provides a inside outside begin indicator. e.g. here "the Prudential Center" represents one single entity, so "the" is the beginning of the entity (B); "Prudential Center" are both inside that entity (I). And the one that does not belong to an entity gets labeled as outside (O).

In [12]:
token_entity_type = [token.ent_type_ for token in parsed_doc]
token_entity_iob = [token.ent_iob_ for token in parsed_doc]

entity = pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
                      columns = ['token_text', 'entity_type', 'inside_outside_begin'])
entity.iloc[37:41]

Unnamed: 0,token_text,entity_type,inside_outside_begin
37,the,ORG,B
38,Prudential,ORG,I
39,Center,ORG,I
40,makes,,O


## Token Level Attribute

What about a variety of other token-level attributes, such as the relative frequency of tokens (how frequently does each token/word appears in the english vocabulary), and whether or not a token matches any of the following categories?

- stopword (grammatically functional words that don't contribute too much to the context)
- punctuation
- whitespace
- number
- whether the token is included in spaCy's default vocabulary or not?
- In terms of the token's relative frequency, spaCy expresses it as the log probability, so a negative number closer to 0 means it appears more often. Or we can say a smaller absolute value means it commonly appears

Please refer to the [documentation page](https://spacy.io/docs/api/token) to see all the available attributes at the token level.

In [13]:
token_attrs = [(token.orth_,
                token.lemma_,
                token.prob,
                token.is_stop,
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
                for token in parsed_doc]

df = pd.DataFrame(token_attrs,
                  columns = ['text',
                             'lemma',
                             'log_probability',
                             'stop',
                             'punctuation',
                             'whitespace',
                             'number',
                             'out_of_vocab'])

# we convert the boolean columns to only showing Yes for True
# and a blank string for False for a cleaner output
df.loc[:, 'stop':'out_of_vocab'] = (df.loc[:, 'stop':'out_of_vocab']
                                      .applymap(lambda x: 'Yes' if x else ''))
df.tail()

Unnamed: 0,text,lemma,log_probability,stop,punctuation,whitespace,number,out_of_vocab
68,staff,staff,-10.720455,,,,,
69,was,be,-5.404201,Yes,,,,
70,quite,quite,-8.2562,Yes,,,,
71,friendly,friendly,-10.44458,,,,,
72,.,.,-3.072948,,Yes,,,


## Dependency Parsing

Spacy also offers a fast and accurate dependency parser. Let's parse the dependency tree of all the sentences which contains a targeted term that we specified and check what are the adjectives that were commonly used with that term.

In [14]:
# toy example of how to get the depency
token = parsed_doc[1]
print('target word:', token)
print('depency:')

# to get the dependency for a token
# we can access the .children attribute
# and iterate through them
for child in token.children:
    print(child)

target word: place
depency:
Nice


In [15]:
def valid_word(token):
    """
    Returns False if the spacy token is either
    a punctuation, whitespace, number or a pronoun
    (indicated by the '-PRON-' flag)
    """
    pron_flag = token.lemma_ != '-PRON-'
    word_flag = not (token.is_punct or token.is_space or token.like_num)
    valid = word_flag and pron_flag
    return valid


def post_words(document, target_token, post, topn = 5):
    """
    given a document/corpus, look for the most commonly
    associated part of speech tag associated with the specified token
    """
    target_sents = [sent for sent in document.sents if target_token in sent.lower_]    
    words = []
    for sentence in target_sents:
        for token in sentence: 
            words.extend([child.lemma_
                          for child in token.children
                          if child.pos_ == post and valid_word(child)])

    common_words = Counter(words).most_common(topn)
    return common_words

In [16]:
# lump all the documents into one giant document
start = time()
corpus = ' '.join(reviews['text'])
document = nlp(corpus)
elapse = time() - start
print('elapse:', elapse)

elapse: 3.2930688858032227


In [17]:
common_words = post_words(document, target_token = 'view', post = 'ADJ')
common_words

[('great', 41), ('good', 28), ('small', 13), ('which', 13), ('fantastic', 11)]

If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box. It will probably become a core part of the Python data science ecosystem — it will do for natural language computing what other great libraries have done for numerical computing.

# Implementation

The following code chunks 1) preprocess the raw text using spaCy. 2) trains a Phrase model to glue words that commonly appear next to each other into bigrams. 3) trains the Word2vec model.

There also a pure python script [here](https://github.com/ethen8181/machine-learning/tree/master/deep_learning/word2vec/word2vec_workflow.py) if you're interested.

In [18]:
import os
import spacy
from joblib import cpu_count
from string import punctuation
from gensim.models import Phrases
from gensim.models import Word2Vec
from gensim.models.phrases import Phraser
from gensim.models.word2vec import LineSentence
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [19]:
def export_unigrams(unigram_path, texts, parser, stopwords,
                    batch_size = 10000, n_jobs = -1):
    """
    Preprocessed the raw text and export it to a .txt file,
    where each line is one document, for what sort of preprocessing
    is done, please refer to the `clean_corpus` function

    Parameters
    ----------
    unigram_path : str
        output file path of the preprocessed unigram text

    texts : iterable
        iterable can be simply a list, but for larger corpora,
        consider an iterable that streams the sentences directly from
        disk/network using Gensim's Linsentence or something along
        those line

    parser : spacy model object
        e.g. parser = spacy.load('en')

    stopwords : set
        stopword set that will be excluded from the corpus

    batch_size : int, default 10000
        batch size for the spacy preprocessing

    n_jobs : int, default -1
        number of jobs/cores/threads to use for the spacy preprocessing
    """
    with open(unigram_path, 'w', encoding = 'utf_8') as f:
        for cleaned_text in clean_corpus(texts, parser, stopwords, batch_size, n_jobs):
            f.write(cleaned_text + '\n')


def clean_corpus(texts, parser, stopwords, batch_size, n_jobs):
    """
    Generator function using spaCy to parse reviews:
    - lemmatize the text
    - remove punctuation, whitespace and number
    - remove pronoun, e.g. 'it'
    """
    n_threads = cpu_count()
    if n_jobs > 0 and n_jobs < n_threads:
        n_threads = n_jobs

    # use the .pip to process texts as a stream;
    # this functionality supports using multi-threads
    for parsed_text in parser.pipe(texts, n_threads = n_threads, batch_size = batch_size):
        tokens = []
        for token in parsed_text:
            if valid_word(token) and token.lemma_ not in stopwords:
                tokens.append(token.lemma_)

        cleaned_text = ' '.join(tokens)
        yield cleaned_text


def valid_word(token):
    """
    Returns False if the spacy token is either
    a punctuation, whitespace, number or a pronoun
    (indicated by the 'PRON' flag)
    """
    pron_flag = token.pos_ != 'PRON'
    word_flag = not (token.is_punct or token.is_space or token.like_num)
    word_len_flag = len(token) >= 2
    valid = pron_flag and word_flag and word_len_flag
    return valid

In [20]:
# a set of stopwords built-in to spacy,
# we can always expand this set for the
# problem that we are working on, here we include
# python built-in string punctuation mark
nlp = spacy.load('en')
STOPWORDS = spacy.en.STOP_WORDS | set(punctuation) | set(ENGLISH_STOP_WORDS)

# create a directory called 'model' to
# store all outputs in later section
MODEL_DIR = 'model'
if not os.path.isdir(MODEL_DIR):
    os.mkdir(MODEL_DIR)

UNIGRAM_PATH = os.path.join(MODEL_DIR, 'unigram.txt')
if not os.path.exists(UNIGRAM_PATH):
    start = time()
    export_unigrams(UNIGRAM_PATH, texts = newsgroups_train.data,
                    parser = nlp, stopwords = STOPWORDS)
    elapse = time() - start
    print('text preprocessing, elapse', elapse)

text preprocessing, elapse 226.76951503753662


In [21]:
PHRASE_MODEL_CHECKPOINT = os.path.join(MODEL_DIR, 'phrase_model')
if os.path.exists(PHRASE_MODEL_CHECKPOINT):
    phrase_model = Phrases.load(PHRASE_MODEL_CHECKPOINT)
else:
    # use LineSentence to stream text as oppose to
    # loading it all into memory
    unigram_sentences = LineSentence(UNIGRAM_PATH)
    start = time()
    phrase_model = Phrases(unigram_sentences)
    elapse = time() - start
    print('training phrase model, elapse', elapse)
    phrase_model.save(PHRASE_MODEL_CHECKPOINT)

training phrase model, elapse 2.6187009811401367


In [22]:
def export_bigrams(unigram_path, bigram_path, phrase_model):
    """
    Use the learned phrase model to create (potential) bigrams,
    and output the text that contains bigrams to disk

    Parameters
    ----------
    unigram_path : str
        input file path of the preprocessed unigram text

    bigram_path : str
        output file path of the transformed bigram text

    phrase_model : gensim's Phrase model object

    References
    ----------
    Gensim Phrase Detection
    - https://radimrehurek.com/gensim/models/phrases.html
    """

    # after training the Phrase model, create a performant
    # Phraser object to transform any sentence (list of
    # token strings) and glue unigrams together into bigrams
    phraser = Phraser(phrase_model)
    with open(bigram_path, 'w') as fout, open(unigram_path) as fin:
        for text in fin:
            unigram = text.split()
            bigram = phraser[unigram]
            bigram_sentence = ' '.join(bigram)
            fout.write(bigram_sentence + '\n')

In [23]:
BIGRAM_PATH = os.path.join(MODEL_DIR, 'bigram.txt')
if not os.path.exists(BIGRAM_PATH):
    start = time()
    export_bigrams(UNIGRAM_PATH, BIGRAM_PATH, phrase_model)
    elapse = time() - start
    print('converting words to phrases, elapse', elapse)

converting words to phrases, elapse 10.983206987380981


The next two code chunks is a slight digression from the overall process, this is simply used assess the amount of memory that is potentially saved by doing these preprocessing, i.e. loading all the raw text
and preprocessed text into memory and comparing the two.

In [24]:
import sys


def get_size(obj, seen = None):
    """
    Recursively finds size of objects, shamelessly "borrowed"
    from link listed in reference
    
    References
    ----------
    https://goshippo.com/blog/measure-real-size-any-python-object/
    """
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()

    obj_id = id(obj)
    if obj_id in seen:
        return 0

    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])

    return size

In [25]:
X = []
with open(BIGRAM_PATH) as f:
    for line in f:
        X.append(line)

original_size = get_size(newsgroups_train.data)
preprocessed_size = get_size(X)
ratio = original_size / preprocessed_size
print('original size uses {} times the amount of memory'.format(ratio))
del X

original size uses 1.6859290484964748 times the amount of memory


The Word2vec also accepts several key parameters that affect both training speed and quality. Hopefully these will now make more sense after we covered some theory of the algorithm.

- `iter`: Number of iterations to train the model
- `min_count`: For pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them. A reasonable value for min_count is between 0-100, depending on the size of the dataset.
- `size`: Refers to the hidden layers size. Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds
- `workers`: Number of cores/threads used for training
- `window`: Only terms that occur within a window-neighbourhood of a term in a sentence are associated with it during training. The default value is 4. Unless your text contains big sentences, it might be safe to leave it at that.
- `sg`: This defines the algorithm. If equal to 1, the skip-gram technique is used. Else, the CBoW method is employed. The word2vec framework we introduced here is skip-gram, where we're using the target word to predict the context word, where CBow is simply flipping it the other way around, i.e. using the context word to predict the target word. Based on the original author's [paper](https://arxiv.org/abs/1310.4546), Skip-gram is reported to be a better framework 

You can refer to the full list of parameters [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).

In [26]:
WORD2VEC_CHECKPOINT = os.path.join(MODEL_DIR, 'word2vec')
if os.path.exists(WORD2VEC_CHECKPOINT):
    word2vec = Word2Vec.load(WORD2VEC_CHECKPOINT)
else:
    sentences = LineSentence(BIGRAM_PATH)
    start = time()
    word2vec = Word2Vec(sentences, workers = cpu_count())
    elapse = time() - start
    print('training word2vec, elapse', elapse)
    word2vec.save(WORD2VEC_CHECKPOINT)

training word2vec, elapse 5.97011399269104


After that we can print out similar words to a targeted word that we've specified and use our human judgement of whether the embedding that was learned makes sense. The next code chunk is simply satisfying a personal interest in how the `most_similar` function is implemented in gensim.

In [27]:
# once we’re finished training the model (i.e. no more updates, only querying)
# we can store the word vectors and delete the model to trim unneeded model memory
vocab = word2vec.wv.vocab
word_vectors = word2vec.wv.syn0
index2word = word2vec.wv.index2word
del word2vec

In [28]:
from sklearn.preprocessing import normalize


def most_similar(word_vectors, vocab, index2word,
                 positive = None, negative = None, topn = 5):
    """
    No frill version of the function `word2vec.wv.most_similar`;
    Find the top-N most similar words. Positive words contribute
    positively towards the similarity, negative words negatively
    
    Parameters
    ----------
    word_vectors : 2d ndarray
        the learned word vectors
        
    vocab : word2vec.wv.vocab
        dictionary like object where the key is the word
        and the value has a .index attribute that allows
        us to look up the index for a given word
        
    index2word : word2vec.wv.index2word
        list like object that serves as the looking up the word
        of a given index
        
    positive/negative : list, default None
        list of positive or negative words to look for similar words,
        positive words will get assign a +1 weight and negative words
        will get assign a -1 weight when doing the vector addition
    
    topn : int
        top-n similar words to find
    """
    # normalize word vectors up front makes the cosine distance
    # calculation later easier
    normed = normalize(word_vectors)
    
    # assign weight to positive and negative words
    if positive is None:
        positive = []
    else:
        positive = [(word, 1.0) if isinstance(word, str) else word
                    for word in positive]
    if negative is None:
        negative = []
    else:
        negative = [(word, -1.0) if isinstance(word, str) else word
                    for word in negative]

    # compute the weighted average of all words
    all_words, mean = set(), []
    for word, weight in positive + negative:
        # gensim's Word2vec word2vec.wv.vocab
        # stores the index of a given word
        word_idx = vocab[word].index
        word_vector = normed[word_idx]
        mean.append(weight * word_vector)
        all_words.add(word_idx) 
    
    # find topn most similar words measured by cosine distance
    mean_vector = np.mean(mean, axis = 0)
    mean_vector /= np.sqrt(np.sum(mean_vector ** 2))
    dists = np.dot(normed, mean_vector)
    best = np.argsort(dists)[::-1][:(topn + len(all_words))]
    result = [(index2word[sim], float(dists[sim]))
              for sim in best if sim not in all_words]

    return result

In [29]:
word = 'computer'
most_similar(word_vectors, vocab, index2word, positive = [word])

[('modem', 0.9702850580215454),
 ('board', 0.9628041386604309),
 ('upgrade', 0.9603150486946106),
 ('printer', 0.9596825838088989),
 ('cd_rom', 0.9565122127532959)]

# Final Thoughts

## Hyperparameters

The most crucial decisions that affect the performance are the choice of the model
architecture, the size of the vectors, the subsampling rate, and the size of the training window. Because Word2vec training is an unsupervised task, there's no good way to objectively evaluate the result, it all depends on the end application.

## Applications

Word2Vec and the concept of word embeddings originated in the domain of NLP, however the idea of words in the context of a sentence or a surrounding word window can be generalized to other problem domain dealing with sequences or sets of related data points. For example:

- A direct application of Word2Vec to a classical engineering task was recently presented by Spotify. They abstracted the ideas behind Word2Vec to apply them not simply to words in sentences but to any object in any sequence, in this case to songs in a playlist. Songs are treated as words and other songs in a playlist as their surrounding context, depending on whether the playlists in question were genre specific the vocabulary encompassed a number of songs from that collection. Now in order to recommend songs to a user one merely has to examine a neighborhood of the 'song embeddings' of songs the user already likes
- Similarly, we could recommend a user who to connect to in a social network setting by examining the graph of relationships, where the nodes represent words and a path through the graph represents a sentence, then nodes that occur in the context of similar other nodes, would be close together in the vector space.

These examples show that the general applicability of Word2Vec based algorithms is very rich and that it behooves practitioners to examine their problem domain with respect to sequences of objects that occur in some meaningful context.

## Resources

A lot of the non-mathematical notes were taken from Chris McCormick's blog post. If you wish to learn more about the topic, the following link also contains different resources that Chris has curated (including a commented version of word2vec's original C code, derivation of word2vec's gradient update, etc.). [Blog: Word2Vec Resources](http://mccormickml.com/2016/04/27/word2vec-resources/)

If you would like to practice using a deep learning framework such as Tensorflow to implement the word2vec model. The following resources are good places to start, the reason that its not included in this documentation is because it was an order of magnitude slower than Gensim's Word2vec and the result weren't as good as well.

- [Note: CS 20SI Lecture note 4: How to structure your model in TensorFlow](http://web.stanford.edu/class/cs20si/lectures/notes_04.pdf)
- [Blog: Word2Vec word embedding tutorial in Python and TensorFlow](http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/)

# Reference

- [Blog: Word2vec Tutorial](http://rare-technologies.com/word2vec-tutorial/)
- [Blog: Demystifying Word2Vec](http://www.deeplearningweekly.com/blog/demystifying-word2vec)
- [Blog: Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- [Blog: Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
- [Blog: Natural Language Processing Made Easy – using SpaCy (in Python)](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/)
- [Paper: Yoav Goldberg, Omer Levy (2014) word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method](https://arxiv.org/abs/1402.3722)
- [Paper: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
- [Youtube: PyData DC 2016: Modern NLP in Python](https://www.youtube.com/watch?v=6zm9NC9uRkk)
- [Notebook: PyData DC 2016: Modern NLP in Python](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb)