In [1]:
%matplotlib inline

Word Movers' Distance
=====================
* It allows us to submit a query and return the most relevant documents. See _wmdistance_.

* WMD assesses the distance between two documents even when they have no words in common. It uses [word2vec](http://rare-technologies.com/word2vec-tutorial/) vector embeddings of
words. It been shown to outperform many of SOTA methods in KNN classification [3].

* Below: WMD for two very similar sentences - [see Vlad Niculae's blog](http://vene.ro/blog/word-movers-distance-in-python.html>). 
* The sentences have no words in common, but by matching the relevant words, WMD is able to measure the (dis)similarity between them.
* The method uses a BoW representation of the documents. The reasoning is that we find the minimum "traveling distance" between documents - aka the most efficient way to "move" the
distribution of document 1 to the distribution of document 2.

In [3]:
# Image from https://vene.ro/images/wmd-obama.png
#import matplotlib.pyplot as plt
#import matplotlib.image as mpimg
#img = mpimg.imread('wmd-obama.png')
#imgplot = plt.imshow(img)
#plt.axis('off')
#plt.show()

* From [From Word Embeddings To Document Distances" by Matt Kusner](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf>). It uses a "transportation problem" solver.

* Gensim's WMD functions: the ``wmdistance`` method for distance computation, and 
``WmdSimilarity`` class for corpus based similarity queries.

* If you use Gensim's WMD functionality, please consider citing [1], [2] and [3].

### Computation
-----------

* To use WMD, you need an existing word embedding. Let's use an existing Word2Vec model.

In [4]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentence_obama     = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'

* These sentences have similar content - WMD should be low.
* Remove stopwords ("the", "to") before proceeeding.

In [5]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download

download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)

[nltk_data] Downloading package stopwords to /home/bjpcjp/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


* Load these pre-trained embeddings into a Gensim Word2Vec model class.
* Note: these embeddings require a lot of memory.

In [17]:
!ls ~/projects/*datasets*/google-news

GoogleNews-vectors-negative300.bin  GoogleNews-vectors-negative300.bin.gz


In [26]:
import gensim.downloader as api

#path = "~/projects/*datasets*/google-news/"
#file = "GoogleNews-vectors-negative300.bin.gz"
#model = gensim.models.Word2Vec.load(path+file)

model = api.load('word2vec-google-news-300')
#model = api.load('~/projects/*datasets*/google-news/GoogleNews-vectors-negative.bin')



KeyboardInterrupt: 

* find WMD using ``wmdistance``.




In [None]:
distance = model.wmdistance(sentence_obama, sentence_president)
print('distance = %.4f' % distance)

* try again with two unrelated sentences. The distance should be larger.




In [None]:
sentence_orange = preprocess('Oranges are my favorite fruit')
distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

### Normalizing word2vec vectors

* When using ``wmdistance`` you should first normalize the word2vec vectors so they have equal lengths. Call ``model.init_sims(replace=True)`` to do this.

* We often use **[cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity)** when comparing word2vec vectors - it measures the **angle between vectors**. 
* WMD uses **Euclidean distance** instead. The ED between two vectors might be large due to different lengths, but the cosine distance can be small.
* Note: normalizing the vectors can take some time, especially if you have a large vocabulary and/or large vectors.




In [None]:
model.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.

distance = model.wmdistance(sentence_obama, sentence_president)  # Compute WMD as normal.
print('distance: %r' % distance)

distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

References
----------

1. Ofir Pele and Michael Werman, *A linear time histogram metric for improved SIFT matching*\ , 2008.
2. Ofir Pele and Michael Werman, *Fast and robust earth mover's distances*\ , 2009.
3. Matt Kusner et al. *From Embeddings To Document Distances*\ , 2015.
4. Thomas Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*\ , 2013.


