In [1]:
%matplotlib inline

Word Movers' Distance
=====================
* It allows us to submit a query and return the most relevant documents. See _wmdistance_.

* WMD assesses the distance between two documents even when they have no words in common. It uses [word2vec](http://rare-technologies.com/word2vec-tutorial/) vector embeddings of
words. It been shown to outperform many of SOTA methods in KNN classification [3].

* Below: WMD for two very similar sentences - [see Vlad Niculae's blog](http://vene.ro/blog/word-movers-distance-in-python.html>). 
* The sentences have no words in common, but by matching the relevant words, WMD is able to measure the (dis)similarity between them.
* The method uses a BoW representation of the documents. The reasoning is that we find the minimum "traveling distance" between documents - aka the most efficient way to "move" the
distribution of document 1 to the distribution of document 2.

* From [From Word Embeddings To Document Distances" by Matt Kusner](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf>). It uses a "transportation problem" solver.

* Gensim's WMD functions: the ``wmdistance`` method for distance computation, and 
``WmdSimilarity`` class for corpus based similarity queries.

* If you use Gensim's WMD functionality, please consider citing [1], [2] and [3].

### Computation
-----------

* To use WMD, you need an existing word embedding. Let's use an existing Word2Vec model.

In [2]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentence_obama     = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'

* These sentences have similar content - WMD should be low.
* Remove stopwords ("the", "to") before proceeeding.

In [3]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download

download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)

[nltk_data] Downloading package stopwords to /home/bjpcjp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


* Load these pre-trained embeddings into a Gensim Word2Vec model class.
* Note: these embeddings require a lot of memory.

In [4]:
!ls ~/projects/*datasets*/google-news

GoogleNews-vectors-negative300.bin  GoogleNews-vectors-negative300.bin.gz


In [5]:
# if downloading this model for the first time, grab a coffee. It's 1.6GB.
import gensim.downloader as api
model = api.load('word2vec-google-news-300')

2020-05-05 14:32:15,285 : INFO : loading projection weights from /home/bjpcjp/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2020-05-05 14:35:34,362 : INFO : loaded (3000000, 300) matrix from /home/bjpcjp/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz


In [6]:
distance = model.wmdistance(sentence_obama, sentence_president)
print('distance = %.4f' % distance)

2020-05-05 14:35:34,368 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-05-05 14:35:34,370 : INFO : built Dictionary(8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...) from 2 documents (total 8 corpus positions)


distance = 3.3741


* try again with two unrelated sentences. The distance should be larger.




In [7]:
sentence_orange = preprocess('Oranges are my favorite fruit')
distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

2020-05-05 14:35:34,519 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-05-05 14:35:34,521 : INFO : built Dictionary(7 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'favorite']...) from 2 documents (total 7 corpus positions)


distance = 4.3802


### Normalizing word2vec vectors

* You should first normalize word2vec vectors to equal lengths. Call **model.init_sims(replace=True)** to do this.
* We often use **[cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity)** when comparing word2vec vectors - it measures the **angle between vectors**. 
* WMD instead uses **Euclidean distance**. The ED between two vectors might be large due to different lengths, but the cosine distance can be small.
* Note: normalizing the vectors can take some time, especially if you have a large vocabulary and/or large vectors.

In [8]:
model.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.

distance = model.wmdistance(sentence_obama, sentence_president)
print('distance(obama:president): %r' % distance)

distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance(obama:orange) = %.4f' % distance)

2020-05-05 14:39:04,982 : INFO : precomputing L2-norms of word weight vectors
2020-05-05 14:39:07,220 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-05-05 14:39:07,221 : INFO : built Dictionary(8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...) from 2 documents (total 8 corpus positions)
2020-05-05 14:39:07,222 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-05-05 14:39:07,223 : INFO : built Dictionary(7 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'favorite']...) from 2 documents (total 7 corpus positions)


distance(obama:president): 1.0174646259300113
distance(obama:orange) = 1.3663


References
----------

1. Ofir Pele and Michael Werman, *A linear time histogram metric for improved SIFT matching*\ , 2008.
2. Ofir Pele and Michael Werman, *Fast and robust earth mover's distances*\ , 2009.
3. Matt Kusner et al. *From Embeddings To Document Distances*\ , 2015.
4. Thomas Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*\ , 2013.


