# Simple example of Word Mover Distance
Here we will use the gensim library to show how easy it is to implement the WMD in python

### Import the libraries needed

In [1]:
import pandas as pd
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cathalhoran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Download a pre-trained embedding model
For this example we don't want to train our own embeddings so let's use a pre-trained model. <br>
There are many available and you can check them out [here](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html). <br>
Note that in general, the larger the model the more accurate it is. <br>
So be sure and try some different ones and see what impact they have on the accuracy. <br>
We use the `word2vec-google-news-300` one here which is one of the largest models available. <br>
But you can test it with a much smaller one like `glove-twitter-25`.<br>

In [2]:
import gensim.downloader as api
model = api.load('word2vec-google-news-300')

### Clean the text
You want to remove differences due to things like capitalisation. <br>
But also you want to remove stop words which carry very little informatio. <br>

In [3]:
def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

### Create your test sentences
You can use any sentnece you like here

In [4]:
sentences = ["I would like to reset my password",
             "Can I change my password", 
             "can I edit my password",
             "Where is the nearest restaurant?", 
             "Can I change my clothes"]

In [5]:
cleaned_sentences = []
for s in sentences:
    print(f'Original Sentence: {s}')
    cleaned_sentences.append(preprocess(s))
    print(f'Cleaned Sentence: {preprocess(s)}')

Original Sentence: I would like to reset my password
Cleaned Sentence: ['would', 'like', 'reset', 'password']
Original Sentence: Can I change my password
Cleaned Sentence: ['change', 'password']
Original Sentence: can I edit my password
Cleaned Sentence: ['edit', 'password']
Original Sentence: Where is the nearest restaurant?
Cleaned Sentence: ['nearest', 'restaurant?']
Original Sentence: Can I change my clothes
Cleaned Sentence: ['change', 'clothes']


### Display results in a grid format
One thing to always do when using similairty measures is to compare the sentence with itself. <br>
This will ensure: <br>
<ol>
  <li>Metric works: Any metric should be able to identify identical sentences</li>
  <li>Sorting works: And you need to ensure you are assoicating the correct score with the relevant sentences</li>
</ol>
We will compare all the sentences with each other here and display the results in a grid. <br>

In [7]:
res = {}
for sen in sentences:
    res[sen] = []

for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        wmd_score = model.wmdistance(cleaned_sentences[i], cleaned_sentences[j])
        res[s1].append(round(wmd_score, 1))
    
pd.DataFrame(res, index=[s for s in sentences])

Unnamed: 0,I would like to reset my password,Can I change my password,can I edit my password,Where is the nearest restaurant?,Can I change my clothes
I would like to reset my password,0.0,2.7,3.4,4.2,3.8
Can I change my password,2.7,0.0,2.2,4.7,2.8
can I edit my password,3.4,2.2,0.0,5.3,5.0
Where is the nearest restaurant?,4.2,4.7,5.3,0.0,4.0
Can I change my clothes,3.8,2.8,5.0,4.0,0.0
