# Latent Semantic Analysis

Latent Semantic Analysis (LSA) was basically the top of the line as far as word vectors were concerned before the word2vec model [cite] was developed. The long and short of it is that you do a term/document matrix and do a Singular-Value Decomposition (SVD) [cite] on it. The result removes variation from the character representations of words. 

## Techniques and Their Results

### Word by Word

This is sort of like levenstein matching, in that for each word in utterance_a, you compare it to each word in utterance_b, if the cosine similarity of the vectors is high enough, then it counts as a match. Despite cranking the similarity up to 0.999, the F1 score never broke 32, no matter how many components we had in the vectorizer. The Precision hovered somewhere around 20, while the recall was somewhere around 92.

pyemd -> requirements.txt

### Sentence Mean

In this case, we average all the vectors in a sentence and then compare it to another sentence, whose vectors are also averaged. This method does increase our F1 score: 49 with 300 components to the SVD and a similarity threshold of 0.95. In addition it's worth noting that the precision is 46 and the recall is 54. So we have a much more balanced set.

### Word Mover's Distance

This is a novel concept, based on the 'Earth Mover's Distance'. From wikipedia:

>In statistics, the earth mover's distance (EMD) is a measure of the distance between two probability distributions over a region D. In mathematics, this is known as the Wasserstein metric. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be amount of dirt moved times the distance by which it is moved.

In our case, the difficulty of moving dirt is the cosine similarity between two words. A distance matrix between each word in each utterance is formed, and the shortest path between each individual words, such that there is only one path between each word (assuming an equal number of words, when there are unqual numbers of words, there can be multiple paths to a certain word).

**Note:** This is a computationally intensive method.

**References:**

Kusner, Matt, et al. *"From word embeddings to document distances."* International Conference on Machine Learning. 2015.

Pele, Ofir, and Michael Werman. *"Fast and robust Earth Mover's Distances."* ICCV. Vol. 9. 2009.

Pele, Ofir, and Michael Werman. *"A linear time histogram metric for improved sift matching."* European conference on computer vision. Springer, Berlin, Heidelberg, 2008.

## Baseline

Right now it doesn't make sense to compare it to anything with a concrete minimum matching. But if we hit an F1 score of 60, then I will consider this a viable option for detecting variation sets.

In [1]:
import utterances
import evaluation
import sys
import difflib
import collections
import codecs
import numpy as np
from math import log
from itertools import islice
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from gensim import corpora
from pyemd import emd

In [2]:
import varseta_accuracy_tester as vat

In [3]:
args = ("anch", 3, 2)

to_dos = [
        ("DATA/Swedish_MINGLE_dataset/plain/1", "DATA/Swedish_MINGLE_dataset/GOLD/1"),
        ("DATA/Swedish_MINGLE_dataset/plain/2", "DATA/Swedish_MINGLE_dataset/GOLD/2"),
        ("DATA/Swedish_MINGLE_dataset/plain/3", "DATA/Swedish_MINGLE_dataset/GOLD/3"),
        ("DATA/Swedish_MINGLE_dataset/plain/4", "DATA/Swedish_MINGLE_dataset/GOLD/4")]


### Reading in data

Things to note:
I'm using only the lowercase versions, considering the corpus size, this can change

In [4]:
all_utterances = []
for to_do in to_dos:
    print("Reading in: " + to_do[0])
    u = utterances.Utterances(to_do[0], to_do[1])
    gold_utterances = u._goldutterances

    utterances_reformatted = []
    ids = []

    for utterance in u._utterances:
        new_utt = utterance[2].split()
        new_utt = [i.lower() for i in new_utt]
        utterances_reformatted.append(new_utt)
        ids.append((utterance[0], utterance[1]))
        
    all_utterances = all_utterances + utterances_reformatted

Reading in: DATA/Swedish_MINGLE_dataset/plain/1
Reading in: DATA/Swedish_MINGLE_dataset/plain/2
Reading in: DATA/Swedish_MINGLE_dataset/plain/3
Reading in: DATA/Swedish_MINGLE_dataset/plain/4


## Tf-idf

The first step is to build a tf-idf matrix. For our purposes, each utterance will be a document. This may change down the line.

In [5]:
def _dummy_preprocessor(to_return):
    """This is a workaround for the TfidfVectorizer's tokenizer"""
    return to_return

In [6]:
tf_idf = TfidfVectorizer(analyzer='word',
                         tokenizer=_dummy_preprocessor,
                         preprocessor=_dummy_preprocessor,
                         token_pattern=None)

In [7]:
tf_idf_features = tf_idf.fit_transform(all_utterances)
tf_idf_features

<4660x1101 sparse matrix of type '<type 'numpy.float64'>'
	with 16818 stored elements in Compressed Sparse Row format>

### Create the SVD

Note, we can put all this in a pipeline, but I think it's a little more explecit if we just go through each one.

In [8]:
lsa = TruncatedSVD(n_components=400, 
                   algorithm='randomized',
                   n_iter=7, random_state=69)

test_dense_corpus = lsa.fit_transform(tf_idf_features)

In [9]:

test_dense_corpus

array([[ 9.51364892e-01, -2.31301547e-01,  1.16082492e-01, ...,
        -4.08167856e-04,  1.04575443e-04,  2.07873373e-04],
       [ 1.46198434e-01,  2.32317490e-01,  7.58477293e-03, ...,
        -4.50809077e-04, -9.40323514e-04,  2.30051817e-04],
       [ 4.49652796e-02,  7.92418302e-02,  9.63618728e-03, ...,
         4.70855383e-03,  5.37170895e-03,  2.18235174e-03],
       ...,
       [ 3.59692288e-01,  1.90604597e-02, -8.16234413e-02, ...,
         3.95780257e-03,  2.01106691e-04,  3.70454178e-03],
       [ 4.20428058e-02,  1.31244813e-01, -1.60543821e-01, ...,
        -2.35938225e-02, -2.88579328e-02,  1.29561837e-02],
       [ 5.82134770e-02,  2.08614156e-01, -2.59514807e-01, ...,
         3.07040632e-03, -3.30577309e-03, -1.97216009e-04]])

In [10]:
test_a = lsa.transform(tf_idf.transform(["ja"]))
test_a

array([[ 1.29011349e-03,  2.52727211e-03, -6.34374583e-04,
         5.55326881e-03, -2.39245892e-04, -1.48188405e-03,
        -8.16205523e-04,  8.42298315e-04, -5.04169499e-03,
        -2.30353937e-06, -2.36536407e-03, -2.89080734e-03,
        -1.18691833e-03, -3.04637112e-03,  2.09389575e-03,
        -2.18931324e-04, -3.96857868e-03,  1.26094287e-02,
         6.75262110e-03,  3.08682847e-03,  3.75317972e-03,
        -2.40704326e-03, -1.56736219e-03,  2.42793778e-03,
         8.20005555e-04, -1.04234354e-03,  8.62725451e-03,
         2.07788970e-03,  2.97357074e-03, -5.56152645e-03,
        -4.86703384e-03, -2.18052368e-03, -8.87272118e-04,
        -7.30859899e-04, -1.10543256e-03,  2.74002328e-04,
        -4.39753718e-03, -2.51085806e-04, -1.23747915e-03,
        -1.52308022e-03, -4.99930786e-03, -7.37726700e-03,
        -1.84185587e-03,  3.44898674e-03,  3.63050591e-03,
        -2.83417619e-03, -2.31814295e-03, -2.04832542e-03,
         2.62849610e-02, -6.32743844e-03,  8.94120616e-0

In [11]:
# yes
test_a = lsa.transform(tf_idf.transform(["ja"]))
print(test_a.shape)

# no
test_b = lsa.transform(tf_idf.transform([u'n\xe4']))

# maybe (according to an online dictionary)
test_c = lsa.transform(tf_idf.transform([u'kanske']))

# yes -> no
print(cosine_similarity(test_a, test_b))

# yes -> maybe
print(cosine_similarity(test_a, test_c))

# no -> maybe (interesting!)
print(cosine_similarity(test_b, test_c))

# no -> no
print(cosine_similarity(test_b, test_b))

(1, 400)
[[-0.00012748]]
[[0.68189472]]
[[-0.00308386]]
[[1.]]


## Word by Word


In [12]:
args = ("anch", 3, 2)

### Accuracy functions

These are functions to track the precision, recall and f1 scores of the methods

In [13]:
def confusion_dict_init():
    """Initialize the precision, recall, f1 tracker"""
    confusion = {"fuzzy_precisions" : list(),
                 "strict_precisions" : list(),
                 "fuzzy_recalls" : list(),
                 "strict_recalls" : list(),
                 "fuzzy_f1s" : list(),
                 "strict_f1s" : list()
                 }
    
    return confusion

In [14]:
def update_and_print(confusion_dict, evaluation_stats):
    """Update precision, recall, f1 stats and print their values"""
    confusion_dict["fuzzy_precisions"].append(evaluation_stats.fuzzy_precision)
    confusion_dict["strict_precisions"].append(evaluation_stats.strict_precision)
    confusion_dict["fuzzy_recalls"].append(evaluation_stats.fuzzy_recall)
    confusion_dict["strict_recalls"].append(evaluation_stats.strict_recall)
    confusion_dict["fuzzy_f1s"].append(evaluation_stats.fuzzy_f1)
    confusion_dict["strict_f1s"].append(evaluation_stats.strict_f1)
    
    

    print('\tFuzzy Precision: {:0.2f}'.format(evaluation_stats.fuzzy_precision))
    print('\tFuzzy Recall: {:0.2f}'.format(evaluation_stats.fuzzy_recall))
    print('\tFuzzy F1: {:0.2f}'.format(evaluation_stats.fuzzy_f1))
    print('')
    print('\tStrict Precision: {:0.2f}'.format(evaluation_stats.strict_precision))
    print('\tStrict Recall: {:0.2f}'.format(evaluation_stats.strict_recall))
    print('\tStrict F1: {:0.2f}'.format(evaluation_stats.strict_f1))
    print('\n')
    
    return confusion_dict

In [15]:
def final_stats_print(confusion_dict):
    """Calculate final precision, recall and f1"""
    
    avg_fuzzy_precision = sum([i for i in confusion_dict["fuzzy_precisions"]])/len(confusion_dict["fuzzy_precisions"])
    avg_fuzzy_recall = sum([i for i in confusion_dict["fuzzy_recalls"]])/len(confusion_dict["fuzzy_recalls"])
    avg_fuzzy_f1 = sum([i for i in confusion_dict["fuzzy_f1s"]])/len(confusion_dict["fuzzy_f1s"])
    avg_strict_precision = sum([i for i in confusion_dict["strict_precisions"]])/len(confusion_dict["strict_precisions"])
    avg_strict_recall = sum([i for i in confusion_dict["strict_recalls"]])/len(confusion_dict["strict_recalls"])
    avg_strict_f1 = sum([i for i in confusion_dict["strict_f1s"]])/len(confusion_dict["strict_f1s"])


    print("\n-------------------")
    print('\nAverage Scores:')
    print('\tAverage Fuzzy Precision: {:0.2f}'.format(avg_fuzzy_precision))
    print('\tAverage Fuzzy Recall: {:0.2f}'.format(avg_fuzzy_recall))
    print('\tAverage Fuzzy F1: {:0.2f}'.format(avg_fuzzy_f1))
    print('')
    print('\tAverage Strict Precision: {:0.2f}'.format(avg_strict_precision))
    print('\tAverage Strict Recall: {:0.2f}'.format(avg_strict_recall))
    print('\tAverage Strict F1: {:0.2f}'.format(avg_strict_f1))

### Matcher

In [16]:
def cosine_similarity_matcher(a_vectors, b_vectors, similarity, minimum_matches):
    """Return True if a sentence takes the correct number of matches
    
    Parameters:
    a_vectors:
        List of vectors, semantic representation of a sentence
    b_vectors:
        List of vectors, semantic representation of a sentence
    similarity:
        Semantic similarity threshold for a match
    minimum_matches:
        Number of matches threshold to be considered a variation set"""
    
    matches = 0
    
    for vector_a in a_vectors:
        for vector_b in b_vectors:
            cos_similarity = cosine_similarity(vector_a.reshape(1, -1), vector_b.reshape(1, -1))[0][0]
            if cos_similarity > similarity:
                matches += 1
             
    if matches >= minimum_matches:
        return True
                
    return False
                

### Matches wrapper

This is similar to the `matches_anchor` function found in `../variation_sets/processing/utils.py` but edit to work with cosine similarity

In [17]:
def matches_anchor_lsa(it, minimum_matches, match_type, overlap, return_count=True, ids=None):
    """Returns varation set matches using anchor method"""

    matches = 0
    matches_list = []

    for count, i in enumerate(it):
        utterances = iter(i)
        first = next(utterances)
        first = [j.lower() for j in first]
        first_vector = lsa.transform(tf_idf.transform(first))
        
        for utterance in utterances:
            utterance = [j.lower() for j in utterance]
            utterance_vector = lsa.transform(tf_idf.transform(utterance))
            if cosine_similarity_matcher(first_vector, utterance_vector, overlap, args[2]):
                matches += 1
                if ids:
                    matches_list.append((ids[count], i))
                else:
                    matches_list.append(i)

    if return_count:
        return matches
    else:
        return matches_list

In [18]:
confusion_stats = confusion_dict_init()

similarity = 0.9999

for to_do in to_dos:
    print("Finding variation sets in: " + to_do[0])
    u = utterances.Utterances(to_do[0], to_do[1])
    gold_utterances = u._goldutterances

    utterances_reformatted = []
    ids = []

    for utterance in u._utterances:
        new_utt = utterance[2].split()
        
        utterances_reformatted.append(new_utt)
        ids.append((utterance[0], utterance[1]))

    utt_iter = vat.window(utterances_reformatted, args[1])
    id_iter = vat.window(ids, args[1])
    ids = [i for i in id_iter]
    ids_and_matches = matches_anchor_lsa(utt_iter, args[2], None, similarity, False, ids=ids)
    combined = vat.convert_varseta_format(ids_and_matches)

    varseta_eval = evaluation.Evaluation(combined, gold_utterances)
    
    confusion_stats = update_and_print(confusion_stats, varseta_eval)
    
final_stats_print(confusion_stats)

Finding variation sets in: DATA/Swedish_MINGLE_dataset/plain/1
	Fuzzy Precision: 0.33
	Fuzzy Recall: 0.92
	Fuzzy F1: 0.49

	Strict Precision: 0.07
	Strict Recall: 0.19
	Strict F1: 0.10


Finding variation sets in: DATA/Swedish_MINGLE_dataset/plain/2
	Fuzzy Precision: 0.23
	Fuzzy Recall: 0.85
	Fuzzy F1: 0.36

	Strict Precision: 0.05
	Strict Recall: 0.18
	Strict F1: 0.08


Finding variation sets in: DATA/Swedish_MINGLE_dataset/plain/3
	Fuzzy Precision: 0.17
	Fuzzy Recall: 0.93
	Fuzzy F1: 0.28

	Strict Precision: 0.04
	Strict Recall: 0.22
	Strict F1: 0.07


Finding variation sets in: DATA/Swedish_MINGLE_dataset/plain/4
	Fuzzy Precision: 0.08
	Fuzzy Recall: 0.97
	Fuzzy F1: 0.14

	Strict Precision: 0.00
	Strict Recall: 0.05
	Strict F1: 0.01



-------------------

Average Scores:
	Average Fuzzy Precision: 0.20
	Average Fuzzy Recall: 0.92
	Average Fuzzy F1: 0.32

	Average Strict Precision: 0.04
	Average Strict Recall: 0.16
	Average Strict F1: 0.06


## Average Sentences

The way this technique works is for each utterance, you average the vectors of each word into the sentance and then compare the cosine similarity of those sentences.

In [19]:
def vector_averager(vectors):
    """Average a series of vectors into one vector"""
    
    return np.mean(vectors, axis=0)

In [20]:
def matches_anchor_lsa_sentences(it, minimum_matches, match_type, overlap, return_count=True, ids=None):
    """Return varation set matches using anchor method"""


    
    matches = 0
    matches_list = []

    for count, i in enumerate(it):
        utterances = iter(i)
        first = next(utterances)
        first = [j.lower() for j in first]
        first_vector = lsa.transform(tf_idf.transform(first))
        first_vector = vector_averager(first_vector)
        
        for utterance in utterances:
            utterance = [j.lower() for j in utterance]
            utterance_vector = lsa.transform(tf_idf.transform(utterance))
            utterance_vector = vector_averager(utterance_vector)
            if cosine_similarity(first_vector.reshape(1, -1), utterance_vector.reshape(1, -1)) > overlap:
                matches += 1
                if ids:
                    matches_list.append((ids[count], i))
                else:
                    matches_list.append(i)


    if return_count:
        return matches
    else:
        return matches_list

In [21]:
confusion_stats = confusion_dict_init()

similarity = 0.95

for to_do in to_dos:
    print("Finding variation sets in" + to_do[0])
    u = utterances.Utterances(to_do[0], to_do[1])
    gold_utterances = u._goldutterances
    utterances_reformatted = []
    ids = []

    for utterance in u._utterances:
        new_utt = utterance[2].split()
        utterances_reformatted.append(new_utt)
        ids.append((utterance[0], utterance[1]))
        
    utt_iter = vat.window(utterances_reformatted, args[1])
    id_iter = vat.window(ids, args[1])
    ids = [i for i in id_iter]
    ids_and_matches = matches_anchor_lsa_sentences(utt_iter, args[2], None, similarity, False, ids=ids)
    
    combined = vat.convert_varseta_format(ids_and_matches)

    varseta_eval = evaluation.Evaluation(combined, gold_utterances)
    
    confusion_stats = update_and_print(confusion_stats, varseta_eval)
    
final_stats_print(confusion_stats)

Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/1
	Fuzzy Precision: 0.63
	Fuzzy Recall: 0.60
	Fuzzy F1: 0.62

	Strict Precision: 0.08
	Strict Recall: 0.07
	Strict F1: 0.07


Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/2
	Fuzzy Precision: 0.51
	Fuzzy Recall: 0.50
	Fuzzy F1: 0.50

	Strict Precision: 0.10
	Strict Recall: 0.10
	Strict F1: 0.10


Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/3
	Fuzzy Precision: 0.45
	Fuzzy Recall: 0.47
	Fuzzy F1: 0.46

	Strict Precision: 0.08
	Strict Recall: 0.09
	Strict F1: 0.09


Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/4
	Fuzzy Precision: 0.27
	Fuzzy Recall: 0.38
	Fuzzy F1: 0.32

	Strict Precision: 0.00
	Strict Recall: 0.00
	Strict F1: 0.00



-------------------

Average Scores:
	Average Fuzzy Precision: 0.46
	Average Fuzzy Recall: 0.49
	Average Fuzzy F1: 0.47

	Average Strict Precision: 0.06
	Average Strict Recall: 0.06
	Average Strict F1: 0.06


## Averages with smoothing

This is an implementation of (Arora et al. 2016). The long and short as that you smooth based off of the frequency of the word. I'm skeptical, as the LSA was done on tf-idf, but it has performed well in semantic matching (in English and with word2vec vectors).

Before smoothing we get our value:
$$\Sigma_w v_w$$

But then we add a smoothing element alpha:

$$\alpha _w = \frac{a}{a + p_w}$$

Where $p_w$ is the frequency of *w* in the corpus and *a* is some hyperparameter (0.0001 is used in the paper). Making our final formula for each summed vector:

$$\Sigma _w \alpha_w v_w$$

### Results

It has the same F1 Score, but a MUCH higher recall.

>Average Scores:

>	Average Fuzzy Precision: 0.37

>	Average Fuzzy Recall: 0.68

>	Average Fuzzy F1: 0.47
>
>	Average Strict Precision: 0.06

>	Average Strict Recall: 0.10

>	Average Strict F1: 0.07

Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. *"A simple but tough-to-beat baseline for sentence embeddings."* (2016).

In [22]:
def smooth_vector(words, vectors, counter):
    
    a = 0.01
    
    alpha_list = []
    
    for word in words:
        value = a/(a + float(counter[word]))
        alpha_list.append([value])
        
    alphas = np.array(alpha_list)
    
    return alphas * vectors

In [23]:
def matches_anchor_smoothed_lsa_sentences(it, minimum_matches, match_type, overlap, word_counts, return_count=True, ids=None):
    """Return varation set matches using anchor method"""
    
    matches = 0
    matches_list = []

    for count, i in enumerate(it):
        utterances = iter(i)
        first = next(utterances)
        first = [j.lower() for j in first]
        first_vector = lsa.transform(tf_idf.transform(first))
        
        first_vector = smooth_vector(first, first_vector, word_counts)
        first_vector = vector_averager(first_vector)
        
        for utterance in utterances:
            utterance = [j.lower() for j in utterance]
            utterance_vector = lsa.transform(tf_idf.transform(utterance))
            utterance_vector = smooth_vector(utterance, utterance_vector, word_counts)
            utterance_vector = vector_averager(utterance_vector)
            if cosine_similarity(first_vector.reshape(1, -1), utterance_vector.reshape(1, -1)) > overlap:
                matches += 1
                if ids:
                    matches_list.append((ids[count], i))
                else:
                    matches_list.append(i)


    if return_count:
        return matches
    else:
        return matches_list

In [24]:
confusion_stats = confusion_dict_init()

similarity = 0.96


# get word counts
word_counts = collections.Counter()
[word_counts.update(i) for i in all_utterances]

for to_do in to_dos:
    print("Finding variation sets in" + to_do[0])
    u = utterances.Utterances(to_do[0], to_do[1])
    gold_utterances = u._goldutterances
    utterances_reformatted = []
    ids = []

    for utterance in u._utterances:
        new_utt = utterance[2].split()
        utterances_reformatted.append(new_utt)
        ids.append((utterance[0], utterance[1]))
        
    utt_iter = vat.window(utterances_reformatted, args[1])
    id_iter = vat.window(ids, args[1])
    ids = [i for i in id_iter]
    ids_and_matches = matches_anchor_smoothed_lsa_sentences(utt_iter,
                                                            args[2],
                                                            None,
                                                            similarity,
                                                            word_counts,
                                                            False,
                                                            ids=ids)
    
    combined = vat.convert_varseta_format(ids_and_matches)

    varseta_eval = evaluation.Evaluation(combined, gold_utterances)
    
    confusion_stats = update_and_print(confusion_stats, varseta_eval)
    
final_stats_print(confusion_stats)

Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/1
	Fuzzy Precision: 0.48
	Fuzzy Recall: 0.79
	Fuzzy F1: 0.60

	Strict Precision: 0.08
	Strict Recall: 0.13
	Strict F1: 0.10


Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/2
	Fuzzy Precision: 0.41
	Fuzzy Recall: 0.64
	Fuzzy F1: 0.50

	Strict Precision: 0.07
	Strict Recall: 0.12
	Strict F1: 0.09


Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/3
	Fuzzy Precision: 0.35
	Fuzzy Recall: 0.69
	Fuzzy F1: 0.46

	Strict Precision: 0.06
	Strict Recall: 0.11
	Strict F1: 0.08


Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/4
	Fuzzy Precision: 0.22
	Fuzzy Recall: 0.59
	Fuzzy F1: 0.32

	Strict Precision: 0.02
	Strict Recall: 0.05
	Strict F1: 0.03



-------------------

Average Scores:
	Average Fuzzy Precision: 0.36
	Average Fuzzy Recall: 0.68
	Average Fuzzy F1: 0.47

	Average Strict Precision: 0.06
	Average Strict Recall: 0.10
	Average Strict F1: 0.07


## Word Mover's Distance



In [25]:
def wmdistance(lsa, document1, document2):
    """Compute the Word Mover's Distance between two documents.
    Parameters
    ----------
    lsa: an sklearn dense TrunkcatedSVD matrix
        to generate vectors
    document1 : list of str
        Input document.
    document2 : list of str
        Input document.
    Returns
    -------
    float
        Word Mover's distance between `document1` and `document2`.
    Warnings
    --------
    This method only works if `pyemd <https://pypi.org/project/pyemd/>`_ is installed.
    If one of the documents have no words that exist in the vocab, `float('inf')` (i.e. infinity)
    will be returned.
    Raises
    ------
    ImportError
        If `pyemd <https://pypi.org/project/pyemd/>`_  isn't installed.
    """
#     if not PYEMD_EXT:
#         raise ImportError("Please install pyemd Python package to compute WMD.")
        
    dictionary = corpora.Dictionary(documents=[document1, document2])
    vocab_len = len(dictionary)

    # Sets for faster look-up.
    docset1 = set(document1)
    docset2 = set(document2)

    # Compute distance matrix.
    distance_matrix = np.zeros((vocab_len, vocab_len), dtype=np.double)
    for i, t1 in dictionary.items():
        for j, t2 in dictionary.items():
            if t1 not in docset1 or t2 not in docset2:
                continue
            # Compute Euclidean distance between word vectors.
            distance_matrix[i, j] = np.sqrt(np.sum((lsa.transform(tf_idf.transform([t1])) - lsa.transform(tf_idf.transform([t2])))**2))
            
    if np.sum(distance_matrix) == 0.0:
        # `emd` gets stuck if the distance matrix contains only zeros.
        return float('inf')

    def nbow(document):
        d = np.zeros(vocab_len, dtype=np.double)
        nbow = dictionary.doc2bow(document)  # Word frequencies.
        doc_len = len(document)
        for idx, freq in nbow:
            d[idx] = freq / float(doc_len)  # Normalized word frequencies.
        return d

    # Compute nBOW representation of documents.
    d1 = nbow(document1)
    d2 = nbow(document2)

    # Compute WMD.
    return emd(d1, d2, distance_matrix)

In [26]:
def matches_anchor_wmd(it, minimum_matches, match_type, overlap, lsa, return_count=True, ids=None):
    """Return varation set matches using anchor method"""
    matches = 0
    matches_list = []

    for count, i in enumerate(it):
        utterances = iter(i)
        first = next(utterances)
        first = [j.lower() for j in first]
        
        for utterance in utterances:
            utterance = [j.lower() for j in utterance]

            if wmdistance(lsa, first, utterance) < overlap:
                matches += 1
                if ids:
                    matches_list.append((ids[count], i))
                else:
                    matches_list.append(i)


    if return_count:
        return matches
    else:
        return matches_list

In [27]:
print(wmdistance(lsa, [u'ja'], [u'ja', u'ja']))
print(all_utterances[0])
print(all_utterances[1])

print(wmdistance(lsa, all_utterances[0], all_utterances[1]))

inf
[u'ja']
[u's\xe5', u'!']
1.3976866697


In [None]:
confusion_stats = confusion_dict_init()

similarity = 0.40

for to_do in to_dos:
    print("Finding variation sets in" + to_do[0])
    u = utterances.Utterances(to_do[0], to_do[1])
    gold_utterances = u._goldutterances
    utterances_reformatted = []
    ids = []

    for utterance in u._utterances:
        new_utt = utterance[2].split()
        utterances_reformatted.append(new_utt)
        ids.append((utterance[0], utterance[1]))
        
    utt_iter = vat.window(utterances_reformatted, args[1])
    id_iter = vat.window(ids, args[1])
    ids = [i for i in id_iter]
    ids_and_matches = matches_anchor_wmd(utt_iter, args[2], None, similarity, lsa, False, ids=ids)
    combined = vat.convert_varseta_format(ids_and_matches)

    varseta_eval = evaluation.Evaluation(combined, gold_utterances)

    confusion_stats = update_and_print(confusion_stats, varseta_eval)
    
final_stats_print(confusion_stats)

Finding variation sets inDATA/Swedish_MINGLE_dataset/plain/1
