## Baselines
This goal of this notebook is to have a working implementation of standard models of learning distributed representations of words: word2vec and GloVe.

### word2vec
There are a few existing implementations of word2vec. The [original code](https://code.google.com/archive/p/word2vec/) is available in C. The TensorFlow docs have a [good tutorial](https://www.tensorflow.org/tutorials/word2vec) with two versions of word2vec. However, I'm going with [gensim's](http://radimrehurek.com/gensim/models/word2vec.html) version. This is for the following reasons: 1) I'm confident it's correct, because it's listed on the website for the original version as a Python implementation, it's [fast](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), 3) it fits nicely into my existing Python workflow in a way that the other options don't, 4) it's used by other researchers. An easy tutorial showing how to use it is [here](https://rare-technologies.com/word2vec-tutorial/). To start, I'm going to train word2vec with default hyperparameters on an easy-to-use corpus.

In [1]:
import gensim
import nltk

#### Training corpus

Word2vec in gensim comes with a bunch of different corpus objects to iterate over large corpora. It's straightforward to use your own corpus, but at first I'll use the pre-canned Brown corpus. The data of the Brown corpus come from NLTK but the pre-canned bit is gensim's class for iterating over it.

In [2]:
path_to_brown = nltk.data.find('corpora/brown').path
training_corpus = gensim.models.word2vec.BrownCorpus(path_to_brown)

#### Options for training word2vec in gensim
- `sg` 0 for CBOW (default), 1 for skip-gram
- `size` of vectors
- `window` window size
- `alpha` is the initial learning rate (will linearly drop to min_alpha as training progresses)
- `seed` for setting random seed, but it's complicated in Python 3.
- `min_count` lower bound on word frequency
- `max_vocab_size` used to limit RAM usage
- `workers` number of threads
- `hs` 1 for hierarchical softmax, 0 for negative sampling (default)
- `negative` number of negative words to sample (default 5)
- `iter` number of epochs (default 5)

In [3]:
model = gensim.models.Word2Vec(training_corpus, sg=0, size=100, window=5, min_count=1)

#### Accessing trained word embeddings
Gensim has a class, `keyedvectors` for storing word vectors in a read-only way. This is where gensim has all its functionality for assessing the vectors, such as accuracy on similarity/analogy datasets, most similar words, etc. Because it's a little restrictive, I don't know how much I'll use this. At the moment, I'd rather have them as a pandas dataframe and work with custom assessment methods from there. Most importantly, I want to save to a human-readable format.

In [4]:
embeddings = model.wv

In [5]:
embeddings['the/at']

array([ 0.37326825, -1.77543032,  1.01418495,  0.76122272,  0.6014545 ,
        0.61599427, -0.40848544, -0.27209073,  1.14151788,  0.20488951,
        0.67067975, -0.3072879 , -0.66000891, -0.51147223,  0.95049095,
        0.86448979, -0.07529252,  0.12546216,  0.41390449,  0.79765427,
       -0.51364458,  1.4868288 , -0.41729179,  1.49503493,  0.55480272,
       -0.52688456,  0.26556841,  0.75362861, -0.09337724, -0.17674325,
        1.00097585,  1.36793303,  1.11379051, -0.30979261, -0.01855088,
       -0.03304428,  0.6429615 , -0.52750677, -1.76178443, -0.06411998,
        0.24928731,  0.93973953, -0.19085188,  1.68263257, -0.24911675,
       -0.44579032, -1.3018446 ,  0.08730339,  0.15827051,  1.31775475,
       -0.38957056,  1.14586544,  0.29244775,  0.5834164 , -0.29266146,
        0.20015037, -1.34383428,  1.21257615,  1.03930342, -0.99131125,
       -0.89497703, -0.06676644,  1.34059489, -0.91830373,  0.23483844,
        0.47910669, -0.39294741, -0.32439584,  0.5079456 ,  0.31

#### Saving the model
You can either save the whole model, which is good if you want to continue training it later, or just the word embeddings, which is best for my current purposes.

In [6]:
# To save the whole model:
#outfile = 'word2vec_model' # If model is large enough, gensim will actually write to multiple files
#model.save(outfile, pickle_protocol=3)

In [7]:
# To save just the embeddings
outfile = 'word2vec_embeddings'
embeddings.save_word2vec_format(outfile, binary=False)

In [8]:
!head -n 3 word2vec_embeddings

54294 100
the/at 0.373268 -1.775430 1.014185 0.761223 0.601454 0.615994 -0.408485 -0.272091 1.141518 0.204890 0.670680 -0.307288 -0.660009 -0.511472 0.950491 0.864490 -0.075293 0.125462 0.413904 0.797654 -0.513645 1.486829 -0.417292 1.495035 0.554803 -0.526885 0.265568 0.753629 -0.093377 -0.176743 1.000976 1.367933 1.113791 -0.309793 -0.018551 -0.033044 0.642962 -0.527507 -1.761784 -0.064120 0.249287 0.939740 -0.190852 1.682633 -0.249117 -0.445790 -1.301845 0.087303 0.158271 1.317755 -0.389571 1.145865 0.292448 0.583416 -0.292661 0.200150 -1.343834 1.212576 1.039303 -0.991311 -0.894977 -0.066766 1.340595 -0.918304 0.234838 0.479107 -0.392947 -0.324396 0.507946 0.311040 -0.204842 1.311464 -0.419551 -0.091043 -0.195750 -1.434508 1.314521 0.853064 -0.086493 -0.984213 0.494367 1.082618 -0.539906 -0.320106 -1.068538 0.422812 0.318158 0.095050 -0.379815 -0.531776 -0.966955 -0.989597 1.189013 -0.500338 0.316590 0.482409 -0.234608 0.749378 0.028366 0.590650
of/in -1.529803 -1.978377 0.310563

The format of saved word embeddings is as follows: the first line is "number of words in vocab size of embeddings". Then, every other line is "word form w1 w2 ... wn". The format I want is as a pandas dataframe, with column labels being word forms, and n rows for the n dimensions. I want it this way because it's easier to access columns in pandas than rows.

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv('word2vec_embeddings', sep=' ', skiprows=[0], header=None, index_col=0).T

In [11]:
df.head()

Unnamed: 0,the/at,of/in,and/cc,a/at,in/in,to/to,to/in,is/be,was/be,he/pp,...,fluke/nn,bilharziasis/nn,perelman/np,exhaling/vb,aviary/nn,olive-flushed/jj,cherokee/np,coral-colored/jj,boucle/nn,stupefying/vb
1,0.373268,-1.529803,-0.015312,0.544234,-0.73863,1.645025,-0.253535,-2.694053,-1.40652,0.779479,...,-0.004336,-0.002706,0.001036,0.006756,0.008,-0.001969,-0.007027,-0.003263,0.002693,0.001828
2,-1.77543,-1.978377,-1.727973,-1.099333,-1.759403,-1.29359,-2.046214,-1.536931,-0.656673,0.824978,...,-0.031166,-0.019416,-0.002642,-0.024017,-0.02049,-0.030188,-0.023442,-0.030332,-0.03319,-0.013628
3,1.014185,0.310563,-0.302466,1.74175,-0.141368,-0.543366,-0.544071,1.3065,0.801294,-1.17524,...,-0.010961,-0.007187,-0.002249,-0.00993,-0.00178,-0.020248,-0.021888,-0.026032,-0.022188,-0.006811
4,0.761223,0.773821,-0.519353,1.287695,1.225874,-1.391787,0.163469,1.704131,1.78851,1.458925,...,0.017467,0.014442,0.004862,0.002261,0.013614,0.006257,-0.003881,0.006945,0.010526,0.003633
5,0.601454,0.270909,0.459512,0.9239,1.131693,-0.192218,0.880312,-0.955793,0.594803,0.884147,...,0.013015,0.012107,0.00829,0.009965,0.012712,0.008782,0.015151,0.008668,0.01136,0.003573


#### Evaluating embeddings
As mentioned above, gensim's word2vec implementation has built-in functionality for assessing the embeddings. Although it won't always suit my purposes, I'm testing it here.

OK, so there's a mismatch between the word form stored in the `keyedvectors` object and the way the words are stored in the 'wordsim353.tsv' file included with gensim. In particular, the training data has POS attached to it, whereas the wordsim dataset is just the word. I could find a workaround, but given how much other customization I want for evaluating embeddings, it's not worth it. Moreover, this sample test data included with gensim is in a particular format that the evaluation routines expect, and it's too restrictive for my purposes.

In [12]:
#import os
#embeddings.evaluate_word_pairs(os.path.join(gensim.__path__[0], 'test', 'test_data', 'wordsim353.tsv'))

Now I evaluate against the ws-353 data myself. I want one dataframe with word1, word2, empirical_similarity and model_similarity. Then it should be easy to use the pandas `corr` method, `scipy.stats.spearmanr` or plot the data. The similarity data already has the first three columns, so I just add the model_similarity to it.

In [21]:
path_to_ws353 = '../evaluate/data/ws-353/ws-353.csv'
ws353 = pd.read_csv(path_to_ws353)
ws353.head()

Unnamed: 0,word1,word2,similarity,which_set?,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,love,sex,6.77,set1,9.0,6.0,8.0,8.0,7.0,8.0,8.0,4.0,7.0,2.0,6.0,7.0,8.0,,,
1,tiger,cat,7.35,set1,9.0,7.0,8.0,7.0,8.0,9.0,8.5,5.0,6.0,9.0,7.0,5.0,7.0,,,
2,tiger,tiger,10.0,set1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,,,
3,book,paper,7.46,set1,8.0,8.0,7.0,7.0,8.0,9.0,7.0,6.0,7.0,8.0,9.0,4.0,9.0,,,
4,computer,keyboard,7.62,set1,8.0,7.0,9.0,9.0,8.0,8.0,7.0,7.0,6.0,8.0,10.0,3.0,9.0,,,


The problem is that the Brown corpus I trained on has POS on the end of the words, so that 'love' isn't just 'love', but 'love/nn' and 'love/vb'. There are two different approaches I see to handling this:
1. Always prefer one of the POSs (e.g. pretend 'love' always is 'love/nn').
2. Calculate similarity for each POS, then average.

The problem with 2 is that if word1 and word2 both have multiple POSs, then I have to calculate all possible pairwise similarities. For the time being, I'm going to go with 1. The first function `find_embedding` takes a word (without POS) and finds the column label for it in an embedding dataframe. Then `model_similarity` actually calculates the similarity.

In [52]:
def find_embedding(embeddings, w):
    """Helper function for finding vector representation of w in embeddings.
    
    Implements the logic of the disucssion above, namely that the Brown corpus
    has POS tags while the similarity data doesn't. This is hacky.
    """
    relevant_columns = [c for c in embeddings.columns if c.split('/')[0] == w]
    assert len(relevant_columns) > 0, 'no embedding for {}'.format(w)
    if len(relevant_columns) == 1:
        column = relevant_columns[0]
    elif len(relevant_columns) > 1:
        pos = [c.split('/')[1] for c in relevant_columns]
        if 'nn' in pos:
            column = w + '/nn'
        elif 'vb' in pos:
            column = w + '/vb'
        else:
            column = relevant_columns[0]
    return column

In [47]:
find_embedding(df, 'sugar')

'sugar/nn'

In [59]:
import numpy as np
from scipy.spatial.distance import cosine as cosine_dist

In [53]:
def model_similarity(embeddings, word1, word2):
    """
    Return the model's estimated similarity of word1 and word2.
    
    We can't use sklearn's pairwise cosine similarity because we 
    only want certain entries of that giant pairwise similarity matrix.
    
    Parameters
    ----------
    embeddings : pandas.DataFrame
        Of shape (num_dim, num_words)
    word1, word2 : str
    
    Returns
    -------
    float
        Between 0 and 1
    """
    word1, word2 = find_embedding(embeddings, word1), find_embedding(embeddings, word2)
    v1, v2 = embeddings[word1], embeddings[word2]
    return 1 - cosine_dist(v1, v2)

In [57]:
model_similarity(df, 'love', 'sex')

0.93076247876265705

If either of the words from the similarity dataset do not appear in the training data (or were dropped for frequency reasons), then they won't have an embedding and we can't calculate the model's similarity. In these cases, I'll leave their model_similarity as NaN.

In [65]:
def evaluate_model(row):
    """
    Helper function for applying model_similarity to the dataframe with empirical judgements.
    
    Note that the estimated embedding matrix `df` is baked in to this function, as the `apply` 
    function requires a one-argument function.
    """
    try:
        word1, word2 = row['word1'], row['word2']
        return model_similarity(df, word1, word2)
    except AssertionError: # either or both of the words are missing
        return np.nan

In [69]:
ws353['estimate_word2vec'] = ws353.apply(evaluate_model, axis=1)

In [70]:
ws353.head()

Unnamed: 0,word1,word2,similarity,which_set?,1,2,3,4,5,6,...,8,9,10,11,12,13,14,15,16,estimate_word2vec
0,love,sex,6.77,set1,9.0,6.0,8.0,8.0,7.0,8.0,...,4.0,7.0,2.0,6.0,7.0,8.0,,,,0.930762
1,tiger,cat,7.35,set1,9.0,7.0,8.0,7.0,8.0,9.0,...,5.0,6.0,9.0,7.0,5.0,7.0,,,,0.797021
2,tiger,tiger,10.0,set1,10.0,10.0,10.0,10.0,10.0,10.0,...,10.0,10.0,10.0,10.0,10.0,10.0,,,,1.0
3,book,paper,7.46,set1,8.0,8.0,7.0,7.0,8.0,9.0,...,6.0,7.0,8.0,9.0,4.0,9.0,,,,0.934245
4,computer,keyboard,7.62,set1,8.0,7.0,9.0,9.0,8.0,8.0,...,7.0,6.0,8.0,10.0,3.0,9.0,,,,0.85897


Good, now I have the ws353 data with an additional column for similarity estimated by word2vec (as trained above). We can use pandas's `corr` function, although that doesn't give us the p-value, like `scipy.stats.spearmanr` does. They both agree.

In [81]:
ws353[['similarity', 'estimate_word2vec']].corr('spearman')

Unnamed: 0,similarity,estimate_word2vec
similarity,1.0,-0.035086
estimate_word2vec,-0.035086,1.0


In [82]:
from scipy.stats import spearmanr

In [83]:
ws353_without_nan = ws353[['similarity', 'estimate_word2vec']].dropna()
spearmanr(ws353_without_nan['similarity'], ws353_without_nan['estimate_word2vec'])

SpearmanrResult(correlation=-0.035086466354430301, pvalue=0.53693587748827087)

TODO:
- Logging
- Training time
- Train on bigger data