## Baselines
This goal of this notebook is to have a working implementation of standard models of learning distributed representations of words: word2vec and GloVe.

### word2vec
There are a few existing implementations of word2vec. The [original code](https://code.google.com/archive/p/word2vec/) is available in C. The TensorFlow docs have a [good tutorial](https://www.tensorflow.org/tutorials/word2vec) with two versions of word2vec. However, I'm going with [gensim's](http://radimrehurek.com/gensim/models/word2vec.html) version. This is for the following reasons: 1) I'm confident it's correct, because it's listed on the website for the original version as a Python implementation, it's [fast](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), 3) it fits nicely into my existing Python workflow in a way that the other options don't, 4) it's used by other researchers. An easy tutorial showing how to use it is [here](https://rare-technologies.com/word2vec-tutorial/). To start, I'm going to train word2vec with default hyperparameters on an easy-to-use corpus.

In [87]:
import gensim
import nltk

#### Training corpus

Word2vec in gensim comes with a bunch of different corpus objects to iterate over large corpora. It's straightforward to use your own corpus, but at first I'll use the pre-canned Brown corpus. The data of the Brown corpus come from NLTK but the pre-canned bit is gensim's class for iterating over it.

In [88]:
path_to_brown = nltk.data.find('corpora/brown').path
training_corpus = gensim.models.word2vec.BrownCorpus(path_to_brown)

#### Options for training word2vec in gensim
- `sg` 0 for CBOW (default), 1 for skip-gram
- `size` of vectors
- `window` window size
- `alpha` is the initial learning rate (will linearly drop to min_alpha as training progresses)
- `seed` for setting random seed, but it's complicated in Python 3.
- `min_count` lower bound on word frequency
- `max_vocab_size` used to limit RAM usage
- `workers` number of threads
- `hs` 1 for hierarchical softmax, 0 for negative sampling (default)
- `negative` number of negative words to sample (default 5)
- `iter` number of epochs (default 5)

In [28]:
model = gensim.models.Word2Vec(training_corpus, sg=0, size=100, window=5, min_count=1)

#### Accessing trained word embeddings
Gensim has a class, `keyedvectors` for storing word vectors in a read-only way. This is where gensim has all its functionality for assessing the vectors, such as accuracy on similarity/analogy datasets, most similar words, etc. Because it's a little restrictive, I don't know how much I'll use this. At the moment, I'd rather have them as a pandas dataframe and work with custom assessment methods from there. Most importantly, I want to save to a human-readable format.

In [69]:
embeddings = model.wv

In [70]:
embeddings['the/at']

array([ 1.11607516,  0.59781128, -0.69757777, -0.65440011, -0.95228773,
       -1.0239532 ,  0.24127391,  0.09813931, -0.07847317, -0.70193112,
        1.45279276, -0.531106  , -0.61621416, -0.69193256,  0.39697152,
        1.18247819, -0.13903154,  0.70073628, -0.71717876, -0.46977347,
       -0.29542673,  0.43563986,  1.03392553, -1.25428498, -1.47405887,
       -0.34138727, -0.07101707, -1.11960971,  0.54950446, -0.19559342,
       -0.60808772,  0.15087596,  0.71020287, -0.10059313,  1.01193798,
       -1.13246524, -1.53708541,  0.12542216, -0.09604352,  0.36703017,
       -0.72496635, -0.31052244, -0.15956955, -0.33531192, -0.01227573,
        0.97107279,  0.94628531, -0.28239197, -0.67198718,  0.04592762,
       -1.00949335,  0.00818446, -0.11548857,  1.28247559, -1.2748512 ,
       -0.99491751, -0.41967675,  0.90822142,  0.12440035,  0.36123386,
        1.02955735, -1.03806973, -1.23069584, -0.46346748, -1.26205599,
       -0.07961008,  0.42273575, -0.87002492,  0.47268146, -0.11

#### Saving the model
You can either save the whole model, which is good if you want to continue training it later, or just the word embeddings, which is best for my current purposes.

In [67]:
# To save the whole model:
#outfile = 'word2vec_model' # If model is large enough, gensim will actually write to multiple files
#model.save(outfile, pickle_protocol=3)

In [74]:
# To save just the embeddings
outfile = 'word2vec_embeddings'
embeddings.save_word2vec_format(outfile, binary=False)

In [76]:
!head -n 3 word2vec_embeddings

54294 100
the/at 1.116075 0.597811 -0.697578 -0.654400 -0.952288 -1.023953 0.241274 0.098139 -0.078473 -0.701931 1.452793 -0.531106 -0.616214 -0.691933 0.396972 1.182478 -0.139032 0.700736 -0.717179 -0.469773 -0.295427 0.435640 1.033926 -1.254285 -1.474059 -0.341387 -0.071017 -1.119610 0.549504 -0.195593 -0.608088 0.150876 0.710203 -0.100593 1.011938 -1.132465 -1.537085 0.125422 -0.096044 0.367030 -0.724966 -0.310522 -0.159570 -0.335312 -0.012276 0.971073 0.946285 -0.282392 -0.671987 0.045928 -1.009493 0.008184 -0.115489 1.282476 -1.274851 -0.994918 -0.419677 0.908221 0.124400 0.361234 1.029557 -1.038070 -1.230696 -0.463467 -1.262056 -0.079610 0.422736 -0.870025 0.472681 -0.110274 -0.076770 2.011114 0.029614 -1.823512 -0.534548 -0.650134 -0.535846 -0.229043 -0.600128 0.014348 0.585814 0.782936 0.048614 -0.340242 0.612882 1.461219 -0.196426 -0.186624 -0.442012 1.000661 -0.419634 0.692868 1.293931 -0.715964 0.220837 0.743135 0.809243 1.144204 0.413419 1.144291
of/in -0.909348 1.174378 

The format of saved word embeddings is as follows: the first line is "number of words in vocab size of embeddings". Then, every other line is "word form w1 w2 ... wn". The format I want is as a pandas dataframe, with column labels being word forms, and n rows for the n dimensions. I want it this way because it's easier to access columns in pandas than rows.

In [78]:
import pandas as pd

In [85]:
df = pd.read_csv('word2vec_embeddings', sep=' ', skiprows=[0], header=None, index_col=0).T

In [86]:
df.head()

Unnamed: 0,the/at,of/in,and/cc,a/at,in/in,to/to,to/in,is/be,was/be,he/pp,...,fluke/nn,bilharziasis/nn,perelman/np,exhaling/vb,aviary/nn,olive-flushed/jj,cherokee/np,coral-colored/jj,boucle/nn,stupefying/vb
1,1.116075,-0.909348,0.485562,1.940079,0.352573,-0.850728,1.071784,-0.141232,1.891459,0.756157,...,0.023827,0.008465,0.012234,0.018724,0.029173,0.020771,0.011677,0.016597,0.027274,0.012702
2,0.597811,1.174378,1.462785,-0.302618,0.767421,1.633128,1.123505,0.046088,-0.11843,-0.727789,...,0.032008,0.006665,0.001928,0.01107,0.009561,0.025823,0.016011,0.028123,0.028053,0.010462
3,-0.697578,-0.276268,-0.440426,-0.409385,-0.687834,1.524905,0.076487,-1.336647,-0.718631,0.5079,...,0.005776,0.000674,0.003894,0.009084,0.000713,0.002589,0.013787,0.005831,0.009301,0.005674
4,-0.6544,-0.70025,0.178239,-0.839388,-0.088267,-0.513129,0.247282,-0.902483,0.033156,1.277415,...,0.003178,0.004775,0.001158,0.010147,-0.003418,0.019345,0.014266,0.00962,0.012125,0.00098
5,-0.952288,1.956809,-0.119881,1.773903,0.899235,-2.538697,-0.04815,2.393934,1.606475,-1.662192,...,0.005961,-0.001794,0.009305,-0.002214,-0.004434,0.015522,0.014296,0.012371,-0.003555,-0.004596


#### Evaluating embeddings
As mentioned above, gensim's word2vec implementation has built-in functionality for assessing the embeddings. Although it won't always suit my purposes, I'm testing it here.

OK, so there's a mismatch between the word form stored in the `keyedvectors` object and the way the words are stored in the 'wordsim353.tsv' file included with gensim. In particular, the training data has POS attached to it, whereas the wordsim dataset is just the word. I could find a workaround, but given how much other customization I want for evaluating embeddings, it's not worth it. Moreover, this sample test data included with gensim is in a particular format that the evaluation routines expect, and it's too restrictive for my purposes.

In [108]:
#import os
#embeddings.evaluate_word_pairs(os.path.join(gensim.__path__[0], 'test', 'test_data', 'wordsim353.tsv'))

Now I evaluate against the ws-353 data myself.

In [111]:
path_to_ws353 = '../evaluate/data/ws-353/ws-353.csv'
ws353 = pd.read_csv(path_to_ws353)

In [112]:
ws353

Unnamed: 0,word1,word2,similarity,which_set?,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,love,sex,6.77,set1,9.0,6.00,8.0,8.0,7.0,8.0,8.0,4.0,7.0,2.0,6.0,7.0,8.0,,,
1,tiger,cat,7.35,set1,9.0,7.00,8.0,7.0,8.0,9.0,8.5,5.0,6.0,9.0,7.0,5.0,7.0,,,
2,tiger,tiger,10.00,set1,10.0,10.00,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,,,
3,book,paper,7.46,set1,8.0,8.00,7.0,7.0,8.0,9.0,7.0,6.0,7.0,8.0,9.0,4.0,9.0,,,
4,computer,keyboard,7.62,set1,8.0,7.00,9.0,9.0,8.0,8.0,7.0,7.0,6.0,8.0,10.0,3.0,9.0,,,
5,computer,internet,7.58,set1,8.0,6.00,9.0,8.0,8.0,8.0,7.5,7.0,7.0,7.0,9.0,5.0,9.0,,,
6,plane,car,5.77,set1,6.0,6.00,7.0,5.0,3.0,6.0,7.0,6.0,6.0,6.0,7.0,3.0,7.0,,,
7,train,car,6.31,set1,7.0,7.50,7.5,5.0,3.0,6.0,7.0,6.0,6.0,6.0,9.0,4.0,8.0,,,
8,telephone,communication,7.50,set1,7.0,6.50,8.0,8.0,6.0,8.0,8.0,7.0,5.0,9.0,9.0,8.0,8.0,,,
9,television,radio,6.77,set1,7.0,7.50,9.0,7.0,3.0,6.0,7.0,8.0,5.5,6.0,8.0,6.0,8.0,,,


I want one dataframe with word1, word2, empirical_similarity, model_similarity. Then can easily use .corr, scipy spearman and plot.

In [119]:
[c for c in df.columns if c.split('/')[0] == 'love']

['love/nn', 'love/vb']

In [120]:
from scipy.spatial.distance import cosine as cosine_dist

In [121]:
def model_similarity(embeddings, word1, word2):
    """Return the model's estimated similarity of word1 and word2"""
    v1, v2 = embeddings[word1], embeddings[word2]
    return 1 - cosine_dist(v1, v2)

In [125]:
model_similarity(df, 'sugar/nn', 'approach/nn')

0.83831189699831432