# Evaluating Pre-trained Word Embeddings - Extended results

This notebook contains the extended results on word embeddings evaluation.

In [1]:
from __future__ import print_function

import pandas as pd
pd.options.display.max_rows = 999
pd.set_option('display.width', 1000)
import glob

header = ["evaluation_type", "dataset", "kwargs", "evaluation", "value", "num_skipped"]
similarity_dfs = []
similarity_names = []
similarity_glob = './results/similarity*'
for similarity_file in glob.glob(similarity_glob):
    df = pd.read_table(similarity_file, header=None, names=header).set_index(["dataset", "kwargs"]).drop(["evaluation_type"], axis=1)
    similarity_dfs.append(df)
    similarity_names.append(similarity_file[len(similarity_glob):])

analogy_dfs = []
analogy_names = []
analogy_glob = './results/analogy*'
for analogy_file in glob.glob(analogy_glob):
    df = pd.read_table(analogy_file, header=None, names=header).set_index(["dataset", "kwargs", "evaluation"]).drop(["evaluation_type"], axis=1)
    analogy_dfs.append(df)
    analogy_names.append(analogy_file[len(analogy_glob):])

similarity_df = pd.concat(similarity_dfs, keys=similarity_names, names=['embedding']).reorder_levels(["dataset", "kwargs", "embedding"]).sort_index()
analogy_df = pd.concat(analogy_dfs, keys=analogy_names, names=['embedding']).reorder_levels(["dataset", "evaluation", "kwargs", "embedding"]).sort_index()

## Similarity task
We can see that the performance varies between the different embeddings on the different datasests.

Please see the [API page](http://gluon-nlp.mxnet.io/api/data.html#word-embedding-evaluation-datasets) for more information about the respective datasets.

In [2]:
for (dataset, kwargs), df in similarity_df.groupby(level=[0,1]):
    print('Performance on', dataset, kwargs)
    print(df.loc[dataset, kwargs].sort_values(by='value', ascending=False))
    print()
    print()

Performance on BakerVerb143 {}
                                              evaluation     value  num_skipped
embedding                                                                      
fasttext-wiki-news-300d-1M.tsv          cosinesimilarity  0.462748            0
fasttext-crawl-300d-2M.tsv              cosinesimilarity  0.447303            0
fasttext-wiki-news-300d-1M-subword.tsv  cosinesimilarity  0.424932            0
fasttext-wiki.en.tsv                    cosinesimilarity  0.397307            0
fasttext-wiki.simple.tsv                cosinesimilarity  0.343885            0
glove-glove.840B.300d.tsv               cosinesimilarity  0.341463            0
glove-glove.42B.300d.tsv                cosinesimilarity  0.327522            0
glove-glove.6B.300d.tsv                 cosinesimilarity  0.305143            0
glove-glove.6B.100d.tsv                 cosinesimilarity  0.302301            0
glove-glove.6B.200d.tsv                 cosinesimilarity  0.284473            0
glove-glo

## Analogy task
For the analogy task, we report the results per category in the dataset.
Note that the analogy task is a open vocabulary task: Given a query of 3 words, we ask the model to select a 4th word from the whole vocabulary. Different pre-trained embeddings have vocabularies of different size. In general the vocabulary of embeddings pretrained on more tokens (indicated by a bigger number before the **B** in the embedding source name) include more tokens in their vocabulary. While training embeddings on more tokens improves their quality, the larger vocabulary also makes the analogy task harder.

In this experiment **all results are reported with reducing the vocabulary to the 300k most frequent tokens**. Questions containing Out Of Vocabulary words are ignored.

### Google Analogy Test Set
We first display the results on the **Google Analogy Test Set**.

- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient
  estimation of word representations in vector space. In Proceedings of
  the International Conference on Learning Representations (ICLR).

The Google Analogy Test Set contains the following categories.
All analogy questions per category follow the pattern specified by the category name.

We first present the results using the `threecosmul` analogy function.

In [3]:
for kwargs, df in analogy_df.loc['GoogleAnalogyTestSet', 'threecosmul'].groupby(level=0):
    print(kwargs)
    print(df.loc[kwargs].sort_values(by='value', ascending=False))
    print()
    print()

{"category": "capital-common-countries"}
                                           value  num_skipped
embedding                                                    
glove-glove.6B.300d.tsv                 0.954545            0
fasttext-wiki.en.tsv                    0.950593            0
glove-glove.6B.200d.tsv                 0.944664            0
glove-glove.42B.300d.tsv                0.942688            0
glove-glove.6B.100d.tsv                 0.922925            0
fasttext-crawl-300d-2M.tsv              0.821429           86
glove-glove.840B.300d.tsv               0.820158            0
fasttext-wiki-news-300d-1M.tsv          0.747619          296
glove-glove.6B.50d.tsv                  0.707510            0
glove-glove.twitter.27B.200d.tsv        0.695652            0
fasttext-wiki-news-300d-1M-subword.tsv  0.685714          296
fasttext-wiki.simple.tsv                0.527668            0
glove-glove.twitter.27B.100d.tsv        0.474308            0
glove-glove.twitter.27B.50d.t

We then present the results using the `threecosadd` analogy function.

In [4]:
for kwargs, df in analogy_df.loc['GoogleAnalogyTestSet', 'threecosadd'].groupby(level=0):
    print(kwargs)
    print(df.loc[kwargs].sort_values(by='value', ascending=False))
    print()
    print()

{"category": "capital-common-countries"}
                                           value  num_skipped
embedding                                                    
glove-glove.42B.300d.tsv                0.950593            0
glove-glove.6B.300d.tsv                 0.948617            0
glove-glove.6B.200d.tsv                 0.946640            0
fasttext-wiki.en.tsv                    0.944664            0
glove-glove.6B.100d.tsv                 0.938735            0
fasttext-crawl-300d-2M.tsv              0.804762           86
glove-glove.840B.300d.tsv               0.800395            0
glove-glove.6B.50d.tsv                  0.792490            0
fasttext-wiki-news-300d-1M.tsv          0.761905          296
glove-glove.twitter.27B.200d.tsv        0.705534            0
fasttext-wiki-news-300d-1M-subword.tsv  0.704762          296
glove-glove.twitter.27B.100d.tsv        0.533597            0
fasttext-wiki.simple.tsv                0.397233            0
glove-glove.twitter.27B.50d.t

### Bigger Analogy Test Set
We then display the results on the **Bigger Analogy Test Set (BATS)**.

- Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection
  of morphological and semantic relations with word embeddings: what works
  and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San
  Diego, California, June 12-17, 2016: ACL. Retrieved from
  https://www.aclweb.org/anthology/N/N16/N16-2002.pdf


Unlike the Google Analogy Test Set, BATS is balanced across 4 types of relations (inflectional morphology, derivational morphology, lexicographic semantics, encyclopedic semantics).

We first present the results for the `threecosadd` analogy function.

In [5]:
for kwargs, df in analogy_df.loc['BiggerAnalogyTestSet', 'threecosadd'].groupby(level=0):
    print(kwargs)
    print(df.loc[kwargs].sort_values(by='value', ascending=False))
    print()
    print()

{"category": "D01"}
                                           value  num_skipped
embedding                                                    
fasttext-wiki.simple.tsv                0.452569         1944
fasttext-wiki-news-300d-1M-subword.tsv  0.121212          470
fasttext-wiki.en.tsv                    0.055497          558
fasttext-crawl-300d-2M.tsv              0.030127          558
glove-glove.twitter.27B.200d.tsv        0.021739         1944
fasttext-wiki-news-300d-1M.tsv          0.019697          470
glove-glove.840B.300d.tsv               0.018272          644
glove-glove.twitter.27B.100d.tsv        0.013834         1944
glove-glove.twitter.27B.50d.tsv         0.013834         1944
glove-glove.42B.300d.tsv                0.013527          380
glove-glove.6B.200d.tsv                 0.004831          380
glove-glove.6B.300d.tsv                 0.004831          380
glove-glove.6B.100d.tsv                 0.002415          380
glove-glove.twitter.27B.25d.tsv         0.001976  

We then present the results for the `threecosmul` analogy function.

In [6]:
for kwargs, df in analogy_df.loc['BiggerAnalogyTestSet', 'threecosmul'].groupby(level=0):
    print(kwargs)
    print(df.loc[kwargs].sort_values(by='value', ascending=False))
    print()
    print()

{"category": "D01"}
                                           value  num_skipped
embedding                                                    
fasttext-wiki.simple.tsv                0.501976         1944
fasttext-wiki-news-300d-1M-subword.tsv  0.161111          470
fasttext-wiki.en.tsv                    0.085624          558
fasttext-crawl-300d-2M.tsv              0.072410          558
fasttext-wiki-news-300d-1M.tsv          0.035859          470
glove-glove.840B.300d.tsv               0.032669          644
glove-glove.42B.300d.tsv                0.020773          380
glove-glove.twitter.27B.200d.tsv        0.019763         1944
glove-glove.twitter.27B.100d.tsv        0.009881         1944
glove-glove.6B.200d.tsv                 0.002899          380
glove-glove.6B.300d.tsv                 0.002899          380
glove-glove.6B.100d.tsv                 0.000966          380
glove-glove.6B.50d.tsv                  0.000000          380
glove-glove.twitter.27B.25d.tsv         0.000000  