# Evaluating Pre-trained Word Embeddings - Extended results

This notebook contains the extended results on word embeddings evaluation.
Please see the `word_embeddings_evaluation.ipynb` notebook in the `docs/examples` directory for aggregated results.

In [1]:
import pandas as pd
pd.options.display.max_rows = 999

df = pd.read_table("results-vocablimit.csv", header=None, names=[
    "evaluation_type", "dataset", "kwargs", "embedding_name",
    "embedding_source", "evaluation", "value", "num_samples"
])


def get_multi_index_highlighter(levels=[0, 1]):
    '''Return a pandas DataFrame highlighter function for MultiIndices.
    
    The multi_index_highlighter returned will operate independently
    on all subsets of rows per unique index along the specified levels.
    
    '''
    def multi_index_highlighter(s):
        colors = []
        for key, _ in s.groupby(level=levels):
            is_max = s.loc[key] == s.loc[key].max()
            colors += ['background-color: yellow' if v else '' for v in is_max]
        return colors
    return multi_index_highlighter

## Similarity task
We can see that the performance varies between the different embeddings on the different datasests.

Please see the [API page](http://gluon-nlp.mxnet.io/api/data.html#word-embedding-evaluation-datasets) for more information about the respective datasets.

In [2]:
dfs = df[~df["dataset"].isin(["BiggerAnalogyTestSet", "GoogleAnalogyTestSet"])].drop(["evaluation_type", "evaluation", "num_samples"], axis=1)
dfs = dfs[dfs["embedding_source"].isin([
    "glove.42B.300d",
    "glove.6B.100d",
    "glove.6B.200d",
    "glove.6B.300d",
    "glove.6B.50d",
    "glove.840B.300d",
    "glove.twitter.27B.100d",
    "glove.twitter.27B.200d",
    "glove.twitter.27B.25d",
    "glove.twitter.27B.50d",
    "wiki.en",
    "wiki.simple",
    "crawl-300d-2M",
    "wiki-news-300d-1M"
])]

dfsi = dfs.set_index(["dataset", "kwargs", "embedding_name", "embedding_source"])
dfsi = dfsi.sort_values(by='value', ascending=False).sort_index(level=[0,1], sort_remaining=False)
dfsi.style.apply(get_multi_index_highlighter(levels=[0,1]))
# To get the html representation of the rendered table, call `render()` on the Styler

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value
dataset,kwargs,embedding_name,embedding_source,Unnamed: 4_level_1
BakerVerb143,{},fasttext,wiki-news-300d-1M,0.462748
BakerVerb143,{},fasttext,crawl-300d-2M,0.447303
BakerVerb143,{},fasttext,wiki.en,0.397307
BakerVerb143,{},fasttext,wiki.simple,0.343885
BakerVerb143,{},glove,glove.840B.300d,0.341463
BakerVerb143,{},glove,glove.42B.300d,0.327522
BakerVerb143,{},glove,glove.6B.300d,0.305143
BakerVerb143,{},glove,glove.6B.100d,0.302301
BakerVerb143,{},glove,glove.6B.200d,0.284473
BakerVerb143,{},glove,glove.6B.50d,0.250317


## Analogy task
For the analogy task, we report the results per category in the dataset.
Note that the analogy task is a open vocabulary task: Given a query of 3 words, we ask the model to select a 4th word from the whole vocabulary. Different pre-trained embeddings have vocabularies of different size. In general the vocabulary of embeddings pretrained on more tokens (indicated by a bigger number before the **B** in the embedding source name) include more tokens in their vocabulary. While training embeddings on more tokens improves their quality, the larger vocabulary also makes the analogy task harder.

In this experiment **all results are reported with reducing the vocabulary to the 300k most frequent tokens**. Questions containing Out Of Vocabulary words are ignored.

### Google Analogy Test Set
We first display the results on the **Google Analogy Test Set**.

- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient
  estimation of word representations in vector space. In Proceedings of
  the International Conference on Learning Representations (ICLR).

The Google Analogy Test Set contains the following categories.
All analogy questions per category follow the pattern specified by the category name.

In [3]:
import json
pd.Series(df[df["dataset"] == "GoogleAnalogyTestSet"]["kwargs"].unique()).apply(json.loads).apply(lambda x: x['category'])

0        capital-common-countries
1                   capital-world
2                        currency
3                   city-in-state
4                          family
5       gram1-adjective-to-adverb
6                  gram2-opposite
7               gram3-comparative
8               gram4-superlative
9        gram5-present-participle
10    gram6-nationality-adjective
11               gram7-past-tense
12                   gram8-plural
13             gram9-plural-verbs
dtype: object

We first load the results.

We now present the table of performances that the different embeddings achieved on the different categories of the dataset. You may find that the performance between categories varies widely and that different embeddings perform best on different categories. This is due to the different training objectives and training datasets used, inducing different properties.

In [4]:
dfa_google = df[df["dataset"] == "GoogleAnalogyTestSet"].drop(["evaluation_type", "num_samples", "dataset"], axis=1)
dfa_google = dfa_google[dfa_google["embedding_source"].isin([
    "glove.42B.300d",
    "glove.6B.100d",
    "glove.6B.200d",
    "glove.6B.300d",
    "glove.6B.50d",
    "glove.840B.300d",
    "glove.twitter.27B.100d",
    "glove.twitter.27B.200d",
    "glove.twitter.27B.25d",
    "glove.twitter.27B.50d",
    "wiki.en",
    "wiki.simple",
    "crawl-300d-2M",
    "wiki-news-300d-1M"
])]
dfa_google["category"] = dfa_google["kwargs"].apply(json.loads).apply(lambda x: str(x['category']))
dfa_google.drop("kwargs", axis=1, inplace=True)

In [5]:
dfai = dfa_google.set_index(
    ["category", "embedding_name", "embedding_source", "evaluation"])
dfai = dfai.sort_values(by='value', ascending=False).sort_index(level=[0], sort_remaining=False)
dfai.style.apply(get_multi_index_highlighter(levels=[0]))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value
category,embedding_name,embedding_source,evaluation,Unnamed: 4_level_1
capital-common-countries,glove,glove.6B.300d,threecosmul,0.954545
capital-common-countries,fasttext,wiki.en,threecosmul,0.950593
capital-common-countries,glove,glove.42B.300d,threecosadd,0.950593
capital-common-countries,glove,glove.6B.300d,threecosadd,0.948617
capital-common-countries,glove,glove.6B.200d,threecosadd,0.94664
capital-common-countries,fasttext,wiki.en,threecosadd,0.944664
capital-common-countries,glove,glove.6B.200d,threecosmul,0.944664
capital-common-countries,glove,glove.42B.300d,threecosmul,0.942688
capital-common-countries,glove,glove.6B.100d,threecosadd,0.938735
capital-common-countries,glove,glove.6B.100d,threecosmul,0.922925


### Bigger Analogy Test Set
We then display the results on the **Bigger Analogy Test Set (BATS)**.

- Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection
  of morphological and semantic relations with word embeddings: what works
  and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San
  Diego, California, June 12-17, 2016: ACL. Retrieved from
  https://www.aclweb.org/anthology/N/N16/N16-2002.pdf


Unlike the Google Analogy Test Set, BATS is balanced across 4 types of relations (inflectional morphology, derivational morphology, lexicographic semantics, encyclopedic semantics).

We first load the results for the BATS dataset:

In [6]:
dfa_bats = df[df["dataset"] == "BiggerAnalogyTestSet"].drop(["evaluation_type", "num_samples", "dataset"], axis=1)
dfa_bats = dfa_bats[dfa_bats["embedding_source"].isin([
    "glove.42B.300d",
    "glove.6B.100d",
    "glove.6B.200d",
    "glove.6B.300d",
    "glove.6B.50d",
    "glove.840B.300d",
    "glove.twitter.27B.100d",
    "glove.twitter.27B.200d",
    "glove.twitter.27B.25d",
    "glove.twitter.27B.50d",
    "wiki.en",
    "wiki.simple",
    "crawl-300d-2M",
    "wiki-news-300d-1M",
])]
dfa_bats["category"] = dfa_bats["kwargs"].apply(json.loads).apply(lambda x: str(x['category']))
dfa_bats.drop("kwargs", axis=1, inplace=True)

In [7]:
import gluonnlp as nlp
dfa_bats["category"] = dfa_bats["category"].apply(lambda x: str(x) + ": " + nlp.data.BiggerAnalogyTestSet._categories[x])
dfai = dfa_bats.set_index(
    ["category", "embedding_name", "embedding_source", "evaluation"])
dfai = dfai.sort_values(by='value', ascending=False).sort_index(level=[0], sort_remaining=False)
dfai.style.apply(get_multi_index_highlighter(levels=[0]))

  from ._conv import register_converters as _register_converters


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value
category,embedding_name,embedding_source,evaluation,Unnamed: 4_level_1
D01: [noun+less_reg],fasttext,wiki.simple,threecosmul,0.106531
D01: [noun+less_reg],fasttext,wiki.simple,threecosadd,0.0963265
D01: [noun+less_reg],fasttext,wiki.en,threecosmul,0.0669388
D01: [noun+less_reg],fasttext,crawl-300d-2M,threecosmul,0.0559184
D01: [noun+less_reg],fasttext,wiki.en,threecosadd,0.0428571
D01: [noun+less_reg],fasttext,wiki-news-300d-1M,threecosmul,0.0289796
D01: [noun+less_reg],fasttext,crawl-300d-2M,threecosadd,0.0232653
D01: [noun+less_reg],glove,glove.42B.300d,threecosmul,0.017551
D01: [noun+less_reg],fasttext,wiki-news-300d-1M,threecosadd,0.0159184
D01: [noun+less_reg],glove,glove.42B.300d,threecosadd,0.0114286
