# Evaluation of unbiased word embeddings on similarities and analogies benchmarks

1. [1] [GloVe](https://nlp.stanford.edu/projects/glove/) Global Vectors for Word Representation from stanford.
1. [2] [GN GloVe](https://arxiv.org/abs/1809.01496) Learning Gender-Neutral Word Embeddings trained on wikipedia. The embeddings contain a gender neutral 
1. [3] [Word2vec](https://code.google.com/archive/p/word2vec/)
1. [4] [Debiaswe](https://github.com/tolga-b/debiaswe) Unbiased Word2Vec

## Benchmarks
We will evaluate the embeddings using some of the datasets listed in this paper: [A Survey of Word Embeddings Evaluation Methods](https://arxiv.org/abs/1801.09536)

### Word similarities
1. [SimVerb-3500](https://arxiv.org/abs/1608.00869): 3 500 pairs of verbs assessed by semantic similarity with a scale from 0 to 4
1. [SimLex-999](https://fh295.github.io/simlex.html): 999 pairs assessed with a strong respect to semantic similarity
with a scale from 0 to 10
1. [MEN](https://staff.fnwi.uva.nl/e.bruni/MEN): 3 000 pairs assessed by semantic relatedness with a discrete scale from 0 to 50

### Word analogies
1. [Google Analogy](https://arxiv.org/abs/1301.3781): Contains 19,544 question pairs (8,869 semantic and 10,675 syntactic questions)
1. [MSR](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rvecs.pdf) (Microsoft Research Syntactic Analogies): 8,000 questions divided into 16 morphological classes

## Reproducibility
During our research, some embeddings / datasets were no longer available at their original URL. So to ensure the reproducibility of our experiments, we stored a copy of the datasets on [Renku](https://renkulab.io/projects/samuel.betrisey/embedding-classical-evaluation/datasets)


In [None]:
!pip install gensim==4.1.2 gdown



In [None]:
import gensim.downloader
from gensim.models import KeyedVectors
from pandas import read_csv, DataFrame
import requests
from six import BytesIO
from zipfile import ZipFile
from pathlib import Path
import gdown
import os

data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

In [None]:
paths = {
    "GloVe_Stanford" : "glove.6B.300d.txt",
    "Unbiased_GloVe" : "1b-vectors300-0.8-0.8.txt",
    "Unbiased_Word2Vec" : "GoogleNews-vectors-negative300-hard-debiased.bin.gz", 
    'Word2Vec' : None # Downloaded directly from gensim
}

if not (data_dir / paths['GloVe_Stanford']).exists():
    #zip_file = gdown.download("http://nlp.stanford.edu/data/glove.6B.zip")
    #ZipFile(zip_file).extract(paths['GloVe_Stanford'], "data")
    #os.remove(zip_file)
    gdown.download("https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/embeddings/glove.6B.300d.txt",
                   output=str(data_dir / paths["GloVe_Stanford"]))
glove_vectors = KeyedVectors.load_word2vec_format(data_dir / paths['GloVe_Stanford'], binary=False, no_header=True)

# gender neutral glove from https://github.com/uclanlp/gn_glove
if not (data_dir / paths['Unbiased_GloVe']).exists():
    #zip_file = gdown.download("https://drive.google.com/uc?id=1g1QPqbIlQorwlfGShtPbZVk6mfwodQgE")
    #ZipFile(zip_file).extract(paths['Unbiased_GloVe'], "data")
    #os.remove(zip_file)
    gdown.download("https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/embeddings/1b-vectors300-0.8-0.8.txt",
                   output=str(data_dir / paths["Unbiased_GloVe"]))
unbiased_glove_vectors = KeyedVectors.load_word2vec_format(data_dir / paths['Unbiased_GloVe'], binary=False, no_header=True)

gensim.downloader.DOWNLOAD_BASE_URL = "https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/embeddings"
word2vec_vectors = gensim.downloader.load("word2vec-google-news-300")

# https://github.com/tolga-b/debiaswe
if not (data_dir / paths['Unbiased_Word2Vec']).exists():
    #gdown.download("https://drive.google.com/uc?id=0B5vZVlu2WoS5ZTBSekpUX0RSNDg", output=str(data_dir / paths["Unbiased_Word2Vec"]))
    gdown.download("https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/embeddings/GoogleNews-vectors-negative300-hard-debiased.bin.gz",
                   output=str(data_dir / paths["Unbiased_Word2Vec"]))
unbiased_word2vec_vectors = KeyedVectors.load_word2vec_format(data_dir / paths['Unbiased_Word2Vec'], binary=True)


In [None]:
MODELS = {"GloVe_Stanford": glove_vectors,
          "Unbiased_GloVe": unbiased_glove_vectors,
          "Word2Vec": word2vec_vectors,
          "Unbiased_Word2Vec": unbiased_word2vec_vectors}

In [None]:
# https://arxiv.org/pdf/1608.00869.pdf
#url = "https://www.repository.cam.ac.uk/bitstream/handle/1810/264124/simverb-3500-data.zip"
url = "https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/evaluation/simverb-3500-data.zip"
simverb = read_csv(ZipFile(BytesIO(requests.get(url).content)).open("data/SimVerb-3500.txt"), sep="\t", header=None)
simverb_path = "data/SimVerb-3500.txt"
simverb = simverb[[0, 1, 3]]
simverb.to_csv(simverb_path, sep="\t", header=False, index=False)
simverb

Unnamed: 0,0,1,3
0,take,remove,6.81
1,walk,trail,4.81
2,feed,starve,1.49
3,shine,polish,7.80
4,calculate,add,5.98
...,...,...,...
3495,impose,cheat,1.16
3496,rebel,protest,7.64
3497,collaborate,conspire,4.23
3498,conspire,protest,1.83


In [None]:
# https://fh295.github.io/simlex.html
#url = "https://fh295.github.io/SimLex-999.zip"
url = "https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/evaluation/SimLex-999.zip"
simlex = read_csv(ZipFile(BytesIO(requests.get(url).content)).open("SimLex-999/SimLex-999.txt"), sep="\t")
simlex = simlex[["word1", "word2", "SimLex999"]]
simlex_path = "data/SimLex-999.txt"
simlex.to_csv(simlex_path, sep="\t", header=False, index=False)
simlex

Unnamed: 0,word1,word2,SimLex999
0,old,new,1.58
1,smart,intelligent,9.20
2,hard,difficult,8.77
3,happy,cheerful,9.55
4,hard,easy,0.95
...,...,...,...
994,join,acquire,2.85
995,send,attend,1.67
996,gather,attend,4.80
997,absorb,withdraw,2.97


In [None]:
# https://staff.fnwi.uva.nl/e.bruni/MEN
#url = "https://staff.fnwi.uva.nl/e.bruni/resources/MEN.zip"
url = "https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/evaluation/MEN.zip"
men = read_csv(ZipFile(BytesIO(requests.get(url).content)).open("MEN/MEN_dataset_natural_form_full"), sep=" ")
men_path = "data/MEN.txt"
men.to_csv(men_path, sep="\t", header=False, index=False)
men

Unnamed: 0,sun,sunlight,50.000000
0,automobile,car,50.0
1,river,water,49.0
2,stairs,staircase,49.0
3,morning,sunrise,49.0
4,rain,storm,49.0
...,...,...,...
2994,feathers,truck,1.0
2995,festival,whiskers,1.0
2996,muscle,tulip,1.0
2997,bikini,pizza,1.0


In [None]:
TESTS = {"simverb": simverb_path,
         "simlex": simlex_path,
         "men": men_path}

for model_name, model in MODELS.items():
    print(f"Model: {model_name}")
    for test_name, test in TESTS.items():
        print(f"  Test: {test_name}")
        print("    " + str(model.evaluate_word_pairs(test)))

# Output format: ((Pearson r, Pearson p), (Spearman r, Spearman p))

Model: GloVe_Stanford
  Test: simverb
    ((0.23014955028065393, 2.822815350386805e-43), SpearmanrResult(correlation=0.226666174150116, pvalue=5.3917592196100066e-42), 0.05714285714285715)
  Test: simlex
    ((0.3877827890006341, 3.687650450670567e-37), SpearmanrResult(correlation=0.3692121013345344, pvalue=1.371018956747063e-33), 0.10010010010010009)
  Test: men
    ((0.7431576592641206, 0.0), SpearmanrResult(correlation=0.7484863201719878, pvalue=0.0), 0.0)
Model: Unbiased_GloVe
  Test: simverb
    ((0.21966858392073654, 2.2873693364746173e-39), SpearmanrResult(correlation=0.21820033173276157, pvalue=7.468453915772925e-39), 0.37142857142857144)
  Test: simlex
    ((0.34692400883694824, 1.3308688685445175e-29), SpearmanrResult(correlation=0.33936603261749826, pvalue=2.541683378519855e-28), 0.10010010010010009)
  Test: men
    ((0.5967259999587126, 8.321626556920017e-289), SpearmanrResult(correlation=0.5944851269439396, pvalue=4.1124826205567506e-286), 0.0)
Model: Unbiased_Word2Vec
  T

In [None]:
# url = "http://download.tensorflow.org/data/questions-words.txt"
url = "https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/evaluation/questions-words.txt"
google_analogies_path = gdown.download(url, output="data/google_analogies.txt")

Downloading...
From: https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/evaluation/questions-words.txt
To: /content/data/google_analogies.txt
100%|██████████| 604k/604k [00:00<00:00, 10.4MB/s]


In [None]:
#url = "https://github.com/vecto-ai/word-benchmarks/raw/master/word-analogy/monolingual/en/msr.csv"
url = "https://renkulab.io/gitlab/samuel.betrisey/embedding-classical-evaluation/-/raw/master/data/evaluation/msr.csv"
msr = read_csv("https://github.com/vecto-ai/word-benchmarks/raw/master/word-analogy/monolingual/en/msr.csv", sep=",")
msr_path = "data/msr.txt"
msr = msr[["word1", "word2", "word3", "target"]]
msr.to_csv(msr_path, sep=" ", header=[':', '', '', ''], index=False)
msr

Unnamed: 0,word1,word2,word3,target
0,good,better,rough,rougher
1,better,good,rougher,rough
2,good,best,rough,roughest
3,best,good,roughest,rough
4,best,better,roughest,rougher
...,...,...,...,...
7995,sent,send,avoided,avoid
7996,send,sends,avoid,avoids
7997,sends,send,avoids,avoid
7998,sends,sent,avoids,avoided


In [None]:
TESTS = {"msr": msr_path, "google_analogies": google_analogies_path}

scores_by_model_test = {}
details_by_model_test = {}

for model_name, model in MODELS.items():
    print(f"Model: {model_name}")
    scores_by_model_test[model_name] = {}
    details_by_model_test[model_name] = {}
    for test_name, test in TESTS.items():
        print(f"  Test: {test_name}")
        score, details = model.evaluate_word_analogies(test)
        scores_by_model_test[model_name][test_name] = score
        details_by_model_test[model_name][test_name] = details
        print("    " + str(score))

Model: GloVe_Stanford
  Test: msr
    0.6427142857142857
  Test: google_analogies
    0.7195422354510931
Model: Unbiased_GloVe
  Test: msr
    0.5343731946851531
  Test: google_analogies
    0.618961577297125
Model: Word2Vec
  Test: msr
    0.7362857142857143
  Test: google_analogies
    0.7401448525607863
Model: Unbiased_Word2Vec
  Test: msr
    0.7365714285714285
  Test: google_analogies
    0.7373512674599069


## Let's explore the results

In [None]:
def get_correct_incorrect(details):
    correct = set()
    incorrect = set()
    for section in details:
        for sample in section['correct']:
            correct.add(sample)
        for sample in section['incorrect']:
            incorrect.add(sample)
    return correct, incorrect

original_unbiased_pairs = [("GloVe_Stanford", "Unbiased_GloVe"), ("Word2Vec", "Unbiased_Word2Vec")]
for original, unbiased in original_unbiased_pairs:
    print(f"{original} -> {unbiased}")
    for test_name, test in TESTS.items():
        print(test_name)
        print(f"Accuracy: {scores_by_model_test[original][test_name]} -> {scores_by_model_test[unbiased][test_name]}")
        details_original = details_by_model_test[original][test_name]
        details_unbiased = details_by_model_test[unbiased][test_name]
        correct_original, incorrect_original = get_correct_incorrect(details_original)
        correct_unbiased, incorrect_unbiased = get_correct_incorrect(details_unbiased)
        # show results that were correct with original and are now wrong and the opposite
        print("correct -> incorrect")
        print(correct_original.intersection(incorrect_unbiased))
        print("incorrect -> correct")
        print(incorrect_original.intersection(correct_unbiased))
        [details_original]
        print()
    print()

GloVe_Stanford -> Unbiased_GloVe
msr
Accuracy: 0.6427142857142857 -> 0.5343731946851531
correct -> incorrect
{('PRETTIER', 'PRETTY', 'SHORTER', 'SHORT'), ('LARGEST', 'LARGE', 'WEAKEST', 'WEAK'), ('SLOW', 'SLOWER', 'NICE', 'NICER'), ('CHANGES', 'CHANGE', 'SAVES', 'SAVE'), ('TASTY', 'TASTIER', 'STICKY', 'STICKIER'), ('RAISE', 'RAISED', 'UNDERSTAND', 'UNDERSTOOD'), ('FUNNIER', 'FUNNIEST', 'LARGER', 'LARGEST'), ('SELL', 'SELLS', 'LIKE', 'LIKES'), ('NARROWEST', 'NARROWER', 'STEADIEST', 'STEADIER'), ('MILD', 'MILDEST', 'RISKY', 'RISKIEST'), ('STICKY', 'STICKIER', 'WISE', 'WISER'), ('SYSTEM', 'SYSTEMS', 'CITIZEN', 'CITIZENS'), ('SLIMMER', 'SLIMMEST', 'MILDER', 'MILDEST'), ('SLIMMER', 'SLIMMEST', 'WISER', 'WISEST'), ('FAIREST', 'FAIRER', 'MILDEST', 'MILDER'), ('GENTLE', 'GENTLEST', 'CLOSE', 'CLOSEST'), ('SMALLEST', 'SMALLER', 'NICEST', 'NICER'), ('PREVENTS', 'PREVENT', 'CONSIDERS', 'CONSIDER'), ('SMALLER', 'SMALLEST', 'BROADER', 'BROADEST'), ('MILDEST', 'MILDER', 'EASIEST', 'EASIER'), ('CLEARE

## Experiment with your own analogies

Change the variables `a`, `b`, `c` to see the results of each embedding:

In [None]:
a = 'male'
# is to
b = 'female'
# as
c = 'doctor'
# is to ...

print(f"{a} is to {b} as {c} is to...")
for model_name, model in MODELS.items():
    print(model_name + ': ', end='')
    v = model.get_vector(b) - model.get_vector(a) + model.get_vector(c)
    d = model.most_similar(v, topn=3)
    print(d)

male is to female as doctor is to...
GloVe_Stanford: [('doctor', 0.9061827063560486), ('physician', 0.6464920043945312), ('nurse', 0.5715017914772034)]
Unbiased_GloVe: [('doctor', 0.8750905394554138), ('doctors', 0.5951226949691772), ('physician', 0.5413261651992798)]
Word2Vec: [('doctor', 0.8938288688659668), ('physician', 0.6915890574455261), ('doctors', 0.6790844202041626)]
Unbiased_Word2Vec: [('doctor', 0.8294335007667542), ('gynecologist', 0.6915062069892883), ('physician', 0.6491522789001465)]


# References

[1] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

[2] Zhao, J., Zhou, Y., Li, Z., Wang, W., & Chang, K. W. (2018). Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496.

[3] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[4] Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
