ICML-2019 Supporting Code for Submission #960 (Model Comparison for Semantic Grouping)
NOTE: This repo is a fork of SentEval. Any commits prior to 2019 are not associated to publication #960 @ICML2019 .
This code is written in python. The dependencies are:
To get all the transfer tasks datasets, run (in data/) using Bash >= 4.0:
This will automatically download and preprocess the datasets, and store them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip). Note: we provide PTB or MOSES tokenization.
This will also download glove.840B.300d.txt and enwiki_vocab_min200.txt (The SIF frequencies from Arora et al. 2016).
To download the other word vectors please go to GoogleNews-word2vec and FastText, then convert binary files into the same text file format as glove.840B.300d.txt and place in /data/word_vectors. We could not upload them to GitHub since they are above the allowed disk-quota.
Reproduce Results for Submission #960
In order to reproduce results please run:
cd examples python arora.py # To reproduce Arora et al. (2016)'s SIF+PCA results python gaussian.py # To reproduce our Gaussian-AIC/TIC results python vmf.py # To reproduce our vMF-AIC/TIC results
This will reproduce the results for glove.840B.300d.txt and potentially crash afterwards if you have not downloaded the other word vectors.
The entire codebase connected to the similarity metrics described in the paper is encapsulated in the similarity folder. This is where the core contributions of our work are.