ICML-2019 Supporting Code for Submission #960 (Model Comparison for Semantic Grouping)

Note This repository is no longer actively maintained by Babylon Health. For further assistance, reach out to the paper authors.

ICML-2019 Supporting Code for Submission #960 (Model Comparison for Semantic Grouping)

This repo consits of a slight set of modifications for SentEval to reproduce results in accepted submission #960 @ICML2019. Manuscript can be found here.

NOTE: This repo is a fork of SentEval. Any commits prior to 2019 are not associated to publication #960 @ICML2019 .

Dependencies

This code is written in python. The dependencies are:

Python 2 with NumPy/SciPy
Pytorch
scikit-learn>=0.18.0
Autograd

Download datasets

To get all the transfer tasks datasets, run (in data/) using Bash >= 4.0:

./get_transfer_data_ptb.bash

This will automatically download and preprocess the datasets, and store them in data/senteval_data (warning: for MacOS users, you may have to use p7zip instead of unzip). Note: we provide PTB or MOSES tokenization.

WARNING: Extracting the MRPC MSI file requires the "cabextract" command line (i.e apt-get/yum install cabextract).

This will also download glove.840B.300d.txt and enwiki_vocab_min200.txt (The SIF frequencies from Arora et al. 2016).

To download the other word vectors please go to GoogleNews-word2vec and FastText, then convert binary files into the same text file format as glove.840B.300d.txt and place in /data/word_vectors. We could not upload them to GitHub since they are above the allowed disk-quota.

Reproduce Results for Submission #960

In order to reproduce results please run:

cd examples
python arora.py # To reproduce Arora et al. (2016)'s SIF+PCA results
python gaussian.py # To reproduce our Gaussian-AIC/TIC results
python vmf.py  # To reproduce our vMF-AIC/TIC results

This will reproduce the results for glove.840B.300d.txt and potentially crash afterwards if you have not downloaded the other word vectors.

Similarity code

The entire codebase connected to the similarity metrics described in the paper is encapsulated in the similarity folder. This is where the core contributions of our work are.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
data		data
examples		examples
senteval		senteval
similarity		similarity
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
LICENSE_SentEval		LICENSE_SentEval
NOTICE		NOTICE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICML-2019 Supporting Code for Submission #960 (Model Comparison for Semantic Grouping)

Dependencies

Download datasets

Reproduce Results for Submission #960

Similarity code

About

Releases

Packages

Contributors 3

Languages

License

babylonhealth/MCSG

Folders and files

Latest commit

History

Repository files navigation

ICML-2019 Supporting Code for Submission #960 (Model Comparison for Semantic Grouping)

Dependencies

Download datasets

Reproduce Results for Submission #960

Similarity code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages